high performance computing on graphics processing units: hgpu.org

Posts

Mar, 9

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. […]

CUDA

Mar, 9

Comparison of FPGA and GPU implementations of real-time stereo vision

Real-time stereo vision systems have many applications – from autonomous navigation for vehicles through surveillance to materials handling. Accurate scene interpretation depends on an ability to process high resolution images in real-time, but, although the calculations for stereo matching are basically simple, a practical system needs to evaluate at least 109 disparities every second – […]

CUDA

Mar, 9

Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation

We present a set of possible implementations for Graphics Processing Units (GPU) of the Overrelaxation technique applied to the 3D Heisenberg spin glass model. The results show that a carefully tuned code can achieve more than 100 GFlops/sec. of sustained performance and update a single spin in about 0.6 nanoseconds. A multi-hit technique that exploits […]

CUDA

Mar, 9

GPU Computing Gems: Emerald Edition

Graphics Processing Units (GPUs) are designed to be parallel – having hundreds of cores versus traditional CPUs. Increasingly, you can leverage GPU power for many computationally-intense applications – not just for graphics. If you’re facing the challenge of programming systems to effectively use these massively parallel processors to achieve efficiency and performance goals, GPU Computing […]

CUDA

Mar, 9

Visualization of level-of-detail meshes on the GPU

Extensive research has been carried out in multiresolution models for many decades. The tendency in recent years has been to harness the potential of GPUs to perform the level-of-detail extraction on graphics hardware. The aim of this work is to present a new level-of-detail scheme based on triangles which is both simple and efficient. In […]

Mar, 9

GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors

Next-generation, high-throughput sequencers are now capable of producing hundreds of billions of short sequences (reads) in a single day. The task of accurately mapping the reads back to a reference genome is of particular importance because it is used in several other biological applications, e.g., genome re-sequencing, DNA methylation, and ChiP sequencing. On a personal […]

Mar, 9

GPGPU flow

Abstract is not available.

Mar, 9

Classical Simulation of Quantum Adiabatic Algorithms using Mathematica on GPUs

In this paper we present a simulation environment enhanced with parallel processing which can be used on personal computers, based on a high-level user interface developed on Mathematicacopyright which is connected to C++ code in order to make our platform capable of communicating with a Graphics Processing Unit. We introduce the reader to the behavior […]

CUDA

Mar, 8

Using common graphics hardware for multi-agent traffic simulation with CUDA

Today’s graphics processing units (GPU) have tremendous resources when it comes to raw computing power. The simulation of large groups of agents in transport simulation has a huge demand of computation time. Therefore it seems reasonable to try to harvest this computing power for traffic simulation. Unfortunately simulating a network of traffic is inherently connected […]

CUDA

Mar, 8

Fast heterogeneous computing with CUDA compatible Tesla GPU computing processor (personal supercomputing)

This paper presents how fast heterogeneous computing can be achieved with Tesla GPU computing processor. Tesla GPU super computer brings the performance of a cluster to a workstation and turning it into a supercomputer. We have chosen molecular dynamics field to show fast and high performance computing with Tesla GPU. We have given a DCS […]

CUDA

Mar, 8

Performance and Scalability of GPU-Based Convolutional Neural Networks

In this paper we present the implementation of a framework for accelerating training and classification of arbitrary Convolutional Neural Networks (CNNs) on the GPU. CNNs are a derivative of standard Multilayer Perceptron (MLP) neural networks optimized for two-dimensional pattern recognition problems such as Optical Character Recognition (OCR) or face detection. We describe the basic parts […]

CUDA

Mar, 8

A GPU-based finite-size pencil beam algorithm with 3D-density correction for radiotherapy dose calculation

Targeting at developing an accurate and efficient dose calculation engine for online adaptive radiotherapy, we have implemented a finite size pencil beam (FSPB) algorithm with a 3D-density correction method on GPU. This new GPU-based dose engine is built on our previously published ultrafast FSPB computational framework [Gu et al. Phys. Med. Biol. 54 6287-97, 2009]. […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Comparison of FPGA and GPU implementations of real-time stereo vision

Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation

GPU Computing Gems: Emerald Edition

Visualization of level-of-detail meshes on the GPU

GPU-RMAP: Accelerating Short-Read Mapping on Graphics Processors

GPGPU flow

Classical Simulation of Quantum Adiabatic Algorithms using Mathematica on GPUs

Using common graphics hardware for multi-agent traffic simulation with CUDA

Fast heterogeneous computing with CUDA compatible Tesla GPU computing processor (personal supercomputing)

Performance and Scalability of GPU-Based Convolutional Neural Networks

A GPU-based finite-size pencil beam algorithm with 3D-density correction for radiotherapy dose calculation

Recent source codes

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

LC Framework

pplx-garden: Perplexity open source garden for inference technology

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

OpScanner

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Most viewed papers (last 30 days)