high performance computing on graphics processing units: hgpu.org

Posts

Oct, 22

Hardware Transactional Memory for GPU Architectures

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks […]

CUDA

•

OpenCL

Oct, 22

Low-Impact Profiling of Streaming, Heterogeneous Applications

Computer engineers are continually faced with the task of translating improvements in fabrication process technology (i.e., Moore’s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks (e.g. […]

Oct, 22

Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Checkpointing is an effective fault tolerant technique to improve the reliability of large scale parallel computing systems. However, checkpointing causes a large number of computation nodes to store a huge amount of data into file system simultaneously. It does not only require a huge storage space to store system state, but also brings a tremendous […]

OpenCL

Oct, 22

Parallelization of the distinct lattice spring model

The distinct lattice spring model (DLSM) is a newly developed numerical tool for modeling rock dynamics problems, i.e. dynamic failure and wave propagation. In this paper, parallelization of DLSM is presented. With the development of parallel computing technologies in both hardware and software, parallelization of a code is becoming easier than before. There are many […]

Oct, 22

Mapping Iterative Medical Imaging Algorithm on Cell Accelerator

Algebraic reconstruction techniques require about half the number of projections as that of Fourier backprojection methods, which makes these methods safer in terms of required radiation dose. Algebraic reconstruction technique (ART) and its variant OS-SART (ordered subset simultaneous ART) are techniques that provide faster convergence with comparatively good image quality. However, the prohibitively long processing […]

Oct, 21

Concurrent Algorithms and Data Structures for Many-Core Processors

The convergence of highly parallel many-core graphics processors with conventional multi-core processors is becoming a reality. To allow algorithms and data structures to scale efficiently on these new platforms, several important factors needs to be considered. (i) The algorithmic design needs to utilize the inherent parallelism of the problem at hand. Sorting, which is one […]

Oct, 21

Solving Linear Recurrences on Hybrid GPU Accelerated Manycore Systems

The aim of this paper is to show that linear recurrence systems with constant coefficients can be efficiently solved on hybrid GPU accelerated manycore systems with modern Fermi GPU cards. The main idea is to use the recently developed divideand-conquer algorithm which can be expressed in terms of Level 2 and 3 BLAS operations. The […]

CUDA

Oct, 21

Analysis and Implementation of eSTREAM and SHA-3 Cryptographic Algorithms

Invaluable benchmarking efforts have been made to measure the performance of eSTREAM portfolio stream ciphers and SHA-3 hash function candidates on multiple architectures. In this thesis we contribute to these efforts; we evaluate the performance of all eSTREAM ciphers and all second- round SHA-3 candidates on NVIDIA Graphics Processing Units (GPUs). Complementarily, we present the […]

CUDA

Oct, 21

On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences

Images are ubiquitous in biomedical applications from basic research to clinical practice. With the rapid increase in resolution, dimensionality of the images and the need for real-time performance in many applications, computational requirements demand proper exploitation of multicore architectures. Towards this, GPU-specific implementations of image analysis algorithms are particularly promising. In this paper, we investigate […]

CUDA

Oct, 21

Computational Fluid Dynamics Using Graphics Processing Units: Challenges and Opportunities

A new paradigm for computing fluid flows is the use of Graphics Processing Units (GPU), which have recently become very powerful and convenient to use. In the past three years, we have implemented five different fluid flow algorithms on GPUs and have obtained significant speed-ups over a single CPU. Typically, it is possible to achieve […]

CUDA

Oct, 21

DOPA: GPU-based protein alignment using database and memory access optimizations

BACKGROUND: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring […]

CUDA

Oct, 21

Massively parallel computation using graphics processors with application to optimal experimentation in dynamic control

The rapid growth in the performance of graphics hardware, coupled with recent improvements in its programmability has lead to its adoption in many non-graphics applications, including a wide variety of scientific computing fields. At the same time, a number of important dynamic optimal policy problems in economics are athirst of computing power to help overcome […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Hardware Transactional Memory for GPU Architectures

Low-Impact Profiling of Streaming, Heterogeneous Applications

Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Parallelization of the distinct lattice spring model

Mapping Iterative Medical Imaging Algorithm on Cell Accelerator

Concurrent Algorithms and Data Structures for Many-Core Processors

Solving Linear Recurrences on Hybrid GPU Accelerated Manycore Systems

Analysis and Implementation of eSTREAM and SHA-3 Cryptographic Algorithms

On the Usage of GPUs for Efficient Motion Estimation in Medical Image Sequences

Computational Fluid Dynamics Using Graphics Processing Units: Challenges and Opportunities

DOPA: GPU-based protein alignment using database and memory access optimizations

Massively parallel computation using graphics processors with application to optimal experimentation in dynamic control

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)