high performance computing on graphics processing units: hgpu.org

Posts

Dec, 7

GPU Implementation of the Keccak Hash Function Family

Hash functions are one of the most important cryptographic primitives. Some of the currently employed hash functions like SHA-1 or MD5 are considered broken today. Therefore, in 2007 the US National Institute of Standards and Technology announced a competition for a new family of hash functions. Keccak is one of the five final candidates to […]

CUDA

Dec, 7

Parallelizing AES on multicores and GPUs

The AES block cipher cryptographic algorithm is widely used and it is resource intensive. An existing sequential open source implementation of the algorithm was parallelized on multi-core microprocessors and GPUs. Performance results are presented.

CUDA

Dec, 7

An Efficient Parallel Motion Estimation Algorithm and X264 Parallelization in CUDA

H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain […]

CUDA

Dec, 7

Sparse-Matrix-CG-Solver in CUDA

This paper describes the implementation of a parallelized conjugate gradient solver for linear equation systems using CUDA-C. Given a real, symmetric and positive definite coefficient matrix and a right-hand side, the parallized cg-solver is able to find a solution for that system by exploiting the massive compute power of todays GPUs. Comparing sequential CPU implementations […]

CUDA

Dec, 7

Accelerating Braided B+ Tree Searches on a GPU with CUDA

Previous work has shown that using the GPU as a brute force method for SELECT statements on a SQLite database table yields significant speedups. However, this requires that the entire table be selected and transformed from the B-Tree to row-column format. This paper investigates possible speedups by traversing B+ Trees in parallel on the GPU, […]

CUDA

Dec, 7

GPU-based solution of Continuous Time Markov Chains using CUSP

This technical report describes the parallelisation of the response-time analyser HYDRA using CUSP and the results of executing it on HECToR’s GPGPU testbed. We achieved good speed-ups in execution time, but these were outweighed by increased setup time.

CUDA

Dec, 7

Effective Mapping of Grammatical Evolution to CUDA Hardware Model

Several papers have shown that symbolic regression is suitable for data analysis and prediction in ?nance markets. The Grammatical Evolution (GE) has been successfully applied in solving various tasks including symbolic regression. However, performance of this method can limit the area of possible applications. This paper deals with utilizing mainstream graphics processing unit (GPU) for […]

CUDA

Dec, 7

Efficient Two-Level Preconditionined Conjugate Gradient Method on the GPU

We present an implementation of Two-Level Preconditioned Conjugate Gradient Method for the GPU. We investigate a Truncated Neumann Series based preconditioner in combination with deflation and compare it with Block Incomplete Cholesky schemes. This combination exhibits fine-grain parallelism and hence we gain considerably in execution time. It’s numerical performance is also comparable to the Block […]

CUDA

Dec, 7

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

We describe a GPU- and multicore-oriented implementation technique for a key component of finite element based simulation toolkits for partial differential equations on unstructured grids: Geometric Multigrid solvers. We use efficient sparse matrix-vector multiplications throughout the solver pipeline: within the coarse-grid solver, smoothers and even grid transfers. Our implementation can handle several low- and high-order […]

CUDA

Dec, 6

Automatic Fusions of CUDA-GPU Kernels for Parallel Map

When implementing a function mapping on the contemporary GPU, several contradictory performance factors affecting distribution of computation into GPU kernels have to be balanced. A decomposition-fusion scheme suggest to decompose computational problem to be solved by several simple functions implemented as standalone kernels and some of these functions later fuse into more complex kernels to […]

CUDA

Dec, 6

Multiprocessing Acceleration of H.264/AVC Motion Estimation Full Search Algorithm under CUDA Architecture

This work presents a parallel GPU-based solution for the Motion Estimation (ME) process in a videoencoding system. We propose a way to partition the steps of Full Search block matching algorithm in the CUDA architecture, and to compare the performance with a theoretical model and two implementations (sequential and parallel using OpenMP library). We obtained […]

CUDA

Dec, 6

DTAM: Dense tracking and mapping in real-time

DTAM is a system for real-time camera tracking and reconstruction which relies not on feature extraction but dense, every pixel methods. As a single hand-held RGB camera flies over a static scene, we estimate detailed textured depth maps at selected keyframes to produce a surface patchwork with millions of vertices. We use the hundreds of […]

high performance computing on graphics processing units: hgpu.org

Posts

GPU Implementation of the Keccak Hash Function Family

Parallelizing AES on multicores and GPUs

An Efficient Parallel Motion Estimation Algorithm and X264 Parallelization in CUDA

Sparse-Matrix-CG-Solver in CUDA

Accelerating Braided B+ Tree Searches on a GPU with CUDA

GPU-based solution of Continuous Time Markov Chains using CUSP

Effective Mapping of Grammatical Evolution to CUDA Hardware Model

Efficient Two-Level Preconditionined Conjugate Gradient Method on the GPU

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Automatic Fusions of CUDA-GPU Kernels for Parallel Map

Multiprocessing Acceleration of H.264/AVC Motion Estimation Full Search Algorithm under CUDA Architecture

DTAM: Dense tracking and mapping in real-time

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)