high performance computing on graphics processing units: hgpu.org

Posts

Aug, 19

A Domain-Specific Language and Compiler for Stencil Computations on Short-Vector SIMD and GPU Architectures

Stencil computations are an integral part of applications in a number of scientific computing domains, such as image processing and partial differential equations. We describe a domain-specific language for regular stencil computations, that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, that generates optimized code for […]

CUDA

Aug, 19

PARIS: A Parallel RSA-Prime Inspection Tool

Modern-day computer security relies heavily on cryptography as a means to protect the data that we have become increasingly reliant on. As the Internet becomes more ubiquitous, methods of security must be better than ever. Validation tools can be leveraged to help increase our confidence and accountability for methods we employ to secure our systems. […]

CUDA

Aug, 19

Algorithms for Compression on GPUs

This project seeks to produce an algorithm for fast lossless compression of data. This is attempted by utilisation of the highly parallel graphic processor units (GPU), which has been made easier to use in the last decade through simpler access. Especially nVidia has accomplished to provide simpler programming of GPUs with their CUDA architecture. I […]

CUDA

Aug, 19

Transfer Time Reduction of Data Transfers between CPU and GPU

In real-time video processing data transfer between CPU and GPU is a time critical action; time spent transferring data is processing time lost. Several variants of standard transfer methods were developed and evaluated on nine computers and two smart decision algorithms was designed to help choose the fastest method for each occasion. Results showed that […]

CUDA

•

OpenCL

Aug, 18

Towards a Distributed GPU-Accelerated Matrix Inversion

We present an extension of a GPU-based matrix inversion algorithm for distributed memory contexts. Specifically, we implement and evaluate a message-passing variant of the Gauss-Jordan method (GJE) for matrix inversion on a cluster of nodes equipped with GPU hardware accelerators. The experimental evaluation of the proposal shows a significant runtime reduction when compared with both […]

CUDA

Aug, 18

A GPU implementation for improved granular simulations with LAMMPS

Granular mechanics plays an important role in many branches of science and engineering, from astrophysics applications in planetary and interstellar dust clouds, to processing of industrial mixtures and powders. In this context, a granular simulation model with improved adhesion and friction, is implemented within the open source code LAMMPS (lammps.sandia.gov). The performance of this model […]

CUDA

Aug, 18

Permutation Index and GPU to Solve efficiently Many Queries

Similarity search is a fundamental operation for applications that deal with multimedia data. For a query in a multimedia database it is meaningless to look for elements exactly equal to a given one as query. Instead, we need to measure the similarity (or dissimilarity) between the query object and each object of the database. The […]

CUDA

Aug, 18

Encrypting video streams using OpenCL code on-demand

The amount of multimedia information transmitted through the web is very high and increasing. Generally, this kind data is not correctly protected, since users do not appreciate the information that images and videos may contain. In this work, we present an architecture for managing safely multimedia transmission channels. The idea is to encrypt and encode […]

OpenCL

Aug, 18

Fast and Flexible: Parallel Packet Processing with GPUs and Click

We introduce Snap, a framework for packet processing that outperforms traditional software routers by exploiting the parallelism available on modern GPUs. While obtaining high performance, it remains extremely flexible, with packet-processing tasks implemented as simple modular elements that are composed to build fully functional routers and switches. Snap is based on the Click modular router, […]

CUDA

Aug, 17

Solving 3D viscous incompressible Navier-Stokes equations using CUDA

A CUDA implementation of the 3D viscous incompressible Navier-Stokes equations is proposed using as advection operator the BFECC (Back and Forth Error Compensation and Correction) scheme. The Poisson problem for pressure is solved with a CG (Conjugated Gradient) preconditioning the system with FFTs (Fast Fourier Transforms). Study cases such as Lid-Driven Cavity and Flow Past […]

CUDA

Aug, 17

Performance Analysis of a Symmetric Cryptography Algorithm on GPU and GPU Cluster

This article presents a performance analysis of the symmetric encryption algorithm AES (Advanced Encryption Standard) on a machine with one GPU and a cluster of GPUs, for cases in which the memory required by the algorithm is more than that of a GPU. Two implementations were carried out, based on C language, that use the […]

CUDA

Aug, 17

Formal specification and verification of OpenCL Kernel optimization

Computing general problems using the graphical processing unit (GPU) of a device is an emerging field. The parallel structure of the GPU allows for massive concurrency, when executing a program. Therefore, by executing (a part of) the code on the GPU, a previously unused resource can be used, to achieve a speed-up of an application. […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

A Domain-Specific Language and Compiler for Stencil Computations on Short-Vector SIMD and GPU Architectures

PARIS: A Parallel RSA-Prime Inspection Tool

Algorithms for Compression on GPUs

Transfer Time Reduction of Data Transfers between CPU and GPU

Towards a Distributed GPU-Accelerated Matrix Inversion

A GPU implementation for improved granular simulations with LAMMPS

Permutation Index and GPU to Solve efficiently Many Queries

Encrypting video streams using OpenCL code on-demand

Fast and Flexible: Parallel Packet Processing with GPUs and Click

Solving 3D viscous incompressible Navier-Stokes equations using CUDA

Performance Analysis of a Symmetric Cryptography Algorithm on GPU and GPU Cluster

Formal specification and verification of OpenCL Kernel optimization

Recent source codes

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

CONCUR: a benchmark designed to evaluate multithreaded Java code generated by LLMs

HIPRT: Ray Tracing using HIP

MXFP4 Training Support Codebase

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Most viewed papers (last 30 days)