high performance computing on graphics processing units: hgpu.org

Posts

Aug, 21

Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS

GPU acceleration is a promising approach to speed up query processing of database systems by using low cost graphic processors as coprocessors. Two major trends have emerged in this area: (1) The development of frameworks for scheduling tasks in heterogeneous CPU/GPU platforms, which is mainly in the context of coprocessing for applications and does not […]

Aug, 21

Parallel Voronoi Diagram computation on scaled distance planes using CUDA

Voronoi diagrams are fundamental data structures in computational geometry with several applications on different fields inside and outside computer science. This paper shows a CUDA algorithm to compute Voronoi diagrams on a 2D image where the distance between points cannot be directly computed in the euclidean plane. The proposed method extends an existing Dijkstra-based GPU […]

CUDA

Aug, 21

Accurate Analytic Models to Estimate Execution Time on GPU Applications

Today top ranked HPC systems feature several GPUs which present high processing speed at low power budget with various parallel applications. Many scientific applications still claim for even more computing speed than the available today. A general approach to provide more processing speed is to scale the system. However, aspects such as interference, the amount […]

CUDA

Aug, 20

Efficient Heterogeneous Execution on Large Multicore and Accelerator Platforms: Case Study Using a Block Tridiagonal Solver

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing […]

CUDA

Aug, 20

Simulating Dam-Break Flooding with Floating Objects through Intricate City Layouts Using GPU-based SPH Method

For the fast transient dam break flooding with floating bodies presented through intricate city layouts, the traditional grid-based method based on solving two dimensional (2D) Shallow Water Equations or three dimensional (3D) Reynolds-averaged Navier-Stokes equations have difficulty in modelling the 3D unsteady flow features and the moving objects in the flow, causing inaccuracies. In this […]

CUDA

Aug, 20

Using Modularity Metrics to assist Move Method Refactoring of Large System

For large software systems, refactoring activities can be a challenging task, since for keeping component complexity under control the overall architecture as well as many details of each component have to be considered. Product metrics are therefore often used to quantify several parameters related to the modularity of a software system. This paper devises an […]

CUDA

Aug, 20

Advanced CFD Modeling Using GeForce GPUs

Advanced applications of CFD for multiphysics modelling of electrokinetic, capillary, turbulent and rarefied hypersonic flows is discussed in this paper. Due the complexity of the geometry involved and the underlying physics associated with the phenomena to be studied, multiphysics study requires enormous computational resources. The CFD computations are performed within a parallel environment for accelerating […]

CUDA

Aug, 20

Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body Potentials

The use of coprocessors or accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, defined as machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), […]

CUDA

Aug, 19

Performance Drawbacks for Matrix Multiplication using Set Associative Cache in GPU devices

Performance of shared memory processors show negative performance impulses (drawbacks) in certain regions for execution of the basic matrix multiplication algorithm. In this paper we continue with analysis of GPU memory hierarchy and corresponding cache memory organization. We give a theoretical analysis why a negative performance impulse appears for specifics problem sizes. The main reason […]

CUDA

Aug, 19

A Domain-Specific Language and Compiler for Stencil Computations on Short-Vector SIMD and GPU Architectures

Stencil computations are an integral part of applications in a number of scientific computing domains, such as image processing and partial differential equations. We describe a domain-specific language for regular stencil computations, that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, that generates optimized code for […]

CUDA

Aug, 19

PARIS: A Parallel RSA-Prime Inspection Tool

Modern-day computer security relies heavily on cryptography as a means to protect the data that we have become increasingly reliant on. As the Internet becomes more ubiquitous, methods of security must be better than ever. Validation tools can be leveraged to help increase our confidence and accountability for methods we employ to secure our systems. […]

CUDA

Aug, 19

Algorithms for Compression on GPUs

This project seeks to produce an algorithm for fast lossless compression of data. This is attempted by utilisation of the highly parallel graphic processor units (GPU), which has been made easier to use in the last decade through simpler access. Especially nVidia has accomplished to provide simpler programming of GPUs with their CUDA architecture. I […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS

Parallel Voronoi Diagram computation on scaled distance planes using CUDA

Accurate Analytic Models to Estimate Execution Time on GPU Applications

Efficient Heterogeneous Execution on Large Multicore and Accelerator Platforms: Case Study Using a Block Tridiagonal Solver

Simulating Dam-Break Flooding with Floating Objects through Intricate City Layouts Using GPU-based SPH Method

Using Modularity Metrics to assist Move Method Refactoring of Large System

Advanced CFD Modeling Using GeForce GPUs

Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body Potentials

Performance Drawbacks for Matrix Multiplication using Set Associative Cache in GPU devices

A Domain-Specific Language and Compiler for Stencil Computations on Short-Vector SIMD and GPU Architectures

PARIS: A Parallel RSA-Prime Inspection Tool

Algorithms for Compression on GPUs

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)