high performance computing on graphics processing units: hgpu.org

Posts

Jan, 31

Raytracing Dynamic Scenes on GPU

Raytracing dynamic scenes at interactive rates to realtime rates has received a lot of attention recently. In this dissertation, We present a few strategies for high performance ray tracing on an off-theshelf commodity Graphics Processing Unit (GPU) traditionally used for accelerating gaming and other graphics applications. We utilize the Grid datastructure for spatially arranging the […]

CUDA

Jan, 31

Decompilation of LLVM IR

Recently, in many important domains, high-level languages have become the code representations with widest platform support surpassing any low-level language in their area with respect to completeness and importance as exchange format (e.g. OpenCL for data-parallel computing, GLSL/HLSL for shader programs, JavaScript for the web). The code representations of many actively-developed compiler frameworks [JVM,LLVM,FIRM] are […]

OpenCL

Jan, 31

The Virtual OpenCL (VCL) Cluster Platform

Heterogeneous computing systems can dramatically increase the performance of parallel applications on clusters. Currently, applications that utilize GPU and APU devices, run their device-specific code only on devices of the same computer were the application runs. This paper presents the Virtual OpenCL (VCL) cluster platform that can run unmodified OpenCL applications transparently on clusters with […]

OpenCL

Jan, 31

Graphical processing unit implementation of an integrated shape-based active contour: Application to digital pathology

Commodity graphics hardware has become a cost-effective parallel platform to solve many general computational problems. In medical imaging and more so in digital pathology, segmentation of multiple structures on high-resolution images, is often a complex and computationally expensive task. Shape-based level set segmentation has recently emerged as a natural solution to segmenting overlapping and occluded […]

CUDA

Jan, 31

An OpenCL implementation for the solution of TDSE on GPU and CPU architectures

Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the openCL model. We […]

OpenCL

Jan, 30

Algorithmic Contributions to the Theory of Regular Chains

Regular chains, introduced about twenty years ago, have emerged as one of the major tools for solving polynomial systems symbolically. In this thesis, we focus on different algorithmic aspects of the theory of regular chains, from theoretical questions to high-performance implementation issues. The inclusion test for saturated ideals is a fundamental problem in this theory. […]

CUDA

Jan, 30

Fast CT Image Processing using Parallelized Non-local Means

Reducing the radiation dose delivered to patients has been an important concern since the introduction of X-ray computed tomography (CT). However, low-dose CT images tend to be severely degraded by noise. This paper proposes using parallelized non-local means (PNM) under a computation framework for improving low-dose X-ray CT images. For the proposed PNM method, the […]

CUDA

Jan, 30

Numerical Ocean Modeling and Simulation with CUDA

ROMS is software that models and simulates an ocean region using a finite difference grid and time stepping. ROMS simulations can take from hours to days to complete due to the compute-intensive nature of the software. As a result, the size and resolution of simulations are constrained by the performance limitations of modern computing hardware. […]

CUDA

Jan, 30

On CUDA implementation of a multichannel room impulse response reshaping algorithm based on p-norm optimization

By using room impulse response shortening and shaping it is possible to reduce the reverberation effects and therefore improve speech intelligibility. This may be achieved by a prefilter that modifies the overall impulse response to have a stronger attenuation. For achieving a spatial robustness, multichannel approaches have been proposed. Unfortunately, these approaches suffer from a […]

CUDA

Jan, 30

How well do STARLAB and NBODY compare? II: Hardware and accuracy

Most recent progress in understanding the dynamical evolution of star clusters relies on direct N-body simulations. Owing to the computational demands, and the desire to model more complex and more massive star clusters, hardware calculational accelerators, such as GRAPE special-purpose hardware or, more recently, GPUs (i.e. graphics cards), are generally utilised. In addition, simulations can […]

Jan, 30

Efficient Password and Key recovery using Graphic Cards

Passwords are without doubt the most common means for authentication throughout all kinds of applications on computer systems, ranging from local or online-service user logins to the protection of sensitive data by password based encryption. However, wherever passwords are employed, these are prone to loss or disremembering, an effect which, especially driven by the advent […]

CUDA

Jan, 30

CUDA Expression Templates

Many algorithms require vector algebra operations such as the dot product, vector norms or component-wise manipulations. Especially for large-scale vectors, the efficiency of algorithms depends on an efficient implementation of those calculations. The calculation of vector operations benefits from the continually increasing chip level parallelism on graphics hardware. Very efficient basic linear algebra libraries like […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Raytracing Dynamic Scenes on GPU

Decompilation of LLVM IR

The Virtual OpenCL (VCL) Cluster Platform

Graphical processing unit implementation of an integrated shape-based active contour: Application to digital pathology

An OpenCL implementation for the solution of TDSE on GPU and CPU architectures

Algorithmic Contributions to the Theory of Regular Chains

Fast CT Image Processing using Parallelized Non-local Means

Numerical Ocean Modeling and Simulation with CUDA

On CUDA implementation of a multichannel room impulse response reshaping algorithm based on p-norm optimization

How well do STARLAB and NBODY compare? II: Hardware and accuracy

Efficient Password and Key recovery using Graphic Cards

CUDA Expression Templates

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)