high performance computing on graphics processing units: hgpu.org

Posts

Sep, 13

Lossless data compression on GPGPU architectures

Modern graphics processors provide exceptional computa- tional power, but only for certain computational models. While they have revolutionized computation in many fields, compression has been largely unnaffected. This paper aims to explain the current issues and possibili- ties in GPGPU compression. This is done by a high level overview of the GPGPU computational model in […]

Sep, 13

Parallel volume rendering implementation on graphics cards using CUDA

The ever-increasing amounts of volume data require high-end parallel visualization methods to process this data interactively. To meet the demands, progamming on graphics cards offers an effective and fast approach to compute volume rendering methods due to the parallel architecture of today’s graphics cards. In this paper, we introduce a volume ray casting method working […]

CUDA

•

OpenGL

Sep, 13

Gate-Level Simulation with GPU Computing

Functional verification of modern digital designs is a crucial, time-consuming task impacting not only the correctness of the final product, but also its time to market. At the heart of most of today’s verification efforts is logic simulation, used heavily to verify the functional correctness of a design for a broad range of abstraction levels. […]

CUDA

Sep, 13

Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems

Preparing applications for a transition from petascale to exascale systems will require a very large investment in several areas of software research and development. The introduction of manycore nodes, the abundance of parallelism, an increase in system faults (including soft errors) and a complicated, multi-component software environment are some of the most challenging issues we […]

Sep, 12

Energy-efficient computing for extreme-scale science

A many-core processor design for high-performance systems draws from embedded computing’s low-power architectures and design processes, providing a radical alternative to cluster solutions. The computational power required to accurately model extreme problem spaces, such as climate change, requires more than a business-as-usual approach. Building ever-larger clusters of commercial off-the-shelf (COTS) hardware will be increasingly constrained […]

Sep, 12

Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing

Although hardware/software partitioning of embedded applications onto FPGAs is widely known to have performance and power advantages, FPGA usage has been typically limited to hardware experts, due largely to several problems: 1) difficulty of integrating hardware design tools into well-established software tool flows, 2) increasingly lengthy FPGA design iterations due to placement and routing, and […]

Sep, 12

Task superscalar: using processors as functional units

The complexity of parallel programming greatly limits the effectiveness of chip-multiprocessors (CMPs). This paper presents the case for task superscalar pipelines, an abstraction of traditional out-of-order superscalar pipelines, that orchestrates an entire chip-multiprocessor in the same degree out-of-order pipelines manage functional units. Task superscalar leverages an emerging class of task-based dataflow programming models to relieve […]

Sep, 12

Meta-simulation of large WSN on multi-core computers

With the advances in wireless communications large scale Wireless Sensor Networks (WSN) are emerging with many applications. These networks are deployed to serve single objective application, with high optimization requirements such as performance enhancement and power saving. Application specific optimization is achieved using formal models and evaluation based simulations of distributed algorithms (DA) controlling such […]

Sep, 12

A scripting language for Digital Content Creation applications

Digital Content Creation (DCC) Applications (e.g. Blender, Autodesk 3ds Max) have long been used for the creation and editing of digital content. Due to current advancement in the field, the need for controlled automated work forced these applications to add support for small programming languages that gave power to artists without diving into many details. […]

OpenCL

Sep, 12

Symbolic crosschecking of floating-point and SIMD code

We present an effective technique for crosschecking an IEEE 754 floating-point program and its SIMD-vectorized version, implemented in KLEE-FP, an extension to the KLEE symbolic execution tool that supports symbolic reasoning on the equivalence between floating-point values. The key insight behind our approach is that floatingpoint values are only reliably equal if they are essentially […]

Sep, 12

Implementing cartesian genetic programming classifiers on graphics processing units using GPU.NET

This paper investigates the use of a new Graphics Processing Unit (GPU) programming tool called ‘GPU.NET’ for implementing a Genetic Programming fitness evaluator. We find that the tool is able to help write software that accelerates fitness evaluation. For the first time, Cartesian Genetic Programming (CGP) was used with a GPU-based interpreter. With its code […]

Sep, 12

Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform

The client computing platform is moving towards a heterogeneous architecture that combines scalar-oriented CPU cores and throughput-oriented accelerator cores. Recognizing that existing programming models for such heterogeneous platforms are still difficult for most programmers, we advocate a shared virtual memory programming model to improve programmability. In this paper, we focus on performance, and demonstrate that […]

high performance computing on graphics processing units: hgpu.org

Posts

Lossless data compression on GPGPU architectures

Parallel volume rendering implementation on graphics cards using CUDA

Gate-Level Simulation with GPU Computing

Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems

Energy-efficient computing for extreme-scale science

Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing

Task superscalar: using processors as functional units

Meta-simulation of large WSN on multi-core computers

A scripting language for Digital Content Creation applications

Symbolic crosschecking of floating-point and SIMD code

Implementing cartesian genetic programming classifiers on graphics processing units using GPU.NET

Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)