high performance computing on graphics processing units: hgpu.org

Posts

Sep, 28

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the […]

Sep, 28

XML3D: interactive 3D graphics for the web

Web technologies provide the basis to distribute digital information worldwide and in realtime but they have also established the Web as a ubiquitous application platform. The Web evolved from simple text data to include advanced layout, images, audio, and recently streaming video. Today, as our digital environment becomes increasingly three-dimensional (e.g. 3D cinema, 3D video, […]

OpenGL

Sep, 28

Spark: modular, composable shaders for graphics hardware

In creating complex real-time shaders, programmers should be able to decompose code into independent, localized modules of their choosing. Current real-time shading languages, however, enforce a fixed decomposition into per-pipeline-stage procedures. Program concerns at other scales — including those that cross-cut multiple pipeline stages — cannot be expressed as reusable modules. We present a shading […]

Sep, 28

VoxelPipe: a programmable pipeline for 3D voxelization

We present a highly exible and efficient software pipeline for programmable triangle voxelization. The pipeline, entirely written in CUDA, supports both fully conservative and thin voxelizations, multiple boolean, floating point, vector-typed render targets, user-defined vertex and fragment shaders, and a bucketing mode which can be used to generate 3D A-buffers containing the entire list of […]

CUDA

Sep, 28

Thread Block Compaction for Efficient SIMT Control Flow

Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data "cores" to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread (SIMT) […]

CUDA

Sep, 27

Parallel implementations of probabilistic latent semantic analysis on graphic processing units

Probabilistic Latent Semantic Analysis (PLSA) has been successfully applied to many text mining tasks such as retrieval, clustering, summarization, etc. PLSA involves iterative computation for a large number of parameters and may take hours or even days to process a large dataset, thus speeding up PLSA is highly motivated in the domain of text mining. […]

CUDA

Sep, 27

Software Development Tools Using GPGPU Potentialities

The paper deals with potentialities of various up-to-date software development tools for making use of graphic processor (GPU) parallel computing resources. Examples are given to illustrate the use of present-day software tools for the development of applications and realization of algorithms for scientific-technical calculations performed by GPGPU. The paper presents some classes of hard mathematical […]

OpenCL

Sep, 27

Fast On-line Statistical Learning on a GPGPU

On-line Machine Learning using Stochastic Gradient Descent is an inherently sequential computation. This makes it difficult to improve performance by simply employing parallel architectures. Langford et al. made a modification to the standard stochastic gradient descent approach which opens up the possibility of parallel computation. They also proved that there is no significant loss in […]

CUDA

Sep, 27

Intelligent GPGPU Classification in Volume Visualization: A framework based on Error-Correcting Output Codes

In volume visualization, the definition of the regions of interest is inherently an iterative trial-and-error process finding out the best parameters to classify and render the final image. Generally, the user requires a lot of expertise to analyze and edit these parameters through multi-dimensional transfer functions. In this paper, we present a framework of intelligent […]

OpenCL

Sep, 27

Fast Frequent Itemset Mining from Uncertain Databases using GPGPU

Frequent itemset mining from uncertain databases is different from conventional one in the sense that it needs to take into account uncertainty. To this end, some methods have already been proposed, but their performances are not satisfactory. Meanwhile, GPGPU (General Purpose computing on GPU) have recently been an interesting research subject in the field of […]

CUDA

Sep, 27

A GPU approach to parallel replica-exchange polymer simulations

We investigate new programming techniques for parallel tempering Monte Carlo simulations of an elementary bead-spring homopolymer model using graphics processing units (GPUs). For a precise estimation of statistical quantities, like the peak structure of the specific heat, a large number of conformations with substantial statistical data is needed. Therefore the advantage of gathering this data […]

CUDA

Sep, 27

A framework to implement a multifrontal scheme on GPU architectures with OpenCL

In this work we analyze an open-source multifrontal solver implementation (UMFPACK) and modify it to transfer the computation load on an OpenCL device, typically a GPU. To achieve this result the dbOpenCL library has been created, which allows a neat integration of OpenCL code into existent C or C++ code. An analysis and pro ling […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

XML3D: interactive 3D graphics for the web

Spark: modular, composable shaders for graphics hardware

VoxelPipe: a programmable pipeline for 3D voxelization

Thread Block Compaction for Efficient SIMT Control Flow

Parallel implementations of probabilistic latent semantic analysis on graphic processing units

Software Development Tools Using GPGPU Potentialities

Fast On-line Statistical Learning on a GPGPU

Intelligent GPGPU Classification in Volume Visualization: A framework based on Error-Correcting Output Codes

Fast Frequent Itemset Mining from Uncertain Databases using GPGPU

A GPU approach to parallel replica-exchange polymer simulations

A framework to implement a multifrontal scheme on GPU architectures with OpenCL

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)