high performance computing on graphics processing units: hgpu.org

Posts

Nov, 19

Performance Analysis of Parallel Sorting Algorithms using GPU Computing

Sorting is a well interrogating issue in computer science. Many authors have invented numerous sorting algorithms on CPU (Central Processing Unit). In today’s life sorting on the CPU is not so efficient. To get the efficient sorting parallelization should be done. There are many ways of parallelization of sorting but at the present time GPU […]

CUDA

Nov, 19

Lattice QCD simulations using the OpenACC platform

In this article we will explore the OpenACC platform for programming Graphics Processing Units (GPUs). The OpenACC platform offers a directive based programming model for GPUs which avoids the detailed data flow control and memory management necessary in a CUDA programming environment. In the OpenACC model, programs can be written in high level languages with […]

CUDA

Nov, 16

Autotuning CUDA Compiler Parameters for Heterogeneous Applications using the OpenTuner Framework

A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating vector operations. The enormous heterogeneity of parallel computing platforms justifies and motivates the development of automated optimization tools and techniques. The Algorithm Selection Problem consists in finding a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of […]

CUDA

Nov, 16

Efficient Communications in Training Large Scale Neural Networks

We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for sub-gradient aggregations, which for large messages quickly dominates overall execution […]

CUDA

Nov, 16

Data Acquisition with GPUs: The DAQ for the Muon g-2 Experiment at Fermilab

Graphical Processing Units (GPUs) have recently become a valuable computing tool for the acquisition of data at high rates and for a relatively low cost. The devices work by parallelizing the code into thousands of threads, each executing a simple process, such as identifying pulses from a waveform digitizer. The CUDA programming library can be […]

CUDA

Nov, 16

Automatic code generation methods applied to numerical linear algebra in high performance computing

Parallelism in today’s computer architectures is ubiquitous whether it be in supercomputers, workstations or on portable devices such as smartphones. Exploiting efficiently these systems for a specific application requires a multidisciplinary effort that concerns Domain Specific Languages (DSL), code generation and optimization techniques and application-specific numerical algorithms. In this PhD thesis, we present a method […]

CUDA

Nov, 16

Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates […]

Nov, 13

CUDA-API-wrappers: Thin C++-flavored wrappers for the CUDA runtime API

nVIDIA’s Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads). This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of […]

CUDA

Nov, 13

OpenCL-based optimizations for acceleration of object tracking on FPGAs and GPUs

OpenCL support across many heterogeneous nodes (FPGAs, GPUs, CPUs) has increased the programmability of these systems significantly. At the same time, it opens up new challenges and design choices for system designers and application programmers. While OpenCL offers a universal semantic to capture the parallel behavior of applications independent of the target architecture, some customization […]

OpenCL

Nov, 13

Executing Dynamic Data Rate Actor Networks on OpenCL Platforms

Heterogeneous computing platforms consisting of general purpose processors (GPPs) and graphics processing units (GPUs) have become commonplace in personal mobile devices and embedded systems. For years, programming of these platforms was very tedious and simultaneous use of all available GPP and GPU resources required low-level programming to ensure efficient synchronization and data transfer between processors. […]

OpenCL

Nov, 13

Hadoopcl2: Motivating the design of a distributed, heterogeneous programming system with machine-learning applications

Machine learning (ML) algorithms have garnered increased interest as they demonstrate improved ability to extract meaningful trends from large, diverse, and noisy data sets. While research is advancing the state-of-the-art in ML algorithms, it is difficult to drastically improve the real-world performance of these algorithms. Porting new and existing algorithms from single-node systems to multi-node […]

OpenCL

Nov, 13

Fractal Art Generation using GPUs

Fractal image generation algorithms exhibit extreme parallelizability. Using general purpose graphics processing unit (GPU) programming to implement escape-time algorithms for Julia sets of functions,parallel methods generate visually attractive fractal images much faster than traditional methods. Vastly improved speeds are achieved using this method of computation, which allow real-time generation and display of images. A comparison […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Performance Analysis of Parallel Sorting Algorithms using GPU Computing

Lattice QCD simulations using the OpenACC platform

Autotuning CUDA Compiler Parameters for Heterogeneous Applications using the OpenTuner Framework

Efficient Communications in Training Large Scale Neural Networks

Data Acquisition with GPUs: The DAQ for the Muon g-2 Experiment at Fermilab

Automatic code generation methods applied to numerical linear algebra in high performance computing

Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors

CUDA-API-wrappers: Thin C++-flavored wrappers for the CUDA runtime API

OpenCL-based optimizations for acceleration of object tracking on FPGAs and GPUs

Executing Dynamic Data Rate Actor Networks on OpenCL Platforms

Hadoopcl2: Motivating the design of a distributed, heterogeneous programming system with machine-learning applications

Fractal Art Generation using GPUs

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)