high performance computing on graphics processing units: hgpu.org

Posts

Jan, 2

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

As graphics processing units (GPUs) are being used increasingly for general purpose processing, efficient tooling for programming such parallel architectures becomes essential. Despite the continuous effort of programmability improvement in CUDA and OpenCL, they remain relatively low-level languages and require in-depth architecture knowledge to achieve high-performance implementations. Developers have to perform memory management manually to […]

CUDA

•

OpenCL

Jan, 2

GPU-accelerated Faster Mean Shift with euclidean distance metrics

Handling clustering problems are important in data statistics, pattern recognition and image processing. The mean-shift algorithm, a common unsupervised algorithms, is widely used to solve clustering problems. However, the mean-shift algorithm is restricted by its huge computational resource cost. In previous research[10], we proposed a novel GPU-accelerated Faster Mean-shift algorithm, which greatly speed up the […]

CUDA

Jan, 2

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud

INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is […]

CUDA

Jan, 2

A Variant of Concurrent Constraint Programming on GPU

The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason is that constraint solvers were primarily designed within the mental frame of sequential computation. To solve this issue, we take a […]

CUDA

Jan, 2

PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations

Machine learning (ML) is increasingly seen as a viable approach for building compiler optimization heuristics, but many ML methods cannot replicate even the simplest of the data flow analyses that are critical to making good optimization decisions. We posit that if ML cannot do that, then it is insufficiently able to reason about programs. We […]

OpenCL

Dec, 26

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs

As CUDA programs become the de facto program among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms has been a compelling option. Although several efforts have attempted to support CUDA on other than NVIDIA GPU devices, due to extra steps in the translation, the support is always […]

CUDA

Dec, 26

OpenCL-HPX Integration

Distributed applications combine the computational capabilities of heterogeneous nodes. As such, they offer challenges regarding data transfer and synchronization. HPX is a library for concurrent, parallel applications. It strives not only to address challenges regarding distributed systems, but also to conform to current and upcoming C++ standards. One of the solutions found in heterogeneous systems […]

OpenCL

Dec, 26

FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs

General sparse matrix-matrix multiplication (SpGEMM) is an integral part of many scientific computing, high-performance computing (HPC), and graph analytic applications. This paper presents a new compressed sparse vector (CSV) format for representing sparse matrices and FSpGEMM, an OpenCL-based HPC framework for accelerating general sparse matrix-matrix multiplication on FPGAs. The proposed FSpGEMM framework includes an FPGA […]

OpenCL

Dec, 26

High-Performance Interactive Scientific Visualization With Datoviz via the Vulkan Low-Level GPU API

We reported initial work towards a new fast and scalable scientific visualization technology that leverages the Vulkan API to achieve unprecedented performance through GPUs. This technology is implemented in a C/C++ library called Datoviz that offers an intermediate-level API for scientific visualization libraries and software. Datoviz provides a unified graphics stack for 2-D, 3-D, graphical […]

Dec, 26

NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems

We introduce version 3 of NetKet, the machine learning toolbox for many-body quantum physics. NetKet is built around neural-network quantum states and provides efficient algorithms for their evaluation and optimization. This new version is built on top of JAX, a differentiable programming and accelerated linear algebra framework for the Python programming language. The most significant […]

Dec, 19

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

This work explores the viability of end-to-end convolutional neural network inference using OpenCL HLS kernels generated from TVM on Intel FPGAs. We explore layer-pipelined execution for small networks and time-multiplexed kernels for larger CNNs. Naively generated kernels do not produce efficient hardware. We propose a set of optimizations to increase parallelism, resource utilization, and more […]

OpenCL

Dec, 19

Evaluation of Pseudo-Random Number Generation on GPU Cards

Monte Carlo methods rely on sequences of random numbers to obtain solutions to many problems in science and engineering. In this work, we evaluate the performance of different pseudorandom number generators (PRNGs) of the Curand library on a number of modern Nvidia GPU cards. As a numerical test, we generate pseudo-random number (PRN) sequences and […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

GPU-accelerated Faster Mean Shift with euclidean distance metrics

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud

A Variant of Concurrent Constraint Programming on GPU

PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs

OpenCL-HPX Integration

FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs

High-Performance Interactive Scientific Visualization With Datoviz via the Vulkan Low-Level GPU API

NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Evaluation of Pseudo-Random Number Generation on GPU Cards

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)