high performance computing on graphics processing units: hgpu.org

Posts

Feb, 10

Heterogeneous Distributed Big Data Clustering on Sparse Grids

Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The […]

OpenCL

Feb, 10

Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

Hierarchical matrices are space and time efficient representations of dense matrices that exploit the low rank structure of matrix blocks at different levels of granularity. The hierarchically low rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity of dense matrices. In this paper, […]

CUDA

Feb, 3

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Gradient Boosting Decision Trees (GBDTs) and Random Forests (RFs) have been used in many real-world applications. They are often a standard recipe for building state-of-the-art solutions to machine learning and data mining problems. However, training and prediction are very expensive computationally for large and high dimensional problems. This article presents an efficient and open source […]

CUDA

Feb, 3

Swizzle Inventor: Data Movement Synthesis for GPU Kernels

Utilizing memory and register bandwidth in modern architectures may require irregular data placement and movement, such as shuffles and broadcasts. We develop Swizzle Inventor to help programmers implement swizzle algorithms, by writing programs that omit swizzles and delegating the creation of those swizzles to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing […]

CUDA

Feb, 3

The OoO VLIW JIT Compiler for GPU Inference

Current trends in Machine Learning (ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization. As ML inference is increasingly time-bounded by tight latency SLOs, increasing data parallelism is not an option. The need for better efficiency motivates GPU multiplexing. Furthermore, existing GPU programming abstractions force programmers to micro-manage GPU resources […]

CUDA

Feb, 3

High Performance Algorithms for Counting Collisions and Pairwise Interactions

The problem of counting collisions or interactions is common in areas as computer graphics and scientific simulations. Since it is a major bottleneck in applications of these areas, a lot of research has been done on such subject, mainly focused on techniques that allow calculations to be performed within pruned sets of objects. This paper […]

CUDA

Feb, 3

DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression

DNNs have been quickly and broadly exploited to improve the data analysis quality in many complex science and engineering applications. Today’s DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources, significantly […]

Jan, 27

Exploiting OpenMP & OpenACC to Accelerate a Molecular Docking Mini-App in Heterogeneous HPC Nodes

In drug discovery, molecular docking is the task in charge of estimating the position of a molecule when interacting with the docking site. This task is usually used to perform screening of a large library of molecules, in the early phase of the process. Given the amount of candidate molecules and the complexity of the […]

CUDA

Jan, 27

Supporting mixed-datatype matrix multiplication within the BLIS framework

We approach the problem of implementing mixed-datatype support within the general matrix multiplication (GEMM) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the computation is allowed to take place in a precision different from […]

Jan, 27

Efficient Implementation and Optimization of Geometric Multigrid Operations in the LIFT Framework

Geometric Multigrid (GMG) is an efficient method to solve partial differential equations. It consists of four operations (smooth, residual, restrict and prolongate), which are applied iteratively. All four operations are stencil computations and hence benefit from being executed on many-core architectures like GPUs. Programs executed on a GPU are usually written using low-level programming approaches […]

OpenCL

Jan, 27

AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks

Typically, Ultra-deep neural network(UDNN) tends to yield high-quality model, but its training process is usually resource intensive and time-consuming. Modern GPU’s scarce DRAM capacity is the primary bottleneck that hinders the trainability and the training efficiency of UDNN. In this paper, we present "AccUDNN", an accelerator that aims to make the utmost use of finite […]

CUDA

Jan, 27

Direct N-body code on low-power embedded ARM GPUs

This work arises on the environment of the ExaNeSt project aiming at design and development of an exascale ready supercomputer with low energy consumption profile but able to support the most demanding scientific and technical applications. The ExaNeSt compute unit consists of densely-packed low-power 64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Heterogeneous Distributed Big Data Clustering on Sparse Grids

Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Swizzle Inventor: Data Movement Synthesis for GPU Kernels

The OoO VLIW JIT Compiler for GPU Inference

High Performance Algorithms for Counting Collisions and Pairwise Interactions

DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression

Exploiting OpenMP & OpenACC to Accelerate a Molecular Docking Mini-App in Heterogeneous HPC Nodes

Supporting mixed-datatype matrix multiplication within the BLIS framework

Efficient Implementation and Optimization of Geometric Multigrid Operations in the LIFT Framework

AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks

Direct N-body code on low-power embedded ARM GPUs

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)