high performance computing on graphics processing units: hgpu.org

Posts

May, 12

Improving Resource Efficiency in Virtualized Datacenters

In recent years there has been an extraordinary growth of the Internet of Things (IoT) and its protocols. The increasing diffusion of electronic devices with identification, computing and communication capabilities is laying ground for the emergence of a highly distributed service and networking environment. The above mentioned situation implies that there is an increasing demand […]

CUDA

May, 12

FPGA Implementation of Reduced Precision Convolutional Neural Networks

With the improvement in processing systems, machine learning applications are finding widespread use in almost all sectors of technology. Image recognition is one application of machine learning which has become widely popular with various architectures and systems aimed at improving recognition performance. With classification accuracy now approaching saturation point, many researchers are now focusing on […]

CUDA

May, 12

Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox

Tomographic image sizes keep increasing over time and while the GPUs that compute the tomographic reconstruction are also increasing in memory size, they are not doing so fast enough to reconstruct the largest datasets. This problem is often solved by reconstructing data in large clusters of GPUs with enough devices to fit the measured X-ray […]

CUDA

May, 12

Predictable GPGPU Computing in DNN-Driven Autonomous Systems

Graphics processing units (GPUs) are being widely used as co-processors in many domains to accelerate general-purpose workloads that are data-parallel and computationally intensive, i.e., GPGPU. An emerging usage domain is adopting GPGPU to accelerate inherently computation-intensive Deep Neural Network (DNN) workloads in autonomous systems. Such autonomous systems are usually time-sensitive, especially for autonomous driving systems. […]

CUDA

May, 12

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges […]

CUDA

May, 8

Charactering and Detecting CUDA Program Bugs

While CUDA has become a major parallel computing platform and programming model for general-purpose GPU computing, CUDA-induced bug patterns have not yet been well explored. In this paper, we conduct the first empirical study to reveal important categories of CUDA program bug patterns based on 319 bugs identified within 5 popular CUDA projects in GitHub. […]

CUDA

May, 8

FPGA-based acceleration of a particle simulation High Performance Computing application

In the present thesis, it has been studied the possibility to insert FPGAs in the world of High Performance Computing (HPC) systems. Such systems are hybrid platforms that exploit the pure parallel computation of GPUs in order to reach very high performances. Nevertheless, GPU-based systems are power-hungry and require a power consumption so large, that […]

OpenCL

May, 8

TensorNetwork: A Library for Physics and Machine Learning

TensorNetwork is an open source library for implementing tensor network algorithms. Tensor networks are sparse data structures originally designed for simulating quantum many-body physics, but are currently also applied in a number of other research areas, including machine learning. We demonstrate the use of the API with applications both physics and machine learning, with details […]

May, 5

Principles, Techniques, and Tools for Explicit and Automatic Parallelization

The end of Dennard scaling also brought an end to frequency scaling as a means to improve performance. Chip manufacturers had to abandon frequency and superscalar scaling as processors became increasingly power constrained. An architecture’s power budget became the limiting factor to performance gains, and computations had to be performed more energy-efficiently. Designers turned to […]

CUDA

May, 5

Compressed Learning of Deep Neural Networks for OpenCL-Capable Embedded Systems

Deep neural networks (DNNs) have been quite successful in solving many complex learning problems. However, DNNs tend to have a large number of learning parameters, leading to a large memory and computation requirement. In this paper, we propose a model compression framework for efficient training and inference of deep neural networks on embedded systems. Our […]

OpenCL

May, 5

An Architectural Journey into RISC Architectures for HPC Workloads

The race to the Exascale (i.e., 10^18 Floating Point operations per seconds) together with the slow-down of Moore’s law are posing unprecedented challenges to the whole High-Performance Computing (HPC) community. Computer architects, system integrators and software engineers studying programming models for handling parallelism are especially called to the rescue in a moment like the one […]

May, 5

Full-stack Optimization for Accelerating CNNs with FPGA Validation

We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference latency, energy efficiency, hardware utilization and inference accuracy. […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Improving Resource Efficiency in Virtualized Datacenters

FPGA Implementation of Reduced Precision Convolutional Neural Networks

Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox

Predictable GPGPU Computing in DNN-Driven Autonomous Systems

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

Charactering and Detecting CUDA Program Bugs

FPGA-based acceleration of a particle simulation High Performance Computing application

TensorNetwork: A Library for Physics and Machine Learning

Principles, Techniques, and Tools for Explicit and Automatic Parallelization

Compressed Learning of Deep Neural Networks for OpenCL-Capable Embedded Systems

An Architectural Journey into RISC Architectures for HPC Workloads

Full-stack Optimization for Accelerating CNNs with FPGA Validation

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)