high performance computing on graphics processing units: hgpu.org

Posts

Aug, 28

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems

In an era of hardware diversity, the management of applications’ allocated memory is a complex task that can have significant performance repercussions. Non-uniformity in the memory hierarchy, along with heterogeneity and asymmetry of chip designs, make the costs of memory accesses unpredictable if the allocated memory is not managed carefully. Poor memory allocation and placement […]

CUDA

•

OpenCL

Aug, 28

Fully Concurrent GPU Data Structures

Building efficient concurrent data structures that scale to the level of GPU parallelism is a challenging problem. Solutions designed for the CPU do not scale to the thousands of cores that modern GPUs offer. In this dissertation, we show how to efficiently build a lock-based B-Tree on the GPU; then, we show how to extend […]

CUDA

Aug, 28

Long Code for Code Search

Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens by default, which makes […]

Aug, 28

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and […]

Aug, 28

Exploring Thread Coarsening on FPGA

Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a unified, high-level programming interface for CPUs, GPUs and FPGAs. This has led to the development of compiler […]

OpenCL

Aug, 21

BenchPress: A Deep Active Benchmark Generator

We develop BenchPress, the first ML benchmark generator for compilers that is steerable within feature space representations of source code. BenchPress synthesizes compiling functions by adding new code in any part of an empty or existing sequence by jointly observing its left and right context, achieving excellent compilation rate. BenchPress steers benchmark generation towards desired […]

OpenCL

Aug, 21

Performance analysis of matrix-free conjugate gradient kernels using SYCL

We examine the performance of matrix-free SYCL implementations of the conjugate gradient method for solving sparse linear systems of equations. Performance is tested on an NVIDIA A100-80GB device and a dual socket Intel Ice Lake CPU node using different SYCL implementations, and compared to CUDA BLAS (cuBLAS) implementations on the A100 GPU and MKL implementations […]

OpenCL

Aug, 21

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Training deep neural networks (DNNs) is becoming more and more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More […]

Aug, 21

Optimization of GPU workloads using natural language processing based on deep learning techniques

Setting program parameters is challenging due to the abstract relationship between hardware and software. Automatic optimization algorithms that are accurate are required to cope with the complexity and variety of current hardware and software. Autotuning has always relied on time-consuming trial and error approaches. Machine learning (ML) and Natural Language Processing (NLP) has flourished over […]

CUDA

•

OpenCL

Aug, 21

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, […]

CUDA

Aug, 7

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High […]

CUDA

Aug, 7

Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities

To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems

Fully Concurrent GPU Data Structures

Long Code for Code Search

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Exploring Thread Coarsening on FPGA

BenchPress: A Deep Active Benchmark Generator

Performance analysis of matrix-free conjugate gradient kernels using SYCL

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Optimization of GPU workloads using natural language processing based on deep learning techniques

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)