high performance computing on graphics processing units: hgpu.org

Posts

Jul, 24

eGPU: A 750 MHz Class Soft GPGPU for FPGA

This paper introduces the eGPU, a SIMT soft processor designed for FPGAs. Soft processors typically achieve modest operating frequencies, a fraction of the headline performance claimed by modern FPGA families, and obtain correspondingly modest performance results. We propose a GPGPU architecture structured specifically to take advantage of both the soft logic and embedded features of […]

Jul, 16

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is […]

CUDA

•

OpenCL

Jul, 16

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly […]

CUDA

Jul, 16

Tile-based Lightweight Integer Compression in GPU

GPUs are increasingly used for high-performance and interactive data analytics workloads due to their capability to accelerate computation using massive parallelism. A key constraint of GPU-based data analytics today is the limited memory capacity in GPU devices. Data compression is a powerful technique that can mitigate the capacity limitation in two ways: (1) fitting more […]

CUDA

Jul, 16

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack […]

CUDA

Jul, 16

Improving the Performance, Portability, and Productivity of Hardware Accelerators

With the end of Moore’s Law and Dennard’s scaling, attention is shifting to new ways of enhancing computer performance. Improving microprocessor performance is becoming increasingly complex, whereas computational power demands still grow tremendously fast. In recent years, we are witnessing a paradigm change: rather than using one single chip, the CPU, for computing everything, computers […]

CUDA

Jul, 9

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While […]

CUDA

Jul, 9

Modeling Parallel Programs using Large Language Models

Parallel software codes in high performance computing (HPC) continue to grow in complexity and scale as we enter the exascale era. A diverse set of emerging hardware and programming paradigms make developing, optimizing, and maintaining parallel software burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. […]

Jul, 9

Optimization Techniques for GPU Programming

In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization […]

CUDA

•

OpenCL

Jul, 9

Matrix Multiplication Using Only Addition

Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU’s are of increasing interest. In this paper, we demonstrate a method of performing matrix multiplication without a scalar multiplier circuit. In many cases of practical interest, only […]

Jul, 9

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design […]

Jul, 2

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

eGPU: A 750 MHz Class Soft GPGPU for FPGA

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Tile-based Lightweight Integer Compression in GPU

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Improving the Performance, Portability, and Productivity of Hardware Accelerators

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

Modeling Parallel Programs using Large Language Models

Optimization Techniques for GPU Programming

Matrix Multiplication Using Only Addition

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)