high performance computing on graphics processing units: hgpu.org

Posts

Dec, 31

Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google’s qsim

Quantum computer simulators play a critical role in supporting the development and validation of quantum algorithms and hardware. This study focuses on porting Google’s qsim, a quantum computer simulator, to AMD Graphics Processing Units (GPUs). We leverage the existing qsim CUDA backend and harness the HIPIFY tool to provide a qsim HIP backend tailored for […]

CUDA

Dec, 31

Gaiwan: a Size-Polymorphic Typesystem for GPU Programs

General-purpose computing on graphics processing units (GPGPU) is increasingly used for number crunching tasks such as analyzing time series data. GPUs are a good fit for these tasks as they can execute many computations in parallel. To leverage this parallelism, the programmer is forced to carefully divide their input data into data blocks that are […]

OpenCL

Dec, 24

KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications

Nowadays, Graphics Processing Units (GPUs) dominate in a wide spectrum of computing realms and multi-task is increasingly applied in various complicated applications. To gain higher performance, multi-task programs require cumbersome programming efforts to take advantage of inter-kernel concurrency at source-code level. Although there exist works automatically scheduling kernels to enable inter-kernel concurrency, they all inevitably […]

CUDA

Dec, 24

FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data

While both the database and high-performance computing (HPC) communities utilize lossless compression methods to minimize floating-point data size, a disconnect persists between them. Each community designs and assesses methods in a domain-specific manner, making it unclear if HPC compression techniques can benefit database applications or vice versa. With the HPC community increasingly leaning towards in-situ […]

CUDA

Dec, 24

Experiences Building an MLIR-based SYCL Compiler

Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges […]

Dec, 24

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Binary code summarization, while invaluable for understanding code semantics, is challenging due to its labor-intensive nature. This study delves into the potential of large language models (LLMs) for binary code comprehension. To this end, we present BinSum, a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis […]

Dec, 24

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations

This Master’s thesis investigates the performance dynamics of database operations – V-Search, Fuzzy Search, and Join – implemented on both Central Processing Units (CPU) and Graphics Processing Units (GPU). With the ever-increasing demand for efficient data processing, it has become crucial to understand and optimize the use of different hardware platforms for executing diverse database […]

CUDA

Dec, 18

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

HPC is a heterogeneous world in which host and device code are interleaved throughout the application. Given the significant performance advantage of accelerators, device code execution time is becoming the new bottleneck. Tuning the accelerated parts is consequently highly desirable but often impractical due to the large overall application runtime which includes unrelated host parts. […]

Dec, 18

Precision and Performance Analysis of C Standard Math Library Functions on GPUs

With the advent of GPU computing, executing large program sections on accelerators has become increasingly important. Efforts are being made to support the C standard library, LIBC, on GPUs via LLVM machinery. Therefore, the C standard math library, LIBM, must be supported on GPUs. So far, LLVM frontends, such as Clang, have relied on GPU […]

CUDA

Dec, 18

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace

Modern supercomputing applications are complex programs built on optimized frameworks and accelerated on GPUs. As such, dedicated tools for profiling GPU kernel utilization and performance are needed to support development of these applications, which in turn accelerates progress for the scientific computing and machine learning communities. This paper presents the Oneprof and Onetrace tools from […]

OpenCL

Dec, 18

Principles for Automated and Reproducible Benchmarking

The diversity in processor technology used by High Performance Computing (HPC) facilities is growing, and so applications must be written in such a way that they can attain high levels of performance across a range of different CPUs, GPUs, and other accelerators. Measuring application performance across this wide range of platforms becomes crucial, but there […]

Dec, 18

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs

Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based scientific compressors, GPU-accelerated compressors exhibit substantially higher throughputs, which can thus better adapt to GPU-based scientific simulation applications. However, a critical limitation still lies in all existing GPU-accelerated error-bounded lossy compressors: they suffer from low compression ratios, which strictly […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google’s qsim

Gaiwan: a Size-Polymorphic Typesystem for GPU Programs

KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications

FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data

Experiences Building an MLIR-based SYCL Compiler

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

Precision and Performance Analysis of C Standard Math Library Functions on GPUs

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace

Principles for Automated and Reproducible Benchmarking

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs

Recent source codes

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

Most viewed papers (last 30 days)