high performance computing on graphics processing units: hgpu.org

Posts

Dec, 24

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations

This Master’s thesis investigates the performance dynamics of database operations – V-Search, Fuzzy Search, and Join – implemented on both Central Processing Units (CPU) and Graphics Processing Units (GPU). With the ever-increasing demand for efficient data processing, it has become crucial to understand and optimize the use of different hardware platforms for executing diverse database […]

CUDA

Dec, 18

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

HPC is a heterogeneous world in which host and device code are interleaved throughout the application. Given the significant performance advantage of accelerators, device code execution time is becoming the new bottleneck. Tuning the accelerated parts is consequently highly desirable but often impractical due to the large overall application runtime which includes unrelated host parts. […]

Dec, 18

Precision and Performance Analysis of C Standard Math Library Functions on GPUs

With the advent of GPU computing, executing large program sections on accelerators has become increasingly important. Efforts are being made to support the C standard library, LIBC, on GPUs via LLVM machinery. Therefore, the C standard math library, LIBM, must be supported on GPUs. So far, LLVM frontends, such as Clang, have relied on GPU […]

CUDA

Dec, 18

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace

Modern supercomputing applications are complex programs built on optimized frameworks and accelerated on GPUs. As such, dedicated tools for profiling GPU kernel utilization and performance are needed to support development of these applications, which in turn accelerates progress for the scientific computing and machine learning communities. This paper presents the Oneprof and Onetrace tools from […]

OpenCL

Dec, 18

Principles for Automated and Reproducible Benchmarking

The diversity in processor technology used by High Performance Computing (HPC) facilities is growing, and so applications must be written in such a way that they can attain high levels of performance across a range of different CPUs, GPUs, and other accelerators. Measuring application performance across this wide range of platforms becomes crucial, but there […]

Dec, 18

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs

Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based scientific compressors, GPU-accelerated compressors exhibit substantially higher throughputs, which can thus better adapt to GPU-based scientific simulation applications. However, a critical limitation still lies in all existing GPU-accelerated error-bounded lossy compressors: they suffer from low compression ratios, which strictly […]

CUDA

Dec, 10

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts

GPUs have cemented their position in computer systems, not restricted to graphics but also extensively used for general-purpose computing. With this comes a rapidly expanding population of developers using GPUs for programming. However, programming with GPUs is notoriously difficult due to their unique architecture and constant evolution. A large number of developers have encountered problems […]

CUDA

•

OpenCL

Dec, 10

Efficiently Processing Large Relational Joins on GPUs

With the growing interest in Machine Learning (ML), Graphic Processing Units (GPUs) have become key elements of any computing infrastructure. Their widespread deployment in data centers and the cloud raises the question of how to use them beyond ML use cases, with growing interest in employing them in a database context. In this paper, we […]

CUDA

Dec, 10

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations

The Large Hadron Collider (LHC) at CERN will see an upgraded hardware configuration which will bring a new era of physics data taking and related computational challenges. To this end, it is necessary to exploit the ever increasing variety of computational architectures, featuring GPUs from multiple vendors and new accelerators. Performance portable frameworks, like SYCL, […]

CUDA

Dec, 10

Edge AI for Internet of Energy: Challenges and Perspectives

The digital landscape of the Internet of Energy (IoE) is on the brink of a revolutionary transformation with the integration of edge Artificial Intelligence (AI). This comprehensive review elucidates the promise and potential that edge AI holds for reshaping the IoE ecosystem. Commencing with a meticulously curated research methodology, the article delves into the myriad […]

Dec, 10

Compiler-centric across-stack deep learning acceleration

Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a […]

Dec, 3

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, […]

CUDA

•

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

Precision and Performance Analysis of C Standard Math Library Functions on GPUs

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace

Principles for Automated and Reproducible Benchmarking

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts

Efficiently Processing Large Relational Joins on GPUs

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations

Edge AI for Internet of Energy: Challenges and Perspectives

Compiler-centric across-stack deep learning acceleration

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)