high performance computing on graphics processing units: hgpu.org

Posts

Dec, 18

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace

Modern supercomputing applications are complex programs built on optimized frameworks and accelerated on GPUs. As such, dedicated tools for profiling GPU kernel utilization and performance are needed to support development of these applications, which in turn accelerates progress for the scientific computing and machine learning communities. This paper presents the Oneprof and Onetrace tools from […]

OpenCL

Dec, 18

Principles for Automated and Reproducible Benchmarking

The diversity in processor technology used by High Performance Computing (HPC) facilities is growing, and so applications must be written in such a way that they can attain high levels of performance across a range of different CPUs, GPUs, and other accelerators. Measuring application performance across this wide range of platforms becomes crucial, but there […]

Dec, 18

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs

Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based scientific compressors, GPU-accelerated compressors exhibit substantially higher throughputs, which can thus better adapt to GPU-based scientific simulation applications. However, a critical limitation still lies in all existing GPU-accelerated error-bounded lossy compressors: they suffer from low compression ratios, which strictly […]

CUDA

Dec, 10

Compiler-centric across-stack deep learning acceleration

Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a […]

Dec, 10

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts

GPUs have cemented their position in computer systems, not restricted to graphics but also extensively used for general-purpose computing. With this comes a rapidly expanding population of developers using GPUs for programming. However, programming with GPUs is notoriously difficult due to their unique architecture and constant evolution. A large number of developers have encountered problems […]

CUDA

•

OpenCL

Dec, 10

Efficiently Processing Large Relational Joins on GPUs

With the growing interest in Machine Learning (ML), Graphic Processing Units (GPUs) have become key elements of any computing infrastructure. Their widespread deployment in data centers and the cloud raises the question of how to use them beyond ML use cases, with growing interest in employing them in a database context. In this paper, we […]

CUDA

Dec, 10

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations

The Large Hadron Collider (LHC) at CERN will see an upgraded hardware configuration which will bring a new era of physics data taking and related computational challenges. To this end, it is necessary to exploit the ever increasing variety of computational architectures, featuring GPUs from multiple vendors and new accelerators. Performance portable frameworks, like SYCL, […]

CUDA

Dec, 10

Edge AI for Internet of Energy: Challenges and Perspectives

The digital landscape of the Internet of Energy (IoE) is on the brink of a revolutionary transformation with the integration of edge Artificial Intelligence (AI). This comprehensive review elucidates the promise and potential that edge AI holds for reshaping the IoE ecosystem. Commencing with a meticulously curated research methodology, the article delves into the myriad […]

Dec, 3

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, […]

CUDA

•

OpenCL

Dec, 3

Testing and Mutation Testing for GPU Kernels

The increasing GPU performance and maturing computational platform make it possible to handle general-purpose computing jobs traditionally computed by the CPU. Also, just like what we did in the CPU program, we use testing to verify the correctness of the GPU program. However, the quality of the tests may remain unknown, which inspires us to […]

CUDA

Dec, 3

A Review of the Parallelization Strategies for Iterative Algorithms

Iteration-based algorithms have been widely used and achieved excellent results in many fields. However, in the big data era, data that needs to be processed is enormous in terms of both depth (the dimensionality of data) and breadth (the volume of data). Due to the slowdown of Moore’s Law, the computing power of single-core CPUs […]

CUDA

•

OpenCL

Dec, 3

CuPBoP-AMD: Extending CUDA to AMD Platforms

The proliferation of artificial intelligence applications has underscored the need for increased portability among graphic processing units (GPUs) from different vendors. With CUDA as one of the most popular GPU programming languages, CuPBoP (CUDA for Parallelized and Broad-range Processors) aims to provide NVIDIA’s proprietary CUDA language support to a variety of GPU and CPU platforms […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace

Principles for Automated and Reproducible Benchmarking

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs

Compiler-centric across-stack deep learning acceleration

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts

Efficiently Processing Large Relational Joins on GPUs

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations

Edge AI for Internet of Energy: Challenges and Perspectives

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

Testing and Mutation Testing for GPU Kernels

A Review of the Parallelization Strategies for Iterative Algorithms

CuPBoP-AMD: Extending CUDA to AMD Platforms

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)