high performance computing on graphics processing units: hgpu.org

Posts

Dec, 3

CuPBoP-AMD: Extending CUDA to AMD Platforms

The proliferation of artificial intelligence applications has underscored the need for increased portability among graphic processing units (GPUs) from different vendors. With CUDA as one of the most popular GPU programming languages, CuPBoP (CUDA for Parallelized and Broad-range Processors) aims to provide NVIDIA’s proprietary CUDA language support to a variety of GPU and CPU platforms […]

CUDA

•

OpenCL

Dec, 3

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous […]

CUDA

Nov, 27

GT4Py: High Performance Stencils for Weather and Climate Applications using Python

All major weather and climate applications are currently developed using languages such as Fortran or C++. This is typical in the domain of high performance computing (HPC), where efficient execution is an important concern. Unfortunately, this approach leads to implementations that intermix optimizations for specific hardware architectures with the high-level numerical methods that are typical […]

CUDA

Nov, 27

Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems

A wide range of bioinformatics applications have to deal with a continuously growing amount of data generated by high-throughput sequencing techniques. Exclusively CPU-based workstations fail to keep up with the task. Instead of employing dozens of CPU cluster nodes to increase the computational power, massively parallel accelerators like modern CUDA-enabled GPUs can be used to […]

CUDA

Nov, 27

Evaluation of FPGA-based high performance computing platforms

High performance computing is a topic that has risen to the top in the era of digitalization, AI and automation. Therefore, the search for more cost and time effective ways to implement HPC work is always a subject extensively researched. One part of this is to have hardware that is capable to improve on these […]

OpenCL

Nov, 27

Frameworks in Medical Image Analysis with Deep Neural Networks

In recent years, deep neural network based medical image analysis has become quite powerful and achieved similar results performance-wise as experts. Consequently, the integration of these tools into the clinical routine as clinical decision support systems is highly desired. The benefits of automatic image analysis for clinicians are massive, ranging from improved diagnostic as well […]

Nov, 27

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

Highly parallelized workloads like machine learning training, inferences and general HPC tasks are greatly accelerated using GPU devices. In a cloud computing cluster, serving a GPU’s computation power through multi-tasks sharing is highly demanded since there are always more task requests than the number of GPU available. Existing GPU sharing solutions focus on reducing task-level […]

CUDA

Nov, 19

Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C+

In this study, we present a novel dataset for training machine learning models translating between OpenMP Fortran and C++ code. To ensure reliability and applicability, the dataset is created from a range of representative open-source OpenMP benchmarks. It is also refined using a meticulous code similarity test. The effectiveness of our dataset is assessed using […]

CUDA

Nov, 19

GPU Auto-tuning Framework for Optimal Performance and Power Consumption

An auto-tuning framework for GPU devices is presented for tuning application kernels of OpenCL. The GPU tuner employs multi-objective optimization methodology to improve the performance and power consumption of applications. It efficiently explores a user defined solution space comprising of possible tunable algorithmic and hardware counter variations through code transformations. The methodology targets GPU code […]

OpenCL

Nov, 19

ExaNBody: a HPC framework for N-Body applications

Increasing heterogeneity among HPC platforms requires applications to be frequently ported and tuned, adding burden to developers. Fast evolution of hardware mandates adaptation of algorithms and data structures to get higher performance, while application complexity constantly grows accordingly. Ensuring portability while preserving high performance at large scale along with minimal changes to an already existing […]

CUDA

Nov, 19

AFOCL: Portable OpenCL Programming of FPGAs via Automated Built-in Kernel Management

OpenCL provides a consistent programming model across CPUs, GPUs, and FPGAs. However, to get reasonable performance out of FPGAs, OpenCL programs created for other platforms need to be modified. These modifications are often vendor-specific, limiting the portability of OpenCL programs between devices from different vendors. In this paper, we propose AFOCL: a cross-vendor portable programming […]

OpenCL

Nov, 19

CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types

Addressing performance portability across diverse accelerator architectures has emerged as a major challenge in the development of application and programming systems for high-performance computing environments. Although recent programming systems that focus on performance portability have significantly improved productivity in an effort to meet this challenge, the problem becomes notably more complex when compute nodes are […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

CuPBoP-AMD: Extending CUDA to AMD Platforms

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

GT4Py: High Performance Stencils for Weather and Climate Applications using Python

Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems

Evaluation of FPGA-based high performance computing platforms

Frameworks in Medical Image Analysis with Deep Neural Networks

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C+

GPU Auto-tuning Framework for Optimal Performance and Power Consumption

ExaNBody: a HPC framework for N-Body applications

AFOCL: Portable OpenCL Programming of FPGAs via Automated Built-in Kernel Management

CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)