high performance computing on graphics processing units: hgpu.org

Posts

Oct, 1

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Deep Learning has achieved state-of-the-art performance in several artificial intelligence tasks like object recognition, speech recognition, machine translation, and summarization. Deep learning is a subset of machine learning that learns multiple levels of data representation using Neural Networks (NNs). The rise of deep learning can be attributed to the presence of large datasets and computation […]

CUDA

Sep, 24

Julia as a unifying end-to-end workflow language on the Frontier exascale system

We evaluate using Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy’s first exascale supercomputer. We evaluate the feasibility, performance, scaling, and trade-offs of (i) the […]

Sep, 24

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

In this paper, we evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, […]

CUDA

•

OpenCL

Sep, 24

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs

The heterogeneous computing paradigm has led to the need for portable and efficient programming solutions that can leverage the capabilities of various hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the portability and performance of the SYCL and CUDA languages for one fundamental bioinformatics application (Smith-Waterman protein database search) across different […]

CUDA

Sep, 24

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted considerable attention. In this paper, we focused on two families of formats that have already achieved interesting results in compressing binary32 numbers in […]

Sep, 24

Compiler-assisted distribution of OpenMP code for improved scalability

High performance computing is a complex field, with many homogeneous and heterogeneous hardware architectures, and numerous programming paradigms, libraries and compilers. OpenMP and netCDF are relatively widely used in Earth system research because they are comparatively easy to learn and yet can exploit the potential of a single compute node. However, Earth system scientists without […]

Sep, 17

Improving the Efficiency of OpenCL Kernels through Pipes

Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by […]

OpenCL

Sep, 17

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work […]

CUDA

Sep, 17

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

In recent history, GPUs became a key driver of compute performance in HPC. With the installation of the Frontier supercomputer, they became the enablers of the Exascale era; further largest-scale installations are in progress (Aurora, El Capitan, JUPITER). But the early-day dominance by NVIDIA and their CUDA programming model has changed: The current HPC GPU […]

CUDA

Sep, 17

host device — Generic programming in Cuda

We present patterns for Cuda/C++ to write save generic code which works both on the host and device side. Writing templated functions in Cuda/C++ both for the CPU and the GPU bears the problem that in general both __host__ and __device__ functions are instantiated, which leads to lots of compiler warnings or errors.

CUDA

Sep, 17

Unified Shared Memory: Friend or Foe?

Adopting heterogeneous execution on GPUs and FPGAs in managed runtime systems, such as Java, is a challenging task due to the complexities of the underlying virtual machine. The majority of the current work has been focusing on compiler toolchains to solve the challenge of transparent just-in-time compilation of different code segments onto the accelerators. However, […]

CUDA

•

OpenCL

Sep, 6

Scope is all you need: Transforming LLMs for HPC Code

With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and […]

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Julia as a unifying end-to-end workflow language on the Frontier exascale system

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

Compiler-assisted distribution of OpenMP code for improved scalability

Improving the Efficiency of OpenCL Kernels through Pipes

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

host device — Generic programming in Cuda

Unified Shared Memory: Friend or Foe?

Scope is all you need: Transforming LLMs for HPC Code

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)