high performance computing on graphics processing units: hgpu.org

Posts

Sep, 24

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

As recently demonstrated, Deep Neural Networks (DNN), usually trained using single precision IEEE 754 floating point numbers (binary32), can also work using lower precision. Therefore, 16-bit and 8-bit compressed format have attracted considerable attention. In this paper, we focused on two families of formats that have already achieved interesting results in compressing binary32 numbers in […]

Sep, 24

Compiler-assisted distribution of OpenMP code for improved scalability

High performance computing is a complex field, with many homogeneous and heterogeneous hardware architectures, and numerous programming paradigms, libraries and compilers. OpenMP and netCDF are relatively widely used in Earth system research because they are comparatively easy to learn and yet can exploit the potential of a single compute node. However, Earth system scientists without […]

Sep, 24

Julia as a unifying end-to-end workflow language on the Frontier exascale system

We evaluate using Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy’s first exascale supercomputer. We evaluate the feasibility, performance, scaling, and trade-offs of (i) the […]

Sep, 17

Improving the Efficiency of OpenCL Kernels through Pipes

Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by […]

OpenCL

Sep, 17

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work […]

CUDA

Sep, 17

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

In recent history, GPUs became a key driver of compute performance in HPC. With the installation of the Frontier supercomputer, they became the enablers of the Exascale era; further largest-scale installations are in progress (Aurora, El Capitan, JUPITER). But the early-day dominance by NVIDIA and their CUDA programming model has changed: The current HPC GPU […]

CUDA

Sep, 17

host device — Generic programming in Cuda

We present patterns for Cuda/C++ to write save generic code which works both on the host and device side. Writing templated functions in Cuda/C++ both for the CPU and the GPU bears the problem that in general both __host__ and __device__ functions are instantiated, which leads to lots of compiler warnings or errors.

CUDA

Sep, 17

Unified Shared Memory: Friend or Foe?

Adopting heterogeneous execution on GPUs and FPGAs in managed runtime systems, such as Java, is a challenging task due to the complexities of the underlying virtual machine. The majority of the current work has been focusing on compiler toolchains to solve the challenge of transparent just-in-time compilation of different code segments onto the accelerators. However, […]

CUDA

•

OpenCL

Sep, 6

Scope is all you need: Transforming LLMs for HPC Code

With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and […]

Sep, 6

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture […]

CUDA

Sep, 6

Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs

In recent years the use of FPGAs to accelerate scientific applications has grown, with numerous applications demonstrating the benefit of FPGAs for high performance workloads. However, whilst High Level Synthesis (HLS) has significantly lowered the barrier to entry in programming FPGAs by enabling programmers to use C++, a major challenge is that most often these […]

OpenCL

Sep, 6

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

We propose a novel computing runtime that exposes remote compute devices via the cross-vendor open heterogeneous computing standard OpenCL and can execute compute tasks on the MEC cluster side across multiple servers in a scalable manner. Intermittent UE connection loss is handled gracefully even if the device’s IP address changes on the way. Network-induced latency […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Compressed Real Numbers for AI: a case-study using a RISC-V CPU

Compiler-assisted distribution of OpenMP code for improved scalability

Julia as a unifying end-to-end workflow language on the Frontier exascale system

Improving the Efficiency of OpenCL Kernels through Pipes

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview

host device — Generic programming in Cuda

Unified Shared Memory: Friend or Foe?

Scope is all you need: Transforming LLMs for HPC Code

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)