high performance computing on graphics processing units: hgpu.org

Posts

Oct, 2

Exploiting dynamic sparse matrices for performance portable linear algebra operations

Sparse matrices and linear algebra are at the heart of scientific simulations. More than 70 sparse matrix storage formats have been developed over the years, targeting a wide range of hardware architectures and matrix types. Each format is developed to exploit the particular strengths of an architecture, or the specific sparsity patterns of matrices, and […]

CUDA

Sep, 11

Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism

Currently, offloading to accelerators requires users to identify which regions are to be executed on the device, what memory needs to be transferred, and how synchronization is to be resolved. On top of these manual tasks, many standard (C/C++ library) functions, such as file I/O or memory manipulation, cannot be directly executed on the device […]

CUDA

Sep, 11

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

Large transformer models display promising performance on a wide range of natural language processing (NLP) tasks. Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints. In this paper, we proposed EnergonAI […]

CUDA

Sep, 11

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse […]

CUDA

Sep, 11

SCALSALE: Scalable SALE Benchmark Framework for Supercomputers

Supercomputers worldwide provide the necessary infrastructure for groundbreaking research. However, most supercomputers are not designed equally due to different desired figure of merit, which is derived from the computational bounds of the targeted scientific applications’ portfolio. In turn, the design of such computers becomes an optimization process that strives to achieve the best performances possible […]

Sep, 11

Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs

Recently, CPU-GPU heterogeneous parallelism has brought transformational performance milestones to static timing analysis (STA) algorithms. As the computing ecosystem continues to proliferate, performance portability has emerged as a new challenge when deploying the result to diverse heterogeneous computing platforms. Specifically, the optimal code written on a CPU-GPU architecture may not be optimal for other CPUGPU […]

OpenCL

Sep, 4

GGArray: A Dynamically Growable GPU Array

We present a dynamically Growable GPU array (GGArray) fully implemented in GPU that does not require synchronization with the host. The idea is to improve the programming of GPU applications that require dynamic memory, by offering a structure that does not require pre-allocating GPU VRAM for the worst case scenario. The GGArray is based on […]

CUDA

Sep, 4

Lina: a fast design optimisation tool for software-based FPGA programming

The continuous technology push on the semiconductor industry has led to the development of several alternate architectures for efficient computing. Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) are examples of devices used to accelerate applications. FPGAs are able to provide massive parallelism for suitable tasks when properly programmed. However, designing for FPGA is […]

OpenCL

Sep, 4

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Nowadays, we are living in an era of extreme device heterogeneity. Despite the high variety of conventional CPU architectures, accelerator devices, such as GPUs and FPGAs, also appear in the foreground exploding the pool of available solutions to execute applications. However, choosing the appropriate device per application needs is an extremely challenging task due to […]

OpenCL

Sep, 4

Understanding the Power of Evolutionary Computation for GPU Code Optimization

Achieving high performance for GPU codes requires developers to have significant knowledge in parallel programming and GPU architectures, and in-depth understanding of the application. This combination makes it challenging to find performance optimizations for GPU-based applications, especially in scientific computing. This paper shows that significant speedups can be achieved on two quite different scientific workloads […]

CUDA

Aug, 28

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems

In an era of hardware diversity, the management of applications’ allocated memory is a complex task that can have significant performance repercussions. Non-uniformity in the memory hierarchy, along with heterogeneity and asymmetry of chip designs, make the costs of memory accesses unpredictable if the allocated memory is not managed carefully. Poor memory allocation and placement […]

CUDA

•

OpenCL

Aug, 28

Fully Concurrent GPU Data Structures

Building efficient concurrent data structures that scale to the level of GPU parallelism is a challenging problem. Solutions designed for the CPU do not scale to the thousands of cores that modern GPUs offer. In this dissertation, we show how to efficiently build a lock-based B-Tree on the GPU; then, we show how to extend […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Exploiting dynamic sparse matrices for performance portable linear algebra operations

Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

SCALSALE: Scalable SALE Benchmark Framework for Supercomputers

Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs

GGArray: A Dynamically Growable GPU Array

Lina: a fast design optimisation tool for software-based FPGA programming

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Understanding the Power of Evolutionary Computation for GPU Code Optimization

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems

Fully Concurrent GPU Data Structures

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)