27299

Posts

Oct, 2

Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform

This paper assesses and reports the experience of eleven application teams working to build, validate, and benchmark several HPC applications on a novel GPU-accelerated Arm testbed. The testbed consists of the latest, at time of writing, Arm Devkits from NVIDIA with server-class Arm CPUs and NVIDIA A100 GPUs. The applications and mini-apps are written using […]
Oct, 2

Exploiting dynamic sparse matrices for performance portable linear algebra operations

Sparse matrices and linear algebra are at the heart of scientific simulations. More than 70 sparse matrix storage formats have been developed over the years, targeting a wide range of hardware architectures and matrix types. Each format is developed to exploit the particular strengths of an architecture, or the specific sparsity patterns of matrices, and […]
Sep, 11

Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism

Currently, offloading to accelerators requires users to identify which regions are to be executed on the device, what memory needs to be transferred, and how synchronization is to be resolved. On top of these manual tasks, many standard (C/C++ library) functions, such as file I/O or memory manipulation, cannot be directly executed on the device […]
Sep, 11

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

Large transformer models display promising performance on a wide range of natural language processing (NLP) tasks. Although the AI community has expanded the model scale to the trillion parameter level, the practical deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints. In this paper, we proposed EnergonAI […]
Sep, 11

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse […]
Sep, 11

SCALSALE: Scalable SALE Benchmark Framework for Supercomputers

Supercomputers worldwide provide the necessary infrastructure for groundbreaking research. However, most supercomputers are not designed equally due to different desired figure of merit, which is derived from the computational bounds of the targeted scientific applications’ portfolio. In turn, the design of such computers becomes an optimization process that strives to achieve the best performances possible […]
Sep, 11

Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs

Recently, CPU-GPU heterogeneous parallelism has brought transformational performance milestones to static timing analysis (STA) algorithms. As the computing ecosystem continues to proliferate, performance portability has emerged as a new challenge when deploying the result to diverse heterogeneous computing platforms. Specifically, the optimal code written on a CPU-GPU architecture may not be optimal for other CPUGPU […]
Sep, 4

GGArray: A Dynamically Growable GPU Array

We present a dynamically Growable GPU array (GGArray) fully implemented in GPU that does not require synchronization with the host. The idea is to improve the programming of GPU applications that require dynamic memory, by offering a structure that does not require pre-allocating GPU VRAM for the worst case scenario. The GGArray is based on […]
Sep, 4

Lina: a fast design optimisation tool for software-based FPGA programming

The continuous technology push on the semiconductor industry has led to the development of several alternate architectures for efficient computing. Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) are examples of devices used to accelerate applications. FPGAs are able to provide massive parallelism for suitable tasks when properly programmed. However, designing for FPGA is […]
Sep, 4

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Nowadays, we are living in an era of extreme device heterogeneity. Despite the high variety of conventional CPU architectures, accelerator devices, such as GPUs and FPGAs, also appear in the foreground exploding the pool of available solutions to execute applications. However, choosing the appropriate device per application needs is an extremely challenging task due to […]
Sep, 4

Understanding the Power of Evolutionary Computation for GPU Code Optimization

Achieving high performance for GPU codes requires developers to have significant knowledge in parallel programming and GPU architectures, and in-depth understanding of the application. This combination makes it challenging to find performance optimizations for GPU-based applications, especially in scientific computing. This paper shows that significant speedups can be achieved on two quite different scientific workloads […]
Aug, 28

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems

In an era of hardware diversity, the management of applications’ allocated memory is a complex task that can have significant performance repercussions. Non-uniformity in the memory hierarchy, along with heterogeneity and asymmetry of chip designs, make the costs of memory accesses unpredictable if the allocated memory is not managed carefully. Poor memory allocation and placement […]

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: