high performance computing on graphics processing units: hgpu.org

Posts

Apr, 17

Fast Arbitrary Precision Floating Point on FPGA

Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be […]

Apr, 17

PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining

Open-source process mining provides many algorithms for the analysis of event data which could be used to analyze mainstream processes (e.g., O2C, P2P, CRM). However, compared to commercial tools, they lack the performance and struggle to analyze large amounts of data. This paper presents PM4Py-GPU, a Python process mining library based on the NVIDIA RAPIDS […]

Apr, 10

Optimizing Performance and Energy Efficiency in Massively Parallel Systems

Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator […]

OpenCL

Apr, 10

Extending SYCL’s Programming Paradigm with Tensor-based SIMD Abstractions

Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. There is generally a trade off between source code portability and device performance for heterogeneous programming. Thus, new programming abstractions to assist programmers to reduce their development efforts while minimizing performance penalties is extremely […]

Apr, 10

Performance Models for Heterogeneous Iterative Programs

This article describes techniques to model the performance of heterogeneous iterative programs, which can execute on multiple device types (CPUs and GPUs). We have classified iterative programs into two categories – static and dynamic, based on their workload distributions. Methods are described to model their performance on multi-device machines using linear regression from statistics. Experiments […]

CUDA

Apr, 10

Persistent Kernels for Iterative Memory-bound GPU Applications

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound […]

CUDA

Apr, 10

ALPINIST: An Annotation-Aware GPU Program Optimizer

GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre- and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when […]

CUDA

•

OpenCL

Mar, 27

Advanced Joins on GPUs

Over the past years, the rise of General Purpose GPU (GPGPU) paradigm has become more evident in high-performance computing. The massive parallelism that GPUs offer at low cost is the catalyst for its adoption in numerous computational intensive applications, where tremendous speedup gains are reported due to the ease of parallelization of the algorithms they […]

CUDA

Mar, 27

One-shot tuner for deep learning compilers

Auto-tuning DL compilers are gaining ground as an optimizing back-end for DL frameworks. While existing work can generate deep learning models that exceed the performance of hand-tuned libraries, they still suffer from prohibitively long auto-tuning time due to repeated hardware measurements in large search spaces. In this paper, we take a neural-predictor inspired approach to […]

CUDA

Mar, 27

Simulation Methodologies for Mobile GPUs

GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the […]

OpenCL

Mar, 27

Data transfer optimizations for heterogeneous managed runtime systems

Nowadays, most programmable systems contain multiple hardware accelerators with different characteristics. In order to use the available hardware resources and improve the performance of their applications, developers must use a low-level language, such as C/C++. Succeeding the same goal from a high-level managed language (Java, Haskell, C#) poses several challenges such as the inability to […]

CUDA

•

OpenCL

Mar, 27

Migrating CUDA to oneAPI: A Smith-Waterman Case Study

To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration […]

CUDA