Posts
Apr, 17
Explicit caching HYB: a new high-performance SpMV framework on GPGPU
Sparse Matrix-Vector Multiplication (SpMV) is a critical operation for the iterative solver of Finite Element Methods on computer simulation. Since the SpMV operation is a memory-bound algorithm, the efficiency of data movements heavily influenced the performance of the SpMV on GPU. In recent years, many research is conducted in accelerating the performance of SpMV on […]
Apr, 17
Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm
Inverting matrices is a crucial part in many algorithms in linear algebra, computer graphics and data analysis. There are many libraries providing algorithms to achieve this but none that allow for calling from the GPU context. GPUs and accelerators become more and more prevalent in high performance computers. Having no ready-to-use implementation scientists need to […]
Apr, 17
PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining
Open-source process mining provides many algorithms for the analysis of event data which could be used to analyze mainstream processes (e.g., O2C, P2P, CRM). However, compared to commercial tools, they lack the performance and struggle to analyze large amounts of data. This paper presents PM4Py-GPU, a Python process mining library based on the NVIDIA RAPIDS […]
Apr, 17
Fast Arbitrary Precision Floating Point on FPGA
Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be […]
Apr, 10
Optimizing Performance and Energy Efficiency in Massively Parallel Systems
Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator […]
Apr, 10
Extending SYCL’s Programming Paradigm with Tensor-based SIMD Abstractions
Heterogeneous computing has emerged as an important method for supporting more than one kind of processors or accelerators in a program. There is generally a trade off between source code portability and device performance for heterogeneous programming. Thus, new programming abstractions to assist programmers to reduce their development efforts while minimizing performance penalties is extremely […]
Apr, 10
Performance Models for Heterogeneous Iterative Programs
This article describes techniques to model the performance of heterogeneous iterative programs, which can execute on multiple device types (CPUs and GPUs). We have classified iterative programs into two categories – static and dynamic, based on their workload distributions. Methods are described to model their performance on multi-device machines using linear regression from statistics. Experiments […]
Apr, 10
Persistent Kernels for Iterative Memory-bound GPU Applications
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound […]
Apr, 10
ALPINIST: An Annotation-Aware GPU Program Optimizer
GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre- and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when […]
Mar, 27
Advanced Joins on GPUs
Over the past years, the rise of General Purpose GPU (GPGPU) paradigm has become more evident in high-performance computing. The massive parallelism that GPUs offer at low cost is the catalyst for its adoption in numerous computational intensive applications, where tremendous speedup gains are reported due to the ease of parallelization of the algorithms they […]
Mar, 27
One-shot tuner for deep learning compilers
Auto-tuning DL compilers are gaining ground as an optimizing back-end for DL frameworks. While existing work can generate deep learning models that exceed the performance of hand-tuned libraries, they still suffer from prohibitively long auto-tuning time due to repeated hardware measurements in large search spaces. In this paper, we take a neural-predictor inspired approach to […]
Mar, 27
Data transfer optimizations for heterogeneous managed runtime systems
Nowadays, most programmable systems contain multiple hardware accelerators with different characteristics. In order to use the available hardware resources and improve the performance of their applications, developers must use a low-level language, such as C/C++. Succeeding the same goal from a high-level managed language (Java, Haskell, C#) poses several challenges such as the inability to […]