high performance computing on graphics processing units: hgpu.org

Posts

May, 3

TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing

Deep learning framework supported by CUDA parallel computing platform boosts advances of studies on machine learning. The advantage of parallel processing largely comes from an efficiency of matrix-matrix multiplication using many CUDA-enabled graphics processing units (GPU). Therefore, for recurrent neural networks (RNNs), the usage of a zero-filled matrix representing variable lengths of sentences for a […]

CUDA

May, 3

Automatic Test Case Reduction for OpenCL

We report on an extension to the C-Reduce tool, for automatic reduction of C test cases, to handle OpenCL kernels. This enables an automated method for detecting bugs in OpenCL compilers, by generating large random kernels using the CLsmith generator, identifying kernels that yield result differences across OpenCL platforms and optimisation levels, and using our […]

OpenCL

May, 3

Polly-ACC: Transparent compilation to heterogeneous hardware

Programming today’s increasingly complex heterogeneous hardware is difficult, as it commonly requires the use of data-parallel languages, pragma annotations, specialized libraries, or DSL compilers. Adding explicit accelerator support into a larger code base is not only costly, but also introduces additional complexity that hinders long-term maintenance. We propose a new heterogeneous compiler that brings us […]

CUDA

May, 3

Exposing Errors Related to Weak Memory in GPU Applications

We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never […]

CUDA

Apr, 29

Array Program Transformation with Loo.py by Example: High-Order Finite Elements

To concisely and effectively demonstrate the capabilities of our program transformation system Loo.py, we examine a transformation path from two real-world Fortran subroutines as found in a weather model to a single high-performance computational kernel suitable for execution on modern GPU hardware. Along the transformation path, we encounter kernel fusion, vectorization, prefetching, parallelization, and algorithmic […]

OpenCL

Apr, 29

On the design of sparse hybrid linear solvers for modern parallel architectures

In the context of this thesis, our focus is on numerical linear algebra, more precisely on solution of large sparse systems of linear equations. We focus on designing efficient parallel implementations of MaPHyS, an hybrid linear solver based on domain decomposition techniques. First we investigate the MPI+threads approach. In MaPHyS, the first level of parallelism […]

CUDA

Apr, 29

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. […]

OpenCL

Apr, 29

Adaptive GPU Array Layout Auto-Tuning

Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for autotuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels […]

CUDA

Apr, 29

Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs

Frequent subgraph mining (FSM) is an important problem in numerous application areas, such as computational chemistry, bioinformatics, social networks, computer programming languages, etc. However, the problem is computationally hard because it requires enumerating possibly an exponential number of candidate subgraph patterns, and checking their presence in a single large graph or a database of graphs. […]

CUDA

Apr, 29

A Survey of Cache Bypassing Techniques

With increasing core-count, the cache demand of modern processors has also increased. However, due to strict area/power budgets and presence of poor data-locality workloads, blindly scaling cache capacity is both infeasible and ineffective. Cache bypassing is a promising technique to increase effective cache capacity without incurring power/area costs of a larger sized cache. However, injudicious […]

Apr, 26

GPU-Aware Non-contiguous Data Movement In Open MPI

Due to better parallel density and power efficiency, GPUs have become more popular for use in scientific applications. Many of these applications are based on the ubiquitous Message Passing Interface (MPI) programming paradigm, and take advantage of non-contiguous memory layouts to exchange data between processes. However, support for efficient non-contiguous data movements for GPU-resident data […]

CUDA

Apr, 26

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing

Automatic Test Case Reduction for OpenCL

Polly-ACC: Transparent compilation to heterogeneous hardware

Exposing Errors Related to Weak Memory in GPU Applications

Array Program Transformation with Loo.py by Example: High-Order Finite Elements

On the design of sparse hybrid linear solvers for modern parallel architectures

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

Adaptive GPU Array Layout Auto-Tuning

Parallel Subgraph Mining on Hybrid Platforms: HPC Systems, Multi-Cores and GPUs

A Survey of Cache Bypassing Techniques

GPU-Aware Non-contiguous Data Movement In Open MPI

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)