high performance computing on graphics processing units: hgpu.org

Posts

Feb, 26

GPU Offloading in ExaHyPE Through C++ Standard Algorithms

The ISO C++17 standard introduces parallel algorithms, a parallel programming model promising portability across a wide variety of parallel hardware including multi-core CPUs, GPUs, and FPGAs. Since 2019, the NVIDIA HPC SDK compiler suite supports this programming model for multi-core CPUs and GPUs. ExaHyPE is a solver engine for hyperbolic partial differential equations for complex […]

CUDA

Feb, 12

A Survey on Optimization Techniques for Edge Artificial Intelligence (AI)

Artificial Intelligence (Al) models are being produced and used to solve a variety of current and future business and technical problems. Therefore, AI model engineering processes, platforms, and products are acquiring special significance across industry verticals. For achieving deeper automation, the number of data features being used while generating highly promising and productive AI models […]

Feb, 12

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Iterative stencil computations are widely used in numerical simulations. They present a high degree of parallelism, high locality and mostly-coalesced memory access patterns. Therefore, GPUs are good candidates to speed up their computation. However, the development of stencil programs that can work with huge grids in distributed systems with multiple GPUs is not straightforward, since […]

OpenCL

Feb, 12

Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study

Modern hardware accelerator cards create an accessible platform for developers to reduce execution times for computationally expensive algorithms. The most widely used systems, however, have dedicated memory spaces, resulting in the processor having to transfer data to the accelerator-card memory space before the computation can be executed. Currently, the performance increase from using an accelerator […]

OpenCL

Feb, 12

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their […]

CUDA

Feb, 12

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Generalizable manipulation skills, which can be composed to tackle long-horizon and complex daily chores, are one of the cornerstones of Embodied AI. However, existing benchmarks, mostly composed of a suite of simulatable environments, are insufficient to push cutting-edge research works because they lack object-level topological and geometric variations, are not based on fully dynamic simulation, […]

CUDA

Feb, 5

Evaluation of Rust for GPGPU high-performance computing

Research within computer science constantly aims to find ways to improve computing performance in various ways. With the apparent death of Moore’s Law, researchers are focused on exploiting other ways of improving performance, for example, programming language optimizations. Fields such as web development, databases and, machine learning are adopting new programming languages but GPU programming […]

CUDA

Feb, 5

Optimization of massive data applications on heterogeneous architectures

In the last few years, the heterogeneous architectures have become dominant in each part of the computing industry: from heterogeneous GPU accelerators joining multi-core CPUs within the same chip, to Systems on Chip that integrate DSPs or. The main motivation of this thesis is the fact that there is no implementation with optimal solution for […]

OpenCL

Feb, 5

Revisiting Query Performance in GPU Database Systems

GPUs offer massive compute parallelism and high-bandwidth memory accesses. GPU database systems seek to exploit those capabilities to accelerate data analytics. Although modern GPUs have more resources (e.g., higher DRAM bandwidth) than ever before, judicious choices for query processing that avoid wasteful resource allocations are still advantageous. Database systems can save GPU runtime costs through […]

CUDA

Feb, 5

CMLCompiler: A Unified Compiler for Classical Machine Learning

Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified […]

CUDA

Feb, 5

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU’s parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from […]

CUDA

Jan, 29

Pulsar search acceleration using FPGAs and OpenCL templates

The Square Kilometre Array (SKA) is the world’s largest radio telescope currently under construction, and will employ elaborate signal processing to detect new pulsars, i.e. highly magnetised rotating neutron stars. This paper addresses the acceleration of demanding computations for this pulsar search on Field-Programmable Gate Arrays (FPGAs) using a new high-level design process based on […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

GPU Offloading in ExaHyPE Through C++ Standard Algorithms

A Survey on Optimization Techniques for Edge Artificial Intelligence (AI)

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Evaluation of Rust for GPGPU high-performance computing

Optimization of massive data applications on heterogeneous architectures

Revisiting Query Performance in GPU Database Systems

CMLCompiler: A Unified Compiler for Classical Machine Learning

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

Pulsar search acceleration using FPGAs and OpenCL templates

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)