high performance computing on graphics processing units: hgpu.org

Posts

Dec, 11

Towards energy efficiency and productivity for decision making in mobile robot navigation

Our goal in this work is to make it easy and feasible to implement solutions for autonomous decision-making and planning under uncertainty on low-power mobile platforms. We focus on practical applications, such as autonomous driving and service robotics, that must run on mobile SoC platforms. These applications often have real-time execution constraints. The main challenge […]

OpenCL

Dec, 4

Fast convolution kernels on pascal GPU with high memory efficiency

The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier high performance. In this paper, we propose two convolution kernels for single-channel convolution and multi-channel convolution respectively. […]

CUDA

Dec, 4

Three Contributions to the Theory and Practice of Optimizing Compilers

The theory and practice of optimizing compilers gather techniques that, from input computer programs, aim at generating code making best use of modern computer hardware. On the theory side, this thesis contributes new results and algorithms in polyhedral geometry. On the practical side, this thesis contributes techniques for the tuning of parameters of programs targeting […]

CUDA

Dec, 4

Efficient Incremental Text-to-Speech on GPUs

Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural […]

Dec, 4

SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL

The skyline is an optimization operator widely used for multi-criteria decision making. It allows minimizing an n-dimensional dataset into its smallest subset. In this work we present SkyFlow, the first heterogeneous CPU+GPU graph-based engine for skyline computation on a stream of data queries. Two data flow approaches, Coarse-grained and Fine-grained, have been proposed for different […]

Dec, 4

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make […]

CUDA

Nov, 27

User-Driven Online Kernel Fusion for SYCL

Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing speciic tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization […]

OpenCL

Nov, 27

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning

Solving partial differential equations with deep learning makes it possible to reduce simulation times by multiple orders of magnitude and unlock scientific methods that typically rely on large numbers of sequential simulations, such as optimization and uncertainty quantification. Two of the largest challenges of adopting scientific AI for industrial problem settings is that training datasets […]

Nov, 27

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad

Recent years have seen increased interest from the HPC community in Field Programmable Gate Arrays (FPGAs) as an alternative/additional accelerator. This has been largely due to the slowdown in the transistor scaling and the difficulty of gaining performance improvement and energy efficiency from the current processing solutions. General (scientific) software programmers have shied away from […]

OpenCL

Nov, 27

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also extensively been studied. The computational power provided by the modern GPUs as well as the rapid development of multicore CPU systems […]

CUDA

Nov, 27

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment

Background and objectives. The computational biology area is growing up over the years. The interest in researching and developing computational tools for the acquisition, storage, organization, analysis, and visualization of biological data generates the need to create new hardware architectures and new software tools that allow processing big data in acceptable times. In this sense, […]

CUDA

•

OpenCL

Nov, 20

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Transformers have become central to recent advances in computer vision. However, training a vision Transformer (ViT) model from scratch can be resource intensive and time consuming. In this paper, we aim to explore approaches to reduce the training costs of ViT models. We introduce some algorithmic improvements to enable training a ViT model from scratch […]

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Towards energy efficiency and productivity for decision making in mobile robot navigation

Fast convolution kernels on pascal GPU with high memory efficiency

Three Contributions to the Theory and Practice of Optimizing Compilers

Efficient Incremental Text-to-Speech on GPUs

SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

User-Driven Online Kernel Fusion for SYCL

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)