high performance computing on graphics processing units: hgpu.org

Posts

Jul, 4

Optimization of Heterogeneous Parallel Computing Systems using Machine Learning

Background: Heterogeneous parallel computing systems utilize the combination of different resources CPUs and GPUs to achieve high performance and, reduced latency and energy consumption. Programming applications that target various processing units requires employing different tools and programming models/languages. Furthermore, selecting the most optimal implementation, which may either target different processing units (i.e. CPU or GPU) […]

CUDA

Jul, 4

Productivity, Portability, Performance: Data-Centric Python

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. […]

Jul, 4

Object Detection Based Handwriting Localization

We present an object detection based approach to localize handwritten regions from documents, which initially aims to enhance the anonymization during the data transmission. The concatenated fusion of original and preprocessed images containing both printed texts and handwritten notes or signatures are fed into the convolutional neural network, where the bounding boxes are learned to […]

Jul, 4

HALF: Holistic Auto Machine Learning for FPGAs

Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing. To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered from the topology down to […]

Jun, 27

Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA

In the last decade the use of GPGPU (General Purpose computing in Graphics Processing Units) has become extremely popular in data centers around the world. GPUs (Graphics Processing Units) have been established as computational accelerators that are used alongside CPUs to form heterogeneous systems. The massively parallel nature of GPUs, traditionally intended for graphics computing, […]

CUDA

Jun, 27

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically […]

OpenCL

Jun, 27

Efficient heterogeneous matrix profile on a CPU + High Performance FPGA with integrated HBM

In this work, we study the problem of efficiently executing a state-of-the-art time series algorithm class – SCAMP – on a heterogeneous platform comprised of CPU + High Performance FPGA with integrated HBM (High Bandwidth Memory). The geometry of the algorithm (a triangular matrix walk) and the FPGA capabilities pose two challenges. First, several replicated […]

OpenCL

Jun, 27

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization […]

CUDA

Jun, 27

Lettuce: PyTorch-based Lattice Boltzmann Framework

The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent […]

CUDA

Jun, 20

GPUAPI: Multi-level Chapel Runtime API for GPUs

Chapel is inherently well suited not only for homogeneous nodes but also heterogeneous nodes because they employ the concept of locales, distributed domains, forall/reduce constructs, and implicit communications. However, it is unfortunate that there is room for further improvements in supporting GPU in Chapel. This paper addresses some of the key limitations of past approaches […]

CUDA

•

OpenCL

Jun, 20

Study and evaluation of improved automatic GPU offloading method

With the slowing down of Moore’s law, the use of hardware other than CPUs, such as graphics processing units (GPUs) or field-Programmable gate arrays (FPGAs), is increasing. However, when using heterogeneous hardware other than CPUs, barriers to technical skills, such for compute unified device architecture (CUDA) and open computing language (OpenCL), are high. Therefore, I […]

CUDA

•

OpenCL

Jun, 20

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Optimization of Heterogeneous Parallel Computing Systems using Machine Learning

Productivity, Portability, Performance: Data-Centric Python

Object Detection Based Handwriting Localization

HALF: Holistic Auto Machine Learning for FPGAs

Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

Efficient heterogeneous matrix profile on a CPU + High Performance FPGA with integrated HBM

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Lettuce: PyTorch-based Lattice Boltzmann Framework

GPUAPI: Multi-level Chapel Runtime API for GPUs

Study and evaluation of improved automatic GPU offloading method

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)