high performance computing on graphics processing units: hgpu.org

Posts

Apr, 25

Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

GPUs are now used for a wide range of problems within HPC. However, making efficient use of the computational power available with multiple GPUs is challenging. The main challenges in achieving good performance are memory layout, affecting memory bandwidth, effective use of the memory spaces with a GPU, inter-GPU communication, and synchronization. We address these […]

CUDA

Apr, 25

CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU

We introduce CryptGPU, a system for privacy-preserving machine learning that implements all operations on the GPU (graphics processing unit). Just as GPUs played a pivotal role in the success of modern deep learning, they are also essential for realizing scalable privacy-preserving deep learning. In this work, we start by introducing a new interface to losslessly […]

CUDA

Apr, 25

Performance Analysis and Optimization Opportunities for NVIDIA Automotive GPUs

Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) bring unprecedented performance requirements for automotive systems. Graphic Processing Unit (GPU) based platforms have been deployed with the aim of meeting these requirements, being NVIDIA Jetson TX2 and its high-performance successor, NVIDIA AGX Xavier, relevant representatives. However, to what extent high-performance GPU configurations are appropriate for […]

CUDA

Apr, 18

SLATE port to AMD and Intel platforms

SLATE implements GPU-accelerated linear algebra, relying primarily on vendor-provided GPU BLAS for performance, in particular batched BLAS routines. Initially, SLATE was written using NVIDIA’s CUDA and cuBLAS for GPU acceleration. At the time that the SLATE project was started, it was unclear what GPU technologies would exist for other platforms [1]. Since then, AMD has […]

CUDA

•

OpenCL

Apr, 18

FANS: FPGA-Accelerated Near-Storage Sorting

Large-scale sorting is always an important yet demanding task for data center applications. In addition to powerful processing capability, high-performance sorting system requires efficient utilization of the available bandwidth of various levels in the memory hierarchy. Nowadays, with the explosive data size, the frequent data transfers between the host and the storage device are becoming […]

OpenCL

Apr, 18

Under the Hood of SYCL – An Initial Performance Analysis With an Unstructured-mesh CFD Application

As the computing hardware landscape gets more diverse, and the complexity of hardware grows, the need for a general purpose parallel programming model capable of developing (performance) portable codes have become highly attractive. Intel’s OneAPI suite, which is based on the SYCL standard aims to fill this gap using a modern C++ API. In this […]

CUDA

Apr, 18

Efficient Large-Scale Language Model Training on GPU Clusters

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required […]

CUDA

Apr, 18

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Recently, Deep Neural Networks (DNNs) have recorded great success in handling medical and other complex classification tasks. However, as the sizes of a DNN model and the available dataset increase, the training process becomes more complex and computationally intensive, which usually takes a longer time to complete. In this work, we have proposed a generic […]

CUDA

Apr, 11

Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes

Modern commodity devices are nowadays equipped with a plethora of heterogeneous devices serving different purposes. Being able to exploit such heterogeneous hardware accelerators to their full potential is of paramount importance in the pursuit of higher performance and energy efficiency. Towards these objectives, the reduction of idle time of each device as well as the […]

OpenCL

Apr, 11

Progressive Semantic Segmentation

The objective of this work is to segment high-resolution images without overloading GPU memory usage or losing the fine details in the output segmentation map. The memory constraint means that we must either downsample the big image or divide the image into local patches for separate processing. However, the former approach would lose the fine […]

Apr, 11

Performance Monitoring of Multi-FPGA Systems

Field-Programmable Gate Arrays (FPGAs) have been increasingly deployed in datacenters and there has been a lot of focus on tools that help the development of FPGA applications. Among the most important tools are performance monitors that provide visibility into the state of the hardware. As the application platforms scale from one FPGA to many FPGAs, […]

Apr, 11

Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct

In this paper, we present the results of a numerical study of air-water turbulent bubbly flow in a periodic vertical square duct. The study is conducted using a novel numerical technique which leverages Volume of Fluid method for interface capturing and Sharp Surface Force method for accurate representation of the surface tension forces. A three-dimensional […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

CryptGPU: Fast Privacy-Preserving Machine Learning on the GPU

Performance Analysis and Optimization Opportunities for NVIDIA Automotive GPUs

SLATE port to AMD and Intel platforms

FANS: FPGA-Accelerated Near-Storage Sorting

Under the Hood of SYCL – An Initial Performance Analysis With an Unstructured-mesh CFD Application

Efficient Large-Scale Language Model Training on GPU Clusters

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Multiple-Tasks on Multiple-Devices (MTMD): Exploiting Concurrency in Heterogeneous Managed Runtimes

Progressive Semantic Segmentation

Performance Monitoring of Multi-FPGA Systems

Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)