high performance computing on graphics processing units: hgpu.org

Posts

Jul, 10

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model […]

CUDA

Jul, 10

DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware

Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a […]

Jul, 10

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL

This dissertation is primarily presenting the design of Digital Signal Processing (DSP) between the transmission in Bluetooth Low Energy Physical Layer (BLE PHY), and its implementation in a Field Programmable Gate Array (FPGA) device with Open Computing Language (OpenCL). During the design of DSP, it bases on the In-Phase/Quadrature-Phase (IQ) architecture to construct the modulation […]

OpenCL

Jul, 3

Novel Parallel Approaches to Efficiently Solve Spatial Problems on Heterogeneous CPU-GPU Systems

In recent years, approaches that seek to extract valuable information from large datasets have become particularly relevant in today’s society. In this category, we can highlight those problems that comprise data analysis distributed across two-dimensional scenarios called spatial problems. These usually involve processing (i) a series of features distributed across a given plane or (ii) […]

Jul, 3

Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing

The Intel DPC++ Compatibility Tool is a component of the Intel oneAPI Base Toolkit. This tool automatically transforms CUDA code into Data Parallel C++ (DPC++), thus assisting in the migration process. DPC++ is an implementation of the programming standard for heterogeneous computing known as SYCL, which unifies the development of parallel applications on CPUs, GPUs […]

CUDA

Jul, 3

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Nowadays, High-performance Computing (HPC) scientific applications often face performance challenges when running on heterogeneous supercomputers, so do scalability, portability, and efficiency issues. For years, supercomputer architectures have been rapidly changing and becoming more complex, and this challenge will become even more complicated as we enter the exascale era, where computers will exceed one quintillion calculations […]

CUDA

Jul, 3

Tensor Computation Based on Heterogeneous Memory

Tensors, which generalize matrices to more than two dimensions, are fundamental to many disciplines, such as scientific computing and machine learning. Improving the performance and scalability of tensor computation is essential to those domains. The recent advance of heterogeneous memory is promising to deliver large-scale, high-performance tensor computation. However, it is challenging to leverage memory […]

OpenCL

Jul, 3

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an […]

Jun, 26

An experimental study of group-by and aggregation on CPU-GPU processors

Hash-based group-by and aggregation is a fundamental operator in database systems. Modern discrete GPUs (graphics processing units) have been considered to accelerate the performance. However, the data transfer through the PCIe (peripheral component interconnect express) bus would reduce gains. On recent architectures, the GPU and the CPU (central processing unit) are built into the same […]

OpenCL

Jun, 26

SnuHPL: high performance LINPACK for heterogeneous GPUs

These days, it is typical for a large-scale cluster system to have different kinds of GPUs. However, HPL (High-Performance LINPACK), the de-facto standard LINPACK implementation for evaluating the performance of a cluster system, is originally designed to work only for homogeneous CPU-only systems. In this paper, we develop SnuHPL, an optimized HPL for clusters of […]

Jun, 26

tntorch: Tensor Network Learning with PyTorch

We present tntorch, a tensor learning framework that supports multiple decompositions (including Candecomp/Parafac, Tucker, and Tensor Train) under a unified interface. With our library, the user can learn and handle low-rank tensors with automatic differentiation, seamless GPU support, and the convenience of PyTorch’s API. Besides decomposition algorithms, tntorch implements differentiable tensor algebra, rank truncation, cross-approximation, […]

Jun, 26

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when deciding on how to choose the proper hardware for training. In particular, CPU servers can be beneficial if […]

high performance computing on graphics processing units: hgpu.org

Posts

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL

Novel Parallel Approaches to Efficiently Solve Spatial Problems on Heterogeneous CPU-GPU Systems

Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Tensor Computation Based on Heterogeneous Memory

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

An experimental study of group-by and aggregation on CPU-GPU processors

SnuHPL: high performance LINPACK for heterogeneous GPUs

tntorch: Tensor Network Learning with PyTorch

Deep Learning Models on CPUs: A Methodology for Efficient Training

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)