high performance computing on graphics processing units: hgpu.org

Posts

Jul, 4

Semantic Product Search

We study the problem of semantic matching in product search, that is, given a customer query, retrieve all semantically related products from the catalog. Pure lexical matching via an inverted index falls short in this respect due to several factors: a) lack of understanding of hypernyms, synonyms, and antonyms, b) fragility to morphological variants (e.g. […]

Jul, 4

PIConGPU: Predictive Simulations of Laser-Particle Accelerators with Manycore Hardware

The presented thesis establishes simulations on modern massively parallel computing hardware to investigate relativistic laser-driven plasmas. The latter are of special interest as they may provide a compact source for energetic ion beams. Computer simulations provide valuable insight into ultrafast plasma processes, evolving in the ultrahigh intensity (l_0 >> 1018 W/cm^2) focus of the ultrashort […]

CUDA

Jul, 4

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair […]

Jun, 30

Automated Generation of OpenCL Programs Based on Algebra-Algorithmic Approach

The paper proposes the further development of algebra-algorithmic design and synthesis tools towards the development of OpenCL programs. The method for semi-automatic parallelization of cyclic operators is proposed. The particular feature of the approach consists in using high-level algebraalgorithmic program specifications (schemes) and rewriting rules technique. The developed tools provide the construction of parallel algorithm […]

CUDA

•

OpenCL

Jun, 30

WCCV: Improving the Vectorization of IF-statements with Warp-Coherent Conditions

When vectorizing programs for modern processors with SIMD extensions, IF-statements pose a challenge: existing vectorization approaches often introduce redundant computations or they resort to inefficient masked instructions. In this paper, we introduce a new notion of warp-coherence for conditions that exhibit coherent run-time behavior on different lanes of a vector register. We demonstrate that warp-coherent […]

OpenCL

Jun, 30

Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact

A major contributor to the deployment and operational costs of a large-scale high-performance computing (HPC) clusters is the memory system. In terms of system performance it is one of the most critical aspects of the system’s design. However, next generation of HPC systems poses significant challenges for the main memory, and it is questionable whether […]

Jun, 30

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

The validation and deployment of novel research ideas in the field of Deep Learning is often limited by the availability of efficient compute kernels for certain basic primitives. In particular, operations that cannot leverage existing vendor libraries (e.g., cuBLAS, cuDNN) are at risk of facing poor device utilization unless custom implementations are written by experts […]

CUDA

Jun, 30

HEATS: Heterogeneity- and Energy-Aware Task-based Scheduling

Cloud providers usually offer diverse types of hardware for their users. Customers exploit this option to deploy cloud instances featuring GPUs, FPGAs, architectures other than x86 (e.g., ARM, IBM Power8), or featuring certain specific extensions (e.g, Intel SGX). We consider in this work the instances used by customers to deploy containers, nowadays the de facto […]

Jun, 27

ReSYCLator: Transforming CUDA C++ source code into SYCL

CUDA while very popular, is not as flexible with respect to target devices as OpenCL. While parallel algorithm research might address problems first with a CUDA C++ solution, those results are not easily portable to a target not directly supported by CUDA. In contrast, a SYCL C++ solution can operate on the larger variety of […]

CUDA

•

OpenCL

Jun, 27

Heterogeneous Active Messages (HAM) – Implementing Lightweight Remote Procedure Calls in C++

We present HAM (Heterogeneous Active Messages), a C++-only active messaging solution for heterogeneous distributed systems.Combined with a communication protocol, HAM can be used as a generic Remote Procedure Call (RPC) mechanism. It has been used in HAM-Offload to implement a low-overhead offloading framework for inter- and intra-node offloading between different architectures including accelerators like the […]

Jun, 27

Mirovia: A Benchmarking Suite for Modern Heterogeneous Computing

This paper presents Mirovia, a benchmark suite developed for modern day heterogeneous computing. Previous benchmark suites such as Rodinia [1] and SHOC [2] are well written and have many desirable features. However, these tools were developed years ago when hardware was less powerful and software had fewer features. For example, unified memory was introduced in […]

CUDA

Jun, 23

A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning

Porting code from CPU to GPU is costly and time-consuming; Unless much time is invested in development and optimization, it is not obvious, a priori, how much speed-up is achievable or how much room is left for improvement. Knowing the potential speed-up a priori can be very useful: It can save hundreds of engineering hours, […]

high performance computing on graphics processing units: hgpu.org

Posts

Semantic Product Search

PIConGPU: Predictive Simulations of Laser-Particle Accelerators with Manycore Hardware

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Automated Generation of OpenCL Programs Based on Algebra-Algorithmic Approach

WCCV: Improving the Vectorization of IF-statements with Warp-Coherent Conditions

Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

HEATS: Heterogeneity- and Energy-Aware Task-based Scheduling

ReSYCLator: Transforming CUDA C++ source code into SYCL

Heterogeneous Active Messages (HAM) – Implementing Lightweight Remote Procedure Calls in C++

Mirovia: A Benchmarking Suite for Modern Heterogeneous Computing

A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)