high performance computing on graphics processing units: hgpu.org

Posts

Mar, 29

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: […]

OpenCL

Mar, 29

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a […]

Mar, 22

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware. Currently, vendor libraries like cuBLAS and cuDNN provide the best performing implementations of GPU algorithms. However the task of the library programmer is […]

CUDA

Mar, 22

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks – a strategy known as […]

CUDA

•

OpenCL

Mar, 22

Learnergy: Energy-based Machine Learners

Throughout the last years, machine learning techniques have been broadly encouraged in the context of deep learning architectures. An interesting algorithm denoted as Restricted Boltzmann Machine relies on energy- and probabilistic-based nature to tackle with the most diverse applications, such as classification, reconstruction, and generation of images and signals. Nevertheless, one can see they are […]

CUDA

Mar, 22

Performance evaluation of deep learning on smartphones

Deep Learning powers a variety of applications from self driving cars and autonomous robotics to web search and voice assistants. It is fair to say that it is omnipresent and here to stay. It is deployed in all sorts of devices ranging from consumer electronics to Internet of Things (IoT). Such a deployment is categorized […]

Mar, 22

Towards automated kernel selection in machine learning systems: A SYCL case study

Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network topology is constant, but machine learning research often involves changing network topologies and hyperparameters. Traditional kernel auto-tuning has […]

Mar, 15

Abstracting OpenCL for Multi-Application Workloads on CPU-FPGA Clusters

Field-programmable gate arrays (FPGAs) continue to see integration in data centres, where customized hardware accelerators provide improved performance for cloud workloads. However, existing programming models for such environments typically require a manual assignment of application tasks between CPUs and FPGA-based accelerators. Furthermore, coordinating the execution of tasks from multiple applications necessitates the use of a […]

OpenCL

Mar, 15

Automated test generation for OpenCL kernels using fuzzing and constraint solving

Graphics Processing Units (GPUs) are massively parallel processors offering performance acceleration and energy efficiency unmatched by current processors (CPUs) in computers. These advantages along with recent advances in the programmability of GPUs have made them attractive for general-purpose computations. Despite the advances in programmability, GPU kernels are hard to code and analyse due to the […]

OpenCL

Mar, 15

Data Movement Optimization for High-Performance Computing

Tuning codes to make efficient use of high-performance computing systems is known to be hard. Programmers have to schedule their computations to thousands of compute cores having the compute and data movement costs in mind. The necessary code transformations – for example, to overlap computation and inter-node communication – are well known. But the complex […]

CUDA

Mar, 15

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance parallel cluster systems due to financial and practical limits on power consumption and cooling. Recent developments in hard- […]

OpenCL

Mar, 15

Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Learnergy: Energy-based Machine Learners

Performance evaluation of deep learning on smartphones

Towards automated kernel selection in machine learning systems: A SYCL case study

Abstracting OpenCL for Multi-Application Workloads on CPU-FPGA Clusters

Automated test generation for OpenCL kernels using fuzzing and constraint solving

Data Movement Optimization for High-Performance Computing

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)