high performance computing on graphics processing units: hgpu.org

Posts

Mar, 22

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware. Currently, vendor libraries like cuBLAS and cuDNN provide the best performing implementations of GPU algorithms. However the task of the library programmer is […]

CUDA

Mar, 22

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks – a strategy known as […]

CUDA

•

OpenCL

Mar, 22

Learnergy: Energy-based Machine Learners

Throughout the last years, machine learning techniques have been broadly encouraged in the context of deep learning architectures. An interesting algorithm denoted as Restricted Boltzmann Machine relies on energy- and probabilistic-based nature to tackle with the most diverse applications, such as classification, reconstruction, and generation of images and signals. Nevertheless, one can see they are […]

CUDA

Mar, 22

Performance evaluation of deep learning on smartphones

Deep Learning powers a variety of applications from self driving cars and autonomous robotics to web search and voice assistants. It is fair to say that it is omnipresent and here to stay. It is deployed in all sorts of devices ranging from consumer electronics to Internet of Things (IoT). Such a deployment is categorized […]

Mar, 22

Towards automated kernel selection in machine learning systems: A SYCL case study

Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network topology is constant, but machine learning research often involves changing network topologies and hyperparameters. Traditional kernel auto-tuning has […]

Mar, 15

Abstracting OpenCL for Multi-Application Workloads on CPU-FPGA Clusters

Field-programmable gate arrays (FPGAs) continue to see integration in data centres, where customized hardware accelerators provide improved performance for cloud workloads. However, existing programming models for such environments typically require a manual assignment of application tasks between CPUs and FPGA-based accelerators. Furthermore, coordinating the execution of tasks from multiple applications necessitates the use of a […]

OpenCL

Mar, 15

Automated test generation for OpenCL kernels using fuzzing and constraint solving

Graphics Processing Units (GPUs) are massively parallel processors offering performance acceleration and energy efficiency unmatched by current processors (CPUs) in computers. These advantages along with recent advances in the programmability of GPUs have made them attractive for general-purpose computations. Despite the advances in programmability, GPU kernels are hard to code and analyse due to the […]

OpenCL

Mar, 15

Data Movement Optimization for High-Performance Computing

Tuning codes to make efficient use of high-performance computing systems is known to be hard. Programmers have to schedule their computations to thousands of compute cores having the compute and data movement costs in mind. The necessary code transformations – for example, to overlap computation and inter-node communication – are well known. But the complex […]

CUDA

Mar, 15

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

When considering different hardware platforms, not just the time-to-solution can be of importance but also the energy necessary to reach it. This is not only the case with battery powered and mobile devices but also with high-performance parallel cluster systems due to financial and practical limits on power consumption and cooling. Recent developments in hard- […]

OpenCL

Mar, 15

Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing […]

CUDA

•

OpenCL

Mar, 8

Solving convex optimization problems on FPGA using OpenCL

The application of accelerators in HPC applications has seen enormous growth in the last decade. In the field of HPC demands on throughput are steadily growing. Not all of the algorithms used have a clear HW architecture which performs the best. Our work explores the performance of different HW architectures in solving a convex optimization […]

OpenCL

Mar, 8

Portable and Performant GPU/Heterogeneous Asynchronous Many-Task Runtime System

Asynchronous many-task (AMT) runtimes are maturing as a model for computing simulations on a diverse range of architectures at large-scale. The Uintah AMT framework is driven by a philosophy of maintaining an application layer distinct from the underlying runtime while operating on an adaptive mesh grid. This model has enabled task developers to focus on […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Learnergy: Energy-based Machine Learners

Performance evaluation of deep learning on smartphones

Towards automated kernel selection in machine learning systems: A SYCL case study

Abstracting OpenCL for Multi-Application Workloads on CPU-FPGA Clusters

Automated test generation for OpenCL kernels using fuzzing and constraint solving

Data Movement Optimization for High-Performance Computing

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

Solving convex optimization problems on FPGA using OpenCL

Portable and Performant GPU/Heterogeneous Asynchronous Many-Task Runtime System

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)