high performance computing on graphics processing units: hgpu.org

Posts

Mar, 17

TensorFlow Doing HPC

TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include […]

CUDA

Mar, 10

Improving GPU Performance through Instruction Redistribution and Diversification

As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPUs resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels of […]

OpenCL

Mar, 10

Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package

FPGA devices have been proving to be good candidates to accelerate applications from different research topics. For instance, machine learning applications such as K-Means clustering usually relies on large amount of data to be processed, and, despite the performance offered by other architectures, FPGAs can offer better energy efficiency. With that in mind, Intel ® […]

OpenCL

Mar, 10

On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation

Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPUaccelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due […]

CUDA

•

OpenCL

Mar, 10

Custom Code Generation for a Graph DSL

Graph algorithms are at the heart of several applications, and achieving high performance with them has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to parallelize automatically, due to access patterns influenced by the input graph, which is unavailable until execution. Former research has addressed this issue […]

CUDA

Mar, 10

GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding

Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and systems are capable of processing networks with hundreds of thousands or a few millions of nodes. However, how to scale them to […]

Mar, 3

Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL

Heterogeneous systems are the core architecture of most of the High Performance Computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability; specifically, releasing the programmer from the burden of managing data and devices with different architectures. To this end, we extend EngineCL to support FPGA devices. […]

OpenCL

Mar, 3

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance […]

Mar, 3

Application level energy measurements and models for hybrid platform with accelerators

High Performance Computing is essential to continued advancement in many scientific and engineering fields. In recent years, due to the scale of the platforms and the breakdown of laws which had long since supported rapid expansion, energy efficiency has emerged as a new design constraint on HPC platforms and applications. This constraint has increased the […]

OpenCL

Mar, 3

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs

With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science. It is thus imperative to decouple programming paradigms and architecture-specific implementation from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric […]

CUDA

•

OpenCL

Mar, 3

cuSten – CUDA Finite Difference and Stencil Library

In this paper we present cuSten, a new library of functions to handle the implementation of 2D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied […]

CUDA

Feb, 24

An Empirically Guided Optimization Framework for FPGA OpenCL

FPGAs have been demonstrated to be capable of very high performance, especially power-performance, but generally at the cost of hand-tuned HDL code by FPGA experts. OpenCL is the leading industry effort in improving performance-programmability. But while it is recognized that optimizing OpenCL code using published best practices is critical to achieving good performance, even optimized […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

TensorFlow Doing HPC

Improving GPU Performance through Instruction Redistribution and Diversification

Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package

On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation

Custom Code Generation for a Graph DSL

GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding

Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

Application level energy measurements and models for hybrid platform with accelerators

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs

cuSten – CUDA Finite Difference and Stencil Library

An Empirically Guided Optimization Framework for FPGA OpenCL

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)