high performance computing on graphics processing units: hgpu.org

Posts

Sep, 26

Generating GPU Code from a High-level Representation for Image Processing Kernels

We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in highlevel framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as […]

CUDA

•

OpenCL

Sep, 26

LLVM to PTX Backend

The low-level virtual machine (LLVM) compiler infrastructure is a mature and stable framework to implement optimization and compiler passes. H. Rhodin presented an LLVM backend to generate Parallel Thread Execution (PTX) instructions from LLVM bitcode. PTX is used as intermediate representation for parallel programming. This paper discusses Rhodin’s PTX generator. Due to the similarity between […]

CUDA

Sep, 26

A Uniform Platform to Support Multigenerational GPUs for High Performance Stream-based Computing

GPU-based computing has become one of the popular high performance computing fields. The field is called GPGPU. This paper is focused on design and implementation of a uniform GPGPU application that is optimized for both the legacy and the recent GPU architectures. As a typical example of such the GPGPU application, this paper will discuss […]

CUDA

•

OpenCL

•

OpenGL

Sep, 26

High-level GPU computing with jacket for MATLAB and C/C++

We describe a software platform for the rapid development of general purpose GPU (GPGPU) computing applications within the MATLAB computing environment, C, and C++: Jacket. Jacket provides thousands of GPU-tuned function syntaxes within MATLAB, C, and C++, including linear algebra, convolutions, reductions, and FFTs as well as signal, image, statistics, and graphics libraries. Additionally, Jacket […]

CUDA

•

OpenGL

Sep, 26

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a […]

CUDA

•

OpenCL

Sep, 26

Identifying scalar behavior in CUDA kernels

We propose a compiler analysis pass for programs expressed in the Single Program, Multiple Data (SPMD) programming model. It identifies statically several kinds of regular patterns that can occur between adjacent threads, including common computations, memory accesses at consecutive locations or at the same location and uniform control flow. This knowledge can be exploited by […]

CUDA

Sep, 26

Putting Automatic Polyhedral Compilation for GPGPU to Work

Automatic parallelization is becoming more important as parallelism becomes ubiquitous. The first step for achieving automation is to develop a theoretical foundation, for example, the polyhedron model. The second step is to implement the algorithms studied in the theoretical framework and getting them to work in a compiler that can be used to parallelize real […]

CUDA

Sep, 26

Running unstructured grid-based CFD solvers on modern graphics hardware

Techniques used to implement an unstructured grid solver on modern graphics hardware are described. The three-dimensional Euler equations for inviscid, compressible flow are considered. Effective memory bandwidth is improved by reducing total global memory access and overlapping redundant computation, as well as using an appropriate numbering scheme and data layout. The applicability of per-block shared […]

CUDA

Sep, 25

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

We present a study of three important kernels that occur frequently in iterative statistical applications: K-Means, Multi-Dimensional Scaling (MDS), and PageRank. We implemented each kernel using OpenCL and evaluated their performance on an NVIDIA Tesla GPGPU card. By examining the underlying algorithms and empirically measuring the performance of various components of the kernel we explored […]

OpenCL

Sep, 25

Exploiting Heterogeneous Computing Platforms By Cataloging Best Solutions For Resource Intensive Seismic Applications

Large heterogeneous data centers of today lack methods to appraise the best fitting solutions regarding, among others, hardware acquisition cost, development time, and performance. Especially resource intensive applications benefit from increased data center utilization to leverage heterogeneous resources and accelerators. In this paper, we implement various methods to accelerate a seismic modeling application, which is […]

OpenCL

Sep, 25

Harnessing the Power of GPUs without Losing Abstractions in SaC and ArrayOL: A Comparative Study

Over recent years, using Graphics Processing Units (GPUs) has become as an effective method for increasing the performance of many applications. However, these performance benefits from GPUs come at a price. Firstly extensive programming expertise and intimate knowledge of the underlying hardware are essential for gaining good speedups. Secondly, the expressibility of GPU-based programs are […]

CUDA

•

OpenCL

Sep, 25

Accelerating image recognition on mobile devices using GPGPU

The future multi-modal user interfaces of battery-powered mobile devices are expected to require computationally costly image analysis techniques. The use of Graphic Processing Units for computing is very well suited for parallel processing and the addition of programmable stages and high precision arithmetic provide for opportunities to implement energy-efficient complete algorithms. At the moment the […]

OpenCL

•

OpenGL

high performance computing on graphics processing units: hgpu.org

Posts

Generating GPU Code from a High-level Representation for Image Processing Kernels

LLVM to PTX Backend

A Uniform Platform to Support Multigenerational GPUs for High Performance Stream-based Computing

High-level GPU computing with jacket for MATLAB and C/C++

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming

Identifying scalar behavior in CUDA kernels

Putting Automatic Polyhedral Compilation for GPGPU to Work

Running unstructured grid-based CFD solvers on modern graphics hardware

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

Exploiting Heterogeneous Computing Platforms By Cataloging Best Solutions For Resource Intensive Seismic Applications

Harnessing the Power of GPUs without Losing Abstractions in SaC and ArrayOL: A Comparative Study

Accelerating image recognition on mobile devices using GPGPU

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)