high performance computing on graphics processing units: hgpu.org

Posts

Jun, 2

MapCG: writing parallel program portable between CPU and GPU

Graphics Processing Units (GPU) have been playing an important role in the general purpose computing market recently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good performance, it raises serious portability issues: programmers are required to […]

CUDA

May, 11

Whole-function vectorization

Data-parallel programming languages are an important component in today’s parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and “general-purpose” languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to implement parallelism and ignore the additional intra-core parallelism provided by […]

OpenCL

May, 10

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel […]

OpenCL

Apr, 11

Simulation of bevel gear cutting with GPGPUs-performance and productivity

The desire for general purpose computation on graphics processing units caused the advance of new programming paradigms, e.g. OpenCL C/C++, CUDA C or the PGI Accelerator Model. In this paper, we apply these programming approaches to the software KegelSpan for simulating bevel gear cutting. This engineering application simulates an important manufacturing process in the automotive […]

CUDA

•

OpenCL

Apr, 7

Context-aware volume navigation

The trackball metaphor is exploited in many applications where volumetric data needs to be explored. Although it provides an intuitive way to inspect the overall structure of objects of interest, an in-detail inspection can be tedious – or when cavities occur even impossible. Therefore we propose a context-aware navigation technique for the exploration of volumetric […]

OpenCL

Apr, 7

Practical examples of GPU computing optimization principles

In this paper, we provide examples to optimize signal processing or visual computing algorithms written for SIMT-based GPU architectures. These implementations demonstrate the optimizations for CUDA or its successors OpenCL and DirectCompute. We discuss the effect and optimization principles of memory coalescing, bandwidth reduction, processor occupancy, bank conflict reduction, local memory elimination and instruction optimization. […]

CUDA

•

OpenCL

Apr, 3

A Light-weight API for Portable Multicore Programming

Multicore nodes have become ubiquitous in just a few years. At the same time, writing portable parallel software for multicore nodes is extremely challenging. Widely available programming models such as OpenMP and Pthreads are not useful for devices such as graphics cards, and more flexible programming models such as RapidMind are only available commercially. OpenCL […]

CUDA

•

OpenCL

Apr, 3

Mobile visual computing

Summary form only given. I will talk about camera phones, how you can use camera as a sensor that gives natural access to the information about the real world around you (mobile augmented reality) and how you can combine general computation capability to combine several input images into better or more interesting output images (mobile […]

OpenCL

•

OpenGL

Apr, 3

Energy consumption of Graphic Processing Units with respect to automotive use-cases

With the introduction of API’s like CUDA, Stream+ or OpenCL, modern Graphics Processing Units (GPU’s) can be easily employed for general purpose computing. Plus, their comparatively low price per GFLOP makes them interesting candidates for coprocessors in future embedded Electronic Control Units (ECUs). Yet, as car manufacturers thrive to reduce the Thermal Design Power (TDP) […]

CUDA

•

OpenCL

Apr, 2

Throughput-Effective On-Chip Networks for Manycore Accelerators

As the number of cores and threads in manycore compute accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This paper explores throughput-effective network-on-chips (NoC) for future manycore accelerators that employ bulk-synchronous parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is “throughput-effective” if […]

CUDA

Apr, 2

MARC: A Many-Core Approach to Reconfigurable Computing

We present a Many-core Approach to Reconfigurable Computing (MARC), enabling efficient high-performance computing for applications expressed using parallel programming models such as OpenCL. The MARC system exploits abundant special FPGA resources such as distributed block memories and DSP blocks to implement complete single-chip high efficiency many-core micro architectures. The key benefits of MARC are that […]

OpenCL

Apr, 2

Real-time particle filtering with heuristics for 3D motion capture by monocular vision

Particle filtering is known as a robust approach for motion tracking by vision, at the cost of heavy computation in a high dimensional pose space. In this work, we describe a number of heuristics that we demonstrate to jointly improve robustness and real-time for motion capture. 3D human motion capture by monocular vision without markers […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

MapCG: writing parallel program portable between CPU and GPU

Whole-function vectorization

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Simulation of bevel gear cutting with GPGPUs-performance and productivity

Context-aware volume navigation

Practical examples of GPU computing optimization principles

A Light-weight API for Portable Multicore Programming

Mobile visual computing

Energy consumption of Graphic Processing Units with respect to automotive use-cases

Throughput-Effective On-Chip Networks for Manycore Accelerators

MARC: A Many-Core Approach to Reconfigurable Computing

Real-time particle filtering with heuristics for 3D motion capture by monocular vision

Recent source codes

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

ROCm's implementation of Gromacs

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

Most viewed papers (last 30 days)