high performance computing on graphics processing units: hgpu.org

Posts

Sep, 27

A Novel GPU-based Parallel Implementation Scheme and Performance Analysis of Robot Forward Dynamics Algorithms

We propose a novel unifying scheme for parallel implementation of articulated robot dynamics algorithms. It is based on a unified Lie group notation for deriving the equations of motion of articulated robots, where various well-known forward algorithms differ only by their joint inertia matrix inversion strategies. This new scheme leads to a unified abstraction of […]

CUDA

Sep, 27

Solving Batched Linear Programs on GPU and Multicore CPU

Linear Programs (LPs) appear in a large number of applications and offloading them to the GPU is viable to gain performance. Existing work on offloading and solving an LP on GPU suggests that performance is gained from large sized LPs (typically 500 constraints, 500 variables and above). In order to gain performance from GPU for […]

CUDA

Sep, 22

Tuning Stencil Codes in OpenCL for FPGAs

OpenCL is designed as a parallel programming framework to support heterogeneous computing platforms. The implicit or explicit parallelism in OpenCL kernel code enables efficient FPGA implementation from a high-level programming abstraction. However, FPGA architecture is completely different from GPU architecture, for which OpenCL is widely used. Tuning OpenCL codes to achieve high performance on FPGAs […]

OpenCL

Sep, 22

Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small

Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU […]

CUDA

Sep, 22

Characterization of Speech Recognition Systems on GPU Architectures

Automatic speech recognition is one of the most important applications in the area of cognitive computing. Mobile devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years. Effective speech recognition systems require real-time recognition, […]

CUDA

Sep, 22

Efficient dictionary learning implementation on the GPU using OpenCL

The dictionary learning field offers a wide range of algorithms that are able to provide good sparse approximations and well trained dictionaries. These algorithms are very complex and this is reflected in the slow execution of their computationally intensive implementations. This article proposes efficient parallel implementations for the main algorithms in the field that significantly […]

OpenCL

Sep, 22

MCS 572: Introduction to Supercomputing

The goal of the course is to study parallel algorithms and their implementation on distributed and shared memory computers, using message passing, OpenMP, and threads. In the second half of the course we will consider general purpose graphics processing units. Prerequisites are a working knowledge of C (or willingness to acquire programming skills) and a […]

CUDA

Sep, 20

Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms

Block-structured matrices arise in several contexts in circuit simulation problems. These matrices typically inherit the pattern of sparsity from the circuit connectivity. However, they are also characterized by dense spots or blocks. Direct factorization of those matrices has emerged as an attractive approach if the host memory is sufficiently large to store the block-structured matrix. […]

OpenCL

Sep, 20

Parallel Computational Fluid Dynamics With the Intel Xeon Phi Coprocessor

The Intel Xeon Phi coprocessor is a PCI Express form factor card designed to work in tangent with Intel Xeon processors in order to allow faster execution of highly parallelizable code. Efficient execution of highly parallel applications is achieved through the use of many smaller, lower clock speed cores; allowing for many more simultaneous execution […]

Sep, 20

A Compiler for Throughput Optimization of Graph Algorithms on GPUs

Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations […]

CUDA

Sep, 20

Feynman Machine: The Universal Dynamical Systems Computer

Efforts at understanding the computational processes in the brain have met with limited success, despite their importance and potential uses in building intelligent machines. We propose a simple new model which draws on recent findings in Neuroscience and the Applied Mathematics of interacting Dynamical Systems. The Feynman Machine is a Universal Computer for Dynamical Systems, […]

OpenCL

Sep, 20

Runtime Support for Adaptive Power Capping on Heterogeneous SoCs

Power capping is a fundamental method for reducing the energy consumption of a wide range of modern computing environments, ranging from mobile embedded systems to datacentres. Unfortunately, maximising performance and system efficiency under static power caps remains challenging, while maximising performance under dynamic power caps has been largely unexplored. We present an adaptive power capping […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

A Novel GPU-based Parallel Implementation Scheme and Performance Analysis of Robot Forward Dynamics Algorithms

Solving Batched Linear Programs on GPU and Multicore CPU

Tuning Stencil Codes in OpenCL for FPGAs

Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small

Characterization of Speech Recognition Systems on GPU Architectures

Efficient dictionary learning implementation on the GPU using OpenCL

MCS 572: Introduction to Supercomputing

Acceleration of Block-Aware Matrix Factorization on Heterogeneous Platforms

Parallel Computational Fluid Dynamics With the Intel Xeon Phi Coprocessor

A Compiler for Throughput Optimization of Graph Algorithms on GPUs

Feynman Machine: The Universal Dynamical Systems Computer

Runtime Support for Adaptive Power Capping on Heterogeneous SoCs

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)