high performance computing on graphics processing units: hgpu.org

Posts

Oct, 8

GPU Concurrency Choices in Graph Analytics

Graph analytics is becoming ever more ubiquitous in today’s world. However, situational dynamic changes in input graphs, such as changes in traffic and weather patterns, lead to variations in concurrency. Moreover, graph algorithms are known to have data dependent loops and fine-grain synchronization that makes them hard to scale on parallel machines. Recent trends in […]

OpenCL

Oct, 8

BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images

In cryo-electron microscopy (EM), molecular structures are determined from large numbers of projection images of individual particles. To harness the full power of this single-molecule information, we use the Bayesian inference of EM (BioEM) formalism. By ranking structural models using posterior probabilities calculated for individual images, BioEM in principle addresses the challenge of working with […]

CUDA

Oct, 8

Rinnegan: Efficient Resource Use in Heterogeneous Architectures

Current processors provide a variety of different processing units to improve performance and power efficiency. For example, ARM’s big.LITTLE, AMD’s APUs, and Oracle’s M7 provide heterogeneous processors, on-die GPUs, and on-die accelerators. However, the performance experienced by programs using these processing units can vary widely due to contention from multiprogramming, thermal constraints and other issues. […]

OpenCL

Oct, 8

A Runtime Controller for OpenCL Applications on Heterogeneous System Architectures

Heterogeneous architectures nowadays are becoming very attractive in the embedded and mobile markets thanks to the possibility to exploit the best computational resource to optimize the performance per Watt figure of merit. Unfortunately, deciding the right resource to use and its operating frequency is a difficult problem that depends on the actual conditions in which […]

OpenCL

Oct, 4

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science, and needs be accelerated because it often represents the dominant cost in many widely-used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different […]

CUDA

Oct, 4

APL on GPUs: A TAIL from the Past, Scribbled in Futhark

This paper demonstrates translation schemes by which programs written in a functional subset of APL can be compiled to code that is run efficiently on general purpose graphical processing units (GPGPUs). Furthermore, the generated programs can be straightforwardly interoperated with mainstream programming environments, such as Python, for example for purposes of visualization and user interaction. […]

OpenCL

Oct, 4

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Convolutional Neural Networks (CNNs) have gained significant traction in the field of machine learning, particularly due to their high accuracy in visual recognition. Recent works have pushed the performance of GPU implementations of CNNs to significantly improve their classification and training times. With these improvements, many frameworks have become available for implementing CNNs on both […]

OpenCL

Oct, 4

Explicit Fourth-Order Runge-Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge-Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using […]

Oct, 4

Training a Feedback Loop for Hand Pose Estimation

We propose an entirely data-driven approach to estimating the 3D pose of a hand given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network trained to predict an estimate of the 3D pose by using a feedback loop. The components of this feedback loop are also Deep […]

CUDA

Sep, 30

GPU-based timetable generation

Throughout an academic year, educational institutions need to generate hundreds of different timetables, this complex task demands a considerable amount of time and human resources.In the past, timetable generation was handmade, in current days as this task complexity increases, it is performed by specialized software which allows to reduce time and costs.Since nearly 10 years […]

CUDA

Sep, 30

Programming Models and Tools for Many-Core Platforms

The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents […]

OpenCL

Sep, 30

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

This paper presents a theoretical analysis and practical evaluation of the main bottlenecks towards a scalable distributed solution for the training of Deep Neuronal Networks (DNNs). The presented results show, that the current state of the art approach, using data-parallelized Stochastic Gradient Descent (SGD), is quickly turning into a vastly communication bound problem. In addition, […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU Concurrency Choices in Graph Analytics

BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images

Rinnegan: Efficient Resource Use in Heterogeneous Architectures

A Runtime Controller for OpenCL Applications on Heterogeneous System Architectures

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

APL on GPUs: A TAIL from the Past, Scribbled in Futhark

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Explicit Fourth-Order Runge-Kutta Method on Intel Xeon Phi Coprocessor

Training a Feedback Loop for Hand Pose Estimation

GPU-based timetable generation

Programming Models and Tools for Many-Core Platforms

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)