high performance computing on graphics processing units: hgpu.org

Posts

Aug, 18

Efficient Simulation of Fluid Flow and Transport in Heterogeneous Media Using Graphics Processing Units (GPUs)

Networks of interconnected resistors, springs and beams, or pores are standard models of studying scalar and vector transport processes in heterogeneous materials and media, such as fluid flow in porous media, and conduction, deformations, and electric and dielectric breakdown in heterogeneous solids. The computation time and required memory are two limiting factors that hinder the […]

CUDA

Aug, 18

High Performance Computing via High Level Synthesis

As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High-Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to […]

OpenCL

Aug, 18

Visual Analysis Algorithms for Embedded Systems

The main contribution of this thesis is the design and development of an optimized framework to realize the deep neural classifiers on the embedded platforms. Deep convolutional networks exhibit unmatched performance in image classification. However, these deep classifiers demand huge computational power and memory storage. That is an issue on embedded devices due to limited […]

CUDA

Aug, 18

SODECL: An Open Source Library for Calculating Multiple Orbits of a System of Stochastic Differential Equations in Parallel

Stochastic differential equations (SDEs) are widely used to model systems affected by random processes. In general, the analysis of an SDE model requires numerical solutions to be generated many times over multiple parameter combinations. However, this process often requires considerable computational resources to be practicable. Due to the embarrassingly parallel nature of the task, devices […]

OpenCL

Aug, 11

Simple Iterative Incompressible Smoothed Particle Hydrodynamics

In this paper a simple, robust, and general purpose approach to implement the Incompressible Smoothed Particle Hydrodynamics (ISPH) method is proposed. The new approach is well suited for implementation on CPUs and GPUs. The method is matrix-free and uses an iterative formulation to setup and solve the pressure Poisson equation. A novel approach is used […]

OpenCL

Aug, 11

A Deep Learning Approach for Automatic Code Optimization in the Tiramisu Compiler

Modern compilers offer more and more code optimization possibilities. This enables better use of sophisticated hardware architectures and available resources in order to accelerate programs. It is difficult to predict which optimizations will be beneficial for a given program, as it depends on the program, the execution environment, interaction with other optimizations, and other factors. […]

CUDA

Aug, 11

Live Migration of FPGA Applications

With the recent and growing trend of Field Programmable Gate Arrays (FPGAs) being deployed into the data centers, cloud computing service providers are finding it difficult to manage these devices efficiently because traditional server management concepts and techniques are not yet available for FPGAs. In this thesis, we explore how to bring one of these […]

Aug, 11

Performance Comparison for Neuroscience Application Benchmarks

Researchers within the Human Brain Project and related projects have in the last couple of years expanded their needs for high-performance computing infrastructures. The needs arise from a diverse set of science challenges that range from large-scale simulations of brain models to processing of extreme-scale experimental data sets. The ICEI project, which is in the […]

CUDA

Aug, 11

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs, because of three challenges: (1) difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic ratio. To address these challenges, GraphBLAS is an innovative, on-going effort by the […]

CUDA

Aug, 5

Parallelization of Coherent Point Drift for patient registration

Point set registration is a central part in any application where the correspondence between two data point sets is of interest, for instance patient data from medical examinations. There exists numerous different algorithms that aim at solving the registration problem, and one of which is the Coherent Point Drift (CPD) algorithm. In this thesis a […]

OpenCL

Aug, 5

Mapping a Guided Image Filter on the HARP Reconfigurable Architecture Using OpenCL

Intel recently introduced the Heterogeneous Architecture Research Platform, HARP. In this platform, the Central Processing Unit and a Field-Programmable Gate Array are connected through a high-bandwidth, low-latency interconnect and both share DRAM memory. For this platform, Open Computing Language (OpenCL), a High-Level Synthesis (HLS) language, is made available. By making use of HLS, a faster […]

OpenCL

Aug, 5

Incremental Bounded Model Checking of Artificial Neural Networks in CUDA

Artificial Neural networks (ANNs) are powerful computing systems employed for various applications due to their versatility to generalize and to respond to unexpected inputs/patterns. However, implementations of ANNs for safety-critical systems might lead to failures, which are hardly predicted in the design phase since ANNs are highly parallel and their parameters are hardly interpretable. Here […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Efficient Simulation of Fluid Flow and Transport in Heterogeneous Media Using Graphics Processing Units (GPUs)

High Performance Computing via High Level Synthesis

Visual Analysis Algorithms for Embedded Systems

SODECL: An Open Source Library for Calculating Multiple Orbits of a System of Stochastic Differential Equations in Parallel

Simple Iterative Incompressible Smoothed Particle Hydrodynamics

A Deep Learning Approach for Automatic Code Optimization in the Tiramisu Compiler

Live Migration of FPGA Applications

Performance Comparison for Neuroscience Application Benchmarks

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Parallelization of Coherent Point Drift for patient registration

Mapping a Guided Image Filter on the HARP Reconfigurable Architecture Using OpenCL

Incremental Bounded Model Checking of Artificial Neural Networks in CUDA

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)