high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

Towards systematic exploration of tradeoffs for medical image registration on heterogeneous platforms

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems

Towards Unified Analysis of GPU Consistency

Towards Unified INT8 Training for Convolutional Neural Network

Towards user transparent parallel multimedia computing on GPU-clusters

Towards Utilizing GPUs in Information Visualization: A Model and Implementation of Image-Space Operations

Towards Utilizing Remote GPUs for CUDA Program Execution

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Track finding in ATLAS using GPUs

Tracking 3d Pose of Rigid Object by Sparse Template Matching

Tracking and Clustering Salient Features in Image Sequences

Tracking humans interacting with the environment using efficient hierarchical sampling and layered observation models

Tracking Many Solution Paths of a Polynomial Homotopy on a Graphics Processing Unit

Tradeoff analysis and optimization of power delivery networks with on-chip voltage regulation

Tradeoffs in designing accelerator architectures for visual computing

Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration

Training a Feedback Loop for Hand Pose Estimation

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Training DNN Models over Heterogeneous Clusters with Optimal Performance

Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

Training Neural Networks Without Gradients: A Scalable ADMM Approach

Tranformation of CPU-based Applications To Leverage on Graphics Processors using CUDA

TransAxx: Efficient Transformers with Approximate Computing

TransCAIP: A Live 3D TV System Using a Camera Array and an Integral Photography Display with Interactive Control of Viewing Parameters

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

Transfer Time Reduction of Data Transfers between CPU and GPU

Transform Coding for Hardware-accelerated Volume Rendering

Transformation of Scientific Algorithms to Parallel Computing Code: Single GPU and MPI multi GPU Backends with Subdomain Support

Transformations of High-Level Synthesis Codes for High-Performance Computing

Transforming and Optimizing Irregular Applications for Parallel Architectures

Transforming C OpenMP Programs for Verification in CIVL

Translating GPU binaries to tiered SIMD architectures with Ocelot

Translating OpenMP Device Constructs to OpenCL using Unnecessary Data Transfer Elimination

Translation-invariant two-dimensional discrete wavelet transform on graphics processing units

Transparent Acceleration for Heterogeneous Platforms With Compilation to OpenCL

Transparent Acceleration of Java-based Deep Learning Engines

Transparent Accelerator Migration in a Virtualized GPU Environment

Transparent Checkpoint-Restart for Hardware-Accelerated 3D Graphics

Transparent Checkpointing for OpenGL Applications on GPUs

Transparent Compiler and Runtime Specializations for Accelerating Managed Languages on FPGAs

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

Transparent FPGA Acceleration with TensorFlow

Transparent use of Java objects on the GPU in the JaMP/OpenMP framework

Trapping of giant-planet cores – I. vortex aided trapping at the outer dead zone edge

Tree Structured Analysis on GPU Power Study

Treecode and fast multipole method for N-body simulation with CUDA

TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

Trellis: Portability Across Architectures with a High-level Framework

Tri-Hybrid Computational Fluid Dynamics on DOE’s Cray XK7, Titan

Triangular matrix inversion on Graphics Processing Unit

Triangular mesh simplification on the GPU

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Trie Compression for GPU Accelerated Multi-Pattern Matching

TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing

triSYCL for Xilinx FPGA

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity

True 4D Image Denoising on the GPU

TRUST: the HPC open-source CFD platform – from CPU to GPU

TTC: A Tensor Transposition Compiler for Multiple Architectures

TuCCompi: A Multi-Layer Programing Model for Heterogeneous Systems with Auto-Tuning Capabilities

Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Tuned and GPU-accelerated parallel data mining from comparable corpora

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Tuning a Finite Difference Computation for Parallel Vector Processors

Tuning A Hybrid GPU-CPU V-cycle Multilevel Preconditioner for Solving Large Real and Complex Systems of FEM Equations

Tuning Manifold Harmonics Filters

Tuning Stencil Codes in OpenCL for FPGAs

Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach

Turbo Bayesian Compressed Sensing

Tutorial 3: Methodologies and Performance Impacts of General Purpose Computing on GPUs

Tutoring LLM into a Better CUDA Optimizer

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: End-to-End Optimization Stack for Deep Learning

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Two Algorithms for Sorting On Heterogeneous Clusters

Two Approaches to Particle Simulation: OpenMPI and CUDA

Two improved GPU acceleration strategies for force-directed graph layout

Two Level Approach to Efficient Visualization of Protein Dynamics

Two Simple Single-pass GPU methods for Multi-channel Surface Voxelization of Dynamic Scenes

Two Stage Data Mining Technique for Fast Monsoon Onset Prediction

Two-electron integral evaluation on the graphics processor unit

Two-fluid compressible simulations on GPU cluster

Two-Level Approach to Efficient Visualization of Protein Dynamics

Two-stage compression for fast volume rendering of time-varying scalar data

Two-way partitioning of a recursive Gaussian filter in CUDA

Two-Way Real Time Fluid Simulation Using a Heterogeneous Multicore CPU and GPU Architecture

TWQCD’s dynamical DWF project

Type-safe Runtime Code Generation: Accelerate to LLVM

U-Net: Convolutional Networks for Biomedical Image Segmentation

UAV Path Planning with Parallel Genetic Algorithms on CUDA Architecture

uBench: Performance Impact of CUDA Block Geometry

UberFlow: a GPU-based particle engine

Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford

UCHPC – UnConventional High Performance Computing for Finite Element Simulations

Brief statistics for this page

Titles: 100

Download open PDFs: 92

Package packages: 28

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)