high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Systematic Physics Constrained Parameter Estimation of Stochastic Differential Equations

SystemC simulation on GP-GPUs: CUDA vs. OpenCL

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning

Tabu Search on GPU

Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Tactics to Directly Map CNN graphs on Embedded FPGAs

Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures

Takagi Factorization on GPU using CUDA

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

Taking the graphics processor beyond graphics

Taming irregular EDA applications on GPUs

Taming the complexities of the C11 and OpenCL memory models

Tamp: A Library for Compact Deep Neural Networks with Structured Matrices

Tangible video teleconference system using real-time image-based relighting

Tango: A Deep Neural Network Benchmark Suite for Various Accelerators

Tangram: a High-level Language for Performance Portable Code Synthesis

Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization

TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture

Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors

Tapping the supercomputer under your desk: solving dynamic equilibrium models with graphics processors?

Target Marker: A Visual Marker for Long Distances and Detection in Realtime on Mobile Devices

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles

Targeting GPUs with OpenMP Directives on Summit: A Simple and Effective Fortran Experience

Targeting heterogeneous architectures via macro data flow

Task and Data Distribution in Hybrid Parallel Systems

Task management for irregular-parallel workloads on the GPU

Task migration of DSP application specified with a DFG and implemented with the BSP computing model on a CPU-GPU cluster

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Task Parallelism and Data Distribution: An Overview of Explicit Parallel Programming Languages

Task Parallelism and Synchronization: An Overview of Explicit Parallel Programming Languages

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

Task Partition Comparison between Multi-core System and GPU

Task Performance with List-Mode Data

Task Scheduling for Heterogeneous Multicore Systems

Task scheduling in hybrid CPU-GPU systems

Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment

Task Superscalar: An Out-of-Order Task Pipeline

Task superscalar: using processors as functional units

Task-based Conjugate-Gradient for multi-GPUs platforms

Task-based FMM for heterogeneous architectures

Task-Based Parallel Strategies for CFD Application in Heterogeneous CPU/GPU Resources

Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA

TBD: Benchmarking and Analyzing Deep Neural Network Training

TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

TCUDB: Accelerating Database with Tensor Processors

TDDFT in massively parallel computer architectures: the OCTOPUS project

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation

Teaching cardiac electrophysiology modeling to undergraduate students: laboratory exercises and GPU programming for the study of arrhythmias and spiral wave dynamics

Teaching graphics processing and architecture using a hardware prototyping approach

Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure

Teaching Parallel Programming Models on a Shallow-Water Code

Teaching Parallel Programming Using Java

Technical aspects of the GPU accelerated surgical simulator

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers

Techniques for designing GPGPU games

Techniques for efficient DCT/IDCT implementation on generic GPU

Techniques for efficient, real-time, 3D visualization of multi-modality cardiac data using consumer graphics hardware

Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters

Techniques to maximize memory bandwidth on the Rigel compute accelerator

TEDI: efficient shortest path query answering on graphs

TEG: GPU Performance Estimation Using a Timing Model

Telekine: Secure Computing with Cloud GPUs

Template Library for Multi-GPU Pseudorandom Number Recursion-based Generators

Temporal Blending for Adaptive SPH

Temporally Consistent Disparity and Optical Flow via Efficient Spatio-temporal Filtering

Temporospatial Epidemic Simulations Using Heterogeneous Computing

TENSILE: A Tensor granularity dynamic GPU memory scheduler method towards multiple dynamic workloads system

Tensor Computation Based on Heterogeneous Memory

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Processing Units for Financial Monte Carlo

Tensor Voting Accelerated by Graphics Processing Units (GPU)

TensorFlow Doing HPC

TensorFlow: A system for large-scale machine learning

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow.js: Machine Learning for the Web and Beyond

TensorNetwork for Machine Learning

TensorNetwork: A Library for Physics and Machine Learning

Tera-scale Astronomical Data Analysis and Visualization

TeraFLOP computing on a desktop PC with GPUs for 3D CFD

Teraflop per second gravitational lensing ray-shooting using graphics processing units

Termination Analysis for GPU Kernels

TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble

Tesla vs. Xeon Phi vs. Radeon A Compiler Writer’s Perspective

Test-driving Intel Xeon Phi

Testing and Exposing Weak Graphics Processing Unit Memory Models

Testing and Mutation Testing for GPU Kernels

Testing fine-grained parallelism for the ADMM on a factor-graph

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

Testing Tesla architecture for scientific computing: The performance of matrix-vector product

Tetrahedral Interpolation for Deformable Image Registration on GPUs

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

Texture Cache Approximation on GPUs

Texture compression of light maps using smooth profile functions

Brief statistics for this page

Titles: 100

Download open PDFs: 94

Package packages: 31

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)