Papers on hgpu.org (.txt-file)
Trellis: Portability Across Architectures with a High-level Framework

Tri-Hybrid Computational Fluid Dynamics on DOE’s Cray XK7, Titan

Triangular matrix inversion on Graphics Processing Unit
Triangular mesh simplification on the GPU

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Trie Compression for GPU Accelerated Multi-Pattern Matching

TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity

True 4D Image Denoising on the GPU

TRUST: the HPC open-source CFD platform – from CPU to GPU

TTC: A Tensor Transposition Compiler for Multiple Architectures

TuCCompi: A Multi-Layer Programing Model for Heterogeneous Systems with Auto-Tuning Capabilities

Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Tuned and GPU-accelerated parallel data mining from comparable corpora

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Tuning a Finite Difference Computation for Parallel Vector Processors

Tuning Manifold Harmonics Filters

Tuning Stencil Codes in OpenCL for FPGAs

Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach

Turbo Bayesian Compressed Sensing

Tutorial 3: Methodologies and Performance Impacts of General Purpose Computing on GPUs
Tutoring LLM into a Better CUDA Optimizer

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: End-to-End Optimization Stack for Deep Learning

Two Algorithms for Sorting On Heterogeneous Clusters

Two Approaches to Particle Simulation: OpenMPI and CUDA

Two improved GPU acceleration strategies for force-directed graph layout
Two Level Approach to Efficient Visualization of Protein Dynamics

Two Simple Single-pass GPU methods for Multi-channel Surface Voxelization of Dynamic Scenes

Two Stage Data Mining Technique for Fast Monsoon Onset Prediction

Two-electron integral evaluation on the graphics processor unit

Two-fluid compressible simulations on GPU cluster

Two-Level Approach to Efficient Visualization of Protein Dynamics
Two-stage compression for fast volume rendering of time-varying scalar data

Two-way partitioning of a recursive Gaussian filter in CUDA

Two-Way Real Time Fluid Simulation Using a Heterogeneous Multicore CPU and GPU Architecture
Type-safe Runtime Code Generation: Accelerate to LLVM

U-Net: Convolutional Networks for Biomedical Image Segmentation

UAV Path Planning with Parallel Genetic Algorithms on CUDA Architecture

uBench: Performance Impact of CUDA Block Geometry

UberFlow: a GPU-based particle engine

Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford

UCHPC – UnConventional High Performance Computing for Finite Element Simulations

Ultra-Fast Detection of Higher-Order Epistatic Interactions on GPUs

Ultra-Fast Displaying Spectral Domain Optical Doppler Tomography System Using a Graphics Processing Unit

Ultra-fast FFT protein docking on graphics processors

Ultra-Fast Hybrid CPU-GPU Multiple Scatter Simulation for 3D PET

Ultra-fast treatment plan optimization for volumetric modulated arc therapy (VMAT)

Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml

Ultrasound goes GPU: real-time simulation using CUDA

Ultrasound Image Simulation with GPU-based Ray Tracing

Uncertainty-Aware Guided Volume Segmentation

Uncluttering Graph Layouts Using Anisotropic Diffusion and Mass Transport

Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units
Under the Hood of SYCL – An Initial Performance Analysis With an Unstructured-mesh CFD Application

Understanding and Modeling the Synchronization Cost in the GPU Architecture

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures

Understanding GPU Triggering APIs for MPI+X Communication

Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations

Understanding Latency Hiding on GPUs

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU

Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

Understanding software approaches for GPGPU reliability

Understanding the Costs of Many-Task Computing Workloads on Intel Xeon Phi Coprocessors

Understanding the design trade-offs among current multicore systems for numerical computations

Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Understanding the efficiency of ray traversal on GPUs

Understanding the impact of CUDA tuning techniques for Fermi

Understanding the Impact of Hybrid Programming on Software Energy Efficiency

Understanding the Impact of Input Entropy on FPU, CPU, and GPU Power

Understanding the ISA impact on GPU Architecture

Understanding the Landscape of Ampere GPU Memory Errors

Understanding the Performance of HPC Applications

Understanding the Power of Evolutionary Computation for GPU Code Optimization

Understanding the SIMD Efficiency of Graph Traversal on GPU

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts

Unfolding and Shrinking Neural Machine Translation Ensembles

UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters

Unified – A Sharp Turn in the Latest Era of Graphic Processors

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment

Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Unified Particle Physics for Real-Time Applications

Unified schemes for directive-based GPU offloading

Unified Shader Programming in C++

Unified Shared Memory: Friend or Foe?

Unified system of code transformation and execution for heterogeneous multi-core architectures

Unified Tables for Exponential and Logarithm Families

UniFL: Accelerating Federated Learning Using Heterogeneous Hardware Under a Unified Framework

Titles: 100
open PDFs: 91
packages: 28
