Papers on hgpu.org (.txt-file)
Task Partition Comparison between Multi-core System and GPU
Task Performance with List-Mode Data

Task Scheduling for Heterogeneous Multicore Systems

Task scheduling in hybrid CPU-GPU systems

Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment
Task Superscalar: An Out-of-Order Task Pipeline

Task superscalar: using processors as functional units

Task-based Conjugate-Gradient for multi-GPUs platforms

Task-based FMM for heterogeneous architectures

Task-Based Parallel Strategies for CFD Application in Heterogeneous CPU/GPU Resources

Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA

TBD: Benchmarking and Analyzing Deep Neural Network Training

TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

TCUDB: Accelerating Database with Tensor Processors

TDDFT in massively parallel computer architectures: the OCTOPUS project

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation

Teaching cardiac electrophysiology modeling to undergraduate students: laboratory exercises and GPU programming for the study of arrhythmias and spiral wave dynamics

Teaching graphics processing and architecture using a hardware prototyping approach

Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure

Teaching Parallel Programming Models on a Shallow-Water Code

Teaching Parallel Programming Using Java

Technical aspects of the GPU accelerated surgical simulator

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers

Techniques for designing GPGPU games

Techniques for efficient DCT/IDCT implementation on generic GPU

Techniques for Mapping Synthetic Aperture Radar Processing Algorithms to Multi-GPU Clusters

Techniques to maximize memory bandwidth on the Rigel compute accelerator

TEDI: efficient shortest path query answering on graphs

TEG: GPU Performance Estimation Using a Timing Model

Telekine: Secure Computing with Cloud GPUs

Template Library for Multi-GPU Pseudorandom Number Recursion-based Generators

Temporal Blending for Adaptive SPH

Temporally Consistent Disparity and Optical Flow via Efficient Spatio-temporal Filtering

Temporospatial Epidemic Simulations Using Heterogeneous Computing

TENSILE: A Tensor granularity dynamic GPU memory scheduler method towards multiple dynamic workloads system

Tensor Computation Based on Heterogeneous Memory

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Processing Units for Financial Monte Carlo

Tensor Voting Accelerated by Graphics Processing Units (GPU)

TensorFlow: A system for large-scale machine learning

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow.js: Machine Learning for the Web and Beyond

TensorNetwork for Machine Learning

TensorNetwork: A Library for Physics and Machine Learning

Tera-scale Astronomical Data Analysis and Visualization

TeraFLOP computing on a desktop PC with GPUs for 3D CFD
Teraflop per second gravitational lensing ray-shooting using graphics processing units

Termination Analysis for GPU Kernels

TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble

Tesla vs. Xeon Phi vs. Radeon A Compiler Writer’s Perspective

Testing and Exposing Weak Graphics Processing Unit Memory Models

Testing and Mutation Testing for GPU Kernels

Testing fine-grained parallelism for the ADMM on a factor-graph

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

Testing Tesla architecture for scientific computing: The performance of matrix-vector product

Tetrahedral Interpolation for Deformable Image Registration on GPUs

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

Texture Cache Approximation on GPUs

Texture compression of light maps using smooth profile functions

Texture-based visualization of uncertainty in flow fields

Texture-Based Visualization of Unsteady 3D Flow by Real-Time Advection and Volumetric Illumination
Texturing and Modeling, Third Edition: A Procedural Approach (The Morgan Kaufmann Series in Computer Graphics)

TH-1: China’s first petaflop supercomputer
The ‘Chimera’: an off-the-shelf CPU/GPGPU/FPGA hybrid computing platform

The 3D Flow Field Around an Embedded Planet

The accelerating implementation of BLAST with stream processor
The Accelerator Wall: Limits of Chip Specialization

The AES Implantation Based on OpenCL for Multi/many Core Architecture
The AGILE library for image reconstruction in biomedical sciences using graphics card hardware acceleration

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

The AlexNet Moment for Homomorphic Encryption: HCNN, the First Homomorphic CNN on Encrypted Data with GPUs

The Anatomy of High-Performance 2D Similarity Calculations

The ANTAREX Approach to Autotuning and Adaptivity for Energy Efficient HPC Systems

The ANTAREX Domain Specific Language for High Performance Computing

The Application of AI Technology in GPU Scheduling Algorithm Optimization

The Application of CUDA Architecture in Facial Expression Recognition
The application of GPU particle tracing to diffusion tensor field visualization

The Application Perspective: Seeking Productivity and Performance

The Arcane development framework
The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing

The architecture of the DecentVM: towards a decentralized virtual machine for many-core computing

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product

The Astrophysical Multipurpose Software Environment

The battle of the giants: a case study of GPU vs FPGA optimisation for real-time image processing

The BiConjugate gradient method on GPUs

The Boat Hull Model: Adapting the Roofline Model to Enable Performance Prediction for Parallel Computing

The BondMachine toolkit: Enabling Machine Learning on FPGA

The Bones Source-to-Source Compiler Manual

The Case for Higher Computational Density in the Memory-Bound FDTD Method within Multicore Environments

The case for VOS: the vector operating system

The Celerity High-level API: C++20 for Accelerator Clusters

The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units

The Comparisons of OpenCL and OpenMP Computing Paradigm

Titles: 100
open PDFs: 89
packages: 28
