Papers on hgpu.org (.txt-file)
Theano-MPI: a Theano-based Distributed Training Framework

Theano: A CPU and GPU Math Compiler in Python

Theano: A Python framework for fast computation of mathematical expressions

Theano: Deep Learning on GPUs with Python

TheanoLM – An Extensible Toolkit for Neural Network Language Modeling

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Theoretical and Numerical Analysis of Three Approaches to the GPGPU Application of the Explicit FDTD Method

Theory of square, rectangular, and microband electrodes through explicit GPU simulation
Thermal and Athermal Swarms of Self-Propelled Particles

Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms

Theseus: A Library for Differentiable Nonlinear Optimization

Thickness computation of trimmed B-Rep model using GPU ray tracing

THOR: A New and Flexible Global Circulation Model to Explore Planetary Atmospheres

THOR: A Transparent Heterogeneous Open Resource framework

Thorough Evaluation of GPU Shared Memory Load and Store Instructions

Thousand core chips: a technology perspective

Thread Block Compaction for Efficient SIMT Control Flow

Thread-safe lattice Boltzmann for high-performance computing on GPUs

Thread-Scalable Evaluation of Multi-Jet Observables

Three Contributions to the Theory and Practice of Optimizing Compilers

Three Dimensional Fast Fourier Transform CUDA Implementation

Three dimensional tracking of gold nanoparticles using digital holographic microscopy

Three storage formats for sparse matrices on GPGPUs

Three-Dimension Fountain Simulation Based on GPU and Particle System
Three-Dimensional Image Warping on Programmable Graphics Hardware

Three-dimensional LBM simulations of buoyancy-driven flow using Graphics processing units

Three-Dimensional Modeling of Long-Wave Runup: Simulation of Tsunami Inundation with GPU-SPHysics

Throughput-Effective On-Chip Networks for Manycore Accelerators

Throughput-Oriented Analytical Models for Performance Estimation on Programmable Hardware Accelerators

ThunderGBM: Fast GBDTs and Random Forests on GPUs

ThunderSVM: A Fast SVM Library on GPUs and CPUs

Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes

Tight Binding Molecular Dynamics on CPU and GPU clusters

Tile Based Procedural Terrain Generation in Real-Time

Tile-based Lightweight Integer Compression in GPU

Tiled QR Decomposition and Its Optimization on CPU and GPU Computing System

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Tiling for Performance Tuning on Different Models of GPUs

Tiling optimizations for stencil computations

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation

Time dependent simulation of the Driven Lid Cavity at High Reynolds Number

Time Predictability of GPU Kernel on an HSA Compliant Platform

Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project

Time-stepping methods for the simulation of the self-assembly of nano-crystals in Matlab on a GPU

Time-varying clustering for local lighting and material design

TimeGraph: GPU scheduling for real-time multi-tasking environments

Tinker-HP: Accelerating Molecular Dynamics Simulations of Large Complex Systems with Advanced Point Dipole Polarizable Force Fields using GPUs and Multi-GPUs systems

TinyDL: Just-In-Time Deep Learning Solution For Constrained Embedded Systems

Tiramisu: A Code Optimization Framework for High Performance Systems

Titan: A Parallel Asynchronous Library for Multi-Agent and Soft-Body Robotics using NVIDIA CUDA

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

tntorch: Tensor Network Learning with PyTorch

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures

To GPU Synchronize or Not GPU Synchronize?

To Use or Not to Use: Graphics Processing Units for Pattern Matching Algorithms

Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM

TonY: An Orchestrator for Distributed Machine Learning Jobs

Toolchain for programming, simulating and studying the XMT many-core architecture

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications

Tools for GPU Computing–Debugging and Performance Analysis of Heterogenous HPC Applications

Tools for Reduced Precision Computation: A Survey

Top ten ways to make formal methods for HPC practical

Top-k Queries Processing With Uncertain Data on Graphics Processing Units

Topical perspective on massive threading and parallelism
TopicBERT for Energy Efficient Document Classification

Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor

Topology Optimization with Unstructured Meshes on Graphics Processing Units (GPUs)

Torch7: A Matlab-like Environment for Machine Learning

TorchAudio: Building Blocks for Audio and Speech Processing

TorchBench: Benchmarking PyTorch with High API Surface Coverage

Torchnet: An Open-Source Platform for (Deep) Learning Research

torchode: A Parallel ODE Solver for PyTorch

TorchOpt: An Efficient Library for Differentiable Optimization

TorchQC – A framework for efficiently integrating machine and deep learning methods in quantum dynamics and control

Toward a Generic Hybrid CPU-GPU Parallelization of Divide-and-Conquer Algorithms

Toward a GPU-Accelerated Immersed Boundary Method for Wind Forecasting Over Complex Terrain

Toward a Multi-level Parallel Framework on GPU Cluster with PetSC-CUDA for PDE-based Optical Flow Computation

Toward a multicore architecture for real-time ray-tracing

Toward a Practical Implementation of Exemplar-Based Noise Robust ASR

Toward Accelerating the Matrix Inversion Computation of Symmetric Positive-Definite Matrices on Heterogeneous GPU-Based Systems

Toward Acceleration of RSA Using 3D Graphics Hardware

Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks

Toward Auto-tuned Krylov Basis Computations with minimized Communication on Clusters of Accelerators

Toward Automatic Translation: From OpenACC to OpenMP 4

Toward Better Computation Models for Modern Machines

Toward efficient GPU-accelerated N-body simulations

Toward GPU Accelerated Data Stream Processing

Toward GPU-accelerated Traffic Simulation and Its Real-Time Challenge

Toward GPUs being mainstream in analytic processing: An initial argument using simple scan-aggregate queries

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

Toward improved aeromechanics simulations using recent advancements in scientific computing

Toward large-scale Hybrid Monte Carlo simulations of the Hubbard model on graphics processing units

Toward OpenCL Automatic Multi-Device Support

Toward optimised skeletons for heterogeneous parallel architecture with performance cost model

Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

Toward Practical Real-Time Photon Mapping: Efficient GPU Density Estimation

Titles: 100
open PDFs: 96
packages: 32
