high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Tinker-HP: Accelerating Molecular Dynamics Simulations of Large Complex Systems with Advanced Point Dipole Polarizable Force Fields using GPUs and Multi-GPUs systems

TinyDL: Just-In-Time Deep Learning Solution For Constrained Embedded Systems

Tiramisu: A Code Optimization Framework for High Performance Systems

Titan: A Parallel Asynchronous Library for Multi-Agent and Soft-Body Robotics using NVIDIA CUDA

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

tntorch: Tensor Network Learning with PyTorch

To Co-Run, or Not To Co-Run: A Performance Study on Integrated Architectures

To GPU Synchronize or Not GPU Synchronize?

To Use or Not to Use: Graphics Processing Units for Pattern Matching Algorithms

Togpu: Automatic Source Transformation from C++ to CUDA using Clang/LLVM

TonY: An Orchestrator for Distributed Machine Learning Jobs

Toolchain for programming, simulating and studying the XMT many-core architecture

Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications

Tools for GPU Computing–Debugging and Performance Analysis of Heterogenous HPC Applications

Tools for Reduced Precision Computation: A Survey

Top ten ways to make formal methods for HPC practical

Top-k Queries Processing With Uncertain Data on Graphics Processing Units

Top-Performance Tokenization and Small-Ruleset Regular Expression Matching: A Quantitative Performance Analysis and Optimization Study on the Cell/B.E. Processor

Topical perspective on massive threading and parallelism

TopicBERT for Energy Efficient Document Classification

Topology optimization design of 3D electrothermomechanical actuators by using GPU as a co-processor

Topology Optimization with Unstructured Meshes on Graphics Processing Units (GPUs)

Torch7: A Matlab-like Environment for Machine Learning

TorchAudio: Building Blocks for Audio and Speech Processing

TorchBench: Benchmarking PyTorch with High API Surface Coverage

Torchnet: An Open-Source Platform for (Deep) Learning Research

torchode: A Parallel ODE Solver for PyTorch

TorchOpt: An Efficient Library for Differentiable Optimization

TorchQC – A framework for efficiently integrating machine and deep learning methods in quantum dynamics and control

Toward a Generic Hybrid CPU-GPU Parallelization of Divide-and-Conquer Algorithms

Toward a GPU-Accelerated Immersed Boundary Method for Wind Forecasting Over Complex Terrain

Toward a Multi-level Parallel Framework on GPU Cluster with PetSC-CUDA for PDE-based Optical Flow Computation

Toward a multicore architecture for real-time ray-tracing

Toward a Practical Implementation of Exemplar-Based Noise Robust ASR

Toward Accelerating the Matrix Inversion Computation of Symmetric Positive-Definite Matrices on Heterogeneous GPU-Based Systems

Toward Acceleration of RSA Using 3D Graphics Hardware

Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks

Toward Auto-tuned Krylov Basis Computations with minimized Communication on Clusters of Accelerators

Toward Automatic Translation: From OpenACC to OpenMP 4

Toward Better Computation Models for Modern Machines

Toward efficient GPU-accelerated N-body simulations

Toward GPU Accelerated Data Stream Processing

Toward GPU-accelerated Traffic Simulation and Its Real-Time Challenge

Toward GPUs being mainstream in analytic processing: An initial argument using simple scan-aggregate queries

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

Toward improved aeromechanics simulations using recent advancements in scientific computing

Toward large-scale Hybrid Monte Carlo simulations of the Hubbard model on graphics processing units

Toward OpenCL Automatic Multi-Device Support

Toward optimised skeletons for heterogeneous parallel architecture with performance cost model

Toward Performance Portability for CPUs and GPUs Through Algorithmic Compositions

Toward Practical Real-Time Photon Mapping: Efficient GPU Density Estimation

Toward Real-Time Dense 3d Reconstruction using Stereo Vision

Toward real-time kernel density estimate display for instrumentation

Towards a Benchmarking Suite for Kernel Tuners

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a Distributed GPU-Accelerated Matrix Inversion

Towards a functional run-time for dense NLA domain

Towards a GPU-based Implementation of Interaction Nets

Towards a GPU-Based Simulation Framework for Deformable Surface Meshes

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core

Brief statistics for this page

Titles: 100

Download open PDFs: 95

Package packages: 29

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)