high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Texture-based visualization of uncertainty in flow fields

Texture-Based Visualization of Unsteady 3D Flow by Real-Time Advection and Volumetric Illumination

Texturing and Modeling, Third Edition: A Procedural Approach (The Morgan Kaufmann Series in Computer Graphics)

TH-1: China’s first petaflop supercomputer

The ‘Chimera’: an off-the-shelf CPU/GPGPU/FPGA hybrid computing platform

The 3D Flow Field Around an Embedded Planet

The Accelerated Universe

The accelerating implementation of BLAST with stream processor

The Accelerator Wall: Limits of Chip Specialization

The AES Implantation Based on OpenCL for Multi/many Core Architecture

The AGILE library for image reconstruction in biomedical sciences using graphics card hardware acceleration

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

The AlexNet Moment for Homomorphic Encryption: HCNN, the First Homomorphic CNN on Encrypted Data with GPUs

The Anatomy of a Triton Attention Kernel

The Anatomy of High-Performance 2D Similarity Calculations

The ANTAREX Approach to Autotuning and Adaptivity for Energy Efficient HPC Systems

The ANTAREX Domain Specific Language for High Performance Computing

The Application of AI Technology in GPU Scheduling Algorithm Optimization

The Application of CUDA Architecture in Facial Expression Recognition

The application of GPU particle tracing to diffusion tensor field visualization

The Application Perspective: Seeking Productivity and Performance

The Arcane development framework

The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing

The architecture of the DecentVM: towards a decentralized virtual machine for many-core computing

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product

The Astrophysical Multipurpose Software Environment

The battle of the giants: a case study of GPU vs FPGA optimisation for real-time image processing

The BiConjugate gradient method on GPUs

The Boat Hull Model: Adapting the Roofline Model to Enable Performance Prediction for Parallel Computing

The BondMachine toolkit: Enabling Machine Learning on FPGA

The Bones Source-to-Source Compiler Manual

The Case for Higher Computational Density in the Memory-Bound FDTD Method within Multicore Environments

The case for VOS: the vector operating system

The Celerity High-level API: C++20 for Accelerator Clusters

The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units

The Comparisons of OpenCL and OpenMP Computing Paradigm

The Complete Rank Transform: A Tool for Accurate and Morphologically Invariant Matching of Structures

The computer graphics wars heat up

The conjugate gradient solver accelerated by GPU for solving wave-propagation problems

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

The Correctness Illusion in LLM-Generated GPU Kernels

The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography

The CUDA Handbook: A Comprehensive Guide to GPU Programming

The CUDA implementation of the method of lines for the curvature dependent flows

The CUDA LATCH Binary Descriptor: Because Sometimes Faster Means Better

The DabR – A multitouch system for intuitive 3D scene navigation

The Deep Learning Compiler: A Comprehensive Survey

The density matrix renormalization group algorithm on kilo-processor architectures: implementation and trade-offs

The Design and Implementation of a GPU-enabled Multi-objective Tabu-search Intended for Real World and High-dimensional Applications

The Design and Implementation of a Verification Technique for GPU Kernels

The design and verification of Mumax3

The development and expansion of HOOMD-blue through six years of GPU proliferation

The development of Mellanox/NVIDIA GPUDirect over InfiniBand-a new model for GPU to GPU communications

The discrete dipole approximation code DDscat.C++: features, limitations and plans

The distributed diagonal force decomposition method for parallelizing molecular dynamics simulations

The Distribution of OpenCL Kernel Execution Across Multiple Devices

The Dual-Path Execution Model for Efficient GPU Control Flow

The Dynamical Kernel Scheduler – Part 1

The Ecological Footprint of Neural Machine Translation Systems

The effects of nutrient chemotaxis on bacterial aggregation patterns with non-linear degenerate cross diffusion

The Fast and Wideband MoM Based on GPU and Two-Path AFS Acceleration

The fast evaluation of hidden Markov models on GPU

The fast multipole method on parallel clusters, multicore processors, and graphics processing units

The Fast Multipole Method on the Cell processor

The Fat-Link Computation On Large GPU Clusters for Lattice QCD

The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming

The FFT on a GPU

The Flocking Based and GPU Accelerated Internet Traffic Classification

The Framework and Compilation Techniques for Directive-based GPU Cluster Programming

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

The Future in Mobile Multicore Computing

The Future of Accelerator Programming: Abstraction, Performance or Can We Have Both?

The future of microprocessors

The GASPI API specification and its implementation GPI 2.0

The Geant4 Visualisation System – a multi-driver graphics system

The GeForce 6 series GPU architecture

The GeForce 6800

The Genetic Convolutional Neural Network Model Based on Random Sample

The GENGA Code: Gravitational Encounters in N-body simulations with GPU Acceleration

The GPU as a high performance computational resource

The GPU as numerical simulation engine

The GPU Computing Era

The GPU Computing Revolution: From Multi-Core CPUs To Many-Core Graphics Processors

The GPU Enhanced Parallel Computing for Large Scale Data Clustering

The GPU enters computing’s mainstream

The GPU on biomedical image processing for color and phenotype analysis

The GPU on irregular computing: performance issues and contributions

The GPU on the simulation of cellular computing models

The GPU vs Phi Debate: Risk Analytics Using Many-Core Computing

The GPU-based High-performance Pattern-matching Algorithm for Intrusion Detection

The GPU-based Parallel Ant Colony System

The GPU-based String Matching System in Advanced AC Algorithm

The gputools package enables GPU computing in R

The GPUVerify Method: a Tutorial Overview

The Graphics Card as a Streaming Computer

The Graphics Processor as a Mathematical Coprocessor in MATLAB

The Heisenberg spin glass model on GPU: myths and actual facts

The Hierarchical Memory Machine Model for GPUs

The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development

The impact of accelerator processors for high-throughput molecular modeling and simulation

Brief statistics for this page

Titles: 100

Download open PDFs: 85

Package packages: 19

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)