high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Predictive Data Race Detection for GPUs

Predictive Lazy Amplification: Synthesis and Rendering of Massive Procedural Scenes in Real Time

Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

Prefiltered Single Scattering

Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede

Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

Preliminary implementation of two parallel programs for fractal image coding on GPUs

Preliminary implementation of VQ image coding using GPGPU

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC

Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture-GeForce GTX 680

Preparing Ginkgo for AMD GPUs – A Testimonial on Porting CUDA Code to HIP

Pretraining large language models with MXFP4 on Native FP4 Hardware

Pretty Good Accuracy in Matrix Multiplication with GPUs

Pricing composable contracts on the GP-GPU

Pricing of cross-currency interest rate derivatives on Graphics Processing Units

Pricing the American Option Using Reconfigurable Hardware

Primal Dual Affine Scaling on GPUs

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads

Principles for Automated and Reproducible Benchmarking

Principles towards Real-Time Simulation of Material Point Method on Modern GPUs

Principles, Techniques, and Tools for Explicit and Automatic Parallelization

Priority-Based Task Management in a GPGPU Megakernel

PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems

Prius: A Runtime for Hybrid Computing

Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs

PRNG Random Numbers on GPU

Probabilistic View-based 3D Curve Skeleton Computation on the GPU

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Probing biomolecular machines with graphics processors

Probing the Statistical Validity of the Ductile-to-Brittle Transition in Metallic Nanowires Using GPU Computing

Process Time Comparison between GPU and CPU

Processing Big Data in Main Memory and on GPU

Processing data streams with hard real-time constraints on heterogeneous systems

Processing Hard Sphere Collisions on a GPU Using OpenCL

Processing Large-scale XML Files on GPGPU Cluster

Processing Markov Logic Networks with GPUs

Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data

Processing Neocognitron of Face Recognition on High Performance Environment Based on GPU with CUDA Architecture

Processing of synthetic Aperture Radar data with GPGPU

Processing OLTP Workloads on Hybrid CPU/GPU Systems

Processing Posting Lists Using OpenCL

Processing XPath Structural Constraints on GPU

Production Floating Point Applications on FPGAs

Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

Productive and Efficient Computational Science Through Domain-specific Abstractions

Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

Productive Performance Engineering for Weather and Climate Modeling with Python

Productivity, Portability, Performance: Data-Centric Python

Professional CUDA C Programming

Profile Util library: A quick and easy way to get MPI, OpenMP and GPU runtime information

Profile-guided optimization of critical medical imaging algorithms

Profiling Apple Silicon Performance for ML Training

Profiling based Out-of-core Hybrid Method for Large Neural Networks

Profiling Concurrent Vision Inference Workloads on NVIDIA Jetson – Extended

Profiling General Purpose GPU Applications

Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms

Profiling High Level Heterogeneous Programs: Using the SPOC GPGPU framework for OCaml

Profiling of Data-Parallel Processors

ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

Program Acceleration in a Heterogeneous Computing Environment Using OpenCL, FPGA, and CPU

Program Analysis and Machine Learning based Approach to Predict Power Consumption of CUDA Kernel

Program optimization carving for GPU computing?

Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+

Program Optimization of Stencil Based Application on the GPU-Accelerated System

Program optimization space pruning for a multithreaded gpu

Program Optimization Strategies for Data-Parallel Many-Core Processors

Program Optimization Study on a 128-Core GPU

PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems

Programmability: Design Costs and Payoffs using AMD GPU Streaming Languages and Traditional Multi-Core Libraries

Programmable and Scalable Architecture for Graphics Processing Units

Programmable shaders for deformation rendering

Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

Programming and Performance of Graphics Processors in Shock Waves Simulation by Finite Volume Method

Programming and Scheduling Model for Supporting Heterogeneous Accelerators in Linux

Programming Challenges for the Implementation of Numerical Quadrature in Atomic Physics on FPGA and GPU Accelerators

Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Programming Embedded Manycore: Refinement and Optimizing Compilation of a Parallel Action Language for Hierarchical State Machines

Programming finite-difference time-domain for graphics processor units using compute unified device architecture

Programming for scientific computing on peta-scale heterogeneous parallel systems

Programming framework for clusters with heterogeneous accelerators

Programming Frameworks for Distributed Smartphone Computing

Programming Future Parallel Architectures with Haskell and Intel ArBB

Programming GPUs with C++14 and Just-In-Time Compilation

Programming Heterogeneous Systems from an Image Processing DSL

Programming Heterogeneous Systems with General and Domain-Specific Frameworks

Programming hybrid systems with implicit memory based synchronization

Programming in CUDA for Kepler and Maxwell Architecture

Programming issues for video analysis on Graphics Processing Units

Programming Many-Core Chips

Programming Massively Parallel Architectures using MARTE: a Case Study

Programming massively parallel processors : A Hands – on approach

Programming Massively Parallel Processors with CUDA (audio course)

Programming model for a heterogeneous x86 platform

Programming Models and Runtimes for Heterogeneous Systems

Programming Models and Scheduling Techniques for Heterogeneous Architectures

Brief statistics for this page

Titles: 100

Download open PDFs: 91

Package packages: 20

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)