high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Application Synthesis and Optimization on Heterogeneous Parallel Processing Systems

Application-guided tool development for architecturally diverse computation

Application-independent accurate mouse placements on surfaces of arbitrary geometry

Applications of Linux-Based QT-CUDA Parallel Architecture

Applications of Many-Core Technologies to On-line Event Reconstruction in High Energy Physics Experiments

Applications Performance on GPGPUs with the Fermi Architecture

Applying Contact Angle to a Two-Dimensional Smoothed Particle Hydrodynamics (SPH) model on a Graphics Processing Unit (GPU) Platform

Applying Genetic Algorithms to Tune Heterogeneous Platform Configurations

Applying GPU Dynamic Parallelism to High-Performance Normalization of Gene Expressions

Applying graphics processor acceleration in a software defined radio prototyping environment

Applying Object Oriented Design Patterns to CUDA based Pyramidal Image Blending – An Experience

Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs

Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

Applying Source Level Auto-Vectorization to Aparapi Java

Applying the “Simple Accelerator Modelling in MATLAB” (SAMM) Code to High Luminosity LHC Upgrade

Applying the Midas Touch of Reproducibility to High-Performance Computing

Applying the Parallel GPU Model to Radiation Therapy Treatment

Approaches for parallelizing reductions on modern GPUs

Approaches for the Parallelization of Software Implementation of Integer Multiplication

Approximate Belief Propagation by Hierarchical Averaging of Outgoing Messages

Approximate Dynamic Programming and Neural Networks on Game Hardware

Approximate dynamic programming with post-decision states as a solution method for dynamic economic models

Approximate Principal Direction Trees

Approximate Similarity Search for Online Multimedia Services on Distributed CPU-GPU Platforms

Approximate Subdivision Surface Evaluation in the Language of Linear Algebra

Approximation of BEM matrices using GPGPUs

Approximation of Loop Subdivision Surfaces for Fast Rendering

Approximative inference for multivariate functional data on massively parallel processors

APPy: Annotated Parallelism for Python on GPUs

APTCC: Auto Parallelizing Translator From C To CUDA

APUNet: Revitalizing GPU as Packet Processing Accelerator

AQsort: Scalable Multi-Array In-Place Sorting with OpenMP

AQUAgpusph, a free 3D SPH solver accelerated with OpenCL

Aquila 2.0: Software Architecture for Cognitive Robotics

Aquila: An Open-Source GPU-Accelerated Toolkit for Cognitive Robotics Research

Arax: a runtime framework for decoupling applications from heterogeneous accelerators

Arbitrarily large iterative tomographic reconstruction on multiple GPUs using the TIGRE toolbox

Arbitrary dimension Reed-Solomon coding and decoding for extended RAID on GPUs

Arbitrary-Precision Arithmetics on the GPU

ArborX: A Performance Portable Search Library

ARC: Adaptive Ray-tracing with CUDA, a New Ray Tracing Code for Parallel GPUs

ArchesWeather: An efficient AI weather forecasting model at 1.5° resolution

Architecting an LTE Base Station with Graphics Processing Units

Architecting graphics processors for non-graphics compute acceleration

Architecting SOT-RAM Based GPU Register File

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

Architectural Analysis and Performance Characterization of NVIDIA GPUs using Microbenchmarking

Architectural Comparisons for a Quantum Monte Carlo Application

Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels

Architectural Exploration and Scheduling Methods for Coarse Grained Reconfigurable Arrays

Architectural explorations for streaming accelerators with customized memory layouts

Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Architectural Principles and Experimentation of Distributed High Performance Virtual Clusters

Architectural Support for the Stream Execution Model on General-Purpose Processors

Architectural Support for Virtual Memory in GPUs

Architecture Comparisons between Nvidia and ATI GPUs: Computation Parallelism and Data Communications

Architecture of the real-time target detection processing in an airborne hyperspectral demonstrator system

Architecture-Adaptive Code Variant Tuning

Architecture-and Workload-Aware Heterogeneous Algorithms for Sparse Matrix Vector Multiplication

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

Architecture-Aware Optimization on a 1600-core Graphics Processor

Architecture-Aware Optimization Targeting Multithreaded Stream Computing

Architecture-based Performance Evaluation of Genetic Algorithms on Multi/Many-core Systems

Architecture, Design, and Experimental Evaluation of a Lightfield Descriptor Depth Buffer Algorithm on Reconfigurable Logic and on a GPU

Are Very Deep Neural Networks Feasible on Mobile Devices?

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-extend search space

Aristotle: A Performance Impact Indicator for the OpenCL Kernels Using Local Memory

ARK: GPU-driven Code Execution for Distributed Deep Learning

ARKCoS: Artifact-Suppressed Accelerated Radial Kernel Convolution on the Sphere

Array Languages Make Neural Networks Fast

Array Program Transformation with Loo.py by Example: High-Order Finite Elements

Array-Oriented Languages and Polyhedral Compilation

ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

Articulated object tracking by rendering consistent appearance parts

Artifact-Free Decompression and Zooming of JPEG Compressed Images with Total Generalized Variation

Artifact-Free JPEG Decompression with Total Generalized Variation

Artificial Intelligence in Electric Machine Drives: Advances and Trends

Artificial neural network computation on graphic process unit

Artificial Neural Network Simulation on CUDA

ARVO-CL: The OpenCL version of the ARVO package – An efficient tool for computing the accessible surface area and the excluded volume of proteins via analytical equations

ASAMgpu V1.0-a moist fully compressible atmospheric model using graphics processing units (GPUs)

Aspect-Driven Mixed-Precision Tuning Targeting GPUs

Aspects of GPU for general purpose high performance computing

Assembling large mosaics of electron microscope images using GPU

Assembly of finite element methods on graphics processors

Assembly-Free Large-Scale Modal Analysis on the GPU

Assembly-Free Structural Dynamics On CPU and GPU

Assessing Accelerator-Based HPC Reverse Time Migration

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

Assessing Intel OneAPI capabilities and cloud-performance for heterogeneous computing

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment

Assessing opportunities of SYCL for biological sequence alignment on GPU-based systems

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations

Assessing the hardness of SVP algorithms in the presence of CPUs and GPUs

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Brief statistics for this page

Titles: 100

Download open PDFs: 95

Package packages: 19

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)