high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Assessing the Performance-Energy Balance of Graphics Processors for Spectral Unmixing

Assessment of GPU computational enhancement to a 2D flood model

Assessment of various GPU acceleration strategies in text categorization processing flow

Astronomical Photometric Data Reduction Using GPGPU

Astrophysical data mining with GPU. A case study: genetic classification of globular clusters

Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems

Astrophysical Particle Simulations with Custom GPU Clusters

Astrophysical particle simulations with large custom GPU clusters on three continents

Astrophysical particle simulations with large custom GPU clusters on three continents

Astrophysical Supercomputing with GPUs: Critical Decisions for Early Adopters

Astrophysical-oriented Computational multi-Architectural Framework

ASW: Accelerating Smith-Waterman Algorithm on Coupled CPU-GPU Architecture

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Asymptotic Peak Utilisation in Heterogeneous Parallel CPU/GPU Pipelines: A Decentralised Queue Monitoring Strategy

Asynchronous Communication for Finite-Difference Simulations on GPU Clusters using CUDA and MPI

Asynchronous Communication Schemes for Finite Difference Methods on Multiple GPUs

Asynchronous Methods for Deep Reinforcement Learning

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous Parallel Computing Algorithm implemented in 1D Heat Equation with CUDA

Asynchronous Parallel Computing Model of Global Motion Estimation with CUDA

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX

ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs

Atmospheric Chemistry

Atmospheric turbulence removal using convolutional neural network

Atomic-free Irregular Computations on GPUs

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

Attack Signature Matching using Graphics Processors in High-Performance Intrusion Detection Systems

Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems

Attention-based NMT Models as Feature Functions in Phrase-based SMT

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures

Audiovisual Voice Activity Detection and Localization of Simultaneous Speech Sources

Augmented reality live-action compositing

Augmented reality usage for prototyping speed up

Augmenting LLM Code Translation with Compiler Analysis for C to Triton Kernel Generation

Augmenting Operating Systems With the GPU

Augur: a Modeling Language for Data-Parallel Probabilistic Inference

Aurally and visually enhanced audio search with soundtorch

AUTO-GC: Automatic translation of data mining applications to GPU clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters

Auto-Generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA

Auto-optimization of a Feature Selection Algorithm

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS (thesis)

Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

Auto-tuning 3-D FFT library for CUDA GPUs

Auto-tuning a High-Level Language Targeted to GPU Codes

Auto-tuning a LOFAR radio astronomy pipeline in JavaCL

Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs

Auto-Tuning Dedispersion for Many-Core Accelerators

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs

Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU

Auto-tuning interactive ray tracing using an analytical GPU architecture model

Auto-tuning of fast fourier transform on graphics processors

Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Auto-tuning on the macro scale: high level algorithmic auto-tuning for scientific applications

Auto-tuning Shallow water simulations on GPUs

Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems

Auto-tuning Streamed Applications on Intel Xeon Phi

Auto-Tunning of Data Communication on Heterogeneous Systems

Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

AutoMat – Automatic Differentiation for Generalized Standard Materials on GPUs

Automated and interactive approaches for optimal surface finding based segmentation of medical image data

Automated and parallel code generation for finite-differencing stencils with arbitrary data types

Automated Architecture Design for Deep Neural Networks

Automated architecture-aware mapping of streaming applications onto GPUs

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

Automated Deep Learning Optimization via DSL-Based Source Code Transformation

Automated development of applications for graphical processing units using rewriting rules

Automated Dynamic Analysis of CUDA Programs

Automated Enhanced Parallelization of Sequential C to Parallel OpenMP

Automated Generation of OpenCL Programs Based on Algebra-Algorithmic Approach

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Automated image alignment for 2D gel electrophoresis in a high-throughput proteomics pipeline

Automated Long-Term Monitoring of Parallel Microfluidic Operations Applying a Machine Vision-Assisted Positioning Method

Automated Partitioning of Data-Parallel Kernels using Polyhedral Compilation

Automated pose estimation in 3D point clouds applying annealing particle filters and inverse kinematics on a GPU

Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing

Automated Software Testing of Memory Performance in Embedded GPUs

Automated Techniques for Enabling Efficient MPI Application Migration

Automated test generation for OpenCL kernels using fuzzing and constraint solving

Automated Testing of Graphics Shader Compilers

Automated Tool to Generate Parallel CUDA code from a Serial C Code

Automatic abstraction and fault tolerance in cortical microachitectures

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems

Automatic bi-layer video segmentation based on sensor fusion

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Automatic C-to-CUDA Code Generation for Affine Programs

Automatic classification of object code using machine learning

Automatic Code Generation and Adaptive Grid Scheduling for GPU Cluster Computing

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

Brief statistics for this page

Titles: 100

Doubles=1

Download open PDFs: 93

Package packages: 16

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)