high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Towards a More Efficient Use of GPUs

Towards a Performance-Portable FFT Library for Heterogeneous Computing

Towards a Portable and Future-proof Particle-in-Cell Plasma Physics Code

Towards a robust, real-time face processing system using CUDA-enabled GPUs

Towards a Software Transactional Memory for Graphics Processors

Towards a Tunable Multi-Backend Skeleton Programming Framework for Multi-GPU Systems

Towards a Unified CPU-GPU code hybridization: A GPU Based Optimization Strategy Efficient on Other Modern Architectures

Towards a unified framework for rapid 3D computed tomography on commodity GPUs

Towards a Unified Sentiment Lexicon (USL) based on Graphics Processing Units (GPUs)

Towards a Unified Sentiment Lexicon Based on Graphics Processing Units

Towards Accelerated Computation of Atmospheric Equations Using CUDA

Towards accelerating molecular modeling via multi-scale approximation on a GPU

Towards accelerating Smoothed Particle Hydrodynamics simulations for free-surface flows on multi-GPU clusters

Towards acceleration of fault simulation using graphics processing units

Towards ad-hoc GPU acceleration of parallel eigensystem computations

Towards Adaptive GPU Resource Management for Embedded Real-Time Systems

Towards Alignment of Parallelism in SYCL and ISO C++

Towards an automatic generation of dense linear algebra solvers on parallel architectures

Towards an Effective Unified Programming Model for Many-Cores

Towards an embedded biologically-inspired machine vision processor

Towards an interactive and automated script feature analysis of 3D scanned cuneiform tablets

Towards Automated Kernel Generation in the Era of LLMs

Towards automated kernel selection in machine learning systems: A SYCL case study

Towards Automated Learning of Object Detectors

Towards Automatic C Programs Optimization and Parallelization using the PIPS-PoCC Integration

Towards automatic Digital Surface Model generation using a Graphics Processing Unit

Towards Automatic Learning of Heuristics for Mechanical Transformations of Procedural Code

Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs

Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System

Towards autonomous resource management: Deep learning prediction of CPU-GPU load balancing

Towards Building Error Resilient GPGPU Applications

Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs

Towards Chip-on-Chip Neuroscience: Fast Mining of Frequent Episodes Using Graphics Processors

Towards chip-on-chip neuroscience: fast mining of neuronal spike streams using graphics hardware

Towards Co-execution on Commodity Heterogeneous Systems: Optimizations for Time-Constrained Scenarios

Towards Code Generation from Design Models for Embedded Systems on Heterogeneous CPU-GPU Platforms

Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems

Towards Distortion-Predictable Embedding of Neural Networks

Towards Distributed Heterogenous High-Performance Computing with ViennaCL

Towards Domain-specific Computing for Stencil Codes in HPC

Towards dynamic reconfigurable load-balancing for hybrid desktop platforms

Towards Efficient and Practical GPU Multitasking in the Era of LLM

Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA

Towards Efficient GPU Sharing on Multicore Processors

Towards Efficient Indexing of Spatiotemporal Trajectories on the GPU for Distance Threshold Similarity Searches

Towards Efficient Large-Scale Graph Neural Network Computing

Towards Efficient Risk Quantification-Using GPUs and Variance Reduction Technique

Towards energy efficiency and productivity for decision making in mobile robot navigation

Towards Enhancing Performance, Programmability, and Portability in Heterogeneous Computing

Towards fast and certified multiple-precision libraries

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Towards fully user transparent task and data parallel image processing

Towards global composition of performance-aware components for GPU-based systems

Towards Good Practices for Very Deep Two-Stream ConvNets

Towards GPGPU Assisted Computing in Virtualized Environments

Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines

Towards GPU-Accelerated Large-Scale Graph Processing in the Cloud

Towards Green Computing: A Survey of Performance and Energy Efficiency of Different Platforms using OpenCL

Towards High Performance Java-based Deep Learning Frameworks

Towards High Speed Aerial Tracking of Agile Targets

Towards High-Performance and Cost-Effective Distributed Storage Systems with Information Dispersal Algorithms

Towards Improving Programmability of Heterogeneous Parallel Architectures

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Towards Interactive Visual Exploration of Parallel Programs using a Domain-specific Language

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Towards large-scale network analytics

Towards Lattice Quantum Chromodynamics on FPGA devices

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Towards Memory-Efficient Answering of Tree-Shaped SPARQL Queries using GPUs

Towards metaprogramming for parallel systems on a chip

Towards microsecond biological molecular dynamics simulations on hybrid processors

Towards Modeling Energy Consumption of Xeon Phi

Towards multi-GPU support for visualization

Towards Multi-GPU Support in the Marrow Skeleton Framework

Towards On-Chip Optical FFTs for Convolutional Neural Networks

Towards On-Line Digital Doubles

Towards paradisEO-MO-GPU: a framework for GPU-based local search metaheuristics

Towards Parallel Programming Models for Predictability

Towards Path Tracing in Games

Towards Performance Portable Programming for Distributed Heterogeneous Systems

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

Towards Performance-Portable, Scalable, and Convenient Linear Algebra

Towards Portable Performance for Explicit Hydrodynamics Codes

Towards Porting a Real-World Seismological Application to the Intel MIC Architecture

Towards Predictable Real-Time Performance on Multi-Core Platforms

Towards Rapid Prototyping of Parallel and HPC Applications (GPU Focus)

Towards real time 2D to 3D registration for ultrasound-guided endoscopic and laparoscopic procedures

Towards real time 3D tracking and reconstruction on a GPU using Monte Carlo simulations

Towards real time vision based UUV navigation using GPU technology

Towards real-time radiation therapy: GPU accelerated superposition/convolution

Towards real-time tomography: Fast reconstruction algorithms and GPU implementation

Towards reverse engineering the brain: Modeling abstractions and simulation frameworks

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Towards robust automatic detection of vulnerable road users: monocular pedestrian tracking from a moving vehicle

Towards scalar synchronization in SIMT architectures

Towards shared memory consistency models for GPUs

Towards smart-pixel-based implementation of wideband active sonar echolocation system for multi-target detection

Towards solving the Table Maker’s Dilemma on GPU

Brief statistics for this page

Titles: 100

Download open PDFs: 92

Package packages: 18

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)