high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Distributed multi-node, multi-GPU, heterogeneous system for 3D image reconstruction in Electrical Capacitance Tomography – network performance and application analysis

Distributed OpenCL Distributing OpenCL Platform on Network Scale

Distributed OpenCL: a platform for distributed, heterogeneous computing for domain scientists

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators

Distributed Password Cracking Platform

Distributed Texture Memory in a Multi-GPU Environment

Distributed time, conservative parallel logic simulation on GPUs

Distributed Training Large-Scale Deep Architectures

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed wideband software-defined radio receiver for heterogeneous systems

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability

Distributed, combined CPU and GPU profiling within HPX using APEX

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Divergence Analysis

Divergence Analysis and Optimizations

Divergence Analysis with Affine Constraints

Divide and Conquer G-Buffer Ray Tracing

Divide-and-Conquer 3D Convex Hulls on the GPU

DiVinE-CUDA – A Tool for GPU Accelerated LTL Model Checking

DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers

DL: A data layout transformation system for heterogeneous computing

DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications

DLL: A Blazing Fast Deep Neural Network Library

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

dMath: A Scalable Linear Algebra and Math Library for Heterogeneous GP-GPU Architectures

dMath: Distributed Linear Algebra for DL

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

Doctor AI: Interpretable Deep Learning for Modeling Electronic Health Records

Document Classification Using KNN on GPU

Document Image Binarization Using Image Segmentation Algorithm in Parallel Environment

Document Stream Clustering using GPUs

Dogwild! – Distributed Hogwild for CPU & GPU

Domain Decomposition method on GPU cluster

Domain Specific Languages for High Performance Computing

Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation

Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks

Domain-Specific Languages for Heterogeneous Parallel Computing

Domain-Specific On-Device Object Detection Method

Domain-Specific Optimizations Supporting Real-Time Image Compression

DOPA: GPU-based protein alignment using database and memory access optimizations

dOpenCL – Evaluation of an API-Forwarding Implementation

Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures

Double-Precision Floating-Point Data Visualizations Using Vulkan API

Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

Dragon-Alpha&cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA Library

DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function

DRiVE: An Example of Distributed Rendering in Virtual Environments

Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

Drug Drug Interaction Extraction from Biomedical Literature Using Syntax Convolutional Neural Network

DSDP: A Blind Docking Strategy Accelerated by GPUs

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

DTAM: Dense tracking and mapping in real-time

Dual-RBF based surface reconstruction

Duality based optical flow algorithms with applications

DUODECIM – a structure for point scan compression and rendering

Duplicate Detection on GPUs

Dust-Dust Collisional Charging and Lightning in Protoplanetary Discs

DVM: Real-Time Kernel Generation for Dynamic AI Models

Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems

Dymaxion++: A Directive-based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems

Dynamic adaptation and distribution of binaries to heterogeneous architectures

Dynamic adaptation of broad phase collision detection algorithms

Dynamic Adaptation Techniques and Opportunities to Improve HPC Runtimes

Brief statistics for this page

Titles: 100

Download open PDFs: 97

Package packages: 26

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)