high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

The integrated implementation of surgical simulations through modeling by means of imaging, comprehension, visualization, deformation, and collision detection in virtual environments

The International Exascale Software Project roadmap

The K-Anonymity Approach in Preserving the Privacy of E-Services that Implement Data Mining

The Landscape of GPU-Centric Communication

The Lattice Boltzmann Equation Method for Complex Flows

The Lattice Boltzmann Simulation on Multi-GPU Systems

The lattice-Boltzmann method for simulating gaseous phenomena

The Linear Direct Sparse Solver on GPU for Bundle Adjustment Method

The Living Application: a Self-Organising System for Complex Grid Tasks

The magic volume lens: an interactive focus+context technique for volume rendering

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

The method of improving performace of the GPU-accelerated 2D FDTD simulator

The Model of Computation of CUDA and its Formal Semantics

The MOPED framework: Object recognition and pose estimation for manipulation

The More We Share, The More We Have: Improving GPU performance through Register Sharing

The MOSIX Cluster Operating System for High-Performance Computing on Linux Clusters, Multi-Clusters, GPU Clusters and Clouds

The MOSIX Virtual OpenCL (VCL) Cluster Platform

The multi-GPU System with ExpEther

The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing

The multikernel: a new OS architecture for scalable multicore systems

The New Compiler Stack: A Survey on the Synergy of LLMs and Compilers

The nonequispaced FFT on graphics processing units

The Ocean Tensor Package

The OoO VLIW JIT Compiler for GPU Inference

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science

The openip open source image processing library

The OpenMP Cluster Programming Model

The Optimization of Algorithms in the Process of Temporal Data Mining Using the Compute Unified Device Architecture

The optimization of parallel Smith-Waterman sequence alignment using on-chip memory of GPGPU

The orthorectified technology for UAV aerial remote sensing image based on the Programmable GPU

The Parallel Bayesian Toolbox for High-performance Bayesian Filtering in Metrology

The Parallel Processing Based on CUDA for Convolution Filter FDK Reconstruction of CT

The PEPPHER Approach to Programmability and Performance Portability for Heterogeneous many-core Architectures

The PEPPHER Composition Tool: Performance-Aware Dynamic Composition of Applications for GPU-based Systems

The Performance Analysis Based on Heterogeneous Parallel Processors for Anisotropic Diffusion Filters

The performances of R GPU implementations of the GMRES method

The Physics of Singular Dislocation Structures in Continuum Dislocation Dynamics

The Plasma Simulation Code: A modern particle-in-cell code with load-balancing and GPU support

The Possibility of Fast Large-Scale Numerical Simulation Implemented with Graphics Processing Units

The Potential for a GPU-Like Overlay Architecture for FPGAs

The Potential of the Intel Xeon Phi for Supervised Deep Learning

The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications

The Promises of Hybrid Hexagonal/Classical Tiling for GPU

The Q Continuum Simulation: Harnessing the Power of GPU Accelerated Supercomputers

The Reconstruction Toolkit (RTK), an open-source cone-beam CT reconstruction toolkit based on the Insight Toolkit (ITK)

The Reduction Problem in CUDA and Its Simulation with P Systems

The Research of Large-Scale 3D Scenes Rendering Optimization

The Research of Real-Time Shadow Rendering Algorithm of Virtual Scenes

The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations

The Rhombic Dodecahedron Map: An Efficient Scheme for Encoding Panoramic Video

The Risks of WebGL: Analysis, Evaluation and Detection

The Rodinia Benchmark Suite in SYCL

The role of GPU computing in medical image analysis and visualization

The role of multigrid algorithms for LQCD

The Saga of Landau-Gauge Propagators: Gathering New Ammo

The Scalable Heterogeneous Computing (SHOC) benchmark suite

The scoring sequences on profile Hidden Markov Models with delete states elimination by GPUs

The Security of Key Derivation Functions in WINRAR

The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs

The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

The sparse matrix vector product on GPUs

The State of the Art in Interactive Global Illumination

The Stencil Processing Unit: GPGPU Done Right

The Study of the OpenCL Processing Models for the FPGA Devices

The system for visualization of synoptic objects

The Test and Evaluation Uses of Heterogeneous Computing: GPGPUs and Other Approaches

The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Use of Automated Search in Deriving Software Testing Strategies

The Use of GPUs for Solving the Computed Tomography Problem

The use of overlapping subgrids to accelerate the FDTD on GPU devices

The Vectorization of the Tersoff Multi-Body Potential: An Exercise in Performance Portability

The VerCors Verifier: A Progress Report

The Virtual Marathon: Parallel Computing Supports Crowd Simulations

The Virtual OpenCL (VCL) Cluster Platform

The visible ear surgery simulator

The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware

The VOLNA-OP2 Tsunami Code (Version 1.0)

The VRE volume rendering engine

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

Theano-based Large-Scale Visual Recognition with Multiple GPUs

Theano-MPI: a Theano-based Distributed Training Framework

Theano: A CPU and GPU Math Compiler in Python

Theano: A Python framework for fast computation of mathematical expressions

Theano: Deep Learning on GPUs with Python

TheanoLM – An Extensible Toolkit for Neural Network Language Modeling

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Theoretical and Numerical Analysis of Three Approaches to the GPGPU Application of the Explicit FDTD Method

Theory of square, rectangular, and microband electrodes through explicit GPU simulation

Thermal and Athermal Swarms of Self-Propelled Particles

Brief statistics for this page

Titles: 100

Download open PDFs: 91

Package packages: 29

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)