high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Scalable and Parallel Implementation of a Financial Application on a GPU: With Focus on Out-of-Core Case

Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

Scalable approximate k-NN in multidimensional big data

Scalable Breadth-First Search on a GPU Cluster

Scalable Clustering for Vision using GPUs

Scalable Clustering Using Graphics Processors

Scalable communication for high-order stencil computations using CUDA-aware MPI

Scalable Data Clustering using GPU Clusters

Scalable Dense Linear Algebra on Heterogeneous Hardware

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Scalable Distributed Fast Multipole Methods

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Heterogeneous Architecture

Scalable framework for mapping streaming applications onto multi-GPU systems

Scalable GPU Acceleration of B-Spline Signal Processing Operations

Scalable GPU rendering of CSG models

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

Scalable heterogeneous parallelism for atmospheric modeling and simulation

Scalable instruction set simulator for thousand-core architectures running on GPGPUs

Scalable Kernel Fusion for Memory-Bound GPU Applications

Scalable Lattice Boltzmann Solvers for CUDA GPU Clusters

Scalable learning for object detection with GPU hardware

Scalable Metropolis Monte Carlo for simulation of hard shapes

Scalable Molecular Dynamics Simulation Using FPGAs and Multicore Processors

Scalable Multi Agent Simulation on the GPU

Scalable Multi-Cache Simulation Using GPUs

Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer

Scalable multi-GPU implementation of the MAGFLOW simulator

Scalable Multi-GPU Simulation of Long-Range Molecular Dynamics

Scalable packet classification via GPU metaprogramming

Scalable Parallel Minimum Spanning Forest Computation

Scalable parallel programming with CUDA

Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-Core Architectures

Scalable Programming Models for Massively Multicore Processors

Scalable Query Evaluation in Relational Databases

Scalable Simulation of 3D Wave Propagation in Semi-Infinite Domains Using the Finite Difference Method on a GPU Based Cluster

Scalable Simulation of Tsunamis Generated by Submarine Landslides on GPU clusters

Scalable SMT-based verification of GPU kernel functions

Scalable Software Defined FM-radio receiver running on desktop computers

Scalable software defined receivers running on desktop computers using General Purpose Graphics Processing Units

Scalable Solution of Radiative Heat Transfer Problems by the Photon Monte Carlo Algorithm on Hybrid Computing Architectures

Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass

Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth

Scalable Techniques for Scheduling and Mapping DSP Applications onto Embedded Multiprocessor Platforms

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay

Scalable Verification Techniques for Data-Parallel Programs

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Scalar collapse in AdS with an OpenCL open source code

SCALE-Ahead-Of-Time Compilation of CUDA for AMD GPUs

Scale-dependent and example-based grayscale stippling

Scale-space ridge detection with GPU acceleration

Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs

ScaleHLS: Scalable High-Level Synthesis through MLIR

Scaling behavior of topologically constrained polymer rings in a melt

Scaling Coupled Climate Models to Exascale: OpenACC-enabled ECEarth3 Earth System Model

Scaling CUDA for Distributed Heterogeneous Processors

Scaling Deep Learning on GPU and Knights Landing clusters

Scaling Deep Learning on Multiple In-Memory Processors

Scaling Fast Multipole Methods up to 4000 GPUs

Scaling GPU-Accelerated Databases beyond GPU Memory Size

Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

Scaling Hierarchical N-body Simulations on GPU Clusters

Scaling High Performance Domain-Specific Language Implementation with Delite

Scaling IDS construction based on Non-negative Matrix factorization using GPU computing

Scaling LAPACK panel operations using parallel cache assignment

Scaling Lattice QCD beyond 100 GPUs

Scaling Monte Carlo Tree Search on Intel Xeon Phi

Scaling Multifluid Compressible Fluid Dynamics to 700,000 cores, 1.5 Pflop/s, and a Trillion Grid Cells

Scaling On-Device GPU Inference for Large Generative Models

Scaling Performance of FFT Computation on an Industrial Integrated GPU Co-processor: Experiments with Algorithm Adaptation

Scaling Radio Astronomy Signal Correlation on Heterogeneous Supercomputers Using Various Data Distribution Methodologies

Scaling Recurrent Neural Network Language Models

Scaling Results for a Discontinuous Galerkin Finite-Element Wave Solver on Multi-GPU Systems

Scaling Soft Matter Physics to Thousands of GPUs in Parallel

Scaling SU(2) to 1000 GPUs using HiRep

Scaling up scientific computations by using map-reduce-like control flow on NUMA architectures

Scaling-up spatially-explicit ecological models using graphics processors

SCALSALE: Scalable SALE Benchmark Framework for Supercomputers

Scan primitives for GPU computing

Scan Test Power Simulation on GPGPUs

Scandalously Parallelizable Mesh Generation

ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU

Scattering Parameters and Surface Normals from Homogeneous Translucent Materials using Photometric Stereo

Scattering Points in Parallel Coordinates

SCELib3.0: The new revision of SCELib, the parallel computational library of molecular properties in the Single Center Approach

Scene Boundary Detection Technique Based on Bottom-Up Attention System and OpenCL Parallel Implementation

Scene image classfying via the Partially Connected Neural Network

Scene independent real-time indirect illumination

Scene Recognition Acceleration Using CUDA and OpenMP

SCF: a device- and language-independent task coordination framework for reconfigurable, heterogeneous systems

SCGPSim: A fast SystemC simulator on GPUs

Scheduling (ir)regular applications on heterogeneous platforms

Scheduling a Parallel Sparse Direct Solver to Multiple GPUs

Scheduling by Work-Stealing in Hybrid Parallel Architectures

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

Scheduling Dataflow Execution Across Multiple Accelerators

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Brief statistics for this page

Titles: 100

Download open PDFs: 89

Package packages: 16

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)