high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

SU(2) Lattice Gauge Theory Simulations on Fermi GPUs

SU(2) Lattice QCD Simulations on Fermi GPUs

Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models

Subdivision Surface Evaluation as Sparse Matrix-Vector Multiplication

Subpixel reconstruction antialiasing for deferred shading

Suitability of NVIDIA GPUs for SKA1-Low

Super Earths and Dynamical Stability of Planetary Systems: First Parallel GPU Simulations Using GENGA

Supercharging Federated Learning with Flower and NVIDIA FLARE

Supercomputing and stellar dynamics

Supercomputing with toys: harnessing the power of NVIDIA 8800GTX and playstation 3 for bioinformatics problem

Superconducting proximity effect in graphene under inhomogeneous strain

SUPERGLUE: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization

SUperman: Efficient Permanent Computation on GPUs

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

SuperNeurons: FFT-based Gradient Sparsification in the Distributed Training of Deep Neural Networks

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Supervised Hashing with Deep Neural Networks

Support for Parallel Scan in OpenMP

Support Operator Rupture Dynamics on GPU

Support Vector Machines on GPU with Sparse Matrix Format

Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms

Supporting CUDA for an extended RISC-V GPU architecture

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Supporting Heterogenous Computing Environments in SaC

Supporting input dependent access pattern algorithms on GPUs using GPUfs

Supporting Iteration in a Heterogeneous Data Flow Engine

Supporting mixed-datatype matrix multiplication within the BLIS framework

Supporting Preemptive Task Executions and Memory Copies in GPGPUs

Supporting x86-64 Address Translation for 100s of GPU Lanes

Surface Compression Using Dynamic Color Palettes

Surface Normal Integration for Convex Space-time Multi-view Reconstruction

Surface quality assessment of subdivision surfaces on programmable graphics hardware

Surface Reconstruction from Scattered Point via RBF Interpolation on GPU

Survey and Benchmarking of Machine Learning Accelerators

Survey of Domain-Specific Languages for FPGA Computing

Survey of GPU water simulation in game engine

Survey of HPC in US Research Institutions

Survey on Benchmarks for a GPU Based Multi Camera Stereo Matching Algorithm

Survey on Efficient Linear Solvers for Porous Media Flow Models on Recent Hardware Architectures

Survey On The Off-Chip Scheduling of Memory Accesses in the Memory Interface Of GPUs

Survey paper on Deep Learning on GPUs

Sustainable GPU Computing at Scale

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale

SW# – GPU enabled exact alignments on genome scale

SW#db: GPU-accelerated exact sequence similarity database search

Swan: A tool for porting CUDA programs to OpenCL

SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

Swarm-NG: a CUDA Library for Parallel n-body Integrations with focus on Simulations of Planetary Systems

Swarm’s flight: Accelerating the particles using C-CUDA

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer

Swendsen-Wang Multi-Cluster Algorithm for the 2D/3D Ising Model on Xeon Phi and GPU

Swept Volume approximation of polygon soups

SWIFOLD: Smith-Waterman implementation on FPGA with OpenCL for long DNA sequences

Switching to High Gear: Opportunities for Grand-Scale Real-Time Parallel Simulations

Swizzle Inventor: Data Movement Synthesis for GPU Kernels

SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection

SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and x86/SSE2

SYCL Code Generation for Multigrid Methods

SYCL compute kernels for ExaHyPE

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration

SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs

SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing

SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing

SYCLops: A SYCL Specific LLVM to MLIR Converter

Sylkan: Towards a Vulkan Compute Target Platform for SYCL

Symbolic Crosschecking of Data-Parallel Floating Point Code

Symbolic crosschecking of floating-point and SIMD code

Symbolic Differentiation in GPU Shaders

Symbolic Testing of OpenCL Code

Symphony: A Scheduler for Client-Server Applications on Coprocessor-based Heterogeneous Clusters

Synchronization and Coordination in Heterogeneous Processors

Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming

Synergia CUDA: GPU-accelerated accelerator modeling package

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra

Synergistic execution of stream programs on multicores with accelerators

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction

Synthesis and rendering of bidirectional texture functions on arbitrary surfaces

Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation

Synthesis of Embedded Software using Dataflow Schedule Graphs

Synthesis of GPU Programs from High-Level Models

Synthesis of Platform Architectures from OpenCL Programs

Synthesizing Benchmarks for Predictive Modeling

Synthesizing Software from a ForSyDe Model Targeting GPGPUs

Synthesizing Structured Traversals from Attribute Grammars

Synthesizing Subdivision Meshes Using Real Time Tessellation

Synthetic Aperture Beamformation using the GPU

Synthetic Aperture Radar imaging on a CUDA-enabled mobile platform

Synthetic Aperture Radar Processing with GPGPU

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

System Design Principles for Heterogeneous Resource Management and Scheduling in Accelerator-Based Systems

System integration of FastSPECT III, a dedicated SPECT rodent-brain imager based on BazookaSPECT detector technology

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

Systematic construction, verification and implementation methodology for LDPC codes

Systematic Performance Optimization of Cone-Beam Back-Projection on the Kepler Architecture

Brief statistics for this page

Titles: 100

Download open PDFs: 97

Package packages: 33

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)