high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Acceleration of Selective Cationic Antibacterial Peptides computation: A comparison of FPGA and GPU approaches

Acceleration of Solving Maxwell’s Equations Using Cluster of GPUs

Acceleration of spiking neural networks in emerging multi-core and GPU architectures

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump Using CUDA-enabled GPU Hardware

Acceleration of stereo-matching on multi-core CPU and GPU

Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters

Acceleration of tensor-product operations for high-order finite element methods

Acceleration of the 3D ADI-FDTD method using graphics processor units

Acceleration of the GAMESS-UK electronic structure package on graphical processing units

Acceleration of the Method of Moments Calculations by Using Graphics Processing Units

Acceleration of the MMFF94 routines within OpenBabel using Eigen and OpenCL

Acceleration of the Smith-Waterman Algorithm using Single and Multiple Graphics Processors

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

Acceleration of Time-Domain Finite Element Method (TD-FEM) Using Graphics Processor Units (GPU)

Acceleration of TM cylinder EFIE with CUDA

Acceleration of Tsunami Wave Propagation Modeling based on Re-engineering of Computational Components

Acceleration of Variance of Color Differences-Based Demosaicing Using CUDA

Acceleration of Various Direct/Iterative Solvers for MoM by GPU and Its Computational Cost

Acceleration technique for volume rendering using 2D texture based ray plane casting on GPU

Acceleration Techniques for GPU-based Volume Rendering

Acceleration-as-a-Service: Exploiting Virtualised GPUs for a Financial Application

Accelerator Aware MPI Micro-benchmarking using CUDA, OpenACC and OpenCL

Accelerator weather forecasting

Accelerator-Oriented Algorithm Transformation for Temporal Data Mining

Accelerator: using data parallelism to program GPUs for general-purpose uses

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

AccFFT: A library for distributed-memory 3-D FFT on CPU and GPU architectures

AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

Accounting for Secondary Uncertainty: Efficient Computation of Portfolio Risk Measures on Multi and Many Core Architectures

Accounting for Uncertainty in Medical Data: A CUDA Implementation of Normalized Convolution

ACCTuner: OpenACC Auto-Tuner For Accelerated Scientific Applications

AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks

accULL: An User-directed Approach to Heterogeneous Programming

Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Accuracy, Memory, and Speed Strategies in GPU-Based Finite-Element Matrix-Generation

Accurate Analytic Models to Estimate Execution Time on GPU Applications

Accurate and Efficient Filtering using Anistropic Filter Decomposition

Accurate Cross-Architecture Performance Modeling for Sparse Matrix-Vector Multiplication (SpMV) on GPUs

Accurate CUDA Performance Modeling for Sparse Matrix-Vector Multiplication

Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels

Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing

Accurate Models of NVIDIA Tensor Cores

Accurate multi-view reconstruction using robust binocular stereo and surface meshing

Accurate real-time stereo correspondence using intra- and inter-scanline optimization

Accurate Sequence Alignment using Distributed Filtering on GPU Clusters

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale

Achieving a single compute device image in OpenCL for multiple GPUs

Achieving High Throughput Sequencing with Graphics Processing Units

Achieving high-performance with a sparse direct solver on Intel KNL

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability

Achieving O(1) IP lookup on GPU-based software routers

Achieving Speedup in Aggregate Risk Analysis using Multiple GPUs

Achieving TeraCUPS on Longest Common Subsequence Problem using GPGPUs

ACL2 Meets the GPU: Formalizing a CUDA-based Parallelizable All-Pairs Shortest Path Algorithm in ACL2

ACO on Multiple GPUs with CUDA for Faster Solution of QAPs

ACO with tabu search on a GPU for solving QAPs using move-cost adjusted thread assignment

Acquisition Method of Spread Spectrum Signals Based on GPU Acceleration

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Action Spotting and Recognition Based on a Spatiotemporal Orientation Analysis

Action-Based Multifield Video Visualization

Active Structured Learning for High-Speed Object Detection

Active thread compaction for GPU path tracing

Activity recognition from videos with parallel hypergraph matching on GPUs

Adaboost GPU-based Classifier for Direct Volume Rendering

AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles

Adaptable particle-in-cell algorithms for graphical processing units

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

Adaptation of algorithms for underwater sonar data processing to GPU-based systems

Adaptation of an acoustic propagation model to the parallel architecture of a graphics processor

Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments

Adaptation of the MapReduce programming framework to compute-intensive data-analytics kernels

Adapting a message-driven parallel application to GPU-accelerated clusters

Adapting data processing methods to modern GPU architecture

Adapting database components to heterogeneous environments

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Adapting MoM with RWG Basis Functions to GPU Technology Using CUDA

Adapting Particle Filter Algorithms to Many-Core Architectures

Adapting the GA Approach to Solve Traveling Salesman Problems on CUDA Architecture

Adaptive algebraic multigrid on SIMD architectures

Adaptive and Hybrid Machine Learning Approaches Utilizing General Purpose Computing on Graphical Processing Units

Adaptive and Transparent Cache Bypassing for GPUs

Adaptive Data Migration in Load-Imbalanced HPC Applications

Adaptive discrete cosine transform-based image compression method on a heterogeneous system platform using Open Computing Language

Adaptive Dynamic Load Balancing in Heterogeneous Multiple GPUs-CPUs Distributed Setting: Case Study of B&B Tree Search

Adaptive enhancement and noise reduction in very low light-level video

Adaptive fast multipole methods on the GPU

Adaptive GPU Array Layout Auto-Tuning

Adaptive Hardware-accelerated Terrain Tessellation

Adaptive implementation selection in the SkePU skeleton programming library

Adaptive Input-aware Compilation for Graphics Engines

Adaptive Kinetic-Fluid Solvers for Heterogeneous Computing Architectures

Adaptive Line Tracking with Multiple Hypotheses for Augmented Reality

Adaptive load balancing for raycasting of non-uniformly bricked volumes

Adaptive Mesh Fluid Simulations on GPU

Adaptive Multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU

Adaptive OpenCL (ACL) Execution in GPU Architectures

Adaptive Optimization for OpenCL Programs on Embedded Heterogeneous Systems

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Brief statistics for this page

Titles: 100

Download open PDFs: 89

Package packages: 14

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)