high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Where is the data? Why you cannot debate CPU vs. GPU performance without the answer 1,443 views

Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment 1,443 views

Accelerating Concurrent Heap on GPUs 1,442 views

How to Render FDTD Computations More Effective Using a Graphics Accelerator 1,441 views

Performance Evaluation of Optimized Implementations of Finite Difference Method for Wave Propagation Problems on GPU Architecture 1,440 views

Heuristic Optimization Methods for Improving Performance of Recursive General Purpose Applications on GPUs 1,440 views

Interective Point Clouds Fairing on Many-Core System 1,440 views

Power analysis and optimizations for GPU architecture using a power simulator 1,439 views

torchode: A Parallel ODE Solver for PyTorch 1,437 views

A Survey on Hardware Accelerators for Large Language Models 1,437 views

DBMS Index for Hierarchical Data Using Nested Intervals and Residue Classes 1,436 views

Multicore performance optimization using partner cores 1,434 views

LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure 1,433 views

Implicit Feature-Based Alignment System for Radiotherapy 1,433 views

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats 1,432 views

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU 1,431 views

iGUARD: In-GPU Advanced Race Detection 1,429 views

A Systematic Literature Survey of Sparse Matrix-Vector Multiplication 1,429 views

BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs 1,427 views

Advanced Joins on GPUs 1,426 views

Fast Turnaround HLS Debugging using Dependency Analysis and Debug Overlays 1,426 views

Taking the graphics processor beyond graphics 1,425 views

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product 1,425 views

Building a Personal High Performance Computer with Heterogeneous Processors 1,424 views

MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring 1,424 views

Performance prediction of deep learning applications training in GPU as a service systems 1,424 views

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis 1,424 views

GPTPU: Accelerating Applications using Edge Tensor Processing Units 1,422 views

Fast Isosurface Rendering on a GPU by Cell Rasterization 1,422 views

Non-deterministic parallelism considered useful 1,421 views

Deductive verification for SYCL 1,421 views

Software Testing – Test Suite Compilation and Execution Optimizations 1,421 views

Efficient code generation for hardware accelerators by refining partially specified implementation 1,421 views

Deep Graph Learning for Program Analysis and System Optimization 1,419 views

LS-CAT: A Large-Scale CUDA AutoTuning Dataset 1,419 views

Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA 1,419 views

A Highly Parameterizable Framework for Conditional Restricted Boltzmann Machine Based Workloads Accelerated With FPGAs and OpenCL 1,419 views

Direct Self-Consistent Field Computations on GPU Clusters 1,416 views

Fast CUDA-Aware MPI Datatypes without Platform Support 1,413 views

PeriPy – A High Performance OpenCL Peridynamics Package 1,413 views

Using a GPU to accelerate die and mold fabrication 1,413 views

Comparison of different n-body algorithms on various hardware platforms using SYCL 1,413 views

Challenging cloning related problems with GPU-based algorithms 1,412 views

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs 1,412 views

Effective GPU Sharing Under Compiler Guidance 1,412 views

Performance Analysis of a High-level Abstractions-based Hydrocode on Future Computing Systems 1,412 views

General purpose lattice QCD code set Bridge++ 2.0 for high performance computing 1,411 views

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance 1,410 views

Autotuning CUDA: Applying NLP Techniques to LS-CAT 1,409 views

Optimization of Heterogeneous Parallel Computing Systems using Machine Learning 1,408 views

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations 1,407 views

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours 1,406 views

From English To Foreign Languages: Transferring Pre-trained Language Models 1,405 views

Mixed precision in Graphics Processing Unit 1,404 views

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration 1,404 views

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing 1,404 views

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors 1,401 views

How much can we gain from Tensor Kernel Fusion on GPUs? 1,399 views

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs 1,398 views

Real-time Geometric Calibration on graphics processing unit with CUDA 1,398 views

LeXInt: GPU-accelerated Exponential Integrators package 1,396 views

Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities 1,395 views

Study for measurement method for coal volume on base of GPU 1,394 views

Julia as a unifying end-to-end workflow language on the Frontier exascale system 1,394 views

Reducing IO bandwidth for GPU based moment invariant classifier systems 1,394 views

Scalable instruction set simulator for thousand-core architectures running on GPGPUs 1,393 views

StreamBlocks: A compiler for heterogeneous dataflow computing 1,393 views

Case Study: GPU-based implementation of sequence pair based floorplanning using CUDA 1,392 views

Acceleration of the Method of Moments Calculations by Using Graphics Processing Units 1,391 views

LithOS: An Operating System for Efficient Machine Learning on GPUs 1,391 views

Experiences with implementing Kokkos’ SYCL backend 1,386 views

An Accelerated IHS Transform Fusion of Remote Sensing Image Data Based on GPU 1,386 views

Enhancing Performance of Simulations using GPGPU 1,385 views

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs 1,385 views

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems 1,384 views

Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing 1,384 views

Towards a Benchmarking Suite for Kernel Tuners 1,384 views

Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms 1,383 views

Exploring Applications in CUDA 1,382 views

Modular FPGA Systems with Support for Dynamic Workloads and Virtualisation 1,382 views

Heuristic Adaptability to Input Dynamics for SpMM on GPUs 1,380 views

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations 1,379 views

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations 1,379 views

BAT: A Benchmark suite for AutoTuners 1,378 views

TorchBench: Benchmarking PyTorch with High API Surface Coverage 1,378 views

Better GPU Hash Tables 1,378 views

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores 1,378 views

Lightning: Scaling the GPU Programming Model Beyond a Single GPU 1,377 views

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX 1,369 views

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs 1,369 views

Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs 1,368 views

Multi-level parallelization for hybrid ACO 1,367 views

Retargeting and Respecializing GPU Workloads for Performance Portability 1,367 views

Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout 1,367 views

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors 1,366 views

Character-level Transformer-based Neural Machine Translation 1,366 views

QArray: a GPU-accelerated constant capacitance model simulator for large quantum dot arrays 1,366 views

A Study on the Intersection of GPU Utilization and CNN Inference 1,364 views

Parallel computing with CUDA 1,364 views

INSTA-YOLO: Real-Time Instance Segmentation 1,364 views

Brief statistics for this page

Titles: 100

Total views: 140466

SIGMo: Scalable Isomorphism Graph Matching on GPUs

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

Most viewed papers (last 30 days)