high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation 780 views

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation 779 views

Forecasting time series with constraints 778 views

A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection 778 views

CPPJoules: An Energy Measurement Tool for C++ 777 views

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran 776 views

Communication-minimizing Asynchronous Tensor Parallelism 774 views

Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU 773 views

GPU Performance Portability needs Autotuning 764 views

Bridging Control-Centric and Data-Centric Optimization 762 views

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework 762 views

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems 761 views

GPU Auto-tuning Framework for Optimal Performance and Power Consumption 759 views

Anatomizing Deep Learning Inference in Web Browsers 758 views

Deep Learning Model Security: Threats and Defenses 756 views

Fastrack: Fast IO for Secure ML using GPU TEEs 755 views

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels 748 views

SUperman: Efficient Permanent Computation on GPUs 747 views

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs 745 views

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks 745 views

GPUVM: GPU-driven Unified Virtual Memory 742 views

Can Tensor Cores Benefit Memory-Bound Kernels? (No!) 739 views

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR 736 views

Acceleration for the many, not the few 730 views

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs 727 views

LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations 725 views

Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL 724 views

Modernization and Optimization of MPI Codes 723 views

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization 722 views

Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems 718 views

A Survey of General-purpose Polyhedral Compilers 715 views

No More Shading Languages: Compiling C++ to Vulkan Shaders 713 views

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms 711 views

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation 706 views

Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX 705 views

CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL 701 views

Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen 699 views

Mìmir: A real-time interactive visualization library for CUDA programs 699 views

A User’s Guide to KSig: GPU-Accelerated Computation of the Signature Kernel 691 views

Scheduling Languages: A Past, Present, and Future Taxonomy 690 views

MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction 687 views

Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach 686 views

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks 682 views

Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data 680 views

Efficient allocation of image recognition and LLM tasks on multi-GPU system 680 views

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems 675 views

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL 667 views

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs 667 views

Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams 658 views

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation 654 views

Predicting GPUDirect Benefits for HPC Workloads 653 views

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations 650 views

Adaptive Optimization Techniques for High-Performance Computing 649 views

Optimal Workload Placement on Multi-Instance GPUs 642 views

On the Partitioning of GPU Power among Multi-Instances 641 views

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs 640 views

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication 634 views

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU 633 views

WiLLM: An Open Wireless LLM Communication System 630 views

Leveraging the potential of task-based programming with OpenMP task graphs 630 views

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications 628 views

Performance Portable Gradient Computations Using Source Transformation 626 views

Debunking the CUDA Myth Towards GPU-based AI Systems 598 views

Guardian: Safe GPU Sharing in Multi-Tenant Environments 594 views

FLASH: Fast All-to-All Communication in GPU Clusters 594 views

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics 594 views

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement 589 views

Specx: a C++ task-based runtime system for heterogeneous distributed architectures 582 views

Optimizing the optimizer increasing performance efficiency of modern compilers 579 views

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization 577 views

Data Parallel Visualization and Rendering on the RAMSES Supercomputer with ANARI 571 views

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs 567 views

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability 567 views

Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? 553 views

RTCUDB: Building Databases with RT Processors 541 views

Development of a new framework for high performance volunteer computing 535 views

Keras Sig: Efficient Path Signature Computation on GPU in Keras 3 531 views

Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs 517 views

Scaling On-Device GPU Inference for Large Generative Models 514 views

Ilargi: a GPU Compatible Factorized ML Model Training Framework 511 views

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters 505 views

The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations 505 views

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks 503 views

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling 489 views

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration 484 views

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler 479 views

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization 477 views

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning 461 views

Validation of GPU Computation in Decentralized, Trustless Networks 445 views

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning 436 views

Understanding the Landscape of Ampere GPU Memory Errors 418 views

Pre-Training LLMs on a budget: A comparison of three optimizers 415 views

Kevin: Multi-Turn RL for Generating CUDA Kernels 411 views

Survey of HPC in US Research Institutions 398 views

Enabling Profile Guided Optimizations (PGO) for Graphics 396 views

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs 381 views

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency 378 views

GPU Acceleration of SQL Analytics on Compressed Data 369 views

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers 341 views

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline 335 views

Brief statistics for this page

Titles: 100

Total views: 62725

SIGMo: Scalable Isomorphism Graph Matching on GPUs

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

Most viewed papers (last 30 days)