high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Managing heterogeneous device memory using C++17 memory resources 485 views

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations 485 views

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey 485 views

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters 484 views

GPU-based Private Information Retrieval for On-Device Machine Learning Inference 484 views

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC 482 views

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers 481 views

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing 480 views

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning 480 views

minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping 477 views

Performance/power assessment of CNN packages on embedded automotive platforms 477 views

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis 477 views

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation 476 views

Reducing branch divergence to speed up parallel execution of unit testing on GPUs 475 views

Assessing opportunities of SYCL for biological sequence alignment on GPU-based systems 473 views

GPUHarbor: Testing GPU Memory Consistency at Large 473 views

__host__ __device__ — Generic programming in Cuda 470 views

Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs 468 views

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory 466 views

An Evaluative Comparison of Performance Portability across GPU Programming Models 466 views

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads 465 views

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving 462 views

High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations 461 views

Energy-Efficient GPU Clusters Scheduling for Deep Learning 461 views

Compiler-assisted distribution of OpenMP code for improved scalability 461 views

Comparing SYCL data transfer strategies for tracking use cases 460 views

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs 459 views

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs 457 views

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload 456 views

Efficient GPU implementation of a class of array permutations 455 views

A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs 454 views

Prediction of Performance and Power Consumption of GPGPU Applications 454 views

Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL 454 views

Improving the Efficiency of OpenCL Kernels through Pipes 453 views

Reverse-Mode AD of Reduce-by-Index and Scan in Futhark 452 views

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations 451 views

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge 451 views

Unified Shared Memory: Friend or Foe? 450 views

SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations 449 views

PopSparse: Accelerated block sparse matrix multiplication on IPU 449 views

Assessing the Impact of Compiler Optimizations on GPUs Reliability 448 views

An Autonomous Data Language 448 views

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels 447 views

Descend: A Safe GPU Systems Programming Language 447 views

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs 446 views

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms 446 views

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis 445 views

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code 443 views

Quantifying OpenMP: Statistical Insights into Usage and Adoption 442 views

ExaNBody: a HPC framework for N-Body applications 442 views

Monadic Deep Learning 440 views

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU 437 views

EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation 434 views

Full-Scale File System Acceleration on GPU 433 views

Applying the Midas Touch of Reproducibility to High-Performance Computing 431 views

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer 430 views

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach 428 views

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU 426 views

MUPPET: Optimizing Performance in OpenMP via Mutation Testing 426 views

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc 426 views

Creating a Dataset Supporting Translation Between OpenMP Fortran and C++ Code 425 views

TransAxx: Efficient Transformers with Approximate Computing 424 views

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter 424 views

Isolated Scheduling for Distributed Training Tasks in GPU Clusters 421 views

Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads 417 views

Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C+ 417 views

An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark 416 views

Dynamic autotuning of SpMV kernel in CUSP library 415 views

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators 415 views

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems 413 views

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time 413 views

Sieve: Stratified GPU-Compute Workload Sampling 413 views

Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation 412 views

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning 409 views

Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric 409 views

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs 407 views

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs 403 views

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems 401 views

ProtoX: A First Look 400 views

PyTorch Hyperparameter Tuning – A Tutorial for spotPython 398 views

Software Optimization and Orchestration for Heterogeneous and Distributed Architectures 397 views

Fast Knowledge Graph Completion using Graphics Processing Units 397 views

Bridging Control-Centric and Data-Centric Optimization 397 views

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation 397 views

Efficiency without Tears: Securing Multilingual Programs with TRINITY 396 views

Memory Efficient Mixed-Precision Optimizers 394 views

Towards Alignment of Parallelism in SYCL and ISO C++ 389 views

Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments 389 views

Communication-minimizing Asynchronous Tensor Parallelism 388 views

Dragon-Alpha&cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA Library 386 views

qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers 385 views

Graphtoy: Fast Software Simulation of Applications for AMD’s AI Engines 384 views

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs 384 views

Performance Optimization using Multimodal Modeling and Heterogeneous GNN 384 views

Improving Automatic Parallel Training via Balanced Memory Workload Optimization 381 views

A Heterogeneous Inference Framework for a Deep Neural Network 379 views

Adding fault tolerance to OpenCL: Through redundant heterogeneous computing 366 views

Compressed Real Numbers for AI: a case-study using a RISC-V CPU 363 views

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core 362 views

Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs 362 views

Brief statistics for this page

Titles: 100

Total views: 43355

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)