high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Benchmarking optimization algorithms for auto-tuning GPU kernels 752 views

Domain-Specific On-Device Object Detection Method 752 views

Machine Learning for CUDA+MPI Design Rules 751 views

Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures 751 views

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training 750 views

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning 750 views

Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge 750 views

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU 749 views

Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml 747 views

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure 747 views

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators 747 views

BenchPress: A Deep Active Benchmark Generator 746 views

Fully Concurrent GPU Data Structures 745 views

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms 743 views

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture 742 views

Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure 741 views

Performance portability evaluation of blocked stencil computations on GPUs 741 views

Deep Learning Models on CPUs: A Methodology for Efficient Training 739 views

Source-to-Source Automatic Differentiation of OpenMP Parallel Loops 739 views

Parallel programming in mobile devices with FancyJCL 738 views

Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm 738 views

FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis 738 views

Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL 737 views

Monitoring Collective Communication Among GPUs 736 views

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning 736 views

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs 733 views

Fast Arbitrary Precision Floating Point on FPGA 732 views

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code 732 views

Principles for Automated and Reproducible Benchmarking 731 views

Collage: Automated Integration of Deep Learning Backends 731 views

A Ray Tracing Implementation Performance Comparison between the CPU and the GPU 731 views

A Variant of Concurrent Constraint Programming on GPU 731 views

Parallel and Heterogeneous Timing Analysis: Partition, Algorithm, and System 730 views

OpenCL-HPX Integration 730 views

Fast convolution kernels on pascal GPU with high memory efficiency 730 views

Towards making the most of NLP-based device mapping optimization for OpenCL kernels 729 views

Autotuning CUDA: Applying NLP Techniques to LS-CAT 728 views

Fast GPU bounding boxes on tree-structured scenes 728 views

ALPINIST: An Annotation-Aware GPU Program Optimizer 728 views

N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks 728 views

Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel 726 views

Capturing the Memory Topology of GPUs 725 views

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis 724 views

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism 724 views

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity 723 views

GGArray: A Dynamically Growable GPU Array 720 views

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads 720 views

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models 719 views

The Application of AI Technology in GPU Scheduling Algorithm Optimization 718 views

MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring 718 views

IgNet. A Super-precise Convolutional Neural Network 718 views

Testing and Mutation Testing for GPU Kernels 715 views

Compiler Technologies in Deep Learning Co-Design: A Survey 715 views

Demystifying Dependency Bugs in Deep Learning Stack 711 views

OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials 710 views

Simulation Methodologies for Mobile GPUs 708 views

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations 708 views

Providing performance portable numerics for Intel GPUs 706 views

An Open-source FPGA Library for Data Sorting 706 views

Deductive verification for SYCL 704 views

Extending MAGMA Portability with OneAPI 703 views

Three Contributions to the Theory and Practice of Optimizing Compilers 701 views

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI 700 views

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention 699 views

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning 698 views

CPU-GPU Layer-Switched Low Latency CNN Inference 698 views

Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation 698 views

Optimizing Deep Learning Models For Raspberry Pi 697 views

OpenMP Advisor 696 views

Safe and Practical GPU Acceleration in TrustZone 696 views

Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems 696 views

Evaluation of Rust for GPGPU high-performance computing 696 views

GT4Py: High Performance Stencils for Weather and Climate Applications using Python 696 views

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels 696 views

Using AI libraries for Incompressible Computational Fluid Dynamics 694 views

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading 693 views

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era 693 views

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs 693 views

Julia as a unifying end-to-end workflow language on the Frontier exascale system 693 views

OpenMP offload at the Exascale using Intel GPU Max 1550: evaluation of STREAmS compressible solver 693 views

High Performance Privacy Preserving AI 692 views

Performance Evaluation of Heterogeneous GPU Programming Frameworks for Hemodynamic Simulations 691 views

Manas: Mining Software Repositories to Assist AutoML 691 views

A tool set for random number generation on GPUs in R 691 views

Challenges and Opportunities in C/C++ Source-To-Source Compilation 690 views

Strega: An HTTP Server for FPGAs 689 views

Evaluating the Wide Area Classroom After 24,000 HPC Students 687 views

Cramming: Training a Language Model on a Single GPU in One Day 687 views

An OpenCL-Based FPGA Accelerator for Faster R-CNN 685 views

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU 685 views

Real-Time High-Performance Computing for Embedded Control Systems 685 views

Evaluation of Pseudo-Random Number Generation on GPU Cards 685 views

QArray: a GPU-accelerated constant capacitance model simulator for large quantum dot arrays 685 views

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks 684 views

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science 682 views

Challenges and Techniques for Transparent Acceleration of Unmodified Big Data Applications 681 views

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance 680 views

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference 680 views

Behavioral graph fraud detection in E-commerce 679 views

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment 679 views

Brief statistics for this page

Titles: 100

Total views: 71512

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

Experiences with implementing Kokkos’ SYCL backend

ROCm's implementation of Gromacs

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

ROCm's implementation of Gromacs

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Most viewed papers (last 30 days)