high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference 681 views

GPU Offloading in ExaHyPE Through C++ Standard Algorithms 681 views

Theseus: A Library for Differentiable Nonlinear Optimization 679 views

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU 679 views

The OpenMP Cluster Programming Model 678 views

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems 677 views

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang 677 views

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning 676 views

GC3: An Optimizing Compiler for GPU Collective Communication 676 views

SCALSALE: Scalable SALE Benchmark Framework for Supercomputers 675 views

Optimizing a Hardware Network Stack to Realize an In-Network ML Inference Application 675 views

CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices 675 views

FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems 674 views

COX: Exposing CUDA Warp-Level Functions to CPUs 674 views

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU 674 views

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments 673 views

PMT: Power Measurement Toolkit 672 views

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation 670 views

PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework 669 views

User-Driven Online Kernel Fusion for SYCL 663 views

CitiusSynapse: A Deep Learning Framework for Embedded Systems 662 views

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads 662 views

OpenMP Kernel Language Extensions for Performance Portable GPU Codes 661 views

Distributed, combined CPU and GPU profiling within HPX using APEX 660 views

FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data 658 views

Bayesian Optimization for auto-tuning GPU kernels 658 views

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM 657 views

Analytical Performance Estimation during Code Generation on Modern GPUs 657 views

An approach to performance portability through generic programming 656 views

A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware 656 views

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation 655 views

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad 655 views

Multi-line AI-assisted Code Authoring 653 views

An experimental study of group-by and aggregation on CPU-GPU processors 652 views

A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks 652 views

Kernel-as-a-Service: A Serverless Interface to GPUs 651 views

Code Generation for a Variety of Accelerators for a Graph DSL 650 views

Performance Models for Heterogeneous Iterative Programs 648 views

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame 647 views

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems 647 views

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors 647 views

Training DNN Models over Heterogeneous Clusters with Optimal Performance 646 views

Hybrid CPU/GPU/APU accelerated query, insert, update and erase operations in hash tables with string keys 646 views

GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA 642 views

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS 641 views

Harmonic CUDA: Asynchronous Programming on GPUs 639 views

Evaluation of FPGA-based high performance computing platforms 638 views

Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform 638 views

Exploiting dynamic sparse matrices for performance portable linear algebra operations 637 views

Fault Injection techniques for GPU Reliability Evaluation 636 views

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler 636 views

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets 635 views

pyGSL: A Graph Structure Learning Toolkit 634 views

Benchmarking GPU and TPU Performance with Graph Neural Networks 632 views

Compute units in OpenMP: Extensions for heterogeneous parallel programming 632 views

Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models 631 views

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration 631 views

gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs 630 views

Implementation Techniques for SPMD Kernels on CPUs 630 views

A Programming Model for GPU Load Balancing 628 views

APPy: Annotated Parallelism for Python on GPUs 628 views

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs 627 views

TorchOpt: An Efficient Library for Differentiable Optimization 627 views

CuPBoP: CUDA for Parallelized and Broad-range Processors 626 views

Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs 626 views

Increased reliability on Intel GPUs via software diverse redundancy 625 views

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views 625 views

Efficiently Processing Large Relational Joins on GPUs 625 views

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels 625 views

eGPU: A 750 MHz Class Soft GPGPU for FPGA 625 views

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale 622 views

Improving the scalability of modern applications by parallel multi-core and many-core programming 621 views

Towards Performance Portable Programming for Distributed Heterogeneous Systems 620 views

OpenRAND: A Performance Portable, Reproducible Random Number Generation Library for Parallel Computations 620 views

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud 619 views

Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis 618 views

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs 617 views

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems 616 views

Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study 616 views

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance 612 views

Kernel Tuning Toolkit 609 views

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL 608 views

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark 608 views

Minuet: Accelerating 3D Sparse Convolutions on GPUs 608 views

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey 607 views

Design and Implementation of ShenWei Universal C/C++ 607 views

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels 606 views

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach 605 views

Porting OpenACC to OpenMP on heterogeneous systems 605 views

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems 604 views

Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs 603 views

Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study 602 views

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models 602 views

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview 602 views

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications 599 views

Myths and Legends in High-Performance Computing 598 views

Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMD 597 views

Towards energy efficiency and productivity for decision making in mobile robot navigation 597 views

Efficient Quantized Sparse Matrix Operations on Tensor Cores 597 views

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper 596 views

Brief statistics for this page

Titles: 100

Total views: 63824

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

Experiences with implementing Kokkos’ SYCL backend

ROCm's implementation of Gromacs

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

ROCm's implementation of Gromacs

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Most viewed papers (last 30 days)