high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability 851 views

Principles towards Real-Time Simulation of Material Point Method on Modern GPUs 848 views

Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA 847 views

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors 847 views

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours 847 views

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations 847 views

SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs 844 views

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications 843 views

HALF: Holistic Auto Machine Learning for FPGAs 843 views

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels 843 views

Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism 843 views

Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs 841 views

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence 840 views

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs 838 views

Comparison of different n-body algorithms on various hardware platforms using SYCL 837 views

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark 835 views

torchode: A Parallel ODE Solver for PyTorch 835 views

Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing 833 views

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding 832 views

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes 832 views

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs 831 views

MapReduce for Counting Word Frequencies with MPI and GPUs 831 views

Query Processing on Tensor Computation Runtimes 830 views

A Review of the Parallelization Strategies for Iterative Algorithms 830 views

Integrating SkePU’s algorithmic skeletons with GPI on a cluster 825 views

Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library 824 views

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array 820 views

Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow 818 views

SnuHPL: high performance LINPACK for heterogeneous GPUs 818 views

Homomorphic-Encrypted Volume Rendering 818 views

Performance analysis of matrix-free conjugate gradient kernels using SYCL 815 views

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL 815 views

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments 815 views

KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications 809 views

Flashlight: Enabling Innovation in Tools for Machine Learning 809 views

Impacts of Parallel Programming on Limited-Resource Hardware 809 views

Accelerating JPEG Decompression on GPUs 808 views

GPU-accelerated Faster Mean Shift with euclidean distance metrics 807 views

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace 806 views

GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration 805 views

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU 804 views

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training 804 views

NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge 803 views

tntorch: Tensor Network Learning with PyTorch 802 views

GPU Algorithms for Efficient Exascale Discretizations 800 views

Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware 800 views

GPU Ray Tracing with Monte Carlo Methods 800 views

Reusing Auto-Schedules for Efficient DNN Compilation 798 views

Deep Learning Workload Scheduling in GPU Datacenters: A Survey 795 views

Seamless GPU acceleration for C++ based physics with the Metal Shading Language on Apple’s M series unified chips 792 views

Improving Performance and Energy Efficiency of GPUs through Locality Analysis 791 views

Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml 790 views

OpenMP Offloading in the Jetson Nano Platform 790 views

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations 789 views

HLS Portability from Intel to Xilinx: A Case Study 789 views

Decompiling x86 Deep Neural Network Executables 788 views

Lina: a fast design optimisation tool for software-based FPGA programming 787 views

A Variant RSA Acceleration with Parallelization 786 views

High performance computing on Android devices – a case study 783 views

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing 783 views

GCN Inference Acceleration using High-Level Synthesis 782 views

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators 781 views

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration 781 views

A Survey on Hardware Accelerators for Large Language Models 779 views

QGTC: Accelerating Quantized GNN via GPU Tensor Core 779 views

Understanding the Power of Evolutionary Computation for GPU Code Optimization 778 views

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU 777 views

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use 775 views

A readahead prefetcher for GPU file system layer 774 views

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA 774 views

Intel oneAPI DPC++ FPGA Optimization Guide 773 views

A Systematic Literature Survey of Sparse Matrix-Vector Multiplication 772 views

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads 771 views

Compiler-centric across-stack deep learning acceleration 769 views

Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google’s qsim 768 views

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing 763 views

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs 763 views

Gaiwan: a Size-Polymorphic Typesystem for GPU Programs 761 views

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters 760 views

Retargeting and Respecializing GPU Workloads for Performance Portability 760 views

Python-Based Quantum Chemistry Calculations with GPU Acceleration 760 views

PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining 759 views

Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks 757 views

DISTAL: The Distributed Tensor Algebra Compiler 757 views

On the Compilation Performance of Current SYCL Implementations 756 views

Explicit caching HYB: a new high-performance SpMV framework on GPGPU 756 views

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs 754 views

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud 754 views

Securing GPU via Region-based Bounds Checking 753 views

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs 753 views

Brief statistics for this page

Titles: 100

Total views: 80601

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

Experiences with implementing Kokkos’ SYCL backend

ROCm's implementation of Gromacs

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

ROCm's implementation of Gromacs

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Most viewed papers (last 30 days)