high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Tensor Computation Based on Heterogeneous Memory 595 views

Performance portability study of epistasis detection using SYCL on NVIDIA GPU 595 views

DGEMM on Integer Matrix Multiplication Unit 594 views

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL 594 views

Deep Language Models for Software Testing and Optimisation 594 views

Statistical Computing With Graphics Processing Units 594 views

Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement 593 views

Pulsar search acceleration using FPGAs and OpenCL templates 589 views

Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes 589 views

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference 588 views

Thread-safe lattice Boltzmann for high-performance computing on GPUs 587 views

PySAGES: flexible, advanced sampling methods accelerated with GPUs 587 views

Porting numerical integration codes from CUDA to oneAPI: a case study 586 views

Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL) 585 views

Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality 585 views

GPU First – Execution of Legacy CPU Codes on GPUs 585 views

Software Optimization and Orchestration for Heterogeneous and Distributed Architectures 584 views

Novel Parallel Approaches to Efficiently Solve Spatial Problems on Heterogeneous CPU-GPU Systems 584 views

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL 582 views

Solving MaxSAT with Matrix Multiplication 581 views

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis 581 views

Automated Buffer Sizing of Dataflow Applications in a High-Level Synthesis Workflow 580 views

Distributed Calculations with Algorithmic Skeletons for Heterogeneous Computing Environments 580 views

Efficient Incremental Text-to-Speech on GPUs 579 views

Exploring Thread Coarsening on FPGA 578 views

Gallatin: A General-Purpose GPU Memory Manager 576 views

GPU Load Balancing 575 views

A Domain-Extensible Compiler with Controllable Automation of Optimisations 575 views

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures 569 views

Fuzzing Loop Optimizations in Compilers for C++ and Data-Parallel Languages 569 views

Simple and efficient GPU accelerated topology optimisation: Codes and applications 569 views

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs 567 views

High Performance Simulation for Scalable Multi-Agent Reinforcement Learning 566 views

Understanding the Impact of Input Entropy on FPU, CPU, and GPU Power 566 views

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing 564 views

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training 563 views

A Survey on Optimization Techniques for Edge Artificial Intelligence (AI) 562 views

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU 562 views

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU 561 views

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication 561 views

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs 560 views

GPUNet: Searching the Deployable Convolution Neural Networks for GPUs 560 views

Improving Energy Efficiency of Basic Linear Algebra Routines on Heterogeneous Systems with Multiple GPUs 558 views

Edge AI for Internet of Energy: Challenges and Perspectives 558 views

Long Code for Code Search 557 views

ARK: GPU-driven Code Execution for Distributed Deep Learning 557 views

Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and Replay 556 views

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s 555 views

BenchDirect: A Directed Language Model for Compiler Benchmarks 553 views

SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL 551 views

Seer: Predictive Runtime Kernel Selection for Irregular Problems 551 views

Frameworks in Medical Image Analysis with Deep Neural Networks 550 views

mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL 549 views

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers 549 views

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework 548 views

Optimization of massive data applications on heterogeneous architectures 546 views

Static and Dynamic Analyses for Efficient GPU Execution 545 views

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU 544 views

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills 544 views

HiRace: Accurate and Fast Source-Level Race Checking of GPU Programs 543 views

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability 540 views

Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues 539 views

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics 537 views

On the Three P’s of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and Portability 537 views

Orca: FSS-based Secure Training with GPUs 531 views

Porting Batched Iterative Solvers onto Intel GPUs with SYCL 531 views

PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks 531 views

Analyzing GPU Performance in Virtualized Environments: A Case Study 530 views

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems 529 views

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems 528 views

A Deep Learning Model for Loop Interchange 528 views

cuSLINK: Single-linkage Agglomerative Clustering on the GPU 528 views

Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs 527 views

Hybrid quantum programming with PennyLane Lightning on HPC platforms 525 views

UniFL: Accelerating Federated Learning Using Heterogeneous Hardware Under a Unified Framework 525 views

Compilation and Design Space Exploration of Dataflow Programs for Heterogeneous CPU-GPU Platforms 523 views

FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing 523 views

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies 521 views

An Asynchronous Dataflow-Driven Execution Model For Distributed Accelerator Computing 521 views

CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types 519 views

SYCL compute kernels for ExaHyPE 518 views

Performant low-order matrix-free finite element kernels on GPU architectures 518 views

cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications 517 views

DSDP: A Blind Docking Strategy Accelerated by GPUs 515 views

Matrix Multiplication Using Only Addition 512 views

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs 511 views

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation 511 views

CMLCompiler: A Unified Compiler for Classical Machine Learning 510 views

Hardware Checkpointing and Productive Debugging Flows for FPGAs 509 views

Scope is all you need: Transforming LLMs for HPC Code 508 views

Revisiting Query Performance in GPU Database Systems 506 views

Novel insights on atomic synchronization for sort-based group-by on GPUs 506 views

Efficient OpenCL system integration of non-blocking FPGA accelerators 506 views

A Study on the Intersection of GPU Utilization and CNN Inference 504 views

A Performance-Portable SYCL Implementation of CRK-HACC for Exascale 504 views

Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution 504 views

Out-of-the-box library support for DBMS operations on GPUs 503 views

Precision and Performance Analysis of C Standard Math Library Functions on GPUs 500 views

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models 499 views

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications 499 views

Brief statistics for this page

Titles: 100

Total views: 54941

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

Experiences with implementing Kokkos’ SYCL backend

ROCm's implementation of Gromacs

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

ROCm's implementation of Gromacs

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Most viewed papers (last 30 days)