2402

Views of posts on hgpu.org

Managing heterogeneous device memory using C++17 memory resources  485 views

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations  485 views

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey  485 views

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters  484 views

GPU-based Private Information Retrieval for On-Device Machine Learning Inference  484 views

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC  482 views

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers  481 views

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing  480 views

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning  480 views

minimap2-fpga: Integrating hardware-accelerated chaining for efficient end-to-end long-read sequence mapping  477 views

Performance/power assessment of CNN packages on embedded automotive platforms  477 views

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis  477 views

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation  476 views

Reducing branch divergence to speed up parallel execution of unit testing on GPUs  475 views

Assessing opportunities of SYCL for biological sequence alignment on GPU-based systems  473 views

GPUHarbor: Testing GPU Memory Consistency at Large  473 views

__host__ __device__ — Generic programming in Cuda  470 views

Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs  468 views

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory  466 views

An Evaluative Comparison of Performance Portability across GPU Programming Models  466 views

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads  465 views

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving  462 views

High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations  461 views

Energy-Efficient GPU Clusters Scheduling for Deep Learning  461 views

Compiler-assisted distribution of OpenMP code for improved scalability  461 views

Comparing SYCL data transfer strategies for tracking use cases  460 views

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs  459 views

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs  457 views

Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload  456 views

Efficient GPU implementation of a class of array permutations  455 views

A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs  454 views

Prediction of Performance and Power Consumption of GPGPU Applications  454 views

Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL  454 views

Improving the Efficiency of OpenCL Kernels through Pipes  453 views

Reverse-Mode AD of Reduce-by-Index and Scan in Futhark  452 views

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations  451 views

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge  451 views

Unified Shared Memory: Friend or Foe?  450 views

SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations  449 views

PopSparse: Accelerated block sparse matrix multiplication on IPU  449 views

Assessing the Impact of Compiler Optimizations on GPUs Reliability  448 views

An Autonomous Data Language  448 views

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels  447 views

Descend: A Safe GPU Systems Programming Language  447 views

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs  446 views

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms  446 views

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis  445 views

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code  443 views

Quantifying OpenMP: Statistical Insights into Usage and Adoption  442 views

ExaNBody: a HPC framework for N-Body applications  442 views

Monadic Deep Learning  440 views

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU  437 views

EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation  434 views

Full-Scale File System Acceleration on GPU  433 views

Applying the Midas Touch of Reproducibility to High-Performance Computing  431 views

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer  430 views

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach  428 views

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU  426 views

MUPPET: Optimizing Performance in OpenMP via Mutation Testing  426 views

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc  426 views

Creating a Dataset Supporting Translation Between OpenMP Fortran and C++ Code  425 views

TransAxx: Efficient Transformers with Approximate Computing  424 views

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter  424 views

Isolated Scheduling for Distributed Training Tasks in GPU Clusters  421 views

Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads  417 views

Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C+  417 views

An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark  416 views

Dynamic autotuning of SpMV kernel in CUSP library  415 views

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators  415 views

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems  413 views

ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time  413 views

Sieve: Stratified GPU-Compute Workload Sampling  413 views

Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation  412 views

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning  409 views

Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric  409 views

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs  407 views

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs  403 views

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems  401 views

ProtoX: A First Look  400 views

PyTorch Hyperparameter Tuning – A Tutorial for spotPython  398 views

Software Optimization and Orchestration for Heterogeneous and Distributed Architectures  397 views

Fast Knowledge Graph Completion using Graphics Processing Units  397 views

Bridging Control-Centric and Data-Centric Optimization  397 views

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation  397 views

Efficiency without Tears: Securing Multilingual Programs with TRINITY  396 views

Memory Efficient Mixed-Precision Optimizers  394 views

Towards Alignment of Parallelism in SYCL and ISO C++  389 views

Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments  389 views

Communication-minimizing Asynchronous Tensor Parallelism  388 views

Dragon-Alpha&cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA Library  386 views

qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers  385 views

Graphtoy: Fast Software Simulation of Applications for AMD’s AI Engines  384 views

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs  384 views

Performance Optimization using Multimodal Modeling and Heterogeneous GNN  384 views

Improving Automatic Parallel Training via Balanced Memory Workload Optimization  381 views

A Heterogeneous Inference Framework for a Deep Neural Network  379 views

Adding fault tolerance to OpenCL: Through redundant heterogeneous computing  366 views

Compressed Real Numbers for AI: a case-study using a RISC-V CPU  363 views

Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core  362 views

Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs  362 views

 

Brief statistics for this page

Titles: 100

Total views: 43355

 

Most viewed items:

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: