Views of posts on hgpu.org
Managing heterogeneous device memory using C++17 memory resources 485 views
pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations 485 views
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey 485 views
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters 484 views
GPU-based Private Information Retrieval for On-Device Machine Learning Inference 484 views
Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC 482 views
LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers 481 views
Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing 480 views
TLP: A Deep Learning-based Cost Model for Tensor Program Tuning 480 views
Performance/power assessment of CNN packages on embedded automotive platforms 477 views
HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis 477 views
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation 476 views
Reducing branch divergence to speed up parallel execution of unit testing on GPUs 475 views
Assessing opportunities of SYCL for biological sequence alignment on GPU-based systems 473 views
GPUHarbor: Testing GPU Memory Consistency at Large 473 views
__host__ __device__ — Generic programming in Cuda 470 views
Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs 468 views
LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory 466 views
An Evaluative Comparison of Performance Portability across GPU Programming Models 466 views
Managing, Profiling, and Optimizing Heterogeneous GPU Workloads 465 views
SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving 462 views
High-order thread-safe lattice Boltzmann model for HPC turbulent flow simulations 461 views
Energy-Efficient GPU Clusters Scheduling for Deep Learning 461 views
Compiler-assisted distribution of OpenMP code for improved scalability 461 views
Comparing SYCL data transfer strategies for tracking use cases 460 views
Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs 459 views
FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs 457 views
Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) Offload 456 views
Efficient GPU implementation of a class of array permutations 455 views
A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs 454 views
Prediction of Performance and Power Consumption of GPGPU Applications 454 views
Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL 454 views
Improving the Efficiency of OpenCL Kernels through Pipes 453 views
Reverse-Mode AD of Reduce-by-Index and Scan in Futhark 452 views
Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations 451 views
Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge 451 views
Unified Shared Memory: Friend or Foe? 450 views
SIMULATeQCD: A simple multi-GPU lattice code for QCD calculations 449 views
PopSparse: Accelerated block sparse matrix multiplication on IPU 449 views
Assessing the Impact of Compiler Optimizations on GPUs Reliability 448 views
An Autonomous Data Language 448 views
Descend: A Safe GPU Systems Programming Language 447 views
Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs 446 views
Runtime Support for Performance Portability on Heterogeneous Distributed Platforms 446 views
Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis 445 views
ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code 443 views
Quantifying OpenMP: Statistical Insights into Usage and Adoption 442 views
ExaNBody: a HPC framework for N-Body applications 442 views
Monadic Deep Learning 440 views
Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU 437 views
EfficientBioAI: Making Bioimaging AI Models Efficient in Energy, Latency and Representation 434 views
Full-Scale File System Acceleration on GPU 433 views
Applying the Midas Touch of Reproducibility to High-Performance Computing 431 views
Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach 428 views
HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU 426 views
MUPPET: Optimizing Performance in OpenMP via Mutation Testing 426 views
Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc 426 views
Creating a Dataset Supporting Translation Between OpenMP Fortran and C++ Code 425 views
TransAxx: Efficient Transformers with Approximate Computing 424 views
Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter 424 views
Isolated Scheduling for Distributed Training Tasks in GPU Clusters 421 views
Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads 417 views
An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark 416 views
Dynamic autotuning of SpMV kernel in CUSP library 415 views
Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators 415 views
E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems 413 views
ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time 413 views
Sieve: Stratified GPU-Compute Workload Sampling 413 views
Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation 412 views
Pgx: Hardware-accelerated parallel game simulation for reinforcement learning 409 views
Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric 409 views
GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs 407 views
Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs 403 views
Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems 401 views
ProtoX: A First Look 400 views
PyTorch Hyperparameter Tuning – A Tutorial for spotPython 398 views
Software Optimization and Orchestration for Heterogeneous and Distributed Architectures 397 views
Fast Knowledge Graph Completion using Graphics Processing Units 397 views
Bridging Control-Centric and Data-Centric Optimization 397 views
GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation 397 views
Efficiency without Tears: Securing Multilingual Programs with TRINITY 396 views
Memory Efficient Mixed-Precision Optimizers 394 views
Towards Alignment of Parallelism in SYCL and ISO C++ 389 views
Parallel Implementation of Lightweight Secure Hash Algorithm on CPU and GPU Environments 389 views
Communication-minimizing Asynchronous Tensor Parallelism 388 views
Dragon-Alpha&cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA Library 386 views
qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers 385 views
Graphtoy: Fast Software Simulation of Applications for AMD’s AI Engines 384 views
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs 384 views
Performance Optimization using Multimodal Modeling and Heterogeneous GNN 384 views
Improving Automatic Parallel Training via Balanced Memory Workload Optimization 381 views
A Heterogeneous Inference Framework for a Deep Neural Network 379 views
Adding fault tolerance to OpenCL: Through redundant heterogeneous computing 366 views
Compressed Real Numbers for AI: a case-study using a RISC-V CPU 363 views
Towards a GPU-Parallelization of the neXtSIM-DG Dynamical Core 362 views
Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs 362 views
Titles: 100
Total views: 43355
- Programming - 186,126 views
- Login - 164,276 views
- User dashboard - 90,368 views
- Paper titles list - 69,696 views
- Add new event - 64,536 views
- Add new post - 59,109 views
- Register - 49,131 views
- Statistics - 36,255 views
- Modification of self-organizing migration algorithm for OpenCL framework - 34,161 views
- Books on OpenCL and CUDA - 28,758 views