Views of posts on hgpu.org
AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference 681 views
GPU Offloading in ExaHyPE Through C++ Standard Algorithms 681 views
Theseus: A Library for Differentiable Nonlinear Optimization 679 views
Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU 679 views
The OpenMP Cluster Programming Model 678 views
SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems 677 views
End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning 676 views
GC3: An Optimizing Compiler for GPU Collective Communication 676 views
SCALSALE: Scalable SALE Benchmark Framework for Supercomputers 675 views
Optimizing a Hardware Network Stack to Realize an In-Network ML Inference Application 675 views
CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices 675 views
FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems 674 views
COX: Exposing CUDA Warp-Level Functions to CPUs 674 views
Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU 674 views
Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments 673 views
PMT: Power Measurement Toolkit 672 views
Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation 670 views
PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework 669 views
User-Driven Online Kernel Fusion for SYCL 663 views
CitiusSynapse: A Deep Learning Framework for Embedded Systems 662 views
Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads 662 views
OpenMP Kernel Language Extensions for Performance Portable GPU Codes 661 views
Distributed, combined CPU and GPU profiling within HPX using APEX 660 views
FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data 658 views
Bayesian Optimization for auto-tuning GPU kernels 658 views
Can We Run in Parallel? Automating Loop Parallelization for TornadoVM 657 views
Analytical Performance Estimation during Code Generation on Modern GPUs 657 views
An approach to performance portability through generic programming 656 views
Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation 655 views
Multi-line AI-assisted Code Authoring 653 views
An experimental study of group-by and aggregation on CPU-GPU processors 652 views
A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks 652 views
Kernel-as-a-Service: A Serverless Interface to GPUs 651 views
Code Generation for a Variety of Accelerators for a Graph DSL 650 views
Performance Models for Heterogeneous Iterative Programs 648 views
Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame 647 views
Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems 647 views
Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors 647 views
Training DNN Models over Heterogeneous Clusters with Optimal Performance 646 views
GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA 642 views
Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS 641 views
Harmonic CUDA: Asynchronous Programming on GPUs 639 views
Evaluation of FPGA-based high performance computing platforms 638 views
Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform 638 views
Exploiting dynamic sparse matrices for performance portable linear algebra operations 637 views
Fault Injection techniques for GPU Reliability Evaluation 636 views
Enabling Data Movement and Computation Pipelining in Deep Learning Compiler 636 views
SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets 635 views
pyGSL: A Graph Structure Learning Toolkit 634 views
Benchmarking GPU and TPU Performance with Graph Neural Networks 632 views
Compute units in OpenMP: Extensions for heterogeneous parallel programming 632 views
Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models 631 views
A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration 631 views
Implementation Techniques for SPMD Kernels on CPUs 630 views
A Programming Model for GPU Load Balancing 628 views
APPy: Annotated Parallelism for Python on GPUs 628 views
TorchOpt: An Efficient Library for Differentiable Optimization 627 views
CuPBoP: CUDA for Parallelized and Broad-range Processors 626 views
Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs 626 views
Increased reliability on Intel GPUs via software diverse redundancy 625 views
Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views 625 views
Efficiently Processing Large Relational Joins on GPUs 625 views
Low-Overhead Trace Collection and Profiling on GPU Compute Kernels 625 views
eGPU: A 750 MHz Class Soft GPGPU for FPGA 625 views
Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale 622 views
Improving the scalability of modern applications by parallel multi-core and many-core programming 621 views
Towards Performance Portable Programming for Distributed Heterogeneous Systems 620 views
iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud 619 views
Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis 618 views
Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs 617 views
Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study 616 views
SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance 612 views
Kernel Tuning Toolkit 609 views
FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL 608 views
Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark 608 views
Minuet: Accelerating 3D Sparse Convolutions on GPUs 608 views
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey 607 views
Design and Implementation of ShenWei Universal C/C++ 607 views
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels 606 views
Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach 605 views
Porting OpenACC to OpenMP on heterogeneous systems 605 views
MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems 604 views
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models 602 views
Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview 602 views
Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications 599 views
Myths and Legends in High-Performance Computing 598 views
Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMD 597 views
Towards energy efficiency and productivity for decision making in mobile robot navigation 597 views
Efficient Quantized Sparse Matrix Operations on Tensor Cores 597 views
Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper 596 views
Titles: 100
Total views: 63824
- Programming - 186,132 views
- Login - 164,499 views
- User dashboard - 90,988 views
- Paper titles list - 70,490 views
- Add new event - 64,744 views
- Add new post - 59,475 views
- Register - 49,293 views
- Statistics - 36,854 views
- Modification of self-organizing migration algorithm for OpenCL framework - 34,169 views
- Books on OpenCL and CUDA - 28,871 views