2402

Views of posts on hgpu.org

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference  681 views

GPU Offloading in ExaHyPE Through C++ Standard Algorithms  681 views

Theseus: A Library for Differentiable Nonlinear Optimization  679 views

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU  679 views

The OpenMP Cluster Programming Model  678 views

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems  677 views

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang  677 views

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning  676 views

GC3: An Optimizing Compiler for GPU Collective Communication  676 views

SCALSALE: Scalable SALE Benchmark Framework for Supercomputers  675 views

Optimizing a Hardware Network Stack to Realize an In-Network ML Inference Application  675 views

CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices  675 views

FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems  674 views

COX: Exposing CUDA Warp-Level Functions to CPUs  674 views

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU  674 views

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments  673 views

PMT: Power Measurement Toolkit  672 views

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation  670 views

PILC: Practical Image Lossless Compression with an End-to-end GPU Oriented Neural Framework  669 views

User-Driven Online Kernel Fusion for SYCL  663 views

CitiusSynapse: A Deep Learning Framework for Embedded Systems  662 views

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads  662 views

OpenMP Kernel Language Extensions for Performance Portable GPU Codes  661 views

Distributed, combined CPU and GPU profiling within HPX using APEX  660 views

FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data  658 views

Bayesian Optimization for auto-tuning GPU kernels  658 views

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM  657 views

Analytical Performance Estimation during Code Generation on Modern GPUs  657 views

An approach to performance portability through generic programming  656 views

A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware  656 views

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation  655 views

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad  655 views

Multi-line AI-assisted Code Authoring  653 views

An experimental study of group-by and aggregation on CPU-GPU processors  652 views

A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks  652 views

Kernel-as-a-Service: A Serverless Interface to GPUs  651 views

Code Generation for a Variety of Accelerators for a Graph DSL  650 views

Performance Models for Heterogeneous Iterative Programs  648 views

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame  647 views

Towards Understanding and Mitigating Memory-Access Challenges in Computing Systems  647 views

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors  647 views

Training DNN Models over Heterogeneous Clusters with Optimal Performance  646 views

Hybrid CPU/GPU/APU accelerated query, insert, update and erase operations in hash tables with string keys  646 views

GPU-Acceleration of Tensor Renormalization with PyTorch using CUDA  642 views

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS  641 views

Harmonic CUDA: Asynchronous Programming on GPUs  639 views

Evaluation of FPGA-based high performance computing platforms  638 views

Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform  638 views

Exploiting dynamic sparse matrices for performance portable linear algebra operations  637 views

Fault Injection techniques for GPU Reliability Evaluation  636 views

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler  636 views

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets  635 views

pyGSL: A Graph Structure Learning Toolkit  634 views

Benchmarking GPU and TPU Performance with Graph Neural Networks  632 views

Compute units in OpenMP: Extensions for heterogeneous parallel programming  632 views

Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models  631 views

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration  631 views

gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs  630 views

Implementation Techniques for SPMD Kernels on CPUs  630 views

A Programming Model for GPU Load Balancing  628 views

APPy: Annotated Parallelism for Python on GPUs  628 views

Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs  627 views

TorchOpt: An Efficient Library for Differentiable Optimization  627 views

CuPBoP: CUDA for Parallelized and Broad-range Processors  626 views

Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs  626 views

Increased reliability on Intel GPUs via software diverse redundancy  625 views

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views  625 views

Efficiently Processing Large Relational Joins on GPUs  625 views

Low-Overhead Trace Collection and Profiling on GPU Compute Kernels  625 views

eGPU: A 750 MHz Class Soft GPGPU for FPGA  625 views

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale  622 views

Improving the scalability of modern applications by parallel multi-core and many-core programming  621 views

Towards Performance Portable Programming for Distributed Heterogeneous Systems  620 views

OpenRAND: A Performance Portable, Reproducible Random Number Generation Library for Parallel Computations  620 views

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud  619 views

Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis  618 views

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs  617 views

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems  616 views

Experiences Migrating CUDA to SYCL: A Molecular Docking Case Study  616 views

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance  612 views

Kernel Tuning Toolkit  609 views

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL  608 views

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark  608 views

Minuet: Accelerating 3D Sparse Convolutions on GPUs  608 views

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey  607 views

Design and Implementation of ShenWei Universal C/C++  607 views

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels  606 views

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach  605 views

Porting OpenACC to OpenMP on heterogeneous systems  605 views

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems  604 views

Experience Migrating OpenCL to SYCL: A Case Study on Searches for Potential Off-Target Sites of Cas9 RNA-Guided Endonucleases on AMD GPUs  603 views

Improving Performance of Hardware Accelerators by Optimizing Data Movement: A Bioinformatics Case Study  602 views

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models  602 views

Many Cores, Many Models: GPU Programming Model vs. Vendor Compatibility Overview  602 views

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications  599 views

Myths and Legends in High-Performance Computing  598 views

Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMD  597 views

Towards energy efficiency and productivity for decision making in mobile robot navigation  597 views

Efficient Quantized Sparse Matrix Operations on Tensor Cores  597 views

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper  596 views

 

Brief statistics for this page

Titles: 100

Total views: 63824

 

Most viewed items:

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: