2402

Views of posts on hgpu.org

Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation  780 views

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation  779 views

Forecasting time series with constraints  778 views

A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection  778 views

CPPJoules: An Energy Measurement Tool for C++  777 views

FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran  776 views

Communication-minimizing Asynchronous Tensor Parallelism  774 views

Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU  773 views

GPU Performance Portability needs Autotuning  764 views

Bridging Control-Centric and Data-Centric Optimization  762 views

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework  762 views

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems  761 views

GPU Auto-tuning Framework for Optimal Performance and Power Consumption  759 views

Anatomizing Deep Learning Inference in Web Browsers  758 views

Deep Learning Model Security: Threats and Defenses  756 views

Fastrack: Fast IO for Secure ML using GPU TEEs  755 views

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels  748 views

SUperman: Efficient Permanent Computation on GPUs  747 views

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs  745 views

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks  745 views

GPUVM: GPU-driven Unified Virtual Memory  742 views

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)  739 views

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR  736 views

Acceleration for the many, not the few  730 views

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs  727 views

LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations  725 views

Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL  724 views

Modernization and Optimization of MPI Codes  723 views

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization  722 views

Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems  718 views

A Survey of General-purpose Polyhedral Compilers  715 views

No More Shading Languages: Compiling C++ to Vulkan Shaders  713 views

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms  711 views

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation  706 views

Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX  705 views

CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL  701 views

Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen  699 views

Mìmir: A real-time interactive visualization library for CUDA programs  699 views

A User’s Guide to KSig: GPU-Accelerated Computation of the Signature Kernel  691 views

Scheduling Languages: A Past, Present, and Future Taxonomy  690 views

MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction  687 views

Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach  686 views

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks  682 views

Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data  680 views

Efficient allocation of image recognition and LLM tasks on multi-GPU system  680 views

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems  675 views

DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL  667 views

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs  667 views

Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams  658 views

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation  654 views

Predicting GPUDirect Benefits for HPC Workloads  653 views

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations  650 views

Adaptive Optimization Techniques for High-Performance Computing  649 views

Optimal Workload Placement on Multi-Instance GPUs  642 views

On the Partitioning of GPU Power among Multi-Instances  641 views

Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs  640 views

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication  634 views

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU  633 views

WiLLM: An Open Wireless LLM Communication System  630 views

Leveraging the potential of task-based programming with OpenMP task graphs  630 views

Enhancing Code Portability, Problem Scale, and Storage Efficiency in Exascale Applicationsin Exascale Applications  628 views

Performance Portable Gradient Computations Using Source Transformation  626 views

Debunking the CUDA Myth Towards GPU-based AI Systems  598 views

Guardian: Safe GPU Sharing in Multi-Tenant Environments  594 views

FLASH: Fast All-to-All Communication in GPU Clusters  594 views

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics  594 views

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement  589 views

Specx: a C++ task-based runtime system for heterogeneous distributed architectures  582 views

Optimizing the optimizer increasing performance efficiency of modern compilers  579 views

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization  577 views

Data Parallel Visualization and Rendering on the RAMSES Supercomputer with ANARI  571 views

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs  567 views

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability  567 views

Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?  553 views

RTCUDB: Building Databases with RT Processors  541 views

Development of a new framework for high performance volunteer computing  535 views

Keras Sig: Efficient Path Signature Computation on GPU in Keras 3  531 views

Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs  517 views

Scaling On-Device GPU Inference for Large Generative Models  514 views

Ilargi: a GPU Compatible Factorized ML Model Training Framework  511 views

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters  505 views

The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations  505 views

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks  503 views

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling  489 views

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration  484 views

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler  479 views

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization  477 views

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning  461 views

Validation of GPU Computation in Decentralized, Trustless Networks  445 views

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning  436 views

Understanding the Landscape of Ampere GPU Memory Errors  418 views

Pre-Training LLMs on a budget: A comparison of three optimizers  415 views

Kevin: Multi-Turn RL for Generating CUDA Kernels  411 views

Survey of HPC in US Research Institutions  398 views

Enabling Profile Guided Optimizations (PGO) for Graphics  396 views

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs  381 views

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency  378 views

GPU Acceleration of SQL Analytics on Compressed Data  369 views

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers  341 views

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline  335 views

 

Brief statistics for this page

Titles: 100

Total views: 62725

 

Most viewed items:

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org