Views of posts on hgpu.org
Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation 780 views
Forecasting time series with constraints 778 views
A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection 778 views
CPPJoules: An Energy Measurement Tool for C++ 777 views
FortranX: Harnessing Code Generation, Portability, and Heterogeneity in Fortran 776 views
Communication-minimizing Asynchronous Tensor Parallelism 774 views
Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU 773 views
GPU Performance Portability needs Autotuning 764 views
Bridging Control-Centric and Data-Centric Optimization 762 views
TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework 762 views
Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems 761 views
GPU Auto-tuning Framework for Optimal Performance and Power Consumption 759 views
Anatomizing Deep Learning Inference in Web Browsers 758 views
Deep Learning Model Security: Threats and Defenses 756 views
Fastrack: Fast IO for Secure ML using GPU TEEs 755 views
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels 748 views
SUperman: Efficient Permanent Computation on GPUs 747 views
Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs 745 views
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks 745 views
GPUVM: GPU-driven Unified Virtual Memory 742 views
Can Tensor Cores Benefit Memory-Bound Kernels? (No!) 739 views
Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR 736 views
Acceleration for the many, not the few 730 views
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs 727 views
Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL 724 views
Modernization and Optimization of MPI Codes 723 views
Accelerating Sparse Graph Neural Networks with Tensor Core Optimization 722 views
A Survey of General-purpose Polyhedral Compilers 715 views
No More Shading Languages: Compiling C++ to Vulkan Shaders 713 views
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms 711 views
Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX 705 views
CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL 701 views
Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen 699 views
Mìmir: A real-time interactive visualization library for CUDA programs 699 views
A User’s Guide to KSig: GPU-Accelerated Computation of the Signature Kernel 691 views
Scheduling Languages: A Past, Present, and Future Taxonomy 690 views
MambaCPU: Enhanced Correlation Mining with State Space Models for CPU Performance Prediction 687 views
Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach 686 views
Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks 682 views
Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data 680 views
Efficient allocation of image recognition and LLM tasks on multi-GPU system 680 views
Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems 675 views
DNA sequence alignment: An assignment for OpenMP, MPI, and CUDA/OpenCL 667 views
Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs 667 views
Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams 658 views
Mutual-Supervised Learning for Sequential-to-Parallel Code Translation 654 views
Predicting GPUDirect Benefits for HPC Workloads 653 views
Adaptive Optimization Techniques for High-Performance Computing 649 views
Optimal Workload Placement on Multi-Instance GPUs 642 views
On the Partitioning of GPU Power among Multi-Instances 641 views
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs 640 views
Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication 634 views
Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU 633 views
WiLLM: An Open Wireless LLM Communication System 630 views
Leveraging the potential of task-based programming with OpenMP task graphs 630 views
Performance Portable Gradient Computations Using Source Transformation 626 views
Debunking the CUDA Myth Towards GPU-based AI Systems 598 views
Guardian: Safe GPU Sharing in Multi-Tenant Environments 594 views
FLASH: Fast All-to-All Communication in GPU Clusters 594 views
Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics 594 views
Specx: a C++ task-based runtime system for heterogeneous distributed architectures 582 views
Optimizing the optimizer increasing performance efficiency of modern compilers 579 views
Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization 577 views
Data Parallel Visualization and Rendering on the RAMSES Supercomputer with ANARI 571 views
Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability 567 views
Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? 553 views
RTCUDB: Building Databases with RT Processors 541 views
Development of a new framework for high performance volunteer computing 535 views
Keras Sig: Efficient Path Signature Computation on GPU in Keras 3 531 views
Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs 517 views
Scaling On-Device GPU Inference for Large Generative Models 514 views
Ilargi: a GPU Compatible Factorized ML Model Training Framework 511 views
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters 505 views
The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations 505 views
Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks 503 views
KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling 489 views
Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler 479 views
ConTraPh: Contrastive Learning for Parallelization and Performance Optimization 477 views
MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning 461 views
Validation of GPU Computation in Decentralized, Trustless Networks 445 views
Understanding the Landscape of Ampere GPU Memory Errors 418 views
Pre-Training LLMs on a budget: A comparison of three optimizers 415 views
Kevin: Multi-Turn RL for Generating CUDA Kernels 411 views
Survey of HPC in US Research Institutions 398 views
Enabling Profile Guided Optimizations (PGO) for Graphics 396 views
A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs 381 views
GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency 378 views
GPU Acceleration of SQL Analytics on Compressed Data 369 views
NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers 341 views
A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline 335 views
Titles: 100
Total views: 62725
- Programming - 186,234 views
- Login - 172,438 views
- User dashboard - 98,712 views
- Paper titles list - 93,300 views
- Add new event - 69,269 views
- Add new post - 62,914 views
- Register - 53,201 views
- Statistics - 44,357 views
- Modification of self-organizing migration algorithm for OpenCL framework - 34,525 views
- Books on OpenCL and CUDA - 31,241 views