Views of posts on hgpu.org
Benchmarking optimization algorithms for auto-tuning GPU kernels 752 views
Domain-Specific On-Device Object Detection Method 752 views
Machine Learning for CUDA+MPI Design Rules 751 views
Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training 750 views
Training a Vision Transformer from scratch in less than 24 hours with 1 GPU 749 views
Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml 747 views
Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure 747 views
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators 747 views
BenchPress: A Deep Active Benchmark Generator 746 views
Fully Concurrent GPU Data Structures 745 views
cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms 743 views
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture 742 views
Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure 741 views
Performance portability evaluation of blocked stencil computations on GPUs 741 views
Deep Learning Models on CPUs: A Methodology for Efficient Training 739 views
Source-to-Source Automatic Differentiation of OpenMP Parallel Loops 739 views
Parallel programming in mobile devices with FancyJCL 738 views
Performance study on GPU offloading techniques using the Gauss matrix inverse algorithm 738 views
FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis 738 views
Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL 737 views
Monitoring Collective Communication Among GPUs 736 views
Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning 736 views
Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs 733 views
Fast Arbitrary Precision Floating Point on FPGA 732 views
A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code 732 views
Principles for Automated and Reproducible Benchmarking 731 views
Collage: Automated Integration of Deep Learning Backends 731 views
A Ray Tracing Implementation Performance Comparison between the CPU and the GPU 731 views
A Variant of Concurrent Constraint Programming on GPU 731 views
Parallel and Heterogeneous Timing Analysis: Partition, Algorithm, and System 730 views
OpenCL-HPX Integration 730 views
Fast convolution kernels on pascal GPU with high memory efficiency 730 views
Towards making the most of NLP-based device mapping optimization for OpenCL kernels 729 views
Autotuning CUDA: Applying NLP Techniques to LS-CAT 728 views
Fast GPU bounding boxes on tree-structured scenes 728 views
ALPINIST: An Annotation-Aware GPU Program Optimizer 728 views
N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks 728 views
Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel 726 views
Capturing the Memory Topology of GPUs 725 views
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis 724 views
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism 724 views
GGArray: A Dynamically Growable GPU Array 720 views
Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads 720 views
EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models 719 views
The Application of AI Technology in GPU Scheduling Algorithm Optimization 718 views
MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring 718 views
IgNet. A Super-precise Convolutional Neural Network 718 views
Testing and Mutation Testing for GPU Kernels 715 views
Compiler Technologies in Deep Learning Co-Design: A Survey 715 views
Demystifying Dependency Bugs in Deep Learning Stack 711 views
OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials 710 views
Simulation Methodologies for Mobile GPUs 708 views
Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations 708 views
Providing performance portable numerics for Intel GPUs 706 views
An Open-source FPGA Library for Data Sorting 706 views
Deductive verification for SYCL 704 views
Extending MAGMA Portability with OneAPI 703 views
Three Contributions to the Theory and Practice of Optimizing Compilers 701 views
Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI 700 views
SciAI4Industry – Solving PDEs for industry-scale problems with deep learning 698 views
CPU-GPU Layer-Switched Low Latency CNN Inference 698 views
Optimizing Deep Learning Models For Raspberry Pi 697 views
OpenMP Advisor 696 views
Safe and Practical GPU Acceleration in TrustZone 696 views
Accelerating bioinformatics applications on CUDA-enabled multi-GPU systems 696 views
Evaluation of Rust for GPGPU high-performance computing 696 views
GT4Py: High Performance Stencils for Weather and Climate Applications using Python 696 views
Using AI libraries for Incompressible Computational Fluid Dynamics 694 views
Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading 693 views
Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era 693 views
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs 693 views
Julia as a unifying end-to-end workflow language on the Frontier exascale system 693 views
OpenMP offload at the Exascale using Intel GPU Max 1550: evaluation of STREAmS compressible solver 693 views
High Performance Privacy Preserving AI 692 views
Performance Evaluation of Heterogeneous GPU Programming Frameworks for Hemodynamic Simulations 691 views
Manas: Mining Software Repositories to Assist AutoML 691 views
A tool set for random number generation on GPUs in R 691 views
Challenges and Opportunities in C/C++ Source-To-Source Compilation 690 views
Strega: An HTTP Server for FPGAs 689 views
Evaluating the Wide Area Classroom After 24,000 HPC Students 687 views
Cramming: Training a Language Model on a Single GPU in One Day 687 views
An OpenCL-Based FPGA Accelerator for Faster R-CNN 685 views
Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU 685 views
Real-Time High-Performance Computing for Embedded Control Systems 685 views
Evaluation of Pseudo-Random Number Generation on GPU Cards 685 views
QArray: a GPU-accelerated constant capacitance model simulator for large quantum dot arrays 685 views
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks 684 views
The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science 682 views
Challenges and Techniques for Transparent Acceleration of Unmodified Big Data Applications 681 views
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance 680 views
AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference 680 views
Behavioral graph fraud detection in E-commerce 679 views
Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment 679 views
Titles: 100
Total views: 71512
- Programming - 186,132 views
- Login - 164,479 views
- User dashboard - 90,942 views
- Paper titles list - 70,445 views
- Add new event - 64,634 views
- Add new post - 59,447 views
- Register - 49,282 views
- Statistics - 36,810 views
- Modification of self-organizing migration algorithm for OpenCL framework - 34,168 views
- Books on OpenCL and CUDA - 28,861 views