Views of posts on hgpu.org
Generating GPU Compiler Heuristics using Reinforcement Learning 1,028 views
Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations 1,025 views
FLOWER: A Comprehensive Dataflow Compiler for High-Level Synthesis 1,025 views
Lightning: Scaling the GPU Programming Model Beyond a Single GPU 1,024 views
Optimization of Heterogeneous Parallel Computing Systems using Machine Learning 1,023 views
From English To Foreign Languages: Transferring Pre-trained Language Models 1,022 views
Improving Performance and Energy Efficiency of Heterogeneous Systems with rCUDA 1,019 views
Improving the Performance, Portability, and Productivity of Hardware Accelerators 1,013 views
Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library 1,008 views
Simulating flows of incompressible and weakly compressible fluids on multicore hybrid computer systems 1,008 views
StreamBlocks: A compiler for heterogeneous dataflow computing 1,007 views
Mixed precision in Graphics Processing Unit 1,003 views
Apple Silicon Performance in Scientific Computing 1,002 views
A Compiler Framework for Optimizing Dynamic Parallelism on GPUs 1,001 views
Thermal Safety and Real-Time Predictability on Heterogeneous Embedded SoC Platforms 999 views
ScaleHLS: Scalable High-Level Synthesis through MLIR 998 views
DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors 997 views
Novel Computing Architectures 996 views
The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product 994 views
LS-CAT: A Large-Scale CUDA AutoTuning Dataset 992 views
CUDA implementation of Wagener’s 2D convex hull PRAM algorithm 991 views
Improving performance for emergent environments parameter tuning and simulation in games using GPU 988 views
Optimal program variant generation for hybrid manycore systems 987 views
GPU-based JSON data processing using structural indexes 983 views
Character-level Transformer-based Neural Machine Translation 982 views
Advanced Joins on GPUs 978 views
GPTPU: Accelerating Applications using Edge Tensor Processing Units 978 views
Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems 976 views
INSTA-YOLO: Real-Time Instance Segmentation 974 views
It’s all about data movement: Optimising FPGA data access to boost performance 973 views
Productivity, Portability, Performance: Data-Centric Python 972 views
TorchBench: Benchmarking PyTorch with High API Surface Coverage 970 views
A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks 962 views
Integrating Accelerators in Heterogeneous Systems 961 views
Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators 959 views
Performance prediction of deep learning applications training in GPU as a service systems 956 views
Using Intel oneAPI for Multi-hybrid Acceleration Programming with GPU and FPGA Coupling 956 views
Block Conjugate Gradient Solver in OpenCL 954 views
Parallel Approaches for SWAMP Sequence Alignment 951 views
General purpose lattice QCD code set Bridge++ 2.0 for high performance computing 949 views
Implementation of Parallel Simplified Swarm Optimization in CUDA 945 views
OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems 944 views
Open SYCL on heterogeneous GPU systems: A case of study 943 views
Artificial Intelligence in Electric Machine Drives: Advances and Trends 942 views
Measurement and Analysis of GPU-accelerated Applications with HPCToolkit 940 views
Onesweep: A Faster Least Significant Digit Radix Sort for GPUs 940 views
One-shot tuner for deep learning compilers 938 views
On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors 936 views
Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA 933 views
Heuristic Adaptability to Input Dynamics for SpMM on GPUs 928 views
APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores 925 views
Joint Forces: From Multithreaded Programming to GPU Computing 920 views
Dynamic Adaptation Techniques and Opportunities to Improve HPC Runtimes 920 views
NNP/MM: Fast molecular dynamics simulations with machine learning potentials and molecular mechanics 919 views
Better GPU Hash Tables 918 views
Fancier: A Unified Framework for Java, C, and OpenCL Integration 917 views
A method for decompilation of AMD GCN kernels to OpenCL 916 views
94% on CIFAR-10 in 3.29 Seconds on a Single GPU 913 views
Extending SYCL’s Programming Paradigm with Tensor-based SIMD Abstractions 908 views
EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems 905 views
Exploring the acceleration of Nekbone on reconfigurable architectures 905 views
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale 904 views
Managing Extreme Heterogeneity in Next Generation HPC Systems 902 views
BAT: A Benchmark suite for AutoTuners 899 views
A ML-based resource utilization OpenCL GPU-kernel fusion model 895 views
NetKet 3: Machine Learning Toolbox for Many-Body Quantum Systems 894 views
An Auto-Programming Approach to Vulkan 891 views
CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research 891 views
Optimization of GPU workloads using natural language processing based on deep learning techniques 890 views
Efficient heterogeneous matrix profile on a CPU + High Performance FPGA with integrated HBM 887 views
AnySeq/GPU: A Novel Approach for Faster Sequence Alignment on GPUs 882 views
A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning 882 views
User’s needs influencing HPC technologies 877 views
Research and Development of Porting SYCL on QNX Operating System for High Parallelism 876 views
Data-Oriented Language Implementation of Lattice-Boltzmann Method for Dense and Sparse Geometries 875 views
Predictive Data Race Detection for GPUs 873 views
Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment 873 views
Concurrent CPU-GPU Task Programming using Modern C++ 869 views
PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations 865 views
Programming Heterogeneous Systems with General and Domain-Specific Frameworks 865 views
Towards a Benchmarking Suite for Kernel Tuners 862 views
Enabling On-Device Smartphone GPU based Training: Lessons Learned 861 views
Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures 859 views
TCUDB: Accelerating Database with Tensor Processors 858 views
Data transfer optimizations for heterogeneous managed runtime systems 858 views
Performance assessment of CUDA and OpenACC in large scale combustion simulations 855 views
LeXInt: GPU-accelerated Exponential Integrators package 854 views
Titles: 100
Total views: 93768
- Programming - 186,126 views
- Login - 164,273 views
- User dashboard - 90,364 views
- Paper titles list - 69,694 views
- Add new event - 64,535 views
- Add new post - 59,107 views
- Register - 49,130 views
- Statistics - 36,253 views
- Modification of self-organizing migration algorithm for OpenCL framework - 34,160 views
- Books on OpenCL and CUDA - 28,757 views