Views of posts on hgpu.org
High-Performance Interactive Scientific Visualization With Datoviz via the Vulkan Low-Level GPU API 861 views
A Study on Neural-based Code Summarization in Low-resource Settings 859 views
LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs 858 views
Persistent Kernels for Iterative Memory-bound GPU Applications 857 views
Analysis of High Level implementations for Recursive Methods on GPUs 857 views
Design and Implementation of CNN-FPGA accelerator based on Open Computing Language 855 views
perf4sight: A toolflow to model CNN training performance on Edge GPUs 853 views
SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration 853 views
neoSYCL: a SYCL implementation for SX-Aurora TSUBASA 852 views
Principles towards Real-Time Simulation of Material Point Method on Modern GPUs 848 views
Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA 847 views
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors 847 views
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours 847 views
GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations 847 views
SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs 844 views
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications 843 views
HALF: Holistic Auto Machine Learning for FPGAs 843 views
Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism 843 views
Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs 841 views
DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence 840 views
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs 838 views
Comparison of different n-body algorithms on various hardware platforms using SYCL 837 views
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark 835 views
torchode: A Parallel ODE Solver for PyTorch 835 views
Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing 833 views
Lossless Acceleration for Seq2seq Generation with Aggressive Decoding 832 views
COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs 831 views
MapReduce for Counting Word Frequencies with MPI and GPUs 831 views
Query Processing on Tensor Computation Runtimes 830 views
A Review of the Parallelization Strategies for Iterative Algorithms 830 views
Integrating SkePU’s algorithmic skeletons with GPI on a cluster 825 views
Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library 824 views
Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array 820 views
Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow 818 views
SnuHPL: high performance LINPACK for heterogeneous GPUs 818 views
Homomorphic-Encrypted Volume Rendering 818 views
Performance analysis of matrix-free conjugate gradient kernels using SYCL 815 views
Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL 815 views
KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications 809 views
Flashlight: Enabling Innovation in Tools for Machine Learning 809 views
Impacts of Parallel Programming on Limited-Resource Hardware 809 views
Accelerating JPEG Decompression on GPUs 808 views
GPU-accelerated Faster Mean Shift with euclidean distance metrics 807 views
Application Performance Profiling on Intel GPUs with Oneprof and Onetrace 806 views
GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration 805 views
BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU 804 views
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training 804 views
NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge 803 views
tntorch: Tensor Network Learning with PyTorch 802 views
GPU Algorithms for Efficient Exascale Discretizations 800 views
GPU Ray Tracing with Monte Carlo Methods 800 views
Reusing Auto-Schedules for Efficient DNN Compilation 798 views
Deep Learning Workload Scheduling in GPU Datacenters: A Survey 795 views
Improving Performance and Energy Efficiency of GPUs through Locality Analysis 791 views
Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml 790 views
OpenMP Offloading in the Jetson Nano Platform 790 views
Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations 789 views
HLS Portability from Intel to Xilinx: A Case Study 789 views
Decompiling x86 Deep Neural Network Executables 788 views
Lina: a fast design optimisation tool for software-based FPGA programming 787 views
A Variant RSA Acceleration with Parallelization 786 views
High performance computing on Android devices – a case study 783 views
Pattern-based Programming Abstractions for Heterogeneous Parallel Computing 783 views
GCN Inference Acceleration using High-Level Synthesis 782 views
DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators 781 views
SYCL in the edge: performance and energy evaluation for heterogeneous acceleration 781 views
A Survey on Hardware Accelerators for Large Language Models 779 views
QGTC: Accelerating Quantized GNN via GPU Tensor Core 779 views
Understanding the Power of Evolutionary Computation for GPU Code Optimization 778 views
WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU 777 views
Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use 775 views
A readahead prefetcher for GPU file system layer 774 views
Intel oneAPI DPC++ FPGA Optimization Guide 773 views
A Systematic Literature Survey of Sparse Matrix-Vector Multiplication 772 views
Modeling GPU Dynamic Parallelism for Self Similar Density Workloads 771 views
Compiler-centric across-stack deep learning acceleration 769 views
Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google’s qsim 768 views
Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing 763 views
Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs 763 views
Gaiwan: a Size-Polymorphic Typesystem for GPU Programs 761 views
A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters 760 views
Retargeting and Respecializing GPU Workloads for Performance Portability 760 views
Python-Based Quantum Chemistry Calculations with GPU Acceleration 760 views
PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining 759 views
Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks 757 views
DISTAL: The Distributed Tensor Algebra Compiler 757 views
On the Compilation Performance of Current SYCL Implementations 756 views
Explicit caching HYB: a new high-performance SpMV framework on GPGPU 756 views
Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs 754 views
A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud 754 views
Securing GPU via Region-based Bounds Checking 753 views
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs 753 views
Titles: 100
Total views: 80601
- Programming - 186,132 views
- Login - 164,480 views
- User dashboard - 90,944 views
- Paper titles list - 70,450 views
- Add new event - 64,635 views
- Add new post - 59,450 views
- Register - 49,283 views
- Statistics - 36,813 views
- Modification of self-organizing migration algorithm for OpenCL framework - 34,168 views
- Books on OpenCL and CUDA - 28,862 views