2402

Views of posts on hgpu.org

High-Performance Interactive Scientific Visualization With Datoviz via the Vulkan Low-Level GPU API  861 views

A Study on Neural-based Code Summarization in Low-resource Settings  859 views

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs  858 views

Persistent Kernels for Iterative Memory-bound GPU Applications  857 views

Analysis of High Level implementations for Recursive Methods on GPUs  857 views

Design and Implementation of CNN-FPGA accelerator based on Open Computing Language  855 views

cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs  854 views

perf4sight: A toolflow to model CNN training performance on Edge GPUs  853 views

SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration  853 views

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA  852 views

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability  851 views

Principles towards Real-Time Simulation of Material Point Method on Modern GPUs  848 views

Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA  847 views

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors  847 views

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours  847 views

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations  847 views

SYCL-Bench 2020: Benchmarking SYCL 2020 on AMD, Intel, and NVIDIA GPUs  844 views

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications  843 views

HALF: Holistic Auto Machine Learning for FPGAs  843 views

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels  843 views

Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism  843 views

Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs  841 views

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence  840 views

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs  838 views

Comparison of different n-body algorithms on various hardware platforms using SYCL  837 views

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark  835 views

torchode: A Parallel ODE Solver for PyTorch  835 views

Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing  833 views

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding  832 views

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes  832 views

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs  831 views

MapReduce for Counting Word Frequencies with MPI and GPUs  831 views

Query Processing on Tensor Computation Runtimes  830 views

A Review of the Parallelization Strategies for Iterative Algorithms  830 views

Integrating SkePU’s algorithmic skeletons with GPI on a cluster  825 views

Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library  824 views

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array  820 views

Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow  818 views

SnuHPL: high performance LINPACK for heterogeneous GPUs  818 views

Homomorphic-Encrypted Volume Rendering  818 views

Performance analysis of matrix-free conjugate gradient kernels using SYCL  815 views

Experience of Migrating a Parallel Graph Coloring Program from CUDA to SYCL  815 views

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments  815 views

KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications  809 views

Flashlight: Enabling Innovation in Tools for Machine Learning  809 views

Impacts of Parallel Programming on Limited-Resource Hardware  809 views

Accelerating JPEG Decompression on GPUs  808 views

GPU-accelerated Faster Mean Shift with euclidean distance metrics  807 views

Application Performance Profiling on Intel GPUs with Oneprof and Onetrace  806 views

GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration  805 views

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU  804 views

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training  804 views

NEPTUNE: Network- and GPU-aware Management of Serverless Functions at the Edge  803 views

tntorch: Tensor Network Learning with PyTorch  802 views

GPU Algorithms for Efficient Exascale Discretizations  800 views

Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware  800 views

GPU Ray Tracing with Monte Carlo Methods  800 views

Reusing Auto-Schedules for Efficient DNN Compilation  798 views

Deep Learning Workload Scheduling in GPU Datacenters: A Survey  795 views

Seamless GPU acceleration for C++ based physics with the Metal Shading Language on Apple’s M series unified chips  792 views

Improving Performance and Energy Efficiency of GPUs through Locality Analysis  791 views

Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml  790 views

OpenMP Offloading in the Jetson Nano Platform  790 views

Comparative Performance and Scalability Analysis of GPU-accelerated Database Operations  789 views

HLS Portability from Intel to Xilinx: A Case Study  789 views

Decompiling x86 Deep Neural Network Executables  788 views

Lina: a fast design optimisation tool for software-based FPGA programming  787 views

A Variant RSA Acceleration with Parallelization  786 views

High performance computing on Android devices – a case study  783 views

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing  783 views

GCN Inference Acceleration using High-Level Synthesis  782 views

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators  781 views

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration  781 views

A Survey on Hardware Accelerators for Large Language Models  779 views

QGTC: Accelerating Quantized GNN via GPU Tensor Core  779 views

Understanding the Power of Evolutionary Computation for GPU Code Optimization  778 views

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU  777 views

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use  775 views

A readahead prefetcher for GPU file system layer  774 views

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA  774 views

Intel oneAPI DPC++ FPGA Optimization Guide  773 views

A Systematic Literature Survey of Sparse Matrix-Vector Multiplication  772 views

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads  771 views

Compiler-centric across-stack deep learning acceleration  769 views

Enabling Quantum Computer Simulations on AMD GPUs: a HIP Backend for Google’s qsim  768 views

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing  763 views

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs  763 views

Gaiwan: a Size-Polymorphic Typesystem for GPU Programs  761 views

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters  760 views

Retargeting and Respecializing GPU Workloads for Performance Portability  760 views

Python-Based Quantum Chemistry Calculations with GPU Acceleration  760 views

PM4Py-GPU: a High-Performance General-Purpose Library for Process Mining  759 views

Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks  757 views

DISTAL: The Distributed Tensor Algebra Compiler  757 views

On the Compilation Performance of Current SYCL Implementations  756 views

Explicit caching HYB: a new high-performance SpMV framework on GPGPU  756 views

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs  754 views

A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud  754 views

Securing GPU via Region-based Bounds Checking  753 views

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs  753 views

 

Brief statistics for this page

Titles: 100

Total views: 80601

 

Most viewed items:

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: