high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

[Serbian] The Methods and Procedures for Accelerating Operations and Queries in Large Database Systems and Data Warehouse (Big Data Systems) 10,168 views

Computing Treewidth on the GPU 10,077 views

Mixed Precision Solver Scalable to 16000 MPI Processes for Lattice Quantum Chromodynamics Simulations on the Oakforest-PACS System 10,047 views

OpenCL Actors – Adding Data Parallelism to Actor-based Programming with CAF 9,905 views

Energy efficiency of finite difference algorithms on multicore CPUs, GPUs, and Intel Xeon Phi processors 9,897 views

An Efficient Load Balancing Method for Tree Algorithms 9,769 views

GMM based Fisher vector calculation on GPGPU 9,746 views

Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices 9,662 views

GALARIO: a GPU Accelerated Library for Analysing Radio Interferometer Observations 9,507 views

Breaking DVB-CSA 9,249 views

End-to-end Deep Learning of Optimization Heuristics 9,044 views

Torch7: A Matlab-like Environment for Machine Learning 9,006 views

Experiences Building an MLIR-based SYCL Compiler 8,689 views

On Optimizing Complex Stencils on GPUs 8,508 views

Accelerating Radio Astronomy with Auto-Tuning 8,368 views

4kUHD H264 wireless live video streaming using CUDA 8,295 views

GPU implementation of a deep learning network for image recognition tasks 8,267 views

Monte Carlo methods for massively parallel computers 8,175 views

CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs 8,036 views

PySPH: A Python framework for SPH 8,030 views

Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds 8,018 views

GPU Octrees and Optimized Search 7,978 views

Automated Testing of Graphics Shader Compilers 7,893 views

Distributed wideband software-defined radio receiver for heterogeneous systems 7,825 views

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures 7,799 views

Fast Parallel Sorting Algorithms on GPUs 7,736 views

Compoundly weighted Voronoi: a sequential and parallel implementation 7,692 views

IBM Deep Learning Service 7,647 views

Meta Networks for Neural Style Transfer 7,642 views

Empower Sequence Labeling with Task-Aware Neural Language Model 7,622 views

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts 7,603 views

NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems 7,586 views

A code-based analytical approach for using separate device coprocessors in computing systems 7,429 views

An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture 7,311 views

A Common GPU n-Dimensional Array for Python and C 6,981 views

GPU-Accelerated Parallel Finite-Difference Time-Domain Method for Electromagnetic Waves Propagation in Unmagnetized Plasma Media 6,977 views

Quasi-real-time analysis of dynamic near field scattering data using a graphics processing unit 6,957 views

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams 6,832 views

gSLIC: a real-time implementation of SLIC superpixel segmentation 6,714 views

A Comparative Study of 2D Numerical Methods with GPU Computing 6,704 views

The CUDA Handbook: A Comprehensive Guide to GPU Programming 6,671 views

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes 6,631 views

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units 6,558 views

SYCL Code Generation for Multigrid Methods 6,552 views

Interactive Soft Tissue for Surgical Simulation 6,539 views

Performance Comparison of GPU, DSP and FPGA implementations of image processing and computer vision algorithms in embedded systems 6,521 views

An octree-based proxy for collision detection in large-scale particle systems 6,510 views

GPU-PIV 6,402 views

Deep learning for galaxy surface brightness profile fitting 6,350 views

libWater: Heterogeneous Distributed Computing Made Easy 6,322 views

Fast in-place sorting with CUDA based on bitonic sort 6,319 views

Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation 6,318 views

Advanced 2D Rasterization on Modern CPUs 6,272 views

Usage of GPU in LS-DYNA 6,226 views

Accelerating Genomics Research with OpenCL and FPGAs 6,219 views

Sorting with GPUs: A Survey 6,173 views

Optimization of the Brillouin operator on the KNL architecture 6,151 views

DTAM: Dense tracking and mapping in real-time 6,140 views

Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning 6,130 views

Report: Performance comparison between C2075 and P100 GPU cards using cosmological correlation functions 6,114 views

Language Modeling with Gated Convolutional Networks 6,090 views

GPU sample sort 6,085 views

OpenCL Programming Guide 6,085 views

Scalable Streaming Tools for Analyzing N-body Simulations: Finding Halos and Investigating Excursion Sets in One Pass 6,071 views

Build and Travel KD-Tree with CUDA 6,032 views

Industrial Robot Collision Handling in Harsh Environments 6,018 views

Best Practice Guide – GPGPU 6,015 views

BIDMach: Large-scale Learning with Zero Memory Allocation 6,011 views

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism 5,977 views

vCUDA Framework Development for GPU Virtualization 5,909 views

Hydra: a C++11 framework for data analysis in massively parallel platforms 5,880 views

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures 5,856 views

Efficient Algorithms for Sorting on GPUs 5,823 views

Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II 5,816 views

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort 5,769 views

A novel sorting algorithm for many-core architectures based on adaptive bitonic sort 5,753 views

Launch-time Optimization of OpenCL Kernels 5,745 views

Parallel Medical Image Reconstruction: From Graphics Processors to Grids 5,741 views

Radeon PRO Solid State Graphics (SSG) API User Manual 5,740 views

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs 5,724 views

Acceleration of tensor-product operations for high-order finite element methods 5,699 views

Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms 5,679 views

Simulation of Biological Tissue using Mass-Spring-Damper Models 5,673 views

OpenCL in Action: How to Accelerate Graphics and Computations 5,646 views

Collision Detection Based on Fuzzy Scene Subdivision 5,615 views

Implementing Neural Networks Efficiently 5,587 views

Unified Deep Learning with CPU, GPU, and FPGA Technologies 5,579 views

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models 5,572 views

A Dynamic Hash Table for the GPU 5,571 views

Brief statistics for this page

Titles: 100

Total views: 784758

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

exa-AMD: Exascale Accelerated Materials Discovery

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

No More Shading Languages: Compiling C++ to Vulkan Shaders

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)