high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Automatically Harnessing Sparse Acceleration 1,116 views

A Newcomer In The PGAS World – UPC++ vs UPC: A Comparative Study 1,115 views

Transparent Checkpointing for OpenGL Applications on GPUs 1,115 views

GPU-Based Hierarchical Computations for View Independent Visibility 1,114 views

CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems 1,112 views

Supporting CUDA for an extended RISC-V GPU architecture 1,112 views

Exploiting SPMD Horizontal Locality to Improve Memory Efficiency 1,112 views

Scalable and Parallel Implementation of a Financial Application on a GPU: With Focus on Out-of-Core Case 1,109 views

An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs 1,109 views

Efficient Video Compression via Content-Adaptive Super-Resolution 1,108 views

Scalable instruction set simulator for thousand-core architectures running on GPGPUs 1,106 views

Instruments of Productivity for High Performance Computing 1,106 views

Implementing a GPU-Enhanced Cluster for Large-Scale Simulations 1,105 views

Challenging cloning related problems with GPU-based algorithms 1,103 views

Exploiting GPU On-chip Shared Memory for Accelerating Schedulability Analysis 1,102 views

NVIDIA CUDA software and gpu parallel computing architecture 1,101 views

Streaming architectures and technology trends 1,101 views

Migrating real-time depth image-based rendering from traditional to next-gen GPGPU 1,101 views

Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach 1,101 views

LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure 1,100 views

Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs 1,100 views

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs 1,099 views

TopicBERT for Energy Efficient Document Classification 1,097 views

A Survey of Machine Learning for Computer Architecture and Systems 1,097 views

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors 1,094 views

LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems 1,094 views

Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach 1,094 views

Parallel Arbitrary-precision Integer Arithmetic 1,092 views

Dynamic GPU Energy Optimization for Machine Learning Training Workloads 1,091 views

Performance study of mapping irregular computations on GPUs 1,089 views

Performance Optimisations for Heterogeneous Managed Runtime Systems 1,088 views

End-to-end Optimization of Machine Learning Prediction Queries 1,088 views

Multicore performance optimization using partner cores 1,086 views

Compiler-Based Tools to Aid in Data Transfer Optimization and On-Chip Debug of Heterogeneous Compute Systems 1,086 views

Accurate Energy and Performance Prediction for Frequency-Scaled GPU Kernels 1,086 views

Optimisation and GPU code generation of Stencils for Futhark 1,086 views

Performance Analysis of a High-level Abstractions-based Hydrocode on Future Computing Systems 1,085 views

Performance Evaluation of Optimized Implementations of Finite Difference Method for Wave Propagation Problems on GPU Architecture 1,085 views

Modeling Parallel Programs using Large Language Models 1,084 views

Exploring Applications in CUDA 1,084 views

Optimization of tele-immersion codes 1,083 views

Level-of-Detail Triangle Strips for Deforming Meshes 1,082 views

GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments 1,082 views

Accelerating Concurrent Heap on GPUs 1,080 views

Visualization of level-of-detail meshes on the GPU 1,080 views

Study for measurement method for coal volume on base of GPU 1,079 views

Heuristic Optimization Methods for Improving Performance of Recursive General Purpose Applications on GPUs 1,079 views

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark 1,079 views

Taking the graphics processor beyond graphics 1,078 views

Direct Self-Consistent Field Computations on GPU Clusters 1,077 views

KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks 1,076 views

Using a GPU to accelerate die and mold fabrication 1,076 views

Implicit Feature-Based Alignment System for Radiotherapy 1,076 views

Employ Bump Mapping to Enrich the 3D NPR Image 1,076 views

Simulation Studies of Viral Advertisement Diffusion on Multi-GPU 1,074 views

FSpGEMM: An OpenCL-based HPC Framework for Accelerating General Sparse Matrix-Matrix Multiplication on FPGAs 1,073 views

Real-time Geometric Calibration on graphics processing unit with CUDA 1,072 views

CuPBoP-AMD: Extending CUDA to AMD Platforms 1,072 views

A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster 1,071 views

Performance analysis and optimization of highly diverging algorithms on GPUs 1,071 views

Embedded Software Synthesis using Heterogeneous Dataflow Models 1,071 views

Fast Turnaround HLS Debugging using Dependency Analysis and Debug Overlays 1,068 views

TorchAudio: Building Blocks for Audio and Speech Processing 1,068 views

Where is the data? Why you cannot debate CPU vs. GPU performance without the answer 1,067 views

iGUARD: In-GPU Advanced Race Detection 1,066 views

Simulation Modelling and Visualisation: Toolkits for Building Artificial Worlds 1,066 views

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries 1,065 views

Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing 1,065 views

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA 1,065 views

Efficient code generation for hardware accelerators by refining partially specified implementation 1,064 views

Productive Performance Engineering for Weather and Climate Modeling with Python 1,064 views

Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks? 1,062 views

BASEMENT v3: a modular freeware for river process modelling over multiple computational backends 1,060 views

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs 1,060 views

The Ecological Footprint of Neural Machine Translation Systems 1,057 views

A Highly Parameterizable Framework for Conditional Restricted Boltzmann Machine Based Workloads Accelerated With FPGAs and OpenCL 1,056 views

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices 1,055 views

Reducing IO bandwidth for GPU based moment invariant classifier systems 1,053 views

94% on CIFAR-10 in 3.29 Seconds on a Single GPU 1,052 views

Case Study: GPU-based implementation of sequence pair based floorplanning using CUDA 1,051 views

The Celerity High-level API: C++20 for Accelerator Clusters 1,051 views

An Accelerated IHS Transform Fusion of Remote Sensing Image Data Based on GPU 1,051 views

PeriPy – A High Performance OpenCL Peridynamics Package 1,050 views

GPGPU flow 1,050 views

Software Testing – Test Suite Compilation and Execution Optimizations 1,049 views

Acceleration of the Method of Moments Calculations by Using Graphics Processing Units 1,049 views

Enhancing Performance of Simulations using GPGPU 1,043 views

Effective GPU Sharing Under Compiler Guidance 1,042 views

How to Render FDTD Computations More Effective Using a Graphics Accelerator 1,041 views

BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs 1,041 views

Deep Graph Learning for Program Analysis and System Optimization 1,040 views

Custom Code Generation for a Graph DSL 1,040 views

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis 1,040 views

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results 1,040 views

AZP: Automatic Specialization for Zero Values in Gaming Applications 1,040 views

Migrating CUDA to oneAPI: A Smith-Waterman Case Study 1,039 views

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures 1,038 views

Fast CUDA-Aware MPI Datatypes without Platform Support 1,036 views

Multi-level parallelization for hybrid ACO 1,036 views

Parallel computing with CUDA 1,034 views

Brief statistics for this page

Titles: 100

Total views: 107574

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Views of posts on hgpu.org

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)