high performance computing on graphics processing units: hgpu.org

hgpu.org » Memory level parallelism

Scalable Kernel Fusion for Memory-Bound GPU Applications

Mohamed Wahib and Naoya Maruyama

View

Download (PDF)

Tags: CUDA, Memory level parallelism, nVidia GeForce GTX 750 Ti, Tesla K20, Tesla K40

September 1, 2014 by wahibium

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

George C. Caragea, Alexandros Tzannes, Fuat Keceli, Rajeev Barua, Uzi Vishkin

View

Download (PDF)

Source codes

Tags: Algorithms, ASIC, Benchmarking, Computer science, Memory level parallelism, nVidia, Package, Prefetch

December 29, 2011 by hgpu

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Sunpyo Hong, Hyesoon Kim

View

Download (PDF)

Tags: Analytical model, Benchmarking, Computer science, CUDA, Memory level parallelism, nVidia, nVidia Quadro FX 5600, Warp level parallelism

December 15, 2011 by hgpu

Accelerating Parameter Sweep Applications Using CUDA

Masaya Motokubota, Fumihiko Ino, Kenichi Hagihara

Tags: Computer science, CUDA, Memory level parallelism, nVidia, Performance

June 14, 2011 by hgpu

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Sunpyo Hong, Hyesoon Kim

View

Download (PDF)

Tags: Analytical model, Computer science, CUDA, Memory level parallelism, nVidia, nVidia GeForce 8800 GT, nVidia GeForce 8800 GTX, nVidia Quadro FX 5600, Warp level parallelism

October 29, 2010 by hgpu

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Scalable Kernel Fusion for Memory-Bound GPU Applications

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

Accelerating Parameter Sweep Applications Using CUDA

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)