high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Robert Halstead, Walid Najjar

Computer Science & Engineering, UC Riverside, Riverside, CA 92521

UC Riverside, Technical Report UCR-CSE-2013-02011, 2013

@article{halstead2013hardware,

title={A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex},

author={Halstead, Robert and Najjar, Walid and Riverside, UC},

year={2013}

}

Download (PDF)

View

Source

1638

views

Applications exhibiting irregular behavior through poor memory locality have been a constant challenge for high-performance computing. Architectures supporting hardware multithreading (e.g. Tera MTA and Cray XMT) have been shown to deliver superior performance on such applications by masking memory latency. FPGAs have outperformed traditional architectures on applications that exhibit very large spatial locality and where the data can be streamed through a pre-configured hardware accelerator customized for that application. However, hardware multithreading can be implemented on FPGAs when the memory system can support multiple outstanding memory requests. CHAT (Custom Hardware Accelerated Threads) is a compiler effort targeting the generation of multithreaded hardware on FPGAs for irregular applications. In this paper we explore the multithreaded implementation of SpMV (Sparse Matrix Vector) multiplication on the Convey HC-2ex. Our design uses multiple Computation Engines (CEs) that are supplied workloads from a single management unit. Each job is for an individual row of the matrix, dynamically assigned as engines become available. This approach efficiently copes with matrices exhibiting both high and low row size variances. The CEs use multiple outstanding memory requests to mask the long latencies, and they can handle multiple jobs in parallel to ensure sufficient memory requests. Experimental evaluation on the HC-2ex shows that our approach sustains 80% of the peak memory throughout, and scales linearly up to three on the four FPGAs. After which memory bottlenecks reduce the sustained throughput to 75% of the peak.

Tags: Computer science, CUDA, FPGA, nVidia, nVidia GeForce GTX 280, Sparse matrix

April 8, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Share this:

Recent source codes

Most viewed papers (last 30 days)