9138

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Robert Halstead, Walid Najjar
Computer Science & Engineering, UC Riverside, Riverside, CA 92521
UC Riverside, Technical Report UCR-CSE-2013-02011, 2013

@article{halstead2013hardware,

   title={A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex},

   author={Halstead, Robert and Najjar, Walid and Riverside, UC},

   year={2013}

}

Download Download (PDF)   View View   Source Source   

850

views

Applications exhibiting irregular behavior through poor memory locality have been a constant challenge for high-performance computing. Architectures supporting hardware multithreading (e.g. Tera MTA and Cray XMT) have been shown to deliver superior performance on such applications by masking memory latency. FPGAs have outperformed traditional architectures on applications that exhibit very large spatial locality and where the data can be streamed through a pre-configured hardware accelerator customized for that application. However, hardware multithreading can be implemented on FPGAs when the memory system can support multiple outstanding memory requests. CHAT (Custom Hardware Accelerated Threads) is a compiler effort targeting the generation of multithreaded hardware on FPGAs for irregular applications. In this paper we explore the multithreaded implementation of SpMV (Sparse Matrix Vector) multiplication on the Convey HC-2ex. Our design uses multiple Computation Engines (CEs) that are supplied workloads from a single management unit. Each job is for an individual row of the matrix, dynamically assigned as engines become available. This approach efficiently copes with matrices exhibiting both high and low row size variances. The CEs use multiple outstanding memory requests to mask the long latencies, and they can handle multiple jobs in parallel to ensure sufficient memory requests. Experimental evaluation on the HC-2ex shows that our approach sustains 80% of the peak memory throughout, and scales linearly up to three on the four FPGAs. After which memory bottlenecks reduce the sustained throughput to 75% of the peak.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: