high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic

Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic

Ahmad Khayyat

Queen’s University, Kingston, Ontario, Canada

Queen’s University, 2013

@article{khayyat2013analysis,

title={Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic},

author={Khayyat, Ahmad},

year={2013},

publisher={Canadian theses}

}

Download (PDF)

View

Source

1661

views

The objective of this research is to design an efficient and flexible implementation of parallel matrix multiplication for FPGA devices by analyzing the computation and studying its design space. In order to adapt to the FPGA platform, the design employs blocking and parallelization. Blocked matrix multiplication enables processing arbitrarily large matrices using limited memory capacity, and reduces the bandwidth requirements across the device boundaries by reusing available elements. Exploiting the inherent parallelism in the matrix multiplication computation improves the performance and utilizes the available reconfigurable FPGA resources. The design is constructed by identifying the main design decisions and evaluating the alternatives for each one. The considered design decisions include the scheduling of block transfers, the scheduling of arithmetic operations in a block multiplication, the extent to which the parallelism is exploited, determining the block sizes and shapes, and the use of double buffers for storing matrix blocks. The choices offered by each decision are evaluated analytically in terms of their performance and utilization of FPGA resources. Based on this analysis, a detailed, flexible design that accommodates various alternative design choices is described. The design is optimized for matrices of floating-point elements, and for the FPGA target platform. Prior work is analyzed based on the considered design choices in order to identify the similarities and the differences. The proposed design is implemented using the VHDL hardware description language. The implementation is used to verify the correctness of the design and to confirm the analysis of the design decisions. Correctness is verified both by simulation using the ModelSim logic simulator, and in hardware through compiling the implementation using the Altera Quartus II CAD software and testing it on the Altera DE4 board, featuring a Stratix IV EP4SGX530C2 FPGA device. The implementation supports a range of parameters to facilitate the experimental evaluation of design choices. Experimental results show that the design scales linearly with respect to the consumed resources. Although increasing the system size reduces the maximum operating frequency, it also increases the parallelism, resulting in a higher performance. For instance, with 8 floating-point arithmetic units, the system runs at 320 MHz, which corresponds to a performance of 4 GFLOPS, whereas with 64 arithmetic units, it runs at 160 MHz, which corresponds to a performance of 16 GFLOPS. It is also shown that using a transfer schedule based on inner products reduces the transfer time by up to 50% compared to other schedules. Although using square blocks minimizes the number of required block multiplications, other non-square blocks minimize the transfer time, resulting in better total times.

Tags: Computer science, FPGA, Matrix multiplication, Thesis

August 24, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Analysis-Driven Design of Parallel Floating-Point Matrix Multiplication for Implementation in Reconfigurable Logic

Share this:

Recent source codes

Most viewed papers (last 30 days)