## Effect of GPU Communication-Hiding for SpMV Using OpenACC

Department of Systems Innovation, School of Engineering, The University of Tokyo, 7-3-1 Hongo Bunkyo-ku, Tokyo 113-8656, Japan

The 5th International Conference on Computational Methods (ICCM2014), 2014

@{,

}

In the finite element method simulation we often deal with large sparse matrices. Sparse matrix-vector multiplication (SpMV) is of high importance for iterative solvers. During the solver stage, most of the time is in fact spent in the SpMV routine. The SpMV routine is highly memory-bound; the processor spends much time waiting for the needed data. In this study, we discuss overlapping possibilities of SpMV in cases where the sparse matrix data does not fit into the memory of the discrete GPU, by using OpenACC. With GPUs one can take advantage of their relatively high memory bandwidth capabilities. However, data needs to be transferred over the relatively slow PCI express (PCIe) bus. This transfer time can to a certain degree be hidden. We concurrently perform computation on one set of data while another set of data is being transferred. Parameters such as the size of each subdivision being transferred – the number of matrix subdivisions, and the whole matrix size, are adjustable. We generate matrices modeling one, three and six degrees of freedom. It is observed how these parameters affect performance. We analyze the improved performance as a result of communication-hiding with OpenACC, and a profiler is used to provide us with additional insight. This is of direct relevance for a block Krylov solver, for instance a block Cg solver. Here, one can benefit from streaming of data with SpMV and overlap while doing so. Each streamed subdivision is used several times with different vectors. When using a discrete GPU with an ordinary (non-block) Krylov solver, one has to run SpMV once over the whole matrix (or subdivision) for each solver iteration, so there will be no benefit if the matrix does not fit the GPU memory. This is due to the fact that streaming the matrix over the PCIe bus for each of the solver iterations incurs a too big overhead. For instance, in the case of three degrees of freedom and modeling 2,097,152 nodes, we observe a just above 40% performance increase by applying communication-hiding in our benchmarking routine. This gives us close to 33 GFLOP/s on the AMD Tahiti GPU architecture, in double precision. When modeling the same amount of nodes with a "synthetic" six degrees of freedom, up to ~65.7% is observed in increased performance when hiding parts of the data transfer time. This underlines the importance of applying such techniques in simulations, when it is suitable with the algorithmic structure of the problem in relation to the underlying computer architecture

August 15, 2014 by hgpu