high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Modeling and Optimization of Parallel Matrix-based Computations on GPU

Modeling and Optimization of Parallel Matrix-based Computations on GPU

Andrew White

Auburn University

Auburn University, 2013

BibTeX

Download (PDF)

View

Source

1881

views

As graphics processing units (GPUs) are continually being utilized as coprocessors, the demand for optimally utilizing them for various applications continues to grow. This work narrows the gap between programmers and minimum execution time for matrix-based computations on a GPU. To minimize execution time, computation and communication time must be considered. For computation, the placement of data in GPU memory significantly affects computation time and therefore is considered. Various matrix-based computation patterns are examined with respect to the layout of GPU memory. A computation pattern refers to the order in which each GPU thread computes a result. From examination of computation patterns, an optimized computation pattern, a pattern which reduces computation time, is derived. After the optimized computation pattern is derived, the access pattern to GPU memory, or order in which data is accessed, is considered. From the optimized access pattern, fine-tuning is performed to the GPU code such as minimization of index calculations and loop unrolling to further reduce computation time and resource allocation. After fine-tuning, input parameters which yield the minimum computation time for a matrix-based computation on the GPU are derived. Input parameters are defined as the dimensions of the grid and blocks assigned for execution on the GPU. Execution metrics are formulated, as functions of the input parameters, to represent the executional behavior of the GPU. The execution metrics are utilized to derive the optimal input parameters which are input parameters that yield the minimum computation time. The matrix-based computations considered are matrix-vector multiplication (Mv), matrix-matrix multiplication (MM), convolution, and the conjugate gradient method. A parallelization procedure is developed to minimize execution time by deriving the optimal placement of data, optimized computation pattern, optimized access pattern, and optimal input parameters. Each step of the procedure is developed through analysis of Mv and MM. The procedure includes an accurate communication model and estimates for CPU and GPU data transfers. With accurate estimates for communication and computation times, partitioning computation for CPU and GPU execution is performed to minimize execution time. Convolution and conjugate gradient are utilized to verify the validity of the procedure. Therefore, the overall goal of this work is to develop a parallelization procedure which minimizes execution time of matrix-based computations executing on the GPU.

Tags: Computer science, CUDA, Matrix multiplication, nVidia, Optimization, Tesla S1070, Thesis

April 27, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Modeling and Optimization of Parallel Matrix-based Computations on GPU

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Modeling and Optimization of Parallel Matrix-based Computations on GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)