high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, Ninghui Sun

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

26th ACM International Conference on Supercomputing (ICS), 2012

@article{li2012optimized,

title={An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs},

author={Li, Jiajia and Li, Xingjian and Tan, Guangming and Chen, Mingyu and Sun, Ninghui},

year={2012}

}

Download (PDF)

View

Source

Source codes

Package:

HDGEMM

3088

views

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorithm running on a heterogeneous system that includes ATI GPUs. Our approach decomposes the DGEMM workload to a finer detail and hides the latency of CPU-GPU data transfers to a higher degree than previous approaches in literature. We implement our approach in a five-stage software pipelined DGEMM and analyze its performance on a platform including x86 multi-core CPUs and an ATI RadeonTM HD5970 GPU that has two Cypress GPU chips on board. Our implementation delivers 758 GFLOPS (82% floating-point efficiency) when it uses only the GPU, and 844 GFLOPS (80% efficiency) when it distributes the workload on both CPU and GPU. We analyze the performance of our optimized DGEMM as the number of GPU chips employed grows from one to two, and the results show that resource contention on the PCIe bus and on the host memory are limiting factors.

Tags: Algorithms, ATI, ATI CAL, ATI Radeon HD 5970, ATI Stream, Computer science, Heterogeneous systems, Matrix multiplication, Optimization, Package

June 29, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

Package:

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)