high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, Ninghui Sun

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

26th ACM International Conference on Supercomputing (ICS), 2012

@article{li2012optimized,

title={An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs},

author={Li, Jiajia and Li, Xingjian and Tan, Guangming and Chen, Mingyu and Sun, Ninghui},

year={2012}

}

Download (PDF)

View

Source

Source codes

Package:

HDGEMM

4985

views

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorithm running on a heterogeneous system that includes ATI GPUs. Our approach decomposes the DGEMM workload to a finer detail and hides the latency of CPU-GPU data transfers to a higher degree than previous approaches in literature. We implement our approach in a five-stage software pipelined DGEMM and analyze its performance on a platform including x86 multi-core CPUs and an ATI RadeonTM HD5970 GPU that has two Cypress GPU chips on board. Our implementation delivers 758 GFLOPS (82% floating-point efficiency) when it uses only the GPU, and 844 GFLOPS (80% efficiency) when it distributes the workload on both CPU and GPU. We analyze the performance of our optimized DGEMM as the number of GPU chips employed grows from one to two, and the results show that resource contention on the PCIe bus and on the host memory are limiting factors.

Tags: Algorithms, ATI, ATI CAL, ATI Radeon HD 5970, ATI Stream, Computer science, Heterogeneous systems, Matrix multiplication, Optimization, Package

June 29, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

An Optimized Large-Scale Hybrid DGEMM Design for CPUs and ATI GPUs

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)