high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach

Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Piotr Luszczek, Jack J. Dongarra

Scalable Computing and Communications: Theory and Practice, 2012

@article{bosilca2012dense,

title={Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach},

author={Bosilca, G. and Bouteiller, A. and Danalis, A. and Herault, T. and Luszczek, P. and Dongarra, J.J.},

year={2012}

}

Download (PDF)

View

Source

1782

views

Among the various factors that drive the momentous changes occurring in the design of microprocessors and high end systems [1], three stand out as especially notable: 1. the number of transistors per chip will continue the current trend, i.e. double roughly every 18 months, while the speed of processor clocks will cease to increase; 2. the physical limit on the number and bandwidth of the CPUs pins is becoming a near-term reality; 3. a strong drift toward hybrid/heterogeneous systems for petascale (and larger) systems is taking place. While the first two involve fundamental physical limitations that current technology trends are unlikely to overcome in the near term, the third is an obvious consequence of the first two, combined with the economic necessity of using many thousands of computational units to scale up to petascale and larger systems. More transistors and slower clocks require multicore designs and an increased parallelism. The fundamental laws of traditional processor design – increasing transistor density, speeding up clock rate, lowering voltage – have now been stopped by a set of physical barriers: excess heat produced, too much power consumed, too much energy leaked, and useful signal overcome by noise. Multicore designs are a natural evolutionary response to this situation. By putting multiple processor cores on a single die, architects can overcome the previous limitations, and continue to increase the number of gates per chip without increasing the power densities. However, since excess heat production means that frequencies cannot be further increased, deep-and-narrow pipeline models will tend to recede as shallow-and-wide pipeline designs become the norm. Moreover, despite obvious similarities, multicore processors are not equivalent to multiple-CPUs or to SMPs. Multiple cores on the same chip can share various caches (including TLB – Translation Look-aside Buffer) while competing for memory bandwidth. Extracting performance from such configurations of resources means that programmers must exploit increased thread-level parallelism (TLP) and efficient mechanisms for inter-processor communication and synchronization to manage resources effectively. The complexity of fine grain parallel processing will no longer be hidden in hardware by a combination of increased instruction level parallelism (ILP) and pipeline techniques, as it was with superscalar designs. It will have to be addressed at an upper level, in software, either directly in the context of the applications or in the programming environment. As code and performance portability remain essential, the programming environment has to drastically change.

Tags: Computer science, CUDA, Heterogeneous systems, Linear Algebra, nVidia, Tesla C2070

January 30, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)