Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach
Scalable Computing and Communications: Theory and Practice, 2012
@article{bosilca2012dense,
title={Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach},
author={Bosilca, G. and Bouteiller, A. and Danalis, A. and Herault, T. and Luszczek, P. and Dongarra, J.J.},
year={2012}
}
Among the various factors that drive the momentous changes occurring in the design of microprocessors and high end systems [1], three stand out as especially notable: 1. the number of transistors per chip will continue the current trend, i.e. double roughly every 18 months, while the speed of processor clocks will cease to increase; 2. the physical limit on the number and bandwidth of the CPUs pins is becoming a near-term reality; 3. a strong drift toward hybrid/heterogeneous systems for petascale (and larger) systems is taking place. While the first two involve fundamental physical limitations that current technology trends are unlikely to overcome in the near term, the third is an obvious consequence of the first two, combined with the economic necessity of using many thousands of computational units to scale up to petascale and larger systems. More transistors and slower clocks require multicore designs and an increased parallelism. The fundamental laws of traditional processor design – increasing transistor density, speeding up clock rate, lowering voltage – have now been stopped by a set of physical barriers: excess heat produced, too much power consumed, too much energy leaked, and useful signal overcome by noise. Multicore designs are a natural evolutionary response to this situation. By putting multiple processor cores on a single die, architects can overcome the previous limitations, and continue to increase the number of gates per chip without increasing the power densities. However, since excess heat production means that frequencies cannot be further increased, deep-and-narrow pipeline models will tend to recede as shallow-and-wide pipeline designs become the norm. Moreover, despite obvious similarities, multicore processors are not equivalent to multiple-CPUs or to SMPs. Multiple cores on the same chip can share various caches (including TLB – Translation Look-aside Buffer) while competing for memory bandwidth. Extracting performance from such configurations of resources means that programmers must exploit increased thread-level parallelism (TLP) and efficient mechanisms for inter-processor communication and synchronization to manage resources effectively. The complexity of fine grain parallel processing will no longer be hidden in hardware by a combination of increased instruction level parallelism (ILP) and pipeline techniques, as it was with superscalar designs. It will have to be addressed at an upper level, in software, either directly in the context of the applications or in the programming environment. As code and performance portability remain essential, the programming environment has to drastically change.
January 30, 2012 by hgpu