Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques
Dept. of Electrical, Electronic and Information Engineering (DEI), University of Bologna
University of Bologna, 2015
@article{pinto2015many,
title={Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques},
author={Pinto, Christian},
year={2015}
}
During the last few decades an unprecedented technological growth has been at the center of the embedded systems design paramount, with Moore’s Law being the leading factor of this trend. Today in fact an ever increasing number of cores can be integrated on the same die, marking the transition from state-of-the-art multi-core chips to the new many-core design paradigm. Such many-core chips aim is twofold: provide high computing performance, and increase the energy efficiency of the hardware in terms of OPS/Watt. Despite the extraordinarily high computing power, the complexity of many-core chips opens the door to several challenges. First of all, as a result of the increased silicon density of modern Systems-on-a-Chip (SoC), the design space exploration needed to find the best design has exploded. Hardware designers are in fact facing the problem of a huge design space, with an extremely high number of possibilities to be explored to make a comprehensive evaluation of each of their architectural choices. This is also exacerbated by the extremely competitive silicon market, forcing each actor to always shrink the time-to-market of products to be ahead of the competitors. Virtual Platforms have always been used to enable hardware-software co-design, but today they are facing with the huge complexity of both hardware and software systems. In this thesis two different research works on Virtual Platforms are presented: the first one is intended for the hardware developer, to easily allow complex cycle accurate simulations of many-core SoCs. The second work exploits the parallel computing power of off-the-shelf General Purpose Graphics Processing Units (GPGPUs), with the goal of an increased simulation speed. The term Virtualization can be used in the context of many-core systems not only to refer to the aforementioned hardware emulation tools (Virtual Platforms), but also to identify parallel programming aid tools and the higher level virtualization techniques used today to create software instances of computing systems [21]. Virtualization can be used in fact for two other main purposes: 1) to help the programmer to achieve the maximum possible performance of an application, by hiding the complexity of the underlying hardware. 2) to efficiently exploit the high parallel hardware of many-core chips in environments with multiple active Virtual Machines, in which the accelerator might be able to sustain multiple execution requests from different virtual machines. In this last context beside the sharing of the accelerator, isolation between different virtual machines is required. This thesis is focused on virtualization techniques with the goal to mitigate, and overtake when possible, some of the challenges introduced by the many-core design paradigm. Beside the design challenge, many-core chips themselves pose some challenges to programmers in order to effectively exploit their theoretical computing power. The most important and performance affecting is the Memory-Bandwidth Bottleneck: as a result of several design choices most many-core chips are composed by multi-core computing clusters, which are replicated over the design. Such design pattern is aimed at reducing the design effort, by just defining the architecture of a single cluster and then deploying several clusters on the same chip. For the sake of area/power efficiency, processing elements in a cluster are often not equipped with data cache memories, but rather they share an on-chip data scratch-pad memory. On-chip memories are usually fast but available in limited amount, and the data-set of an application can not always fit into. For this reason data are usually allocated in the much ample, but way slower, external memory. To mitigate the external-memory access latency, and due to the lack of a data cache, programmers are forced to apply copy-in/copy-out schemes to move chunks of data from the external memory to the on-chip memory (and vice versa). Such programming patterns usually exploit a Direct Memory Access Engine (DMA engine) to overlap the computation of a chunk of data with the copy of the next. In this thesis a memory virtualization infrastructure is presented, aimed at automatically dealing with external-memory-to-scratch-pad transfers. The virtualization framework treats the on-chip scratch-pad of a computing cluster as if it was a cache (Software Cache), and data is moved back and forth from external memory without the intervention of the programmer. The software cache is also able to deal with multiple concurrent accesses from the processing element of each cluster. The last aspect investigated is virtualization at its higher level of abstraction, used in the domain of servers/cloud computing to create sand-boxed instances of operating systems (Virtual Machines) physically sharing the same hardware (hardware consolidation). Such type of virtualization has recently been made available also in the embedded systems domain, thanks to the advent of hardware assisted virtualization in ARM based processors [15]. In a virtualized system each hardware peripheral needs to have its virtual counterpart, to give each virtual machine the idea of a dedicated computing device. Since many-core chips are used as a co-processor (Accelerators) to general purpose multi-core processors (Host), they also need to be virtualized and made available to all the virtual machines running on the system. However modern many-core based systems are still under constant refinement, and current virtualization techniques are not able to overcome some of the architectural limitations. One of these limitations is memory sharing between host and accelerator. General purpose processors usually handle any memory region under virtual memory, giving a flexible and contiguous view of the physical memory even if data is not contiguously allocated. This goal is achieved by using a Memory Management Unit (MMU). On the other hand many-core chips are only able to access contiguously physical memory, being them not equipped with an MMU. This makes impossible for the co-processor to directly access any data buffer created from the host system. The problem of memory sharing is much more effective in a virtualized environment, where the accelerator could be sharing data with different virtual machines. This challenge is addressed in this thesis with the definition of a virtualization transparently enabling host-accelerator memory sharing, and implementing a resources sharing mechanism enabling the many-core accelerator to be used concurrently by several virtual machines.
July 17, 2015 by hgpu