Heterogeneous CPU/(GP) GPU Memory Hierarchy Analysis and Optimization

Josue Vladimir Quiroga Esparza
Departament d’Arquitectura de Computadors, Universitat Politecnica de Catalunya
Universitat Politecnica de Catalunya, 2015


   title={Heterogeneous CPU/GPU Memory Hierarchy Analysis and Optimization},

   author={Quiroga Esparza, Josu{‘e} Vladimir},


   publisher={Universitat Polit{‘e}cnica de Catalunya}


Download Download (PDF)   View View   Source Source   



Heterogeneous systems, more specifically CPU – GPGPU platforms, have gained a lot of attention due to the excellent speedups GPUs can achieve with such little amount of energy consumption. Anyhow, not everything is such a good story, the complex programming models to get the maximum exploitation of the devices and data movement overheads are some of the constraints or challenges in order to get the benefits from GP-GPU computing. On the other hand, architects from big processor manufacturers like Intel and AMD have integrated the CPU’s and GPU’s on the same chip thanks to the "Moore’s Law" but the logical integration has not been as easy as putting them physically together side-by-side on the same die. Fusing these two different kind of cores, each one with its own memory hierarchy: one with higher memory bandwidth due to the throughput, on GPU’s for example, and the CPU’s with multi-level, higher capacity caches using protocols to provide strong consistency models for the programmer less scalable due to the coherency-related traffic. With this, the Heterogeneous System Architecture (HSA) has been developed by the HSA Foundation founded ARM, AMD, Qualcomm and many other companies to reduce latency between devices connectivity, and make this system more compatible from a programmer’s perspective(CUDA or OpenCL), without doing copies on disjoint memories, giving as result a Unified Virtual Memory. Because of the nature of these two separated memory systems, the heterogeneous Uniform Memory Access (hUMA) was created by AMD to share the system’s virtual memory space. The GPU can access CPU memory addresses directly, allowing both reading and writing data that the CPU is also accessing sharing page tables so devices can exchange data by sharing pointers. Great improvements can be achieved by the architecture integration on-chip, but memory wall is always present and a big constraint for systems with a lot of memory bandwidth demands as GPU does. Memory Controllers are the main character in scene to coordinate and schedule all the request of the processor to go to main memory, off chip, taking into account the technology latencies, refreshes, etc. It has too many constraints and to many scheduling possibilities that are impossible to have a general formula to schedule a processor requests to main memory so the flavors vary from processor to processor. In this master thesis, we propose a scheduling re-ordering based on a hysteresis detector to give some fairness and speedup to the memory request threads taking advantage of the bank level parallelism at the memory system organization. First we introduce the evolution of the CPU and GPU processors until their integration in systems and processor using GPU as a general purpose processor. Later we take a closer look to a Memory Controller giving the general perspective and functional elements with a state-of-the-art memory controllers for multicore processors. Given this we show our proposal system for re-ordering with the hysteresis detection and re-ordering logic. Then, the methodology about the simulation infrastructure and benchmarks used is described. The analysis of a baseline processor without memory unification, a fusion processor with virtual memory unification and this same fusion processor with the proposal scheduling for bank parallelism awareness. Conclusions derive from the result at the analysis are stated and the future work.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: