Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

Lu Li
Department of Computer and Information Science, Linkoping University SE-581 83 Linkoping, Sweden
Linkoping University, 2018


   title={Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems},

   author={Li, Lu},


   school={Link{"o}ping University Electronic Press}


CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems. Such heterogeneity represents one of the most promising trends for the near-future evolution of high performance computing hardware. However, as a double-edged sword, the heterogeneity also brings significant programming complexities that prevent the easy and efficient usage of different such heterogeneous systems. In this thesis, we are interested in four such kinds of fundamental complexities that are associated with these heterogeneous systems: measurement complexity (efforts required to measure a metric, e.g., measuring energy), CPU-GPU selection complexity, platform complexity and data management complexity. We explore new low-cost programming abstractions to hide these complexities, and propose new optimization techniques that could be performed under the hood. For the measurement complexity, although measuring time is trivial by native library support, measuring energy consumption, especially for systems with GPUs, is complex because of the low level details involved such as choosing the right measurement methods, handling the trade-off between sampling rate and accuracy, and switching to different measurement metrics. We propose a clean interface with its implementation that not only hides the complexity of energy measurement, but also unifies different kinds of measurements. The unification bridges the gap between time measurement and energy measurement, and if no metric-specific assumptions related to time optimization techniques are made, energy optimization can be performed by blindly reusing time optimization techniques. For the CPU-GPU selection complexity, which relates to efficient utilization of heterogeneous hardware, we propose a new adaptive-sampling based construction mechanism of predictors for such selections which can adapt to different hardware platforms automatically, and shows non-trivial advantages over random sampling. For the platform complexity, we propose a new modular platform modeling language and its implementation to formally and systematically describe a computer system, enabling zero-overhead platform information queries for high-level software tool chains and for programmers as a basis for making software adaptive. For the data management complexity, we propose a new mechanism to enable a unified memory view on heterogeneous systems that have separate memory spaces. This mechanism enables programmers to write significantly less code, which runs equally fast with expert-written code and outperforms the current commercially available solution: Nvidia’s Unified Memory. We further propose two data movement optimization techniques, lazy allocation and transfer fusion optimization. The two techniques are based on adaptively merging messages to reduce data transfer latency. We show that these techniques can be potentially beneficial and we prove that our greedy fusion algorithm is optimal. Finally, we show that our approaches to handle different complexities can be combined so that programmers could use them simultaneously. This research has been partly funded by two EU FP7 projects (PEPPHER and EXCESS) and SeRC.
Rating: 5.0/5. From 1 vote.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: