high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

Lu Li

Department of Computer and Information Science, Linkoping University SE-581 83 Linkoping, Sweden

Linkoping University, 2018

DOI:10.3384/diss.diva-145304

BibTeX

Download (PDF)

View

Source

Source codes

Package:

MeterPU: Generic Portable Measurement Framework for Multicore CPU and Multi-GPU Systems

2549

views

CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems. Such heterogeneity represents one of the most promising trends for the near-future evolution of high performance computing hardware. However, as a double-edged sword, the heterogeneity also brings significant programming complexities that prevent the easy and efficient usage of different such heterogeneous systems. In this thesis, we are interested in four such kinds of fundamental complexities that are associated with these heterogeneous systems: measurement complexity (efforts required to measure a metric, e.g., measuring energy), CPU-GPU selection complexity, platform complexity and data management complexity. We explore new low-cost programming abstractions to hide these complexities, and propose new optimization techniques that could be performed under the hood. For the measurement complexity, although measuring time is trivial by native library support, measuring energy consumption, especially for systems with GPUs, is complex because of the low level details involved such as choosing the right measurement methods, handling the trade-off between sampling rate and accuracy, and switching to different measurement metrics. We propose a clean interface with its implementation that not only hides the complexity of energy measurement, but also unifies different kinds of measurements. The unification bridges the gap between time measurement and energy measurement, and if no metric-specific assumptions related to time optimization techniques are made, energy optimization can be performed by blindly reusing time optimization techniques. For the CPU-GPU selection complexity, which relates to efficient utilization of heterogeneous hardware, we propose a new adaptive-sampling based construction mechanism of predictors for such selections which can adapt to different hardware platforms automatically, and shows non-trivial advantages over random sampling. For the platform complexity, we propose a new modular platform modeling language and its implementation to formally and systematically describe a computer system, enabling zero-overhead platform information queries for high-level software tool chains and for programmers as a basis for making software adaptive. For the data management complexity, we propose a new mechanism to enable a unified memory view on heterogeneous systems that have separate memory spaces. This mechanism enables programmers to write significantly less code, which runs equally fast with expert-written code and outperforms the current commercially available solution: Nvidia’s Unified Memory. We further propose two data movement optimization techniques, lazy allocation and transfer fusion optimization. The two techniques are based on adaptively merging messages to reduce data transfer latency. We show that these techniques can be potentially beneficial and we prove that our greedy fusion algorithm is optimal. Finally, we show that our approaches to handle different complexities can be combined so that programmers could use them simultaneously. This research has been partly funded by two EU FP7 projects (PEPPHER and EXCESS) and SeRC.

Tags: Algorithms, Computer science, CUDA, Heterogeneous systems, nVidia, Package, Tesla C1060, Tesla C2050, Tesla K20, Tesla M2050, Thesis

March 10, 2018 by hgpu

Rating: 5.0/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)