Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation
Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China
In Languages and Compilers for Parallel Computing, Vol. 6548 (2011), pp. 151-165.
@article{chen2011unified,
title={Unified parallel C for GPU clusters: language extensions and compiler implementation},
author={Chen, L. and Liu, L. and Tang, S. and Huang, L. and Jing, Z. and Xu, S. and Zhang, D. and Shou, B.},
journal={Languages and Compilers for Parallel Computing},
pages={151–165},
year={2011},
publisher={Springer}
}
Unified Parallel C (UPC), a parallel extension to ANSI C, is designed for high performance computing on large-scale parallel machines. With General-purpose graphics processing units (GPUs) becoming an increasingly important high performance computing platform, we propose new language extensions to UPC to take advantage of GPU clusters. We extend UPC with hierarchical data distribution, revise the execution model of UPC to mix SPMD with fork-join execution model, and modify the semantics of upc_forall to reflect the data-thread affinity on a thread hierarchy. We implement the compiling system, including affinity-aware loop tiling, GPU code generation, and several memory optimizations targeting NVIDIA CUDA. We also put forward unified data management for each UPC thread to optimize data transfer and memory layout for separate memory modules of CPUs and GPUs. The experimental results show that the UPC extension has better programmability than the mixed MPI/CUDA approach. We also demonstrate that the integrated compile-time and runtime optimization is effective to achieve good performance on GPU clusters.
July 1, 2011 by hgpu