A dataflow-like programming model for future hybrid clusters

Jens Breitbart
Research Group Programming Languages / Methodologies, University of Kassel, Wilhelmshoher Allee 73, 34121 Kassel, Germany
International Journal of Networking and Computing, Vol 3, No 1, 2013

   title={A dataflow-like programming model for future hybrid clusters},

   author={Breitbart, J.},

   journal={International Journal of Networking and Computing},






Download Download (PDF)   View View   Source Source   



It is expected that the first exascale supercomputer will be deployed within the next 10 years, however both its CPU architecture and programming model are not known yet. Multicore CPUs are not expected to scale to the required number of cores per node, but hybrid multicore CPUs consisting of different kinds of processing elements are expected to solve this issue. They come at the cost of increased software development complexity with e.g., missing cache coherency and on-chip NUMA effects. It is unclear whether MPI and OpenMP will scale to exascale systems and support easy development and scalable and efficient programs. One of the programming models considered as an alternative is the the so-called partitioned global address space (PGAS) model, which is targeted at easy development by providing one common memory address space across all cluster nodes. In this paper we first outline current and possible future hardware and introduce a new abstract hardware model able to describe hybrid clusters. We discuss how current shared memory, GPU and PGAS programming models can deal with the upcoming hardware challenges and describe how synchronization can generate unneeded inter- and intra-node transfers in case the memory consistency model is not optimal. As a major contribution, we introduce our variation of the PGAS model allowing implicit fine-grained pair-wise synchronization among the nodes and the different kinds of processors. We furthermore offer easy deployment of RDMA transfers and provide communication algorithms commonly used in MPI collective operations, but lift the requirement of the operations to be collective. Our model is based on single assignment variables and uses a data-flow like synchronization mechanism. Reading uninitialized variables results in the reading thread to be blocked until data are made available by another thread. That way synchronization is done implicitly when data are read. Explicit tiling is used to reduce synchronization overhead and to increase cache and network utilization. Broadcast, scatter and gather are modeled based on data distribution among the nodes, whereas reduction and scan follow a combining PRAM approach of having multiple threads write to the same memory location. We discuss the Gauss-Seidel stencil, bitonic sort, FFT and a manual scan implementation in our model. We implemented a proof-of-concept library showing the usability and scalability of the model. With this library the Gauss-Seidel stencil scaled well in initial experiments on an 8-node machine and we show that it is easy to keep two GPUs and multiple cores busy when computing a scan.
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Follow us on Twitter

HGPU group

1512 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

262 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: