Scheduling Dataflow Execution Across Multiple Accelerators

Jon Currey, Adam Eversole, Christopher J. Rossbach
Microsoft Research
The 4th Workshop on Systems for Future Multicore Architectures (SFMA’14), 2014

   title={Scheduling Dataflow Execution Across Multiple Accelerators},

   author={Currey, Jon and Eversole, Adam and Rossbach, Christopher J},



Download Download (PDF)   View View   Source Source   



Dataflow execution engines such as MapReduce, DryadLINQ and PTask have enjoyed success because they simplify development for a class of important parallel applications. Expressing the computation as a dataflow graph allows the runtime, and not the programmer, to own problems such as synchronization, data movement and scheduling – leveraging dynamic information to inform strategy and policy in a way that is impossible for a programmer who must work only with a static view. While this vision enjoys considerable intuitive appeal, the degree to which dataflow engines can implement performance profitable policies in the general case remains under-evaluated. We consider the problem of scheduling in a dataflow engine on a platform with multiple GPUs. In this setting, the cost of moving data from one accelerator to another must be weighed against the benefit of parallelism exposed by performing computations on multiple accelerators. An ideal runtime would automatically discover an optimal or near-optimal partitioning of computation across the available resources with little or no user input. The wealth of dynamic and static information available to a scheduler in this context makes the policy space for scheduling dauntingly large. We present and evaluate a number of approaches to scheduling and partitioning dataflow graphs across available accelerators. We show that simple throughput- or localitypreserving policies do not always do well. For the workloads we consider, an optimal static partitioner operating on a global view of the dataflow graph is an attractive solution – either matching or out-performing a hand-optimized schedule. Application-level knowledge of the graph can be leveraged to achieve best-case performance.
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Follow us on Twitter

HGPU group

1580 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

298 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: