Parallel heterogeneous Branch and Bound algorithms for multi-core and multi-GPU environments

Imen Chakroun
Laboratoire d’Informatique Fondamentale de Lille (UMR CNRS 8022), Centre de Recherche INRIA Lille Nord Europe
tel-00841965, (8 July 2013)

   title={Algorithmes Branch and Bound parall{‘e}les h{‘e}t{‘e}rog{‘e}nes pour environnements multi-coeurs et multi-GPU},

   author={Chakroun, Imen},


   school={Universit{‘e} des Sciences et Technologie de Lille-Lille I}


Download Download (PDF)   View View   Source Source   



Branch and Bound (B&B) algorithms are attractive for solving to optimality combinatorial optimization problems (COPs) by exploring a tree-based search space. Nevertheless, they are highly time-intensive when dealing with large problem instances (e.g. Taillard’s FSP benchmarks) even using grid computing [Mezmaz et al., IEEE IPDPS’2007]. Massively parallel computing supplied through today’s heterogeneous (GPU-enhanced multicore) platforms [TOP500 ] is required to tackle more efficiently those instances. The challenge is therefore to exploit all the underlying levels of parallelism and thus to rethink accordingly the parallel models of B&B. In this thesis, we revisit the design and implementation of B&B for solving large COPs on (large) multi-core and multi-GPU platforms. The Flow-Shop scheduling problem (FSP) is considered as a case study. A preliminary experimental study on some large FSP instances has revealed that the search tree is highly irregular (in shape and size) and very large (billions of billions of nodes), and the bounding operator is time-exorbitant (about 97% of B&B). Therefore, our first contribution is to propose a (single CPU core) GPU-accelerated approach (GB&B) in which only the bounding operator is performed on the GPU device. The approach deals with two issues: thread divergence [Chakroun et al., Concurrency and Computation: Practice and Experience 2012] and device hierarchical memory optimization [Melab et al., IEEE Cluster 2012]. Compared to a single CPU core-based implementation, speed-ups up to (x100) are obtained on Nvidia Tesla C2050. Although these good speed-ups, the performance analysis has shown that the overhead induced by the data transfer between CPU and GPU is high. Therefore, the aim of the second contribution [Chakroun et al., ICCS 2013] is to extend the approach (LL-GB&B) in order to minimize the CPU-GPU communication latency. Such objective is achieved through a GPU-based fine-grained parallelization of the branching and pruning operators in addition to the bounding one. The major and particularly challenging issue addressed here is thread divergence due to the strongly irregular nature of the explored tree mentioned above. Compared to a single CPU-based execution, LL-GB&B allows accelerations up to (x160) for large problem instances. The third contribution [Chakroun et al., Journal of Parallel and Distributed Computing, 2013] consists in investigating the combination of GPU with multi-core processing. Two scenarios have been explored leading to two approaches: a concurrent (RLL-GB&B) and a cooperative one (PLL-GB&B). In the first one, the exploration process is performed concurrently by the GPU and the CPU cores. In the cooperative approach, the CPU cores prepare and off-load to GPU pools of subproblems using data streaming while the GPU performs the exploration. When combining multi-core and GPU, we figure out that using RLL-GB&B is not beneficial while PLL-GB&B enables an improvement up to (36%) compared to LL-GB&B. Recently computational grids such as Grid5000 (on some sites) have been enhanced with GPU accelerators, therefore the fourth contribution of this thesis is to address the combination of GPU and multi-core computing with large scale distributed computing. To do that, the different revisited algorithms have been put together in a heterogeneous meta-algorithm which automatically selects the one to be deployed according to the target hardware configuration. The meta-algorithm is coupled with the B&B@Grid approach proposed in [Mezmaz et al., IEEE IPDPS’2007]. B&B@Grid distributes the work units (search subspaces coded by intervals) among the grid nodes while the metaalgorithm selects and applies locally a parallel B&B algorithm on the received intervals. The combined approach allowed us to solve to optimality and efficiently some Taillard’s FSP instances (20 jobs on 20 machines).
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Follow us on Twitter

HGPU group

1578 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

294 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: