Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

Mark Silberstein, Assaf Schuster, John D. Owens
Computer Science, Technion – Israel Institute of Technology
Chapter in GPU Computing Gems, Jade Edition, edited by Wen-mei W. Hwu, 2011


   title={Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads},

   author={Silberstein, M. and Schuster, A. and Owens, J.D.},



Download Download (PDF)   View View   Source Source   Source codes Source codes



In this talk we address two problems frequently encountered by GPU developers: optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in composite applications with multiple dependent kernels. Both require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs. We first describe our methodology for solving the memory optimization problem via software-managed caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as on the one that uses the Fermi hardware cache alone. We then present the static scheduling algorithm for minimizing the total running time of a complete application comprising multiple kernels with tree data dependencies. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs. The algorithm presents a graph-based approach to optimizing the running time by evaluating the performance of all the assignments jointly, including the communication overhead due to the data dependencies between the kernels. This algorithm can be also applied for minimizing energy consumption at the expense of higher runtimes, in which case the algorithm provides provably optimal solution. We demonstrate these techniques by applying them to a real application for computing probability of evidence in probabilistic networks. The combination of memory optimization and dynamic assignment results in up to three-fold runtime reduction over the non-optimized version on real inputs, and up to five-fold over a highly optimized parallel version running on Intel’s latest dual quad-core 16-thread Nehalem machine.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: