Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

hgpu.org » Programming » Algorithms » Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

Mark Silberstein, Assaf Schuster, John D. Owens

Computer Science, Technion – Israel Institute of Technology

Chapter in GPU Computing Gems, Jade Edition, edited by Wen-mei W. Hwu, 2011

@article{silberstein2011applying,

title={Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads},

author={Silberstein, M. and Schuster, A. and Owens, J.D.},

year={2011}

}

Download (PDF)

View

Source

Source codes

Package:

Sum-product GPU kernel

1432

views

In this talk we address two problems frequently encountered by GPU developers: optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in composite applications with multiple dependent kernels. Both require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs. We first describe our methodology for solving the memory optimization problem via software-managed caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as on the one that uses the Fermi hardware cache alone. We then present the static scheduling algorithm for minimizing the total running time of a complete application comprising multiple kernels with tree data dependencies. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs. The algorithm presents a graph-based approach to optimizing the running time by evaluating the performance of all the assignments jointly, including the communication overhead due to the data dependencies between the kernels. This algorithm can be also applied for minimizing energy consumption at the expense of higher runtimes, in which case the algorithm provides provably optimal solution. We demonstrate these techniques by applying them to a real application for computing probability of evidence in probabilistic networks. The combination of memory optimization and dynamic assignment results in up to three-fold runtime reduction over the non-optimized version on real inputs, and up to five-fold over a highly optimized parallel version running on Intel’s latest dual quad-core 16-thread Nehalem machine.

Tags: Algorithms, Computer science, CUDA, nVidia, nVidia GeForce 8800 GTX, nVidia GeForce GTX 285, Optimization, Package, Performance, Probability, Task scheduling, Tesla C2050

October 5, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org