On Performance of GPU and DSP Architectures for Computationally Intensive Applications
University of Rhode Island
University of Rhode Island, 2013
@article{faella2013performance,
title={On Performance of GPU and DSP Architectures for Computationally Intensive Applications},
author={Faella, John},
year={2013}
}
This thesis focuses on the implementations of a support vector machine (SVM) algorithm on digital signal processor (DSP), graphics processor unit (GPU), and a common Intel i7 core architecture. The purpose of this work is to identify which of the three is most suitable for SVM implementation. The performance is measured by looking at the time required by each of the architectures per prediction. This work also provides an analysis of possible alternatives to existing implementations of computationally intensive algorithms, such as SVM. Some performance improving methods were proposed and examined for the given DSP and GPU architectures. The 4-class and 7-class implementations of the SVM algorithm were examined. On the system with an Intel i7-2720QM CPU at 2.2GHz, the execution times on a per prediction basis were 364ms for the 4-class implementation, and 410ms for the 7-class implementation. On the Spectrum Digital TMS320C6713 DSP development board at 225MHz, the 4-class SVM implementation uses 125ms and the 7-class version needs 165ms. After careful examination of the DSP architecture, the following are implemented to improve the performance: (1) number of memory accesses is greatly reduced via programming technique, (2) the L2 cache is better utilized, and (3) the number of branch statements is reduced. As a result, the run time for 4-class SVM is improved from125ms to only 9ms, and from 165ms to 11ms for the 7-class implementation. On the Nvidia Geforce GT 540m graphics card at 1334MHz, the 4-class SVM needs 798ms, and the 7-class implementation requires 845ms. Again, the GPU’s architecture is investigated and the following are used to improve the performance: (1) eliminating excessive memory accesses, (2) taking advantage of memory coalescing, and (3) the use of the reduction method. The improvements resulted in a decrease in the execution time from 798ms to 175ms for the 4-class SVM implementation and from 845ms to 200ms for the 7-class implementation. Because the three architectures studied here are incorporated in three very different systems, running at different clock speeds, a direct comparison of the run time is not possible. The DSP system runs at roughly 10 times slower clock speed than the Intel i7 core system, and achieved more than 20 times slower run times. We cannot directly extrapolate this result; however, we observed that DSP does have its drawbacks when implementing the SVM algorithm. The DSP processor was designed specifically to support computationally intensive DSP algorithms. However, SVM algorithm is somewhat different from traditional DSP algorithms and thus some DSP architectural features are not applicable. From the experimental results, we may observe that GPU is most suitable for the SVM algorithm. Even though it runs at a lower clock speed, about 60% of that of Intel i7 core, with the performance improvement techniques, the GPU outperforms the i7 counterpart. This may be attributed to the GPU’s architectural support for parallel computations and its flexibility to adapt to various computationally intensive algorithms.
March 2, 2013 by hgpu