high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » On Performance of GPU and DSP Architectures for Computationally Intensive Applications

On Performance of GPU and DSP Architectures for Computationally Intensive Applications

John Faella

University of Rhode Island

University of Rhode Island, 2013

@article{faella2013performance,

title={On Performance of GPU and DSP Architectures for Computationally Intensive Applications},

author={Faella, John},

year={2013}

}

Download (PDF)

View

Source

2822

views

This thesis focuses on the implementations of a support vector machine (SVM) algorithm on digital signal processor (DSP), graphics processor unit (GPU), and a common Intel i7 core architecture. The purpose of this work is to identify which of the three is most suitable for SVM implementation. The performance is measured by looking at the time required by each of the architectures per prediction. This work also provides an analysis of possible alternatives to existing implementations of computationally intensive algorithms, such as SVM. Some performance improving methods were proposed and examined for the given DSP and GPU architectures. The 4-class and 7-class implementations of the SVM algorithm were examined. On the system with an Intel i7-2720QM CPU at 2.2GHz, the execution times on a per prediction basis were 364ms for the 4-class implementation, and 410ms for the 7-class implementation. On the Spectrum Digital TMS320C6713 DSP development board at 225MHz, the 4-class SVM implementation uses 125ms and the 7-class version needs 165ms. After careful examination of the DSP architecture, the following are implemented to improve the performance: (1) number of memory accesses is greatly reduced via programming technique, (2) the L2 cache is better utilized, and (3) the number of branch statements is reduced. As a result, the run time for 4-class SVM is improved from125ms to only 9ms, and from 165ms to 11ms for the 7-class implementation. On the Nvidia Geforce GT 540m graphics card at 1334MHz, the 4-class SVM needs 798ms, and the 7-class implementation requires 845ms. Again, the GPU’s architecture is investigated and the following are used to improve the performance: (1) eliminating excessive memory accesses, (2) taking advantage of memory coalescing, and (3) the use of the reduction method. The improvements resulted in a decrease in the execution time from 798ms to 175ms for the 4-class SVM implementation and from 845ms to 200ms for the 7-class implementation. Because the three architectures studied here are incorporated in three very different systems, running at different clock speeds, a direct comparison of the run time is not possible. The DSP system runs at roughly 10 times slower clock speed than the Intel i7 core system, and achieved more than 20 times slower run times. We cannot directly extrapolate this result; however, we observed that DSP does have its drawbacks when implementing the SVM algorithm. The DSP processor was designed specifically to support computationally intensive DSP algorithms. However, SVM algorithm is somewhat different from traditional DSP algorithms and thus some DSP architectural features are not applicable. From the experimental results, we may observe that GPU is most suitable for the SVM algorithm. Even though it runs at a lower clock speed, about 60% of that of Intel i7 core, with the performance improvement techniques, the GPU outperforms the i7 counterpart. This may be attributed to the GPU’s architectural support for parallel computations and its flexibility to adapt to various computationally intensive algorithms.

Tags: Algorithms, Computer science, CUDA, DSP, nVidia, nVidia GeForce GT 540 M, Programming techniques, Thesis

March 2, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

On Performance of GPU and DSP Architectures for Computationally Intensive Applications

Your response

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

On Performance of GPU and DSP Architectures for Computationally Intensive Applications

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)