high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Rajesh Bordawekar, Uday Bondhugula, Ravi Rao

IBM T. J. Watson Research Center, Hawthorne, NY, USA

Proceedings of the 19th international conference on Parallel architectures and compilation techniques, 2010, p.537-538

DOI:10.1145/1854273.1854340

@conference{bordawekar2010believe,

title={Believe it or not!: mult-core CPUs can match GPU performance for a FLOP-intensive application!},

author={Bordawekar, R. and Bondhugula, U. and Rao, R.},

booktitle={Proceedings of the 19th international conference on Parallel architectures and compilation techniques},

pages={537–538},

year={2010},

organization={ACM}

}

Download (PDF)

View

Source

1977

views

In this work, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. The algorithm processes individual images represented as 2-dimensional matrices of single-precision floating-point values using O(n4) operations involving dot-products and additions. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. On commodity multi-core processors, the application dataset is small enough to fit in caches, and when parallelized using a combination of task and short-vector data parallelism (via SSE/VSX) or through fully automatic optimization from the compiler, the application matches or beats the performance of the GPU version. The primary reasons for better multi-core performance include larger and faster caches, higher clock frequency, higher on-chip memory bandwidth, and better compiler optimization and support for parallelization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 285, Performance, Programming techniques

February 26, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)