high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Ali Bakhoda, Tor M. Aamodt

University of British Columbia, Vancouver, BC, Canada

2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, CMP-MSI 2008, 2008

@article{bakhodaextending,

title={Extending the Scalability of Single Chip Stream Processors with On-chip Caches},

author={Bakhoda, A. and Aamodt, T.M.},

year={2008},

publisher={Citeseer}

}

Download (PDF)

View

Source

1234

views

As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional resources for appropriate applications. However, it is important that programmer effort be amortized across future generations of stream processor architectures. Current industry projections suggest a single chip may be able to integrate several thousand 64-bit floating-point ALUs within the next decade. Future designs will require significantly larger, scalable onchip interconnection networks, which will likely increase memory access latency. While the capacity of the explicitly managed local store of current stream processor architectures could be enlarged to tolerate the added latency, existing stream processing software may require significant programmer effort to leverage such modifications. In this paper we propose a scalable stream processing architecture that addresses this issue. In our design, each stream processor has an explicitly managed local store model backed by an on-chip cache hierarchy. We evaluate our design using several parallel benchmarks to show the trade-offs of various cache and DRAM configurations. We show that addition of a 256KB L2 cache per memory controller increases the performance of our 16, 64 and 121 node stream processors designs (containing 128, 896, and 1760 ALUs, respectively) by 14.5%, 54.9% and 82.3% on average respectively. We find that even those applications that utilize the localstore in our study benefit significantly from the addition of L2 caches.

Tags: ATI, Computer science, Hardware Architecture, Memory, nVidia

April 19, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Share this:

Recent source codes

Most viewed papers (last 30 days)