high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Ali Bakhoda, Tor M. Aamodt

University of British Columbia, Vancouver, BC, Canada

2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, CMP-MSI 2008, 2008

@article{bakhodaextending,

title={Extending the Scalability of Single Chip Stream Processors with On-chip Caches},

author={Bakhoda, A. and Aamodt, T.M.},

year={2008},

publisher={Citeseer}

}

Download (PDF)

View

Source

1777

views

As semiconductor scaling continues, more transistors can be put onto the same chip despite growing challenges in clock frequency scaling. Stream processor architectures can make effective use of these additional resources for appropriate applications. However, it is important that programmer effort be amortized across future generations of stream processor architectures. Current industry projections suggest a single chip may be able to integrate several thousand 64-bit floating-point ALUs within the next decade. Future designs will require significantly larger, scalable onchip interconnection networks, which will likely increase memory access latency. While the capacity of the explicitly managed local store of current stream processor architectures could be enlarged to tolerate the added latency, existing stream processing software may require significant programmer effort to leverage such modifications. In this paper we propose a scalable stream processing architecture that addresses this issue. In our design, each stream processor has an explicitly managed local store model backed by an on-chip cache hierarchy. We evaluate our design using several parallel benchmarks to show the trade-offs of various cache and DRAM configurations. We show that addition of a 256KB L2 cache per memory controller increases the performance of our 16, 64 and 121 node stream processors designs (containing 128, 896, and 1760 ALUs, respectively) by 14.5%, 54.9% and 82.3% on average respectively. We find that even those applications that utilize the localstore in our study benefit significantly from the addition of L2 caches.

Tags: ATI, Computer science, Hardware Architecture, Memory, nVidia

April 19, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Extending the Scalability of Single Chip Stream Processors with On-chip Caches

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)