3437

Throughput-Effective On-Chip Networks for Manycore Accelerators

Ali Bakhoda, John Kim, Tor M. Aamodt
ECE Dept., Univ. of British Columbia, Vancouver, BC, Canada
43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2010

@conference{bakhoda2010throughput,

   title={Throughput-Effective On-Chip Networks for Manycore Accelerators},

   author={Bakhoda, A. and Kim, J. and Aamodt, T.M.},

   booktitle={2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture},

   pages={421–432},

   issn={1072-4451},

   year={2010},

   organization={IEEE}

}

Download Download (PDF)   View View   Source Source   

1124

views

As the number of cores and threads in manycore compute accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This paper explores throughput-effective network-on-chips (NoC) for future manycore accelerators that employ bulk-synchronous parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is “throughput-effective” if it improves parallel application level performance per unit chip area. We evaluate performance of future looking workloads using detailed closed-loop simulations modeling compute nodes, NoC and the DRAM memory system. We start from a mesh design with bisection bandwidth balanced with off-chip demand. Accelerator workloads tend to demand high off-chip memory bandwidth which results in a many-to-few traffic pattern when coupled with expected technology constraints of slow growth in pins-per-chip. Leveraging these observations we reduce NoC area by proposing a “checkerboard” NoC which alternates between conventional full-routers and half-routers with limited connectivity. Checkerboard employs a new oblivious routing algorithm that maintains a minimum hop-count for architectures that place L2 cache banks at the half-router nodes. Next, we show that increasing network injection bandwidth for the large amount of read reply traffic at the nodes connected to DRAM controllers alleviates a significant fraction of the remaining imbalance resulting from the many-to-few traffic pattern. The combined effect of the above optimizations with an improved placement of memory controllers in the mesh and channel slicing improves application throughput per unit area by 25.4%.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: