Revisiting the Case of ARM SoCs in High-Performance Computing Clusters

hgpu.org » Applications » Computer science » Revisiting the Case of ARM SoCs in High-Performance Computing Clusters

Revisiting the Case of ARM SoCs in High-Performance Computing Clusters

Tyler Fox

Brown University

Brown University, 2017

@phdthesis{fox2017revisiting,

title={Revisiting the Case of ARM SoCs in High-Performance Computing Clusters},

author={Fox, Tyler},

year={2017},

school={School of Engineering, Brown University}

}

Download (PDF)

View

Source

3016

views

Over the course of the past decade, the explosive popularity of embedded devices such as smartphones and tablets have given rise to ARM SoCs, whose characteristically low power consumption have made them ideal for these types of embedded devices. Recent maturation in the ARM SoC market, which has seen the advent of more powerful 64-bit ARM SoCs, has enabled the development of server-class machines that make use of ARM CPUs. These servers typically use several heavy-duty ARM CPU cores opposed to the typical heterogeneous setup of mobile-class SoCs, which typically integrate low-power CPU cores accompanied by GPU cores and hardware accelerators. In this paper, we aim to characterize high-performance computing (HPC) clusters built using mobile-class ARM SoCs and determine whether they offer performance and efficiency benefits when compared to dedicated ARM servers and traditional x86 servers. For this thesis, we developed a custom HPC cluster of mobile-class NVIDIA Jetson TX1 developer boards with 10Gb network connectivity. We measure and analyze the performance, power consumption, and energy efficiency of the different setups using a variety of HPC and big data applications, ranging from classical numerical benchmarks to emerging deep learning applications. We also evaluate the benefits of using 10Gb network connectivity between TX1 nodes over the more traditional 1Gb connectivity. We show which HPC applications are best-suited for evaluation using these types of clusters and give insights on how to best utilize the ARM architecture when creating clusters for a variety of different applications; namely, good cache design and efficiently using silicon area by including a GPGPU opposed to additional CPU cores are key to creating efficient ARM systems. Finally, we explore the differences between our cluster’s distributed GPUs and a traditional system with a centralized, discrete GPU to determine which types of workloads are best-suited for the distributed environment.

Tags: Benchmarking, Computer science, CUDA, Deep learning, Heterogeneous systems, nVidia, nVidia GeForce GTX 960, nVidia Tegra TX1, SoC, Thesis

October 21, 2017 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org