{"id":17121,"date":"2017-04-07T00:49:17","date_gmt":"2017-04-06T21:49:17","guid":{"rendered":"https:\/\/hgpu.org\/?p=17121"},"modified":"2017-04-07T00:49:17","modified_gmt":"2017-04-06T21:49:17","slug":"in-datacenter-performance-analysis-of-a-tensor-processing-unit","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=17121","title":{"rendered":"In-Datacenter Performance Analysis of a Tensor Processing Unit"},"content":{"rendered":"<p>Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU) &#8211; deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps\/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU&#8217;s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, &#8230;) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters&#8217; NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X &#8211; 30X faster than its contemporary GPU or CPU, with TOPS\/Watt about 30X &#8211; 80X higher. Moreover, using the GPU&#8217;s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS\/Watt to nearly 70X the GPU and 200X the CPU.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU) &#8211; deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,3],"tags":[330,1782,1673,1851,34,20,1909,1740,1945],"class_list":["post-17121","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-paper","tag-cnn","tag-computer-science","tag-deep-learning","tag-lstm","tag-neural-networks","tag-nvidia","tag-tensorflow","tag-tesla-k80","tag-tpu"],"views":4694,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/17121","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=17121"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/17121\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=17121"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=17121"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=17121"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}