{"id":18872,"date":"2019-05-05T09:26:27","date_gmt":"2019-05-05T06:26:27","guid":{"rendered":"https:\/\/hgpu.org\/?p=18872"},"modified":"2019-05-05T09:26:27","modified_gmt":"2019-05-05T06:26:27","slug":"an-architectural-journey-into-risc-architectures-for-hpc-workloads","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=18872","title":{"rendered":"An Architectural Journey into RISC Architectures for HPC Workloads"},"content":{"rendered":"<p>The race to the Exascale (i.e., 10^18 Floating Point operations per seconds) together with the slow-down of Moore&#8217;s law are posing unprecedented challenges to the whole High-Performance Computing (HPC) community. Computer architects, system integrators and software engineers studying programming models for handling parallelism are especially called to the rescue in a moment like the one in which we are living. While studying the current HPC market, a careful observer can notice that i) the dominance of a single x86 is fading; ii) as a consequence of the previous point, new CPU architectures and accelerators are gaining relevance (e.g. RISC CPUs and GP-GPUs); iii) also, new workloads coming from industry 4.0 and automotive (e.g. machine learning) are requiring more and more computational resources. Thus, driving the development of next-generation computational systems. This thesis explores the boundary of these three observations evaluating the current state-of-the-art of emerging RISC architectures in HPC (Arm and RISC-V). It studies the performance, the instantaneous power consumption and total energy spent to reach the solution of a scientific problem in heterogeneous System-on-Chips (SoCs). For the evaluation, four platforms have been tested: two heterogeneous Arm platforms (CPU+GPU and CPU+FPGA), one RISC-V platform and one Open Source RISC-V core running in an FPGA. The added values of the thesis come from the fact that: A. The evaluation of the aforementioned platforms has been performed using a machine learning test-case based on the k-means clustering algorithm related to predictive maintenance and failure detection provided by an industrial partner. While preparing this master thesis, I was in fact involved in the research activities within the collaboration between the Barcelona Supercomputing Center (BSC) and Aingura IIoT. B. The tests of the k-means algorithm on the RISC-V core implied the implementation of a System on Chip allowing the interaction with the RISC-V core. Even if the Ariane core itself is freely available online, the work of having peripherals for minimal I\/O operations and performance counters required careful work on FPGA using a hardware description language (SystemVerilog). As expected, the more mature Arm Cortex A57 processor outperformed the rest of the platforms and the best RISC-V platform shown to perform as good as the Arm Cortex A9. For the heterogeneous platforms, the studied CPU+GPU system achieved the best performance but the CPU+FPGA used less energy when considering only the active power of the execution. The document makes special emphasis on the reproducibility of the experiments by explaining step-by-step how to set up an FPGA-based research platform using an Open Source RISC-V core and how to interact with the hardware counters defined in RISC-V in order to measure the performance.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The race to the Exascale (i.e., 10^18 Floating Point operations per seconds) together with the slow-down of Moore&#8217;s law are posing unprecedented challenges to the whole High-Performance Computing (HPC) community. Computer architects, system integrators and software engineers studying programming models for handling parallelism are especially called to the rescue in a moment like the one [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,3],"tags":[1238,1782,377,1682,390],"class_list":["post-18872","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-paper","tag-arm","tag-computer-science","tag-fpga","tag-hpc","tag-thesis"],"views":1953,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/18872","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=18872"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/18872\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=18872"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=18872"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=18872"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}