high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Principles, Techniques, and Tools for Explicit and Automatic Parallelization

Principles, Techniques, and Tools for Explicit and Automatic Parallelization

Nico Reissmann

Norwegian University of Science and Technology, Faculty of Information Technology and Electrical Engineering, Department of Computer Science

Norwegian University of Science and Technology, 2019

BibTeX

Download (PDF)

View

Source

Source codes

Package:

jive

2285

views

The end of Dennard scaling also brought an end to frequency scaling as a means to improve performance. Chip manufacturers had to abandon frequency and superscalar scaling as processors became increasingly power constrained. An architecture’s power budget became the limiting factor to performance gains, and computations had to be performed more energy-efficiently. Designers turned to chip multiprocessors (CMPs) and developers began to employ specialized architectures, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to further improve performance while meeting the power envelope. The exploitation of parallelism in an energy-efficient manner became the primary way forward. Until the end of Dennard scaling, programs experienced transparent performance gains with every new processor generation. However, CMPs, GPUs, and FPGAs rely on the static extraction of parallelism to improve performance, and programs need to be modified to take advantage of these architectures. Thus, performance gains are no longer achieved transparently, and developers and tools are forced to face new, as well as long-neglected challenges in program parallelization. These challenges include the detection and encoding of potential parallelism in automatic approaches, application portability issues on GPUs, and performance portability issues on CMPs. It is essential to address these challenges, as the continuous increase in computer performance now solely relies on the exploitation of parallelism. This thesis consists of three parts, each addressing one of the aforementioned challenges in program parallelization. The first part addresses the detection and encoding of potential parallelism in automatic approaches. It presents the Regionalized Value State Dependence Graph (RVSDG) as an alternative intermediate representation for optimizing and parallelizing compilers. The RVSDG exposes the hierarchical structure of programs and explicitly models the dependencies between computations, permitting the explicit encoding of concurrent operations and program structures, such as conditionals, loops, and functions. This helps to expose the inherent parallelism in programs and its structures by employing well-known methods for the extraction of instruction level parallelism. The second part addresses application portability issues on GPUs. A GPU’s specialized architecture is optimized for highly regular data-parallel applications, but compromises program performance for workloads with irregular control flow, potentially leading to redundant code execution. We propose a control flow restructuring method to effectively eliminate repeated code execution on GPUs and potentially improve performance. The third part addresses performance portability on CMPs. This issue arises as developers overfit their application to a specific architecture, which results in suboptimal performance for different program inputs or different architectures. We improve performance analysis for OpenMP programs by addressing the scalability challenges of the grain graph visualization method. We present an aggregation method for grain graphs that hierarchically groups related nodes into a single node. This aggregated graph can then be navigated by progressively uncovering nodes with performance issues, while hiding unrelated regions of the graph. This enhances productivity by enabling developers to understand performance problems of highly-parallel OpenMP programs more easily. The insights and techniques developed by addressing these three challenges may result in improved methods and tools for the exploitation of parallelism. The RVSDG is a promising IR for parallelizing compilers, as it permits the encoding of concurrent computations. The grain graph offers a familiar structural view to developers along with the performance issues of a particular program. In the future, it is necessary to cast these ideas into mature tools to make them applicable in practice and foster further research.

Tags: Computer science, CUDA, nVidia, OpenMP, Package, Performance, performance portability, Tesla K20, Thesis

May 5, 2019 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Principles, Techniques, and Tools for Explicit and Automatic Parallelization

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Principles, Techniques, and Tools for Explicit and Automatic Parallelization

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)