high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Run-time support for multi-level disjoint memory address spaces

Run-time support for multi-level disjoint memory address spaces

Javier Bueno Hedo

Departament d’Arquitectura de Computadors, Universitat Politecnica de Catalunya

Universitat Politecnica de Catalunya, 2015

BibTeX

Download (PDF)

View

Source

1981

views

High Performance Computing (HPC) systems have become widely used tools in many industry areas and research fields. Research to produce more powerful and efficient systems has grown in par with their popularity. As a consequence, the complexity of modern HPC architectures has increased in order to provide systems with the highest levels of performance. This increased complexity has also affected the way HPC systems are programmed. HPC users have to deal with new devices, languages and tools, and this is can be a significant access barrier to people that do not have a deep knowledge in computer science. On par with the evolution of HPC systems, programming models have also evolved to ease the task of developing applications for these machines. Two well-known examples are OpenMP and MPI. The former can be used in shared memory systems and is praised for offering an easy methodology of software development. The latter is more popular because it targets distributed environments but it is considered burdensome to use. Besides these two, many programming models have emerged to propose new methodologies or to handle new hardware devices. One of these models is OmpSs. OmpSs is a programming model for modern HPC systems that is based on OpenMP and StarSs. Developed by the Programming Models group at the Barcelona Supercomputing Center, it targets the latest generation of HPC systems while benefiting from the ease of use of OpenMP. OmpSs offers asynchronous parallelism with the concept of tasks with data dependencies. These tasks allow the specification of sections of code that can be executed in parallel while the dependencies specify the restrictions about the order in which the tasks can be executed. With this, OmpSs programs can adapt to a many different system configurations while fundamentally still being sequential programs with annotations. This thesis explores the benefits of providing OmpSs the capability to target architectures with complex memory hierarchies. An example of such systems can be the new generation of clusters that use accelerators to power their computing capabilities. The memory hierarchy of these machines is composed of a first level of distributed memory formed by the memory of each individual node, and a second level formed by the private memory of each accelerator devices. We propose a reference implementation that enables OmpSs programs to run on a cluster with or without accelerators while also providing a competitive performance when compared with other programming models. We also discuss the enhancement of the OmpSs programming model with the support of non-contiguous regions of data. Offering this feature allows applications with complex data accesses to be easily annotated with OmpSs. This is important to widen the spectrum of applications that can be handled by the programming model. We present an implementation and evaluation of the performance and programmability impact of supporting non-contiguous memory regions.

Tags: Computer science, CUDA, MPI, nVidia, nVidia GeForce GTX 480, OmpSs, OpenMP, Thesis

December 15, 2015 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Run-time support for multi-level disjoint memory address spaces

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Run-time support for multi-level disjoint memory address spaces

Share this:

Recent source codes

Most viewed papers (last 30 days)