high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving Synchronization and Data Access in Parallel Programming Models

Improving Synchronization and Data Access in Parallel Programming Models

Ettore Speziale

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano

Politecnico di Milano, 2013

BibTeX

Download (PDF)

View

Source

1888

views

Today, parallel architectures are the main vector for exploiting available die area. The shift from architectures tuned for sequential programming models to ones optimized for parallel processing follows from the inability of further enhance sequential performance due to power and memory walls. On the other hand, efficient exploitation of parallel computing units looks a hard task. Indeed, to get performance improvements it is necessary to carefully tune applications, as proven by years of High Performance Computing using MPI. To lower the burden of parallel programming, parallel programming models expose a simplified view of the hardware, by relying on abstract parallel constructs, such as parallel loops or tasks. Mapping of those constructs on parallel processing units is achieved by a mix of optimizing compilers and run-time techniques. However, due to the availability of an huge number of very different parallel architectures, hiding low-level details often prevents performance to be comparable with the one of hand-tuned code. This dissertation aims at analyzing inefficiencies related to the usage of parallel computing units, and to optimize them from the runtime perspective. In particular, we analyze the optimization of reduction computations when performed together with barrier synchronizations. Moreover, we show how runtime techniques can exploit affinity between data and computations to limit as much as possible the performance penalty hidden in NUMA architectures, both in the OpenMP and MapReduce settings. We then observe how a lightweight JIT compilation approach could enable better exploitation of parallel architectures, and lastly we analyze the resilience to faults induction of synchronization primitives, a basic building block of all parallel programs.

Tags: Computer science, CUDA, MapReduce, MPI, nVidia, OpenCL, OpenMP, Thesis

May 13, 2013 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Improving Synchronization and Data Access in Parallel Programming Models

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Improving Synchronization and Data Access in Parallel Programming Models

Share this:

Recent source codes

Most viewed papers (last 30 days)