high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Linchuan Chen

Ohio State University, Computer Science and Engineering

The Ohio State University, 2015

BibTeX

Download (PDF)

View

Source

2133

views

Because of the bottleneck in the increase of clock frequency, multi-cores emerged as a way of improving the overall performance of CPUs. In the recent decade, many-cores begin to play a more and more important role in scientific computing. The highly cost-effective nature of many-cores makes them extremely suitable for data-intensive computations. Specifically, many-cores are in the forms of GPUs (e.g., NVIDIA or AMD GPUs) and more recently, coprocessers (Intel MIC). Even though these highly parallel architectures offer significant amount of computation power, it is very hard to program them, and harder to fully exploit the computation power of them. Combing the power of multi-cores and many-cores, i.e., making use of the heterogeneous cores is extremely complicated. Our efforts have been made on performing optimizations to important sets of applications on such parallel systems. We address this issue from the perspective of communication patterns. Scientific applications can be classified based on the properties (communication patterns), which have been specified in the Berkeley Dwarfs many years ago. By investigating the characteristics of each class, we are able to derive efficient execution strategies, across different levels of the parallelism. We design a high-level programming API, as well as implement an efficient runtime system with pattern-specific optimizations, considering the characteristics of the hardware platform. Thus, instead of providing a general programming model, we provide separate APIs for each communication pattern. We have worked on a selected subset of the communication patterns, including MapReduce, generalized reductions, irregular reductions, stencil computations and graph processing. Our targeted platforms are single GPUs, coupled CPU-GPUs, heterogeneous clusters, and Intel Xeon Phis. Our work not only focuses on efficiently executing a communication pattern on a single multi-core or many-core, but also considers inter-device and inter-node task scheduling. While implementing a specific communication pattern, we consider aspects including lock-reducing, data locality, and load balancing. Our work starts with the optimization of the MapReduce on a single GPU, specifically aiming to efficiently utilize the shared memory. We design a reduction based approach, which is able to keep the memory consumption low by avoiding the storage of intermediate key-value pairs. To support such an approach, we design a general data structure, referred to as the reduction object, which is placed in the memory hierarchy of the GPU. The limited memory requirement of the reduction object allows us to extensively utilize the small but fast shared memory. Our approach performs well for a popular set of MapReduce applications, especially the reduction intensive ones. The comparison with former state-of-art accelerator based approaches shows that our approach is much more efficient at utilizing the shared memory. Even though MapReduce significantly reduces the complexity of parallel programming, it is not easy to achieve efficient execution, for complicated applications, on heterogeneous clusters with multi-core and multiple GPUs within each node. In view of this, we design a programming framework, which aims to reduce the programming difficulty, as well as provide automatic optimizations to applications. Our approach is to classify applications based on communication patterns. The patterns we study include Generalized Reductions, Irregular Reductions and Stencil Computations, which are important ones that are frequently used in scientific and data intensive computations. For each pattern, we design a simple API, as well as a runtime with pattern-specific optimizations at different parallelism levels. Besides, we also investigate graph applications. We design a graph processing system over the Intel Xeon Phi and CPU. We design a vertex-centric programming API, and a novel condensed static message buffer that supports less memory consumption and SIMD message reduction. We also use a pipelining scheme to avoid frequent locking. The hybrid graph partitioning is able to achieve load balance between CPU and Xeon Phi, as well as to reduce the communication overhead. Executing irregular applications on SIMD architectures is always challenging. The irregularity leads to problems including poor data access locality, data dependency, as well as inefficient utilization of SIMD lanes. We propose a general optimization methodology for irregular applications, including irregular reductions, graph algorithms and sparse matrix matrix multiplications. The key observation of our approach is that the major data structures accessed by irregular applications can be treated as sparse matrices. The steps of our methodology include: matrix tiling, data access pattern identification, and conflict removal. As a consequence, our approach is able to efficiently utilize both SIMD and MIMD parallelism on the Intel Xeon Phi.

Tags: Computer science, CUDA, Heterogeneous systems, MapReduce, Matrix multiplication, MPI, nVidia, Sparse matrix, Task scheduling, Tesla M2070, Thesis

October 13, 2015 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Share this:

Recent source codes

Most viewed papers (last 30 days)