high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Simon Garcia De Gonzalo, Wen-Mei Hwu, Simon D. Hammond, Christian R. Trott

Department of Computer Science, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois

International Parallel and Distributed Processing Symposium, 2017

BibTeX

Download (PDF)

View

Source

1928

views

Kokkos [1], [2] is a C++ programming model that offers the ability to write portable code that targets a wide degree of parallelism found in current HPC systems. It works by providing abstractions for parallel execution and data layouts that are mapped to different hardware resources during compilation. Some parameters, such as the size of thread teams and vector width, are available for the application scientist to tune their code to a particular hardware platform. For many applications, choosing the right parameters can be highly dependent on the data input or application configuration, that the device characteristics themselves. Sparse-Matrix Vector products (SpMV) are highly irregular computational kernels that can be found in a diverse collection of high-performance science applications. Performance for this important kernel is often highly correlated with the associated matrix sparsity as this ultimately governs the granularity, and therefore, efficiency of the memory system being used. In this paper, we propose to extend the current set of Kokkos profiling tools with an autotuner that can iterate over possible choices for thread-team size and vector width, taking advantage of runtime information to choose the optimal parameters for a particular input. This approach allows an iterative application that calls the same kernel multiple times to continue to progress towards a solution while, at the same time, alleviating the burden from the application programmer of knowing details of the underlying hardware and accounting for variable inputs. We compare an approach using the autotuner against a fixed approach, that attempts to use all the hardware resources all the time, and show that the optimal choice made by the autotuner is significantly different among the two latest classes of accelerator architectures. After 100 iterations we identify which subset of the matrices benefit from improved performance, while others are near the break-even point, where the overhead of the tool has been completely hidden. We highlight the properties of sparse matrices that can help determine when autotuning will be of benefit. Finally, we connect the overhead of the autotuner to specific sparsity patterns and hardware resources.

Tags: Computer science, Intel Xeon Phi, nVidia, Performance, Sparse matrix, Tesla P100

July 15, 2018 by hgpu

Rating: 2.0/5. From 1 vote.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Share this:

Recent source codes

Most viewed papers (last 30 days)