high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimization of Hierarchical Matrix Computation on GPU

Optimization of Hierarchical Matrix Computation on GPU

Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota

Kyushu University, Fukuoka, Japan

Supercomputing Frontiers. Lecture Notes in Computer Science, vol 10776. Springer, 2018

DOI:10.1007/978-3-319-69953-0_16

BibTeX

Download (PDF)

View

Source

2028

views

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H-matrices is more complex than that of dense and sparse matrices; thus, accelerating the H-matrices is required. We focus on H-matrix – vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

Tags: Computer science, CUDA, Intel Xeon Phi, Linear Algebra, nVidia, OpenMP, Tesla P100

March 25, 2018 by hgpu

Rating: 4.0/5. From 1 vote.

Please wait...

high performance computing on graphics processing units: hgpu.org

Optimization of Hierarchical Matrix Computation on GPU

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Optimization of Hierarchical Matrix Computation on GPU

Share this:

Recent source codes

Most viewed papers (last 30 days)