high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures

A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures

Ali Charara, David Keyes, Hatem Ltaief

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah 23955, Saudi Arabia

King Abdullah University of Science and Technology, 2016

@article{charara2016framework,

title={A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures},

author={Charara, Ali and Keyes, David E and Ltaief, Hatem},

journal={Submitted to CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE},

year={2016}

}

Download (PDF)

View

Source

Source codes

Package:

KBLAS

2199

views

We present a new high performance framework for dense triangular BLAS kernels, i.e., triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors (Charara et al., EuroPar, 2016). In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion tree. In addition, a multiple GPU implementation of TRMM and TRSM is proposed and shows an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state-of-the-art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open-source library.

Tags: BLAS, Computer science, CUDA, Intel Xeon Phi, Linear Algebra, Matrix multiplication, nVidia, Package, Tesla K40

January 8, 2017 by hgpu

Rating: 2.0/5. From 4 votes.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures

Package:

Your response

Recent source codes

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Most viewed papers (last 30 days)

A Framework for Dense Triangular Matrix Kernels on Various Manycore Architectures

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)