high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Automatic library generation for BLAS3 on GPUs

Automatic library generation for BLAS3 on GPUs

Huimin Cui, Lei Wang, Jingling Xue, Yang Yang, Xiaobing Feng

Institute of Computing Technology, Chinese Academy of Sciences, China

IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2011

DOI:10.1109/IPDPS.2011.33

BibTeX

Download (PDF)

View

Source

1946

views

High-performance libraries, the performance-critical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential benefits of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).

Tags: Algorithms, Computer science, CUBLAS, Linear Algebra, nVidia, nVidia GeForce 9800 GTX, nVidia GeForce GTX 285, Optimization, Tesla C2050

December 13, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Automatic library generation for BLAS3 on GPUs

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Automatic library generation for BLAS3 on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)