high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Rahul Garg

School of Computer Science, McGill University, Montreal

McGill University, 2015

@phdthesis{garg2015toolkit,

title={A TOOLKIT FOR BUILDING DYNAMIC COMPILERS FOR ARRAY-BASED LANGUAGES TARGETING CPUS AND GPUS},

author={Garg, Rahul},

year={2015},

school={McGill University, Montr{‘e}al}

}

Download (PDF)

View

Source

Source codes

Package:

RaijinCL

2451

views

Array-based languages such as MATLAB and Python (with NumPy) have become very popular for scientific computing. However, the performance of the implementations of these languages is often lacking. For example, some of the implementations are interpreted. Further, these languages were not designed with multi-core CPUs and GPUs in mind and thus don’t take full advantage of modern hardware. Thus, developing just-in-time (JIT) compilers for these languages that allow scientific programmers to efficiently target both CPUs and GPUs is of increasing interest. However building such compilers requires considerable effort. Prior to this thesis, there were no reusable compiler toolkits for array-based languages even though many of the compilation challenges are similar across languages. This thesis is about a set of two novel and reusable tools, Velociraptor and RaijinCL, that simplify the work of building JIT compilers for array-based languages targeting both CPUs and GPUs. Velociraptor is a reusable, embeddable dynamic compiler toolkit while RaijinCL is an auto-tuning highperformance matrix operations library. Velociraptor provides a new high-level intermediate representation (IR) called VRIR which has been specifically designed for numeric computations, with rich support for arrays, plus support for high-level parallel and GPU constructs. A compiler developer uses Velociraptor by generating VRIR for key parts of an input program. Velociraptor provides an optimizing compiler toolkit for generating CPU and GPU code and also provides a smart runtime system to manage the GPU. An important contribution of the thesis is a new dynamic compilation technique called region specialization that is particularly designed for numerical programs. Region specialization first performs region detection analysis, a novel compiler analysis which identifies regions of code that may be interesting for the compiler to analyze, such as loops and library calls involving arrays. Region detection analysis also identifies parameters such as shapes and values of variables which are critical for inferring the properties of the region. Region specialization dynamically generates specialized code for the region based upon the runtime value of parameters identified by region detection analysis. To demonstrate that Velociraptor is not tied to a single language or compiler toolchain, we present two case studies of using Velociraptor: a proof-of-concept Python compiler targeting CPUs and GPUs, and a GPU extension for a MATLAB JIT. We evaluated both these compilers built using Velociraptor against production-level static and dynamic compilers on a variety of benchmarks. We demonstrate that the CPU code generated by our toolkit is either competitive with, or outperforms, the available tools and that our parallel CPU and GPU code generation can provide further speedups. RaijinCL was implemented to support matrix library operations found in array-based languages, such as matrix multiplication, on the GPU. RaijinCL an auto-tuning highperformance matrix operations library that runs across GPUs from many vendors. We present detailed experimental results from many different GPU architectures to show that it is competitive with vendor tuned libraries. Finally, while much of the literature about matrix libraries for GPUs is for discrete GPUs, we also demonstrate a prototype extension to RaijinCL to maximize system performance on single-chip CPU/GPU systems by using both the CPU and GPU.

Tags: AMD Radeon HD 8650, ATI, ATI Radeon HD 7970, Benchmarking, Code generation, Computer science, Intel HD Graphics 4000, Matlab, Matrix multiplication, nVidia, nVidia GeForce GT 650 M, OpenCL, Package, Python, Tesla C2050, Thesis

October 6, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Package:

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)