high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Yunsup Lee

Electrical Engineering and Computer Sciences, University of California at Berkeley

University of California at Berkeley, Technical Report No. UCB/EECS-2016-117, 2016

@article{lee2016decoupled,

title={Decoupled Vector-Fetch Architecture with a Scalarizing Compiler},

author={Lee, Yunsup},

year={2016}

}

Download (PDF)

View

Source

2093

views

As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others. In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM’s stock OpenCL compiler.

Tags: Computer science, CUDA, Electronic design automation, Heterogeneous systems, LLVM, OpenCL, Performance, PTX, SoC, Thesis

June 9, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)