high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Hardware thread reordering to boost OpenCL throughput on FPGAs

Hardware thread reordering to boost OpenCL throughput on FPGAs

Amir Momeni, Hamed Tabkhi, Gunar Schirner, David Kaeli

ECE Department, Northeastern University, Boston, MA

International Conference on Computer Design (ICCD), 2016

@article{momenihardware,

title={Hardware thread reordering to boost OpenCL throughput on FPGAs},

author={Momeni, Amir and Tabkhi, Hamed and Schirner, Gunar and Kaeli, David},

year={2016}

}

Download (PDF)

View

Source

3736

views

Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent control flow. We need to look for new approaches to improve execution efficiency of FPGAs when targeting irregular OpenCL kernels. This paper proposes a novel solution, called Hardware Thread Reordering (HTR), to boost the throughput of the FPGAs when executing irregular kernels possessing non-deterministic runtime control flow. The key insight of HRT is out-of-order OpenCL thread execution over a shared data-path to achieve significantly higher throughput. The thread reordering is performed at a basic-block level granularity. The synthesized basic-blocks are extended with independent pipeline control signals and context registers to bypass the live values of reordered threads. We demonstrate the efficiency of our proposed solution on three parallel irregular kernels. For the experiments, we utilize the LegUp tool to compare the baseline (in-order) data-path with HTR-enhanced data-path. Our RTL simulation results demonstrate that HTR-enhanced data-path achieves up to 11X increase in kernels throughput at a very low overhead (less than 2X increase in FPGA resources).

Tags: Computer science, FPGA, Hardware, OpenCL, Performance

November 30, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Hardware thread reordering to boost OpenCL throughput on FPGAs

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Hardware thread reordering to boost OpenCL throughput on FPGAs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)