high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Exploring the acceleration of Nekbone on reconfigurable architectures

Exploring the acceleration of Nekbone on reconfigurable architectures

Nick Brown

EPCC at the University of Edinburgh, The Bayes Centre, 47 Potterrow, Edinburgh

arXiv:2011.04981 [cs.DC], (10 Nov 2020)

DOI:10.1109/H2RC51942.2020.00008

@article{brown2020exploring,

title={Exploring the acceleration of Nekbone on reconfigurable architectures},

author={Brown, Nick},

journal={arXiv preprint arXiv:2011.04981},

year={2020}

}

Download (PDF)

View

Source

1552

views

Hardware technological advances are struggling to match scientific ambition, and a key question is how we can use the transistors that we already have more effectively. This is especially true for HPC, where the tendency is often to throw computation at a problem whereas codes themselves are commonly bound, at-least to some extent, by other factors. By redesigning an algorithm and moving from a Von Neumann to dataflow style, then potentially there is more opportunity to address these bottlenecks on reconfigurable architectures, compared to more general-purpose architectures. In this paper we explore the porting of Nekbone’s AX kernel, a widely popular HPC mini-app, to FPGAs using High Level Synthesis via Vitis. Whilst computation is an important part of this code, it is also memory bound on CPUs, and a key question is whether one can ameliorate this by leveraging FPGAs. We first explore optimisation strategies for obtaining good performance, with over a 4000 times runtime difference between the first and final version of our kernel on FPGAs. Subsequently, performance and power efficiency of our approach on an Alveo U280 are compared against a 24 core Xeon Platinum CPU and NVIDIA V100 GPU, with the FPGA outperforming the CPU by around four times, achieving almost three quarters the GPU performance, and significantly more power efficient than both. The result of this work is a comparison and set of techniques that both apply to Nekbone on FPGAs specifically and are also of interest more widely in accelerating HPC codes on reconfigurable architectures.

Tags: Computer science, FPGA, HLS, nVidia, OpenCL, Tesla V100

November 15, 2020 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Exploring the acceleration of Nekbone on reconfigurable architectures

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Exploring the acceleration of Nekbone on reconfigurable architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)