high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Seung-Hun Chung

University of Toronto

University of Toronto, 2021

BibTeX

Download (PDF)

View

Source

1121

views

This work explores the viability of end-to-end convolutional neural network inference using OpenCL HLS kernels generated from TVM on Intel FPGAs. We explore layer-pipelined execution for small networks and time-multiplexed kernels for larger CNNs. Naively generated kernels do not produce efficient hardware. We propose a set of optimizations to increase parallelism, resource utilization, and more efficiently use memory bandwidth. They include loop unrolling, tiling, fusion, invariant code motion, cached writes, CL channels, autorun kernels, concurrent execution, and parameterized kernels. These optimizations improve performance up to a factor of 1150x over the naive baseline implementation generated by TVM. Compared to Keras/Tensorflow on a 56-core Xeon 8280, we observe performance improvements up to 4.57x and 1.4x over LeNet and MobileNet but has a slowdown at 0.43x for ResNet18/34.

Tags: Computer science, FPGA, Neural networks, nVidia, nVidia GeForce GTX 1060, OpenCL, Performance, Thesis

December 19, 2021 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Share this:

Recent source codes

Most viewed papers (last 30 days)