Hardware Acceleration for Neural Networks: A Comprehensive Survey

hgpu.org » Applications » Computer science » Security » Hardware Acceleration for Neural Networks: A Comprehensive Survey

Hardware Acceleration for Neural Networks: A Comprehensive Survey

Bin Xu, Ayan Banerjee, Sandeep Gupta

School of Electrical, Computer and Energy Engineering, Arizona State University, USA

arXiv:2512.23914 [eess.SY], (30 Dec 2025)

DOI:10.48550/arXiv.2512.23914

@misc{xu2025hardwareaccelerationneuralnetworks,

title={Hardware Acceleration for Neural Networks: A Comprehensive Survey},

author={Bin Xu and Ayan Banerjee and Sandeep Gupta},

year={2025},

eprint={2512.23914},

archivePrefix={arXiv},

primaryClass={eess.SY},

url={https://arxiv.org/abs/2512.23914}

}

Download (PDF)

View

Source

1035

views

Neural networks have become a dominant computational workload across cloud and edge platforms, but rapid growth in model size and deployment diversity has exposed hardware bottlenecks increasingly dominated by memory movement, communication, and irregular operators rather than peak arithmetic throughput. This survey reviews the technology landscape for hardware acceleration of deep learning, spanning GPUs and tensor-core architectures; domain-specific accelerators (e.g., TPUs/NPUs); FPGA-based designs; ASIC inference engines; and emerging LLM-serving accelerators such as LPUs (language processing units), alongside in-/near-memory computing and neuromorphic/analog approaches. We organize the space using a unified taxonomy across (i) workloads (CNNs, RNNs, GNNs, and Transformers/LLMs), (ii) execution settings (training vs. inference; datacenter vs. edge), and (iii) optimization levers (reduced precision, sparsity and pruning, operator fusion, compilation and scheduling, and memory-system/interconnect design). We synthesize key architectural ideas including systolic arrays, vector and SIMD engines, specialized attention and softmax kernels, quantization-aware datapaths, and high-bandwidth memory, and we discuss how software stacks and compilers bridge model semantics to hardware. Finally, we highlight open challenges — including efficient long-context LLM inference (KV-cache management), robust support for dynamic and sparse workloads, energy- and security-aware deployment, and fair benchmarking — and point to promising directions for the next generation of neural acceleration.

Tags: Benchmarking, Cloud, Computer science, Deep learning, FPGA, Neural networks, Review, Security, survey, TPU

January 4, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org