high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Persistent RNNs: Stashing Recurrent Weights On-Chip

Persistent RNNs: Stashing Recurrent Weights On-Chip

Gregory Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, Sanjeev Satheesh

Baidu Silicon Valley AI Lab, 1195 Bordeaux Drive, Sunnyvale, CA 94089, United States

The 33rd International Conference on Machine Learning, 2016

@inproceedings{diamos2016persistent,

title={Persistent RNNs: Stashing Recurrent Weights On-Chip},

author={Diamos, Greg and Sengupta, Shubho and Catanzaro, Bryan and Chrzanowski, Mike and Coates, Adam and Elsen, Erich and Engel, Jesse and Hannun, Awni and Satheesh, Sanjeev},

booktitle={Proceedings of The 33rd International Conference on Machine Learning},

pages={2024–2033},

year={2016}

}

Download (PDF)

View

Source

Source codes

Package:

persistent-rnn: Fast Recurrent Networks Library

3058

views

This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA Titan X GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

Tags: Computer science, CUDA, Deep learning, Matrix multiplication, Neural networks, nVidia, nVidia GeForce GTX Titan X, Package, RNN, Speech recognition

June 30, 2016 by hgpu

Rating: 2.3/5. From 13 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Persistent RNNs: Stashing Recurrent Weights On-Chip

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Persistent RNNs: Stashing Recurrent Weights On-Chip

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)