high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Single stream parallelization of generalized LSTM-like RNNs on a GPU

Single stream parallelization of generalized LSTM-like RNNs on a GPU

Kyuyeon Hwang, Wonyong Sung

Department of Electrical and Computer Engineering, Seoul National University, Seoul 151-744, South Korea

arXiv:1503.02852 [cs.NE], (10 Mar 2015)

BibTeX

Download (PDF)

View

Source

2459

views

Recurrent neural networks (RNNs) have shown outstanding performance on processing sequence data. However, they suffer from long training time, which demands parallel implementations of the training procedure. Parallelization of the training algorithms for RNNs are very challenging because internal recurrent paths form dependencies between two different time frames. In this paper, we first propose a generalized graph-based RNN structure that covers the most popular long short-term memory (LSTM) network. Then, we present a parallelization approach that automatically explores parallelisms of arbitrary RNNs by analyzing the graph structure. The experimental results show that the proposed approach shows great speed-up even with a single training stream, and further accelerates the training when combined with multiple parallel training streams.

Tags: Algorithms, Computer science, CUDA, Machine learning, Neural and Evolutionary Computing, Neural networks, nVidia, Tesla K40

March 22, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Single stream parallelization of generalized LSTM-like RNNs on a GPU

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Single stream parallelization of generalized LSTM-like RNNs on a GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)