Efficient Incremental Text-to-Speech on GPUs
NVIDIA Corporation
arXiv:2211.13939 [cs.SD], (25 Nov 2022)
@misc{https://doi.org/10.48550/arxiv.2211.13939,
doi={10.48550/ARXIV.2211.13939},
url={https://arxiv.org/abs/2211.13939},
author={Du, Muyang and Liu, Chuan and Qi, Jiaxing and Lai, Junjie},
keywords={Sound (cs.SD), Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
title={Efficient Incremental Text-to-Speech on GPUs},
publisher={arXiv},
year={2022},
copyright={arXiv.org perpetual, non-exclusive license}
}
Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the effectiveness of high-performance incremental TTS on GPUs.
December 4, 2022 by hgpu