Efficient Incremental Text-to-Speech on GPUs

Muyang Du, Chuan Liu, Jiaxing Qi, Junjie Lai
NVIDIA Corporation
arXiv:2211.13939 [cs.SD], (25 Nov 2022)




   author={Du, Muyang and Liu, Chuan and Qi, Jiaxing and Lai, Junjie},

   keywords={Sound (cs.SD), Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},

   title={Efficient Incremental Text-to-Speech on GPUs},



   copyright={arXiv.org perpetual, non-exclusive license}


Download Download (PDF)   View View   Source Source   



Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural network models. To address this issue, we present a highly efficient approach to perform real-time incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching. Experimental results demonstrate that the proposed method is capable of producing high-quality speech with a first-chunk latency lower than 80ms under 100 QPS on a single NVIDIA A10 GPU and significantly outperforms the non-incremental twin in both concurrency and latency. Our work reveals the effectiveness of high-performance incremental TTS on GPUs.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2023 hgpu.org

All rights belong to the respective authors

Contact us: