全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于全局风格嵌入的多说话人印尼语语音合成
Multi-Speaker Indonesian Speech Synthesis Based on Global Style Embedding

DOI: 10.12677/CSA.2023.131013, PP. 126-135

Keywords: 语音合成,多说话人,风格迁移,低资源,印尼语
Speech Synthesis
, End-to-End, Style Transfer, Low-Resource, Indonesian

Full-Text   Cite this paper   Add to My Lib

Abstract:

由于印尼语高质量语料数据库的稀缺,该语种多说话人语音合成系统性能仍有待提升。因此以缓解低资源对多说话人语音合成性能的影响为目的,研究并实现了基于GST-Tacotron2模型框架的印尼语端到端语音合成系统。选用8.5小时的单说话人印尼语数据训练的合成系统,合成语音的MOS评分达4.11。在此基础上,设计多说话人印尼语语音合成系统,着重探索了在仅利用其他印尼语说话人少量语音数据进行混合训练时,采用说话人编码方法对多说话人合成自然度的影响。实验结果表明,利用合计14.5小时多说话人语音数据训练的合成模型,主位说话人合成语音的MOS评分到达了4.12,梅尔倒谱失真比单说话人最优模型降低了7.2%。其他说话人合成语音的MOS评分均大于3.60,验证了所提方法的有效性。
Due to the scarcity of high-quality Indonesian corpus databases, the performance of Indonesian multi-speaker speech synthesis systems still needs to be improved. Therefore, in order to alleviate the impact of low-resources on the performance of multi-speaker speech synthesis, an end-to-end speech synthesis system in Indonesian based on the GST-Tacotron2 model framework is studied and implemented. A synthesis system trained on 8.5 hours of single-speaker Indonesian data achieves a MOS (Mean Opinion Score) score of 4.11 for synthesized speech. On this basis, a multi-speaker Indonesian speech synthesis system is designed, and the influence of the speaker coding method on the naturalness of multi-speaker synthesis is emphatically explored when only a small amount of speech data of other Indonesian speakers is used for hybrid training. The experimental results show that the MOS score of the synthesized speech of the main speaker reaches 4.12 using the synthesis model trained with a total of 14.5 hours of multi-speaker speech data. The MCD is 7.2% lower than the single-speaker optimal model. The MOS scores of the synthesized speech of other speakers are all greater than 3.60, which verifies the effectiveness of the proposed method.

References

[1]  Wang, Y., Skerry-Ryan, R.J., Stanton, D., et al. (2017) Tacotron: Towards End-to-End Speech Synthesis. Proceedings of INTERSPEECH, 2017, 4006-4010.
https://doi.org/10.21437/Interspeech.2017-1452
[2]  Shen, J., Pang, R., Weiss, R.J., et al. (2018) Natural TTS Synthesis by Conditioning WaveNet on Mel-Spectrogram Predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 15-20 April 2018, 4779-4783.
https://doi.org/10.1109/ICASSP.2018.8461368
[3]  Ren, Y., Ruan, Y., Tan, X., et al. (2019) Fastspeech: Fast, Robust and Controllable Text to Speech. Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Van-couver, 8-14 December 2019, 1171-1179.
[4]  Ren, Y., Hu, C., Tan, X., et al. (2020) FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. International Conference on Learning Representations, Addis Ababa, 26-30 April 2020, 1-15.
https://openreview.net/forum?id=piLPYqxtWuA
[5]  Ar?k, S.?., Chrzanowski, M., Coates, A., et al. (2017) Deep Voice: Real-Time Neural Text-to-Speech. International Conference on Machine Learning, Sydney, 6-11 August 2017, 195-204.
[6]  Gibiansky, A., Arik, S., Diamos, G., et al. (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 Decem-ber 2017, 2966-2974.
[7]  Ping, W., Peng, K., Gibiansky, A., et al. (2018) Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. International Conference on Learning Representations, Vancouver, 30 April-3 May 2018, 214-217.
[8]  van den Oord, A., Dieleman, S., Zen, H., et al. (2016) WaveNet: A Generative Model for Raw Audio. 9th ISCA Speech Synthesis Workshop, Sunnyvale, 13-15 September 2016, 125.
[9]  Debnath, A., Patil, S.S., Nadiger, G., et al. (2020) Low-Resource End-to-End Sanskrit TTS Using Tacotron2, WaveGlow and Transfer Learning. 2020 IEEE 17th India Council International Conference (INDICON), New Delhi, 10-13 December 2020, 1-5.
https://doi.org/10.1109/INDICON49873.2020.9342071
[10]  Wang, Y., Stanton, D., Zhang, Y., et al. (2018) Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. International Conference on Machine Learning, Stockholm, 10-15 July 2018, 5180-5189.
[11]  Guo, H., Soong, F.K., He, L., et al. (2019) A New GAN-Based End-to-End TTS Training Algorithm. Proceedings of Interspeech, 2019, 1288-1292.
https://doi.org/10.21437/Interspeech.2019-2176
[12]  Prenger, R., Valle, R. and Catanzaro, B. (2019) WaveGlow: A Flow-Based Generative Network for Speech Synthesis. ICASSP 2019-2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 3617-3621.
https://doi.org/10.1109/ICASSP.2019.8683143
[13]  Ito, K. and Johnson, L. (2017) The LJ Speech Dataset.
https://keithito.com/LJ-Speech-Dataset

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133