OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Computer Science and Application 2020

基于变分自编码的语气语音合成模型
A Speech Synthesis Model with Mood Based on Variational Autoencoder

DOI: 10.12677/CSA.2020.1012227, PP. 2159-2167

王研, 吴怡之

Keywords: 语气，Variational Autoencoders，语音合成，WORLD声码器
Mood, Variational Autoencoders, Speech Synthesis, WORLD Vocoder

Full-Text Cite this paper Add to My Lib

Abstract:

语气作为一种重要情感表达信息，对说话人内容的表达起着重要作用。目前语音合成系统缺乏对语气的良好支持，合成语音也表现出乏味、单一的缺点。为了解决上述问题，提高合成语音的自然度，本文将统计参数语音合成(Statistical Parameter Speech Synthesis, SSPS)与具有强学习能力的变分自编码(Variational Autoencoder, VAE)模型相结合，以无监督的方式学习说话人潜在的语气信息，再通过加入分类器提高模型语气学习的准确率。我们提出了语气语音合成的系统框架，分为三部分：声学模型、语气模型以及合成模型。待合成的目标文本和语气分别利用声学模型与语气模型重构出的包括基频F0的声学特征。最后，将声学特征输入到WORLD声码器合成出带有目标语气的语音信号。本篇文章使用Blizzard Challenge 2018作为模型训练的语料库，最后通过实验结果表明，所提出的模型具有良好的语气生成性能。
Mood as the important emotional expression information plays an important role in the expression of the speaker’s content. The current speech synthesis system lacks good support for mood and synthetic speech also shows the shortcomings of monotonous and boring. In order to solve the above problems and improve the naturalness of the synthesized speech, we use Statistical Parameter Speech Synthesis (SSPS) and Variational Autoencoder (VAE) model with strong learning ability to learn the speaker’s potential mood information in an unsupervised manner, and then improve the accuracy of model mood learning by adding classifiers. We propose a systematic framework for speech synthesis with mood, which is divided into three parts: an acoustic model, a speech mood model, and a synthetic model. The target text and mood to be synthesized are reconstructed using the acoustic features including the fundamental frequency F0 using the acoustic model and the mood model, respectively. Finally, the acoustic features are input into the WORLD vocoder to synthesize speech signals with target mood. This article uses Blizzard Challenge 2018 as a corpus for model training, and finally, the experimental results show that the proposed model has a good performance for mood generation.

References

[1]	Dutoit, T. (2001) An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, Dordrecht.
[2]	Gonzalvo, X., Tazari, S., Chan, C.A., et al. (2016) Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer. Interspeech 2016, San Francisco, 8-12 September 2016, 2238-2242. https://doi.org/10.21437/Interspeech.2016-264
[3]	Zen, H., Agiomyrgiannakis, Y., Egberts, N., et al. (2016) Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices. Inter-speech 2016, San Francisco, 8-12 September 2016, 2273-2277. https://doi.org/10.21437/Interspeech.2016-522
[4]	Li, N., Liu, S., Liu, Y., et al. (2018) Close to Human Quality TTS with Transformer.
[5]	王飞华. 汉英语气系统对比研究[D]: [博士学位论文]. 上海: 复旦大学出版社, 2005.
[6]	张亚强. 基于迁移学习和自学习情感表征的情感语音合成[D]: [硕士学位论文]. 北京: 北京邮电大学, 2019.
[7]	Sun, G., Zhang, Y., Weiss, R.J., et al. (2020) Fully-Hierarchical Fine-Grained Prosody Modeling for Inter-pretable Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Bar-celona, 4-8 May 2020, 6264-6268. https://doi.org/10.1109/ICASSP40776.2020.9053520
[8]	Zhang, Y.J., Pan, S., He, L., et al. (2019) Learning Latent Representations for Style Control and Transfer in End-to-End Speech Synthesis. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 6945-6949. https://doi.org/10.1109/ICASSP.2019.8683623
[9]	Kingma, D.P. and Welling, M. (2014) Au-to-Encoding Variational Bayes. 2nd International Conference on Learning Representations, Banff, 14-16 April 2014.
[10]	Wittrock, M.C. (2010) Learning as a Generative Process. Educational Psychologist, 45, 40-45. https://doi.org/10.1080/00461520903433554
[11]	Bishop, C.M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York.
[12]	Bengio, Yoshua, Courville, et al. (2013) Repre-sentation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35, 1798-1828. https://doi.org/10.1109/TPAMI.2013.50
[13]	Khurana, S., Joty, S.R., Ali, A., et al. (2019) A Factorial Deep Markov Model for Unsupervised Disentangled Representation Learning from Speech. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 6540-6544. https://doi.org/10.1109/ICASSP.2019.8683131
[14]	Wainwright, M.J. and Jordan, M.I. (2008) Graphical Models, Exponential Families, and Variational Inference. Foundations & Trends？ in Machine Learning, 1, 1-305. https://doi.org/10.1561/2200000001
[15]	Joyce, J.M. (2011) Kullback-Leibler Divergence. In: Lovric, M., Ed., In-ternational Encyclopedia of Statistical Science, Springer, Berlin, 720-722. https://doi.org/10.1007/978-3-642-04898-2_327
[16]	Hodari, Z., Lai, C. and King, S. (2020) Perception of Pro-sodic Variation for Speech Synthesis Using an Unsupervised Discrete Representation of F0. 10th International Confer-ence on Speech Prosody, Tokyo, 25-28 May 2020, 965. https://doi.org/10.21437/SpeechProsody.2020-197
[17]	He, M., Deng, Y. and He, L. (2019) Robust Se-quence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS. Interspeech 2019, Graz, 15-19 September 2019, 1293-1297. https://doi.org/10.21437/Interspeech.2019-1972
[18]	Xue, S. and Yan, Z. (2017) Improving Latency-Controlled BLSTM Acoustic Models for Online Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, 5-9 March 2017, 5340-5344. https://doi.org/10.1109/ICASSP.2017.7953176
[19]	Morise, M., Yokomori, F. and Ozawa, K. (2016) WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. Ice Transactions on Information & Systems, 99, 1877-1884. https://doi.org/10.1587/transinf.2015EDP7457
[20]	King, S., Crumlish, J., Martin, A. and Wihlborg, L. (2017) The Blizzard Challenge 2018. Proc. Blizzard Challenge Workshop, Hyderabad.
[21]	Tokuda, K., Yoshimura, T., Masuko, T., et al. (2002) Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis. IEEE International Confer-ence on Acoustics, Orlando, 13-17 May 2002, 1315-1318.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于变分自编码的语气语音合成模型A Speech Synthesis Model with Mood Based on Variational Autoencoder

基于变分自编码的语气语音合成模型
A Speech Synthesis Model with Mood Based on Variational Autoencoder