|
基于变分自编码的语气语音合成模型
|
Abstract:
[1] | Dutoit, T. (2001) An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, Dordrecht. |
[2] | Gonzalvo, X., Tazari, S., Chan, C.A., et al. (2016) Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer. Interspeech 2016, San Francisco, 8-12 September 2016, 2238-2242.
https://doi.org/10.21437/Interspeech.2016-264 |
[3] | Zen, H., Agiomyrgiannakis, Y., Egberts, N., et al. (2016) Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices. Inter-speech 2016, San Francisco, 8-12 September 2016, 2273-2277. https://doi.org/10.21437/Interspeech.2016-522 |
[4] | Li, N., Liu, S., Liu, Y., et al. (2018) Close to Human Quality TTS with Transformer. |
[5] | 王飞华. 汉英语气系统对比研究[D]: [博士学位论文]. 上海: 复旦大学出版社, 2005. |
[6] | 张亚强. 基于迁移学习和自学习情感表征的情感语音合成[D]: [硕士学位论文]. 北京: 北京邮电大学, 2019. |
[7] | Sun, G., Zhang, Y., Weiss, R.J., et al. (2020) Fully-Hierarchical Fine-Grained Prosody Modeling for Inter-pretable Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Bar-celona, 4-8 May 2020, 6264-6268. https://doi.org/10.1109/ICASSP40776.2020.9053520 |
[8] | Zhang, Y.J., Pan, S., He, L., et al. (2019) Learning Latent Representations for Style Control and Transfer in End-to-End Speech Synthesis. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 6945-6949. https://doi.org/10.1109/ICASSP.2019.8683623 |
[9] | Kingma, D.P. and Welling, M. (2014) Au-to-Encoding Variational Bayes. 2nd International Conference on Learning Representations, Banff, 14-16 April 2014. |
[10] | Wittrock, M.C. (2010) Learning as a Generative Process. Educational Psychologist, 45, 40-45.
https://doi.org/10.1080/00461520903433554 |
[11] | Bishop, C.M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York. |
[12] | Bengio, Yoshua, Courville, et al. (2013) Repre-sentation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35, 1798-1828. https://doi.org/10.1109/TPAMI.2013.50 |
[13] | Khurana, S., Joty, S.R., Ali, A., et al. (2019) A Factorial Deep Markov Model for Unsupervised Disentangled Representation Learning from Speech. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 6540-6544. https://doi.org/10.1109/ICASSP.2019.8683131 |
[14] | Wainwright, M.J. and Jordan, M.I. (2008) Graphical Models, Exponential Families, and Variational Inference. Foundations & Trends? in Machine Learning, 1, 1-305. https://doi.org/10.1561/2200000001 |
[15] | Joyce, J.M. (2011) Kullback-Leibler Divergence. In: Lovric, M., Ed., In-ternational Encyclopedia of Statistical Science, Springer, Berlin, 720-722. https://doi.org/10.1007/978-3-642-04898-2_327 |
[16] | Hodari, Z., Lai, C. and King, S. (2020) Perception of Pro-sodic Variation for Speech Synthesis Using an Unsupervised Discrete Representation of F0. 10th International Confer-ence on Speech Prosody, Tokyo, 25-28 May 2020, 965.
https://doi.org/10.21437/SpeechProsody.2020-197 |
[17] | He, M., Deng, Y. and He, L. (2019) Robust Se-quence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS. Interspeech 2019, Graz, 15-19 September 2019, 1293-1297.
https://doi.org/10.21437/Interspeech.2019-1972 |
[18] | Xue, S. and Yan, Z. (2017) Improving Latency-Controlled BLSTM Acoustic Models for Online Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, 5-9 March 2017, 5340-5344. https://doi.org/10.1109/ICASSP.2017.7953176 |
[19] | Morise, M., Yokomori, F. and Ozawa, K. (2016) WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. Ice Transactions on Information & Systems, 99, 1877-1884.
https://doi.org/10.1587/transinf.2015EDP7457 |
[20] | King, S., Crumlish, J., Martin, A. and Wihlborg, L. (2017) The Blizzard Challenge 2018. Proc. Blizzard Challenge Workshop, Hyderabad. |
[21] | Tokuda, K., Yoshimura, T., Masuko, T., et al. (2002) Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis. IEEE International Confer-ence on Acoustics, Orlando, 13-17 May 2002, 1315-1318. |