全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于Transformer的多语种字音转换
Transformer Based Multilingual Grapheme-to-Phoneme Conversion

DOI: 10.12677/CSA.2023.133050, PP. 510-517

Keywords: 字音转换,Transformer,多语种,交叉混合
Grapheme-to-Phoneme Conversion
, Transformer, Multilingual, Cross-Mixing

Full-Text   Cite this paper   Add to My Lib

Abstract:

字音转换(Grapheme-to-Phoneme, G2P)是语音合成前端的重要部分,影响着语音合成的质量。现如今,大多数的字音转换的研究是针对于单一语种的,而在实际应用中,单一语种合成的语音远没有多语种的实用性高。因此,本文利用Transformer架构研究了在文本交叉混合条件下多语种(英、日、韩)的字音转换,使用音素错误率(Phoneme Error Rate, PER)和单词错误率(Word Error Rate, WER)作为评价指标。英文在基于美国英语的CMUDict数据集进行评估,韩语和日语则是先对SIGMORPHON 2021字音转换任务上的韩语及日语数据集进行了数据扩充,并在扩充后的数据集上进行评估。实验结果表明,在文本交叉混合条件下,基于Transformer架构的英、日、韩字音转换在音素错误率和单词错误率方面与基于Transformer架构的英、日、韩三个语言的单一语种相比都大大降低了。
Grapheme-to-Phoneme (G2P) conversion is an important part of the front end of speech synthesis, which affects the quality of speech synthesis. Nowadays, most of the research on G2P conversion is aimed at a single language, and in practical applications, single-language synthesized speech is far less practical than multilingual. Therefore, this paper uses the Transformer architecture to study the G2P conversion of multiple languages (English, Japanese, and Korean) under the condition of text crossmixing, and uses Phoneme Error Rate (PER) and Word Error Rate (WER) as evaluation indicators. English is evaluated on the CMUDict dataset based on American English, while Korean and Japanese are first expanded on the Korean and Japanese data set on the SIGMORPHON 2021 G2P conversion task, and then evaluated on the expanded data set. Experimental results show that under the condition of text crossmixing, the phoneme error rate and word error rate of English, Japanese and Korean characters based on Transformer architecture are greatly reduced compared with the single language of English, Japanese and Korean based on Transformer architecture.

References

[1]  Rao, K., Peng, F., Sak, H., et al. (2015) Grapheme-to-Phoneme Conversion Using Long Short-Term Memory Recurrent Neural Networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4225-4229.
https://doi.org/10.1109/ICASSP.2015.7178767
[2]  胡伟湘, 徐波, 黄泰翼. 汉语语音韵律边界的检测和识别研究[C]//第六届全国人机语音通讯学术会议论文集. 北京: 中国中文信息学会, 2001: 39-42.
[3]  Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, 4-9 December 2017, 6000-6010.
[4]  Mnih, V., Heess, N. and Graves, A. (2014) Recurrent Models of Visual Attention. Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2, 2204-2212.
[5]  Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2020) Transformer Based Grapheme-to-Phoneme Conversion. 20th Annual Conference of the International Speech Communication Association, Graz, 15-19 September 2019, 2095-2099.
https://doi.org/10.21437/Interspeech.2019-1954
[6]  Sutskever, I., Vinyals, O. and Le, Q.V. (2014) Sequence to Sequence Learning with Neural Networks. Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2, 3104-3112.
[7]  LeCun, Y., Boser, B., Denker, J.S., et al. (1989) Backpropagation Ap-plied to Handwritten Zip Code Recognition. Neural Computation, 1, 541-551.
https://doi.org/10.1162/neco.1989.1.4.541
[8]  Mikolov, T., Karafiát, M., Burget, L., et al. (2010) Recurrent Neural Network Based Language Model. Proceedings Interspeech, Vol. 2, 1045-1048.
https://doi.org/10.21437/Interspeech.2010-343
[9]  Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780.
https://doi.org/10.1162/neco.1997.9.8.1735
[10]  The CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[11]  Sejnowski, T.J. (1988) The NetTalk Corpus: Phonetic Transcrip-tion of 20008 English Words.
[12]  Kingsbury, P., Strassel, S., McLemore, C., et al. (1997) CALLHOME American English Lexicon (PRONLEX). Linguistic Data Consortium, Philadelphia.
[13]  Bisani, M. and Ney, H. (2008) Joint-Sequence Models for Grapheme-to-Phoneme Conversion. Speech Communication, 50, 434-451.
https://doi.org/10.1016/j.specom.2008.01.002
[14]  Yao, K. and Zweig, G. (2015) Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion. INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, 6-10 September 2015, 3330-3334.
https://doi.org/10.21437/Interspeech.2015-134
[15]  Ashby, L.F.E., Bartley, T.M., Clematide, S., et al. (2021) Re-sults of the Second Sigmorphon Shared Task on Multilingual Grapheme-to-Phoneme Conversion. Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, August 2021, 115-125.
https://doi.org/10.18653/v1/2021.sigmorphon-1.13
[16]  El Saadany, O. and Suter, B. (2020) Graph-eme-to-Phoneme Conversion with a Multilingual Transformer Model. Proceedings of the 17th SIGMORPHON Work-shop on Computational Research in Phonetics, Phonology, and Morphology, July 2020, 85-89.
https://doi.org/10.18653/v1/2020.sigmorphon-1.7
[17]  Levenshtein, V.I. (1966) Binary Codes Capable of Correct-ing Deletions, Insertions, and Reversals. Soviet Physics Doklady, 10, 707-710.
[18]  Galescu, L. and Allen, J.F. (2002) Pronunciation of Proper Names with a Joint n-Gram Model for Bi-Directional Grapheme-to-Phoneme Conversion. 7th International Conference on Spoken Language Processing, Denver, 16-20 September 2002, 109-112.
https://doi.org/10.21437/ICSLP.2002-79
[19]  Yolchuyeva, S., Németh, G. and Gyires-Tóth, B. (2019) Graph-eme-to-Phoneme Conversion with Convolutional Neural Networks. Applied Sciences, 9, 1143.
https://doi.org/10.3390/app9061143

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133