|
基于汉字简繁转换的汉日神经机器翻译数据增强研究
|
Abstract:
本文提出了一种基于汉字简繁转换的神经机器翻译(Neural Machine Translation, NMT)数据增强方法,旨在通过利用简繁转换表将源端文字替换为目标端文字,从而融合汉字简繁转换信息,并提高翻译质量。本文将此方法应用于汉日机器翻译任务,实验结果表明此方法是一种有效的数据增强方法,可以显著地提高汉日机器翻译质量。
This paper proposed a neural machine translation (NMT) data augmentation method based on conversions between Traditional Chinese and Simplified Chinese. The method aimed to integrate the information of conversions between Traditional Chinese and Simplified Chinese by replacing the source text with target text according to the Chinese characters mapping table, thereby improving the translation quality. The method was applied to the Chinese-Japanese machine translation task, and the experimental results demonstrated that this approach was an effective data augmentation method and could significantly improve the translation quality of Chinese-Japanese machine translation.
[1] | Bahdanau, D., Cho, K. and Bengio, Y. (2014) Neural Machine Translation by Jointly Learning to Align and Translate. The International Conference on Learning Representations (ICLR), Banff, 14-16 April 2014, 1-15. |
[2] | Luong, M.T., Pham, H. and Manning, C.D. (2015) Effective Approaches to Attention-Based Neural Machine Translation. Proceedings 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 17-21 September 2015, 1412-1421. https://doi.org/10.18653/v1/D15-1166 |
[3] | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017, 5998-6008. |
[4] | Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S. and Isahara, H. (2016) ASPEC: Asian Scientific Paper Excerpt Corpus. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portoro?, 23-28 May 2016, 2204-2208. |
[5] | 徐一平, 曹大峰. 汉日对译语料库的研制与应用研究: 论文集[M]. 北京: 外语教学与研究出版社, 2002. |
[6] | Zhang, J., Tian, Y., Han, M., Mao, J. and Matsumoto, T. (2022) WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation. Applied Sciences, 12, Article No. 6002.
https://doi.org/10.3390/app12126002 |
[7] | Zhang, J., Tian, Y., Han, M., Mao, J., Wen, F., Guo, C., Gao, Z. and Matsumoto, T. (2023) WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation. Electronics, 12, Article No. 1140. https://doi.org/10.3390/electronics12051140 |
[8] | Sennrich, R., Haddow, B. and Birch, A. (2016) Improving Neural Machine Translation Models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1, 86-96.
https://doi.org/10.18653/v1/P16-1009 |
[9] | 中澤敏明, C. Chu, 黒橋禎夫. 日中共通漢字の整理とこれを利用した日中機械翻訳の高度化[EB/OL]. Japio Year Book: 258-261. https://cir.nii.ac.jp/crid/1523669555917032960, 2023-05-24. |
[10] | Mao, Z., Cromieres, F., Dabre, R., Song, H. and Kurohashi, S. (2020) JASS: Japanese-Specific Sequence to Sequence Pre-Training for Neural Machine Translation. LREC 2020 12th International Conference on Language Resources and Evaluation, Marseille, 11-16 May 2020, 3683-3691. |
[11] | Xu, C., Hu, B., Jiang, Y., Feng, K., Wang, Z., Huang, S. and Zhu, J. (2020) Dynamic Curriculum Learning for Low- Resource Neural Machine Translation. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, 8-13 December 2020, 3977-3989. https://doi.org/10.18653/v1/2020.coling-main.352 |
[12] | Dou, Z.Y., Anastasopoulos, A. and Neubig, G. (2020) Dynamic Data Selection and Weighting for Iterative Back- Translation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16-20 November 2020, 5894-5904. https://doi.org/10.18653/v1/2020.emnlp-main.475 |
[13] | Araabi, A. and Monz, C. (2020) Optimizing Transformer for Low-Resource Neural Machine Translation. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, 8-13 December 2020, 3429-3435.
https://doi.org/10.18653/v1/2020.coling-main.304 |
[14] | Ngo, T., Nguyen, P., Ha, T., Dinh, K. and Nguyen, L. (2020) Improving Multilingual Neural Machine Translation for Low-Resource Languages: French, English—Vietnamese. The 3rd Workshop on Technologies for MT of Low Resource Languages, 4-7 December 2020, 55-61. |
[15] | Amittai, A., He, X. and Gao, J. (2011) Domain Adaptation via Pseudo In-Domain Data Selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Cedarville, 355-362. |
[16] | Marlies, W., Bisazza, A. and Monz, C. (2017) Dynamic Data Selection for Neural Machine Translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, 7-11 September 2017, 1400-1410. |
[17] | Zhang, P., Xu, X. and Xiong, D. (2018) Active Learning for Neural Machine Translation. 2018 International Conference on Asian Language Processing (IALP), Indonesia, 15-17 November 2018, 153-158.
https://doi.org/10.1109/IALP.2018.8629116 |
[18] | Wang, R., Utiyama, M., Finch, A.M., Liu, L., Chen, K. and Sumita, E. (2018) Sentence Selection and Weighting for Neural Machine Translation Domain Adaptation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 1727-1741. https://doi.org/10.1109/TASLP.2018.2837223 |
[19] | Song, K., Zhang, Y., Yu, H., Luo, W., Wang, K. and Zhang, M. (2019) Code-Switching for Enhancing NMT with Pre-Specified Translation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 449-459. |
[20] | 李毅鹏. 中日双语平行语料库之日语科技语标注技术[J]. 企业导报, 2015(2): 175-176. |
[21] | Caswell, I., Chelba, C. and Grangier, D. (2019) Tagged Back-Translation. Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 1, 53-63. https://doi.org/10.18653/v1/W19-5206 |
[22] | Khatri, J. and Bhattacharyya, P. (2020) Filtering Back-Translated Data in Unsupervised Neural Machine Translation. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, 8-13 December 2020, 4334-4339. https://doi.org/10.18653/v1/2020.coling-main.383 |
[23] | Wei, H., Zhang, Z., Chen, B. and Luo, W. (2020) Iterative Domain-Repaired Back-Translation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 16-20 November 2020, 5884-5893.
https://doi.org/10.18653/v1/2020.emnlp-main.474 |
[24] | Abdulmumin, I., Galadanci, B.S. and Isa, A. (2021). Enhanced Back-Translation for Low Resource Neural Machine Translation Using Self-training. In: Misra, S. and Muhammad-Bello, B., eds., ICTA 2020: Communications in Computer and Information Science, vol 1350, Springer, Cham. https://doi.org/10.1007/978-3-030-69143-1_28 |
[25] | Pham, H., Wang, X., Yang, Y. and Neubig, G. (2021) Meta Back-Translation.
https://doi.org/10.48550/arXiv.2102.07847 |
[26] | 尤丛丛, 高盛祥, 余正涛, 毛存礼, 潘润海. 基于同义词数据增强的汉越神经机器翻译方法[J]. 计算机工程与科学, 2021, 43(8): 1497-1502. |
[27] | 贾承勋, 赖华, 余正涛, 文永华, 于志强. 基于短语替换的汉越伪平行句对生成[J]. 中文信息学报, 2021, 35(8): 47-55. |
[28] | 赵志耘, 石崇德, 何彦青, 高影繁, 姚长青. 面向科技文献的中日机器翻译合作研究[J]. 情报工程, 2017, 3(3): 4-9. |
[29] | Zhuang, Y., Zhang, Y. and Wang, L. (2020) LIT Team’s System Description for Japanese-Chinese Machine Translation Task in IWSLT 2020. Proceedings of the 17th International Conference on Spoken Language Translation, 9-10 July 2020, 109-113. https://doi.org/10.18653/v1/2020.iwslt-1.12 |
[30] | Hagiwara, M. (2020) Octanove Labs’ Japanese-Chinese Open Domain Translation System. Proceedings of the 17th International Conference on Spoken Language Translation, 9-10 July 2020, 166-171.
https://doi.org/10.18653/v1/2020.iwslt-1.20 |
[31] | Zhang, J. and Matsumoto, T. (2019) Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora. Applied Sciences, 9, Article No. 2036. https://doi.org/10.3390/app9102036 |
[32] | Zhang, J. and Matsumoto, T. (2017) Improving Character Level Japanese-Chinese Neural Machine Translation with Radicals as an Additional Input Feature. Proceedings of the 2017 International Conference on Asian Language Processing (IALP), Singapore, 5-7 December 2017, 172-175. https://doi.org/10.1109/IALP.2017.8300572 |
[33] | Zhang, J. and Matsumoto, T. (2019) Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation. Proceedings of the 2019 International Conference on Asian Language Processing (IALP), Shanghai, 15-17 November 2019, 35-40. https://doi.org/10.1109/IALP48816.2019.9037677 |
[34] | Meng, Y., Li, X., Sun, X., Han, Q., Yuan, A. and Li, J. (2019) Is Word Segmentation Necessary for Deep Learning of Chinese Representations? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 28 July-2 August 2019, 3242-3252. |
[35] | Papineni, K., Roukos, S., Ward, T. and Zhu, W. (2002) Bleu: A Method for Automatic Evaluation of Machine Translation. Annual Meeting of the Association for Computational Linguistics, Philadephia, 6-12 July 2002, 311-318.
https://doi.org/10.3115/1073083.1073135 |
[36] | “结巴”中文分词[EB/OL]. http://github.com/fxsjy/jieba, 2020-01-20. |
[37] | MeCab: Yet Another Part-of-Speech and Morphological Analyzer. http://taku910.github.io/mecab |
[38] | Chu, C., Nakazawa, T. and Kurohashi, S. (2012) Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese. Proceedings 8th Conference on International Language Resources and Evaluation (LREC’12), Istanbul, 21-27 May 2012, 2149-2152. |
[39] | Klein, G., Kim, Y., Deng, Y., Senellart, J. and Rush, A. (2017) OpenNMT: Open-Source Toolkit for Neural Machine Translation. Proceedings of ACL 2017, System Demonstrations, Vancouver, 30 July-4 August 2017, 67-72.
https://doi.org/10.18653/v1/P17-4012 |