|
Modern Linguistics 2021
汉语–土耳其语句对齐自动校验方法研究
|
Abstract:
通过互联网获取的句对齐平行语料常存在对齐错位或译文质量差的问题,针对这一问题,本文提出了一种基于反向翻译的汉语–土耳其语平行语料自动校验方法。该方法通过在线机器翻译系统获取反向翻译结果,并将译文作为中间语言构建词袋模型对句子相似度进行向量化表示,最后通过机器学习训练二分类模型的方法来判断句子是否对齐。实验结果显示,以汉语或土耳其语为中间语言时,系统能够获得较好的句对齐检验效果。
The problem of misalignment or poor translation quality often exists in sentence-aligned parallel corpus obtained from the Internet. To solve this problem, the paper proposes an automatic back-translation based verification method for Chinese-Turkish parallel corpus. In this method, the back-translation results are obtained by online machine translation system, and the target language is used as the intermediate language to construct a bag-of-words model to realize the vectorized representation of sentence similarity. With these vector values, a binary classification model is trained by machine learning to judge whether the sentences are properly aligned or not. The experimental results show that the system can achieve better sentence alignment verification results when Chinese or Turkish is used as the intermediate language.
[1] | 路琦. 基于跨语言词向量的句子对齐方法研究[D]: [硕士学位论文]. 哈尔滨: 哈尔滨理工大学, 2020. |
[2] | Canhasi, E. (2013) Measuring the Sentence Level Similarity. Advances in Architecture and Engineering, 1, 35-42 |
[3] | 丁颖. 基于词对和词典的句子对齐研究[D]: [硕士学位论文]. 苏州: 苏州大学, 2019. |
[4] | Wali, W., Gargouri, B. and Hamadou, A.B. (2017) Sentence Similarity Computation Based on Word Net and Verb Net. Computación y Sistemas, 4, 627-635. https://doi.org/10.13053/cys-21-4-2853 |
[5] | 黄佳跃. 基于神经网络的句对齐研究及应用[D]: [硕士学位论文]. 苏州: 苏州大学, 2020. |
[6] | 李玉龙. 基于神经网络的句子相似度计算研究[D]: [硕士学位论文]. 湘潭: 湖南科技大学, 2020. |
[7] | 彭晓娅, 周栋. 跨语言词向量研究综述[J]. 中文信息学报, 2020, 34(2): 1-16. |
[8] | Deng, H., Zhu, X. and Li, Q. (2017) Sentence Similarity Calculation Based on Syntactic Structure and Modifier. Mathematical Problems in Engineering, 43, 240-244. |