OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

- 2017

基于双语句对覆盖度的维汉机器翻译语料选取技术

DOI: 10.3969/j.issn.0253-2778.2017.04.001

朱少林,杨雅婷,米成刚,李晓,王磊

Keywords: 统计机器翻译, 双语句对, 语料选取
Key words： statistical machine translation sentence pairs corpus selection

Full-Text Cite this paper Add to My Lib

Abstract:

在进行语料的选取时，语料中的冗余信息包括词汇和句子层面的冗余.目前的方法主要集中在词汇层次的语料覆盖度进行选取，这种方法可以有效地降低词或者短语的信息冗余，但是没有考虑句子层次的覆盖度. 为了从大规模的双语语料中选取较小规模的训练语料，得到与大规模训练相同甚至更优的翻译系统，基于双语句对覆盖度进行平行语料的选取，提出一种将unseen n-grams和编辑距离相结合进行语料的选取的方法.实验结果表明，该方法可以在使用较少训练语料的情况下，得到与原始训练翻译效果相同的翻译系统.
Abstract：When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133