OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

- 2017

基于维基百科的俄汉可比语料库构建及可比度计算
Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation

DOI: 10.6040/j.issn.1671-9352.0.2017.095

原伟,易绵竹
YUAN Wei, YI Mian-zhu

Keywords: 可比语料库,俄语,维基百科,
Russian,comparable corpora,Wikipedia

Full-Text Cite this paper Add to My Lib

Abstract:

摘要：可比语料库由于其自身优势和广泛用途逐渐成为语料库研究的热点方向之一,而目前国内俄汉可比语料库相关研究未见学者涉及。通过梳理国内外相关研究成果,设计了一种基于维基百科构建俄汉可比语料库的思路和方法,研制了语料自动获取系统,以篇章对齐为基础建立了俄汉可比语料库,语料字(词)总数达到了百万级,最后利用跨语言相似度计算的方法对俄汉语料的可比度进行计算。计算结果表明该方法能够有效获取可比度较高的俄汉语料,所构建的语料库可被用于俄汉翻译、话语分析及计算语言学研究中。
Abstract: Currently Russian and Chinese corpus research is urgently needed new breakthroughs in data sources, research angles and applications. Comparable corpus is one of the research hotspots in corpus linguistics and natural language processing. So far there has been no study of Russian-Chinese comparable corpora in China. This paper reviews the existing achievements in this area, designs an method to construct Russian-Chinese comparable corpus based on Wikipedia, develops a system for automatic acquiring comparable texts, and builds a Russian-Chinese comparable corpus, which contents more than a million words. In the end, the comparability of this comparable corpora was evaluated by using cross-language similarity calculation methods. The results demonstrate that using this method can effectively obtain Russian-Chinese comparable texts with high comparability, and the corpus can be used for translation, discourse analysis and computational linguistics studies

References

[1]	OTERO P G, L‘OPEZ I G. Wikipedia as multilingual source of comparable corpora[C] // Proceedings of the 3rd Workshop on Building and Using Comparable Corpora(LREC-2010). Malta: European Language Resources Association, 2010: 21-25.
[2]	ION R, TUFFS D, BOROS T, et al. Online compilation of comparable corpora and their evaluation[C] // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages. Dubrovnik: FASSBL7, 2010: 29-33.
[3]	朱群燕. 基于可比语料库的跨语言信息检索研究[D]. 武汉: 华中师范大学, 2015. ZHU Qunyan. Research on cross language information retrieval based on comparable corpora[D]. Wuhan: Central China Normal University, 2015.
[4]	胡弘思, 姚天昉. 基于维基百科的双语可比语料的句子对齐[J]. 中文信息学报, 2016, 30(01):198-203. HU Hongsi, YAO Tianfang. Sentence alignment for bilingual comparable corpus from Wikipedia[J]. Journal of Chinese Information Processing, 2016, 30(01):198-203.
[5]	SHAROFF S, BABYCH B, HARTLEY A. Using comparable corpora to solve problems difficult for human translators[C] // Proceedings of the COLING/ACL on Main Conference Poster Sessions. Los Angeles: ACL, 2006: 739-746.
[6]	SMITH JR, QUIRK C, TOUTANOVA K. Extracting parallel sentences from comparable corpora using document level alignment[C] // Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles: ACL, 2010: 403-411.
[7]	FRAISSE A, PAROUBEK P. Twitter as a comparable corpus to build multilingual affective lexicons[C] // Proceedings of the 7th Workshop on Building and Using Comparable Corpora. Reykjavik: JNLE, 2014: 26-31.
[8]	SKADIH<sub>,</sub>A I, AKER A, MASTROPAVLOS N, et al. Collecting and using comparable corpora for statistical machine translation[C] //Proceedings of the 8th International Conference on Language Resources and Evaluation(LREC). Istanbul:[s.n.] 2012: 438-445.
[9]	肖健, 徐建, 徐晓兰, 等. 英中可比语料库中多词表达自动提取与对齐[J]. 计算机工程与应用, 2010, 46(31):130-134,187. XIAO Jian, XU Jian, XU Xiaolan, et al. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering and Applications, 2010, 46(31):130-134,187.
[10]	BAKER M. Corpora in translation studies: an overview and some suggestions for future research[J]. Target, 1995, 7(2): 223-243.
[11]	ZAGIBALOV T, BELYATSKAYA K, CARROLL J. Comparable English-Russian book review corpora for sentiment analysis[C] // Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Alacant: Universitat d'Alacant, 2010: 67-72.
[12]	YU Kun, TSUJI J. Bilingual dictionary extraction from Wikipedia[J]. Proceeding of MT Summit XII, 2009, 12:121-124.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于维基百科的俄汉可比语料库构建及可比度计算Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation

基于维基百科的俄汉可比语料库构建及可比度计算
Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation