OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

- 2019

Detecting new Chinese words from massive domain texts with word embedding

DOI: 10.1177/0165551518786676

Baojun Ma,Hua Yuan,Qiongwei Ye,Xiongwen Deng,Yang Du,Yu Qian

Keywords: Natural language processing,new word detection,similarity measurement,textual information retrieval,word embedding

Full-Text Cite this paper Add to My Lib

Abstract:

Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133