%0 Journal Article
%T Detecting new Chinese words from massive domain texts with word embedding
%A Baojun Ma
%A Hua Yuan
%A Qiongwei Ye
%A Xiongwen Deng
%A Yang Du
%A Yu Qian
%J Journal of Information Science
%@ 1741-6485
%D 2019
%R 10.1177/0165551518786676
%X Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora
%K Natural language processing
%K new word detection
%K similarity measurement
%K textual information retrieval
%K word embedding
%U https://journals.sagepub.com/doi/full/10.1177/0165551518786676