%0 Journal Article %T Detecting new Chinese words from massive domain texts with word embedding %A Baojun Ma %A Hua Yuan %A Qiongwei Ye %A Xiongwen Deng %A Yang Du %A Yu Qian %J Journal of Information Science %@ 1741-6485 %D 2019 %R 10.1177/0165551518786676 %X Textual information retrieval (TIR) is based on the relationship between word units. Traditional word segmentation techniques attempt to discern the word units accurately from texts; however, they are unable to appropriately and efficiently identify all new words. Identification of new words, especially in languages such as Chinese, remains a challenge. In recent years, word embedding methods have used numerical word vectors to retain the semantic and correlated information between words in a corpus. In this article, we propose the word-embedding-based method (WEBM), a novel method that combines word embedding and frequent n-gram string mining for discovering new words from domain corpora. First, we mapped all word units in a domain corpus to a high-dimension word vector space. Second, we used a frequent n-gram word string mining method to identify a set of candidates for new words. We designed a pruning strategy based on the word vectors to quantify the possibility of a word string being a new word, thereby allowing the evaluation of candidates based on the similarity of word units in the same string. In a comparative study, our experimental results revealed that WEBM had a great advantage in detecting new words from massive Chinese corpora %K Natural language processing %K new word detection %K similarity measurement %K textual information retrieval %K word embedding %U https://journals.sagepub.com/doi/full/10.1177/0165551518786676