全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Word Embeddings and Semantic Spaces in Natural Language Processing

DOI: 10.4236/ijis.2023.131001, PP. 1-21

Keywords: Natural Language Processing, Vector Space Models, Semantic Spaces, Word Embeddings, Representation Learning, Text Vectorization, Machine Learning, Deep Learning

Full-Text   Cite this paper   Add to My Lib

Abstract:

One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.

References

[1]  Hinton, G. and Roweis, S. (2002) Stochastic Neighbor Embedding. In: Becker, S., Thrun, S. and Obermayer, K., Eds., Advances in Neural Information Processing Systems 15 (NIPS 2002), The MIT Press, Cambridge.
https://cs.nyu.edu/~roweis/papers/sne_final.pdf
[2]  Hu, J. (2020) An Overview of Text Representations in NLP. Towards Data Science.
https://towardsdatascience.com/an-overview-for-text-representations-in-nlp-311253730af1
[3]  Salton, G. (1971) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Hoboken.
[4]  Salton, G., Wong, A. and Yang, C.S. (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613-620.
https://doi.org/10.1145/361219.361220
[5]  Sparck Jones, K. (1972) A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, 28, 11-21.
https://doi.org/10.1108/eb026526
[6]  Landauer, T.K. and Dumais, S.T. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104, 211-240.
https://doi.org/10.1037/0033-295X.104.2.211
[7]  Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W. and Harshman, R.A. (1990) Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science (JASIS), 41, 391-407.
https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[8]  Turney, P. and Pantel, P. (2010) From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141-188.
https://doi.org/10.1613/jair.2934
[9]  Firth, J.R. (1957) Studies in Linguistic Analysis. Blackwell, Oxford.
[10]  Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013) Efficient Estimation of Word Representations in Vector Space. ArXiv: 1301.3781.
[11]  Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. In: Burges, C.J., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K.Q., Eds., Advances in Neural Information Processing Systems 26, Curran Associates, Inc., Red Hook, 3111-3119.
[12]  Pennington, J., Socher, R. and Manning C. (2014) GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 25-29 October 2014, 1532-1543.
[13]  Sarkar, D. (2018) A Hands-on Intuitive Approach to Deep Learning Methods for Text Data—Word2Vec, GloVe and FastText. Towards Data Science.
https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa
[14]  Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018) Deep Contextualized Word Representations. In: Walker, M., Ji, H. and Stent, A., Eds., Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, 2227-2237.
https://doi.org/10.18653/v1/N18-1202
[15]  Devlin, J, Chang, M.-W., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In: Burstein, J., Doran, C. and Solorio, T., Eds., Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, 4171-4186.
[16]  Wang, Y., Hou, Y., Che, W. and Liu, T. (2020) From Static to Dynamic Word Representations: A Survey. International Journal of Machine Learning and Cybernetics, 11, 1611-1630.
https://doi.org/10.1007/s13042-020-01069-8
[17]  Vaswani, A., Shazeer, N., Parmer, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. ArXiv: 1706.03762.
[18]  Galassi, A., Lippi, M. and Torroni, P. (2021) Attention in Natural Language Processing. IEEE Transactions on Neural Networks and Learning Systems, 32, 4291-4308.
https://doi.org/10.1109/TNNLS.2020.3019893
[19]  Koroteev, M.V. (2021) BERT: A Review of Applications in Natural Language Processing and Understanding. ArXiv: 2103.11943.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133