Modeling topics in short texts presents significant challenges due to feature sparsity, particularly when analyzing content generated by large-scale online users. This sparsity can substantially impair semantic capture accuracy. We propose a novel approach that incorporates pre-clustered knowledge into the BERTopic model while reducing the l2 norm for low-frequency words. Our method effectively mitigates feature sparsity during cluster mapping. Empirical evaluation on the StackOverflow dataset demonstrates that our approach outperforms baseline models, achieving superior Macro-F1 scores. These results validate the effectiveness of our proposed feature sparsity reduction technique for short-text topic modeling.
References
[1]
Li, X., Mao, J., Ma, W., Liu, Y., Zhang, M., Ma, S., et al. (2021) Topic-Enhanced Knowledge-Aware Retrieval Model for Diverse Relevance Estimation. Proceedings of the Web Conference 2021, Ljubljana, 19-23 April 2021, 756-767. https://doi.org/10.1145/3442381.3449943
[2]
Adlakha, V., Dhuliawala, S., Suleman, K., de Vries, H. and Reddy, S. (2022) TopiOCQA: Open-Domain Conversational Question Answering with Topic Switching. Transactions of the Association for Computational Linguistics, 10, 468-483. https://doi.org/10.1162/tacl_a_00471
[3]
Sejwal, V.K. and Abulaish, M. (2022) A Hybrid Recommendation Technique Using Topic Embedding for Rating Prediction and to Handle Cold-Start Problem. Expert Systems with Applications, 209, Article 118307. https://doi.org/10.1016/j.eswa.2022.118307
[4]
Cao, B., Xiao, Q., Zhang, X. and Liu, J. (2019) An API Service Recommendation Method via Combining Self-Organization Map-Based Functionality Clustering and Deep Factorization Machine-Based Quality Prediction. Chinese Journal of Computers, 42, 1367-1383.
[5]
Xi, X., Guo, Y., Song, X. and Wang, J. (2021) Research on the Technical Similarity Visualization Based on Word2vec and LDA Topic Model. Journal of the China Society for Scientific and Technical Information, 40, 974-983.
[6]
Ruan, G. and Huang, Y. (2023) Topic Recognition of Comment Text Based on Sense Bert and LDA. Journal of Modern Information, 43, 46-53.
[7]
Agarwal, N., Sikka, G. and Awasthi, L.K. (2023) WGSDMM+GA: A Genetic Algorithm-Based Service Clustering Methodology Assimilating Dirichlet Multinomial Mixture Model with Word Embedding. Future Generation Computer Systems, 145, 254-266. https://doi.org/10.1016/j.future.2023.03.028
[8]
Peinelt, N., Nguyen, D. and Liakata, M. (2020) tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5-10 July 2020, 7047-7055. https://doi.org/10.18653/v1/2020.acl-main.630
[9]
Salton, G., Wong, A. and Yang, C.S. (1975) A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613-620. https://doi.org/10.1145/361219.361220
[10]
Hofmann, T. (1999) Probabilistic Latent Semantic Analysis. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, 30 July-1 August 1999, 289-296.
[11]
Blei, D.M., Ng, A. and Jordan, M.I. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
[12]
Yu, H., Fan, S., Wu, L. and Ma, Z. (2022) Review Inquiry and IPO Information Disclosure under the Registration System of Science and Technology Board—Text Analysis Based on LDA Subject Model. Journal of Management Sciences in China, 25, 45-62. https://doi.org/10.19920/j.cnki.jmsc.2022.08.003
[13]
Feng, J., Rao, Y., Xie, H., Wang, F.L. and Li, Q. (2019) User Group Based Emotion Detection and Topic Discovery over Short Text. World Wide Web, 23, 1553-1587. https://doi.org/10.1007/s11280-019-00760-3
[14]
Hong, L. and Davison, B.D. (2010) Empirical Study of Topic Modeling in Twitter. Proceedings of the First Workshop on Social Media Analytics, Washington DC, 25-28 July 2010, 80-88. https://doi.org/10.1145/1964858.1964870
[15]
Diao, Q., Jiang, J., Zhu, F. and Lim, E.P. (2012) Finding Bursty Topics from Microblogs. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, 8-14 July 2012, 536-544.
[16]
Zhang, Q., Gong, Y., Sun, X. and Huang, X. (2014) Time-Aware Personalized Hashtag Recommendation on Social Media. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, 23-29 August 2014, 203-212.
[17]
Nguyen, H.T., Duong, P.H. and Cambria, E. (2019) Learning Short-Text Semantic Similarity with Word Embeddings and External Knowledge Sources. Knowledge-Based Systems, 182, Article 104842. https://doi.org/10.1016/j.knosys.2019.07.013
[18]
Jin, O., Liu, N.N., Zhao, K., Yu, Y. and Yang, Q. (2011) Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, 24-28 October 2011, 775-784. https://doi.org/10.1145/2063576.2063689
[19]
Yan, X., Guo, J., Lan, Y. and Cheng, X. (2013) A Biterm Topic Model for Short Texts. Proceedings of the 22nd international conference on World Wide Web, Rio de Janeiro, 13-17 May 2013, 1445-1456. https://doi.org/10.1145/2488388.2488514
[20]
Yang, D., Li, N., Zou, L. and Ma, H. (2022) Lexical Semantics Enhanced Neural Word Embeddings. Knowledge-Based Systems, 252, Article 109298. https://doi.org/10.1016/j.knosys.2022.109298
[21]
Fan, H. and Li, P. (2021) Sentiment Analysis of Short Text Based on FastText Word Vector and Bidirectional GRU Recurrent Neural Network. Information Science, 39, 15-22. https://doi.org/10.13833/j.issn.1007-7634.2021.04.003
[22]
Rashid, J., Kim, J., Hussain, A. and Naseem, U. (2023) WETM: A Word Embedding-Based Topic Model with Modified Collapsed Gibbs Sampling for Short Text. Pattern Recognition Letters, 172, 158-164. https://doi.org/10.1016/j.patrec.2023.06.007
[23]
Qin, T., Liu, Z. and Chen, K. (2020) Topic Model Combining Topic Word Embedding and Attention Mechanism. Computer Engineering, 46, 104-108. https://doi.org/10.19678/j.issn.1000-3428.0055952
[24]
Jia, J., Wang, H., Ren, K. and Kang, W. (2022) Research on Text Clustering Based on Sentence Vector and Convolutional Neural Network. Computer Engineering and Ap-plications, 58, 123-128.
[25]
Zhao, X., Wang, D., Zhao, Z., Liu, W., Lu, C. and Zhuang, F. (2021) A Neural Topic Model with Word Vectors and Entity Vectors for Short Texts. Information Processing & Management, 58, Article 102455. https://doi.org/10.1016/j.ipm.2020.102455
[26]
Qiang, J., Chen, P., Wang, T. and Wu, X. (2017) Topic Modeling over Short Texts by Incorporating Word Embeddings. Advances in Knowledge Discovery and Data Mining, Jeju, 23-26 May 2017, 363-374. https://doi.org/10.1007/978-3-319-57529-2_29
[27]
Bianchi, F., Terragni, S. and Hovy, D. (2021) Pre-Training Is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1-6 August 2021, 759-766. https://doi.org/10.18653/v1/2021.acl-short.96
[28]
Grootendorst, M. (2022) BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv:2203.05794.
[29]
Reimers, N. and Gurevych, I. (2019) Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 3-7 November 2019, 3982-3992. https://doi.org/10.18653/v1/d19-1410
[30]
McInnes, L., Healy, J., Saul, N. and Großberger, L. (2018) UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3, Article 861. https://doi.org/10.21105/joss.00861
[31]
McInnes, L., Healy, J. and Astels, S. (2017) Hdbscan: Hierarchical Density Based Clustering. The Journal of Open Source Software, 2, Article 205. https://doi.org/10.21105/joss.00205
[32]
Liu, H. and Wang, Y. (2022) Clustering Short Text Classification Based on Fusion of BERT and GSDMM. Computer Systems & Applications, 31, 267-272. https://doi.org/10.15888/j.cnki.csa.008307
[33]
Li, B., Zhou, H., He, J., Wang, M., Yang, Y. and Li, L. (2020) On the Sentence Embeddings from Pre-Trained Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16-20 November 2020, 9119-9130. https://doi.org/10.18653/v1/2020.emnlp-main.733
[34]
Hu, Q., Shen, J., Wang, K., Du, J. and Du, Y. (2022) A Web Service Clustering Method Based on Topic Enhanced Gibbs Sampling Algorithm for the Dirichlet Multinomial Mixture Model and Service Collaboration Graph. Information Sciences, 586, 239-260. https://doi.org/10.1016/j.ins.2021.11.087
[35]
Röder, M., Both, A. and Hinneburg, A. (2015) Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, 2-6 February 2015, 399-408. https://doi.org/10.1145/2684822.2685324