|
基于改进LDA模型的主题识别及演化研究——以软件开源领域为例
|
Abstract:
目的:针对基于LDA模型进行主题识别及演化分析方法在主题数量选择困难、时间窗口划分主观性强等方面的局限提出优化改进,从而推动主题识别及演化分析方法的进步。方法:结合TF-IDF算法和Word2Vec词向量技术计算主题向量,减少主题生成时常用词汇的影响,同时实现主题向量的语义表达。在主题演化过程中提出基于主题语义距离变化的方法划分时间窗口,跟踪目标领域主题强度和主题内容的演化趋势。最后以软件开源领域研究文献为例进行实证研究。结果:研究结果显示,本文提出的优化方法能够有效识别领域的研究主题及热点主题,跟踪主题随时间演化的路径,并可视化呈现。结论:软件开源研究存在六个关键主题,其中“开源治理”和“市场竞争”是该研究领域的热点主题。从主题内容的演变来看,软件开源的研究正从个人自发参与的自治动机转向企业与政府等组织层面的参与。
Purpose: To address the limitations of topic identification and evolution analysis methods based on LDA models, such as difficulty in selecting the number of topics and strong subjectivity in time window partitioning, and to propose optimization improvements, in order to promote the progress of topic identification and evolution analysis methods. Method: Combining TF-IDF algorithm and Word2Vec word vector technology to calculate topic vectors, reducing the influence of commonly used vocabulary in topic generation, while achieving semantic expression of topic vectors. Propose a method for dividing time windows based on changes in topic semantic distance during the process of topic evolution, and track the evolution trend of topic intensity and content in the target domain. Finally, empirical research will be conducted using literature in the field of open source software as an example. Result: The research results show that the optimization method proposed in this paper can effectively identify research topics and hot topics in the field, track the path of topic evolution over time, and visualize it. Conclusion: There are six key themes in software open source research, among which “open source governance” and “market competition” are hot topics in this research field. From the evolution of the theme content, research on open source software has shifted from the autonomous motivation of individual participation to the participation of organizations such as enterprises and governments.
[1] | 卢国强, 黄微, 孙悦, 等. 基于舆情客体与本体剥离的重大突发事件网络舆情本体演化强度研究[J]. 图书情报工作, 2023, 67(5): 119-129. |
[2] | 马晓悦, 薛鹏珍, 陈忆金, 等. 社交媒体危机主题演化模型构建与趋势分析[J]. 图书情报工作, 2021, 65(13): 77-86. |
[3] | Huangfu, L., Mo, Y., Zhang, P., Zeng, D.D. and He, S. (2022) COVID-19 Vaccine Tweets after Vaccine Rollout: Sentiment-Based Topic Modeling. Journal of Medical Internet Research, 24, e31726. https://doi.org/10.2196/31726 |
[4] | Qian, Y., Liu, Y. and Sheng, Q.Z. (2020) Understanding Hierarchical Structural Evolution in a Scientific Discipline: A Case Study of Artificial Intelligence. Journal of Informetrics, 14, Article ID: 101047. https://doi.org/10.1016/j.joi.2020.101047 |
[5] | 张玲, 恽诚涛, 尹思力, 等. 我国科研诚信政策与文献主题演化对比分析[J]. 现代情报, 2023, 43(6): 108-120. |
[6] | 李秀霞, 程结晶, 韩霞. 发文趋势与引文趋势融合的学科研究主题优先级排序——以我国情报学学科主题为例[J]. 图书情报工作, 2019, 63(11): 88-95. |
[7] | 单晓红, 韩晟熙, 刘晓燕. 基于技术主题演化的颠覆性技术识别研究[J]. 情报理论与实践, 2023, 46(8): 113-123. |
[8] | Vahidzadeh, R., Bertanza, G., Sbaffoni, S. and Vaccari, M. (2021) Regional Industrial Symbiosis: A Review Based on Social Network Analysis. Journal of Cleaner Production, 280, Article ID: 124054. https://doi.org/10.1016/j.jclepro.2020.124054 |
[9] | 曾子明, 陈思语. 基于LDA与BERT-BiLSTM-Attention模型的突发公共卫生事件网络舆情演化分析[J]. 情报理论与实践, 2023, 46(9): 158-166. |
[10] | 周健, 张杰, 屈冉, 等. 基于LDA的国内外区块链主题挖掘与演化分析[J]. 情报杂志, 2021, 40(9): 161-169. |
[11] | Liu, J., Nie, H., Li, S., Chen, X., Cao, H., Ren, J., et al. (2021) Tracing the Pace of COVID-19 Research: Topic Modeling and Evolution. Big Data Research, 25, Article ID: 100236. https://doi.org/10.1016/j.bdr.2021.100236 |
[12] | Dumais, S.T., Furnas, G.W., Landauer, T.K., Deerwester, S. and Harshman, R. (1988) Using Latent Semantic Analysis to Improve Access to Textual Information. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems—CHI’88, Washington DC, 15-19 May 1988, 281-285. https://doi.org/10.1145/57167.57214 |
[13] | Blei, D., Ng, A. and Jordan, M. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. |
[14] | Blei, D.M. and Lafferty, J.D. (2006) Dynamic Topic Models. Proceedings of the 23rd International Conference on Machine Learning—ICML’06, Pittsburgh, 25-29 June 2006, 113-120. https://doi.org/10.1145/1143844.1143859 |
[15] | 张柳, 王慧, 相甍甍. 基于LDA的突发事件应急管理主题热度与演化分析[J]. 情报科学, 2023, 41(6): 182-191. |
[16] | Hofmann, T. (1999) Probabilistic Latent Semantic Analysis. Morgan Kaufmann Publishers Inc. |
[17] | Cao, Z., Li, S., Liu, Y., et al. (2015) A Novel Neural Topic Model and Its Supervised Extension. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI Press, 2210-2216. |
[18] | Wei, X. and Croft, W.B. (2006). LDA-Based Document Models for Ad-Hoc Retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, 6-11 August 2006, 178-185. https://doi.org/10.1145/1148170.1148204 |
[19] | 夏萌萌, 汝绪伟, 张红军. 基于LDA模型的产业创新生态系统研究主题演化分析[J]. 中国高校科技, 2022(9): 41-46. |
[20] | 陈琦, 张君冬, 郑婉婷, 等. 基于LDA模型的中医药人工智能领域主题演化分析[J]. 世界科学技术-中医药现代化, 2022, 24(9): 3315-3324. |
[21] | 贺亮, 李芳. 基于话题模型的科技文献话题发现和趋势分析[J]. 中文信息学报, 2012, 26(2): 109-115. |
[22] | 冉从敬, 李旺. 基于LDA的企业竞争对手识别模型构建——以蔚来汽车有限公司为例[J]. 情报理论与实践, 2023, 46(8): 88-95. |
[23] | 谭春辉, 熊梦媛. 基于LDA模型的国内外数据挖掘研究热点主题演化对比分析[J]. 情报科学, 2021, 39(4): 174-185. |
[24] | Jeong, B., Yoon, J. and Lee, J. (2019) Social Media Mining for Product Planning: A Product Opportunity Mining Approach Based on Topic Modeling and Sentiment Analysis. International Journal of Information Management, 48, 280-290. https://doi.org/10.1016/j.ijinfomgt.2017.09.009 |
[25] | Tomas, M., Kai, C., Greg, C., et al. (2013) Efficient Estimation of Word Representations in Vector Space. Computation and Language. |
[26] | 靳嘉林, 王曰芬, 巴志超, 等. 基金项目研究的主题挖掘与动态演化分析——以美国NSF数据中AI领域为例[J]. 情报学报, 2022, 41(9): 967-979. |
[27] | 陶文倩, 潘云涛, 王海燕. 基于主题演化动态情境的高被引论文影响力形成模式探索[J]. 现代情报, 2024, 44(4): 114-126+153. |
[28] | Raymond, E.S. (1999) The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O’Reilly. |
[29] | Hong, Q., Kim, S., Cheung, S.C. and Bird, C. (2011) Understanding a Developer Social Network and Its Evolution. 2011 27th IEEE International Conference on Software Maintenance (ICSM), Williamsburg, 25-30 September 2011, 323-332. https://doi.org/10.1109/icsm.2011.6080799 |
[30] | von Krogh, G. and von Hippel, E. (2006) The Promise of Research on Open Source Software. Management Science, 52, 975-983. https://doi.org/10.1287/mnsc.1060.0560 |
[31] | Hars, A. and Ou, S. (2002) Working for Free? Motivations of Participating in Open Source Projects. International Journal of Electronic Commerce, 6, 25-39. |
[32] | Oreg, S. and Nov, O. (2008) Exploring Motivations for Contributing to Open Source Initiatives: The Roles of Contribution Context and Personal Values. Computers in Human Behavior, 24, 2055-2073. https://doi.org/10.1016/j.chb.2007.09.007 |
[33] | von Krogh, G., et al. (2012) Carrots and Rainbows: Motivation and Social Practice in Open Source Software Development. MIS Quarterly, 36, 649-676. https://doi.org/10.2307/41703471 |
[34] | Moqri, M., Mei, X., Qiu, L. and Bandyopadhyay, S. (2018) Effect of “Following” on Contributions to Open Source Communities. Journal of Management Information Systems, 35, 1188-1217. https://doi.org/10.1080/07421222.2018.1523605 |
[35] | Perr, J., Appleyard, M.M. and Sullivan, P. (2010) Open for Business: Emerging Business Models in Open Source Software. International Journal of Technology Management, 52, 432-456. https://doi.org/10.1504/ijtm.2010.035984 |
[36] | Belenzon, S. and Schankerman, M. (2014) Motivation and Sorting of Human Capital in Open Innovation. Strategic Management Journal, 36, 795-820. https://doi.org/10.1002/smj.2284 |
[37] | Shahrivar, S., Elahi, S., Hassanzadeh, A. and Montazer, G. (2018) A Business Model for Commercial Open Source Software: A Systematic Literature Review. Information and Software Technology, 103, 202-214. https://doi.org/10.1016/j.infsof.2018.06.018 |
[38] | Rolandsson, B., Bergquist, M. and Ljungberg, J. (2011) Open Source in the Firm: Opening up Professional Practices of Software Development. Research Policy, 40, 576-587. https://doi.org/10.1016/j.respol.2010.11.003 |
[39] | Eghbal, N. (2020) Working in Public: The Making and Maintenance of Open Source Software. Stripe Press. |
[40] | Gomes, L.A.F., da Silva Torres, R. and Côrtes, M.L. (2021) On the Prediction of Long-Lived Bugs: An Analysis and Comparative Study Using FLOSS Projects. Information and Software Technology, 132, Article ID: 106508. https://doi.org/10.1016/j.infsof.2020.106508 |
[41] | Francalanci, C. and Merlo, F. (2008) Empirical Analysis of the Bug Fixing Process in Open Source Projects. In: Russo, B., Damiani, E., Hissam, S., et al., Eds., Open Source Development, Communities and Quality, Springer US, 187-196. https://doi.org/10.1007/978-0-387-09684-1_15 |
[42] | Yang, X., Yoshida, N., Gaikovina Kula, R. and Iida, H. (2016) Peer Review Social Network (Person) in Open Source Projects. IEICE Transactions on Information and Systems, 99, 661-670. https://doi.org/10.1587/transinf.2015edp7261 |
[43] | Kuang, L., Zhou, C. and Yang, X. (2022) Code Comment Generation Based on Graph Neural Network Enhanced Transformer Model for Code Understanding in Open-Source Software Ecosystems. Automated Software Engineering, 29, Article No. 43. https://doi.org/10.1007/s10515-022-00341-1 |
[44] | Terrell, J., Kofink, A., Middleton, J., Rainear, C., Murphy-Hill, E., Parnin, C., et al. (2017) Gender Differences and Bias in Open Source: Pull Request Acceptance of Women versus Men. PeerJ Computer Science, 3, e111. https://doi.org/10.7717/peerj-cs.111 |
[45] | Sultana, S., Turzo, A.K. and Bosu, A. (2023) Code Reviews in Open Source Projects: How Do Gender Biases Affect Participation and Outcomes? Empirical Software Engineering, 28, Article No. 92. https://doi.org/10.1007/s10664-023-10324-9 |
[46] | 林丽丽, 马秀峰. 基于LDA模型的国内图书情报学研究主题发现及演化分析[J]. 情报科学, 2019, 37(12): 87-92. |