|
深度嵌入聚类及其在投诉文本分析中的应用
|
Abstract:
针对互联网存在的巨量涉及电力投诉的用户生成超短文本,本文提出一种基于深度嵌入的聚类模型,以实现互联网电力投诉文本话题识别的方法。首先,通过改进算法进行词嵌入,以提高文本特征的语义丰度并降低数据集维度;然后,在词嵌入的基础上,借助Sentence-Bert进行句子相似度计算,从而实现短文本聚类;最后,在自主爬取的互联网用户留言中涉及电力投诉的文本数据集上部署提出的方法,完成了投诉文本的话题聚类,并与多个已有的话题识别算法在同一数据集上的效果进行比较,证明了提出模型的有效性。
In view of the huge amount of Internet user-generated ultra-short text involving power complaints, a clustering model based on deep embedding is proposed to realize the topic recognition method of Internet power complaints text in this paper. Firstly, word embedding is carried out by an improved algorithm to enhance the semantic richness of text features and reduce the dimension of data set. Then, sentence similarity is calculated by using Sentence-Bert to realize short text clustering based on word embedding. Finally, the proposed method is deployed on the text data set involving power complaints in the self-crawling Internet user messages to complete the topic clustering of the complaint text, and the effect of several existing topic recognition algorithms on the same data set is compared, which proves the effectiveness of the proposed model.
[1] | 张樯, 程倩. 服务型政府的知识建构与扩散——基于SKAD的5T话语分析[J]. 学习论坛, 2020(4): 46-52. |
[2] | 李晓飞. 户籍分割、资源错配与地方包容型政府的置换式治理[J]. 公共管理学报, 2019, 16(1): 16-28. |
[3] | 丁志刚, 王杰. 中国行政体制改革四十年: 历程、成就、经验与思考[J]. 上海行政学院学报, 2019, 20(1): 35-47. |
[4] | Gencer, B., Larsen, E.R. and van Ackere, A. (2020) Understanding the Coevolution of Electricity Markets and Regulation. Energy Policy, 143, Article ID: 111585. https://doi.org/10.1016/j.enpol.2020.111585 |
[5] | 胡洋, 田兵, 雷金勇, 等. 面向能源互联的分布式发电系统聚合服务运营模式分析[J]. 中国电力, 2020, 53(8): 1-8. |
[6] | 朱州. 基于大数据分析的电力客户服务需求预测[J]. 沈阳工业大学学报, 2020, 42(4): 368-372. |
[7] | 冷媛, 陈政, 黄国日, 等. 偏远山区电力普遍服务微网优化模型研究[J]. 智慧电力, 2020, 48(6): 61-66. |
[8] | 刘志欣, 黄旭, 魏加项, 等. 基于95598大数据的电力客户满意度分析[J]. 电力大数据, 2018, 21(8): 19-24. |
[9] | Liu, Z.X., Huang, Z., Yu, L., Meng, C. and Zhou, J.Q. (2018) Power Customer Complaints Model Based on Grey Correlation Analysis Method. 2018 2nd IEEE Advanced Information Management, Communicates, Electronic and Automation Control Con-ference (IMCEC), Xi’an, 25-27 May 2018, 1411-1415.
https://doi.org/10.1109/IMCEC.2018.8469242 |
[10] | 吴艾薇, 雷景生. 面向电力客户投诉信息的短文本分类算法的改进技术[J]. 上海电力学院学报, 2017, 33(6): 597-600. |
[11] | Huang, D., Wang, C.-D., Wu, J.-S., Lai, J.-H. and Kwoh, C.-K. (2019) Ultra-Scalable Spectral Clustering and Ensemble Clustering. IEEE Transactions on Knowledge and Data Engineering, 32, 1212-1226.
https://doi.org/10.1109/TKDE.2019.2903410 |
[12] | Yang, X., Li, G.X. and Huang, S.S. (2017) Perceived Online Community Support, Member Relations, and Commitment: Differences between Posters and Lurkers. Information & Management, 54, 154-165.
https://doi.org/10.1016/j.im.2016.05.003 |
[13] | 杨东红, 吴邦安, 陈天鹏, 薛红燕. 基于京东商城评价数据的在线商品好评、中评、差评比较研究[J]. 情报科学, 2019, 37(2): 125-132. |
[14] | Li, X.L., Wu, C.J. and Mai, F. (2019) The Effect of Online Reviews on Product Sales: A Joint Sentiment-Topic Analysis. Information & Management, 56, 172-184. https://doi.org/10.1016/j.im.2018.04.007 |
[15] | Moshtaghi, M., Bezdek, J.C., Erfani, S.M., Leckie, C. and Bailey, J. (2019) Online Cluster Validity Indices for Performance Monitoring of Streaming Data Clustering. International Journal of Intelligent Systems, 34, 541-563.
https://doi.org/10.1002/int.22064 |
[16] | Ren, Y.Z., Hua, K.R., Dai, X.Y., et al. (2019) Semi-Supervised Deep Em-bedded Clustering. Neurocomputing, 325, 121-130. https://doi.org/10.1016/j.neucom.2018.10.016 |
[17] | Chen, Y.W, Zhou, L.D, Pei, S.W, et al. (2019) KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 51, 3939-3953.
https://doi.org/10.1109/TSMC.2019.2956527 |
[18] | Chen, M.S., Huang, L., Wang, C.D. and Huang, D. (2020) Multi-View Clustering in Latent Embedding Space. The Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, 7-12 February 2020, 3513-3520. |
[19] | Giglio, S., Bertacchini, F., Bilotta, E. and Pantano, P. (2019) Using Social Media to Identify Tourism Attractiveness in Six Italian Cities. Tourism Management, 72, 306-312. https://doi.org/10.1016/j.tourman.2018.12.007 |
[20] | 孙长伟, 任宗来, 杨俊杰, 庞坤亮. 基于评论数据的酒店服务质量的细粒度分析[J]. 计算机应用与软件, 2019, 36(7): 32-38. |
[21] | Kim, S., Park, H. and Lee, J. (2020) Word2vec-Based Latent Semantic Analysis (W2V-LSA) for Topic Modeling: A Study on Blockchain Technology Trend Analysis. Expert Systems with Applications, 152, Article ID: 113401.
https://doi.org/10.1016/j.eswa.2020.113401 |
[22] | Qu, C., Yang, L., Qiu, M.H., et al. (2019) BERT with History Answer Embedding for Conversational Question Answering. Proceedings of the 42nd International ACM SIGIR Con-ference on Research and Development in Information Retrieval, Paris, 21-25 July 2019, 1133-1136. https://doi.org/10.1145/3331184.3331341 |
[23] | Gao, Z.J., Feng, A., Song, X.Y. and Wu, X. (2019) Tar-get-Dependent Sentiment Classification with BERT. IEEE Access, 7, 154290-154299. https://doi.org/10.1109/ACCESS.2019.2946594 |