全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于LSH技术的试题相似度检测方法
The Application of LSH Technology in Similar Question Detection

DOI: 10.12677/CSA.2020.104077, PP. 741-748

Keywords: 试题查重,LSH算法,Jaccard相似度,K-shingle
Examination Checking
, LSH Algorithm, Jaccard Similarity, K-Shingle

Full-Text   Cite this paper   Add to My Lib

Abstract:

试题内容重复率是评价试题库及试卷质量的重要指标之一,为了快速找出题库中的相似试题,本文主要研究了基于K-shingles的Jaccard相似度、MinHash和LSH技术应用于相似试题的检测方法。此方法首先将题干内容进行中文分词,进行适当处理后转换成K-shingle集,通过MinHash计算出签名,最后使用LSH技术快速地找出候选相似试题对并计算出相应的Jaccard相似度,若该相似度大于给定的阈值,则发现相似试题。该方法通过在题库系统中的使用,充分验证了该方法的可行性,达到了很好的效果。
The repetition rate of test questions is one of the important indexes to evaluate the quality of test questions and test papers. In order to quickly find out similar questions in the test bank, this paper mainly studies the detection methods of similar questions based on K-shingles, Jaccard similarity, MinHash and LSH technology. First of all, the main content of the question is segmented into Chinese words, then converted into K-shingle set after proper processing, and the signature is calculated by MinHash. Finally, LSH technology is used to quickly find out the candidate pairs of similar questions and calculate the corresponding Jaccard similarity. If the similarity is greater than the given threshold, similar questions are found. Experiments prove to be practicable and effective.

References

[1]  Muskan, K.M. (2017) Identifying Influential Segments from Word Co-Occurrence Networks Using AHP. Cognitive Systems Research, S138904171630198X.
[2]  Pawar, A. and Mago, V. (2018) Calculating the Similarity between Words and Sentences Using a Lexical Database and Corpus Statistics.
[3]  Abujar, S., Hasan, M. and Hossain, S.A. (2019) Sentence Similarity Estimation for Text Summarization Using Deep Learning. In: Kulkarni, A., Satapathy, S., Kang, T. and Kashan, A., Eds., Proceedings of the 2nd International Conference on Data Engineering and Communi-cation Technology, Advances in Intelligent Systems and Computing, Vol. 828, Springer, Singapore.
https://doi.org/10.1007/978-981-13-1610-4_16
[4]  Chen, Q., Hu, Q.M., Huang, X.J. and He, L. (2018) CAN: Enhancing Sentence Similarity Modeling with Collaborative and Adversarial Network. 815-824.
https://doi.org/10.1145/3209978.3210019
[5]  Chi, Z. and Zhang, B. (2018) A Sentence Similarity Estimation Method Based on Improved Siamese Network. Journal of Intelligent Learning Systems and Applications, 10, 121-134.
https://doi.org/10.4236/jilsa.2018.104008
[6]  Yao, H., Liu, H. and Zhang, P. (2018) A Novel Sentence Similarity Model with Word Embedding Based on Convolutional Neural Network. Concurrency and Computation: Practice and Experience, 30, e4415.
https://doi.org/10.1002/cpe.4415
[7]  Quan, Z., Wang, Z., Le, Y., Yao, B., Li, K. and Yin, J. (2019) An Efficient Framework for Sentence Similarity Modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27, 853-865.
https://doi.org/10.1109/TASLP.2019.2899494
[8]  Le, Y.Q., Wang, Z.-J., Quan, Z., He, J.W. and Yao, B. (2018) ACV-Tree: A New Method for Sentence Similarity Modeling. IJCAI, 4137-4143.
[9]  梁圣. 基于RNN的试题相似度计算模型研究与实现[J]. 数码设计, 2018, 7(1): 21-23.
[10]  田星, 郑瑾, 张祖平. 基于词向量的Jaccard相似度算法[J]. 计算机科学, 2018, 45(7): 192-195.
[11]  Chen, Q., Hu, Q., Huang, J.X. and He, L. (2018) CA-RNN: Using Context-Aligned Recurrent Neural Networks for Modeling Sentence Similarity. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, 2-7 February 2018.
[12]  Leskovec, J., Rajaraman, A. and Ullman, J.D. (2015) Mining of Massive Datasets. 2nd Edition. Posts & Telecom Press, Beijing, 56-70.
https://doi.org/10.1017/CBO9781139924801
[13]  Manaa, M.E. and Abdulameer, G. (2018) Web Documents Similarity Using K-Shingle Tokens and MinHash Technique. Journal of Engineering and Applied Sciences, 13, 1499-1505.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133