With the blooming of Web 2.0, Community Question Answering (CQA) services such as Yahoo! Answers (http://answers.yahoo.com), WikiAnswer (http://wiki.answers.com), and Baidu Zhidao (http://zhidao.baidu.com), etc., have emerged as alternatives for knowledge and information acquisition. Over time, a large number of question and answer (Q&A) pairs with high quality devoted by human intelligence have been accumulated as a comprehensive knowledge base. Unlike the search engines, which return long lists of results, searching in the CQA services can obtain the correct answers to the question queries by automatically finding similar questions that have already been answered by other users. Hence, it greatly improves the efficiency of the online information retrieval. However, given a question query, finding the similar and well-answered questions is a non-trivial task. The main challenge is the word mismatch between question query (query) and candidate question for retrieval (question). To investigate this problem, in this study, we capture the word semantic similarity between query and question by introducing the topic modeling approach. We then propose an unsupervised machine-learning approach to finding similar questions on CQA Q&A archives. The experimental results show that our proposed approach significantly outperforms the state-of-the-art methods.
References
[1]
Park JH, Croft WB (2010) Query term ranking based on dependency parsing of verbose queries. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. SIGIR '10, pp. 829–830.
[2]
Ming ZY, Chua TS, Cong G (2010) Exploring domain-specific term weight in archived question search. In: Proceedings of the 19th ACM international conference on Information and knowledge management. CIKM '10, pp. 1605–1608.
[3]
Park JH, Croft WB, Smith DA (2011) A quasi-synchronous dependence model for information retrieval. In: Proceedings of the 20th ACM international conference on Information and knowledge management. CIKM '11, pp. 17–26.
[4]
Zhang WN, Ming ZY, Zhang Y, Nie L, Liu T, et al.. (2012) The use of dependency relation graph to enhance the term weighting in question retrieval. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India. COLING, pp. 3105–3120.
[5]
Zhang Y, Zhang WN, Lu K, Ji R, Wang F, et al. (2013) Phrasal paraphrase based question reformulation for archived question retrieval. PLOS ONE 8 (6) e64601. doi: 10.1371/journal.pone.0064601
[6]
Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. Overview of the Third Text Retrieval Conference (TREC-3). Darby, PA: DIANE Publishing. pp. 109–126.
[7]
Gao Y, Tang J, Hong R, Yan S, Dai Q, et al.. (2012) Camera constraint-free view-based 3d object retrieval. In: IEEE Transactions on Image Processing. volume 21, pp. 2269–2281.
[8]
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR '98, pp. 275–281.
[9]
Gao Y, Wang M, Tao D, Ji R, Dai Q (2012) 3d object retrieval and recognition with hypergraph analysis. In: IEEE Transactions on Image Processing. volume 21, pp. 4290–4303.
[10]
Jones KS, Walker S, Robertson SE (2000) A probabilistic model of information retrieval: development and comparative experiments. In: Inf. Process. Manage. volume 36, pp. 779–808.
[11]
Gao Y, Wang M, Zha Z, Shen J, Li X, et al.. (2013) Visual-textual joint relevance learning for tag-based social image search. In: IEEE Transactions on Image Processing. volume 22, pp. 363–376.
[12]
Cui H, Sun R, Li K, Kan MY, Chua TS (2005) Question answering passage retrieval using dependency relations. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR '05, pp. 400–407.
[13]
Wang K, Ming Z, Chua TS (2009) A syntactic tree matching approach to finding similar questions in community-based qa services. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. SIGIR '09, pp. 187–194.
[14]
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. In: J. Mach. Learn. Res. volume 3, pp. 993–1022.
[15]
Metzler D, Bernstein Y, Croft WB, Moffat A, Zobel J (2005) Similarity measures for tracking information flow. In: Proceedings of the 14th ACM international conference on Information and knowledge management. CIKM '05, pp. 517–524.
[16]
Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on Artificial intelligence. IJCAI'03, pp. 805–810.
[17]
Allan J, Wade C, Bolivar A (2003) Retrieval and novelty detection at the sentence level. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. SIGIR '03, pp. 314–321.
[18]
Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. In: J. Am. Soc. Inf. Sci. Technol. volume 54, pp. 203–215.
[19]
Budanitsky A, Hirst G (2006) Evaluating WordNet-based measures of lexical semantic relatedness. In: Comput. Linguist. volume 32, pp. 13–47.
[20]
Landauer T, Laham D, Rehder B, Schreiner M (1997) How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans. In: Proc. 19th Ann. Meeting of the Cognitive Science Soc. pp. 412–417.
[21]
Mandreoli F, Martoglia R, Tiberio P (2002) A syntactic approach for searching similarities within sentences. In: Proceedings of the eleventh international conference on Information and knowledge management. CIKM '02, pp. 635–637.
[22]
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning. ICML '98, pp. 296–304.
[23]
Jeon J, Croft WB, Lee JH (2005) Finding semantically similar questions based on their answers. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR '05, pp. 617–618.
[24]
Jeon J, Croft WB, Lee JH (2005) Finding similar questions in large question and answer archives. In: Proceedings of the 14th ACM international conference on Information and knowledge management. CIKM '05, pp. 84–90.
[25]
Duan H, Cao Y, Lin CY, Yu Y (2008) Searching questions by identifying question topic and question focus. In: Proceedings of the 46rd Annual Meeting on Association for Computational Linguistics. ACL '07, pp. 156–164.
[26]
Miller GA (1995) Wordnet: a lexical database for English. In: Commun. ACM. volume 38, pp. 39–41.
[27]
Xue X, Jeon J, Croft WB (2008) Retrieval models for question and answer archives. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR '08, pp. 475–482.
[28]
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady. volume 10, p. 707.
[29]
Gehler PV, Nowozin S (2009) On feature combination for multiclass object classification. In: IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27–October 4, 2009. ICCV, pp. 221–228.
[30]
Zhang S, Huang J, Huang Y, Yu Y, Li H, et al.. (2010) Automatic image annotation using group sparsity. In: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13–18 June 2010. CVPR, pp. 3312–3319.
[31]
Cheng MM, Zhang GX, Mitra NJ, Huang X, Hu SM (2011) Global contrast based salient region detection. In: IEEE CVPR. pp. 409–416.
[32]
Ji R, Yao H, Liu W, Sun X, Tian Q (2012) Task-dependent visual-codebook compression. In: IEEE Transactions on Image Processing. volume 21, pp. 2282–2293.
[33]
Cheng MM, Mitra NJ, Huang X, Torr PHS, Hu SM (2011) Salient object detection and segmen-tation. Submission NO. TPAMI-2011-10-0753.
[34]
Ji R, Duan L, Chen J, Yao H, Yuan J, et al.. (2012) Location discriminative vocabulary coding for mobile landmark search. In: International Journal of Computer Vision. volume 96, pp. 290–314.
[35]
Cheng MM, Mitra NJ, Huang X, Hu SM (2013) Salientshape: Group saliency in image collections. In: The Visual Computer. pp. 1–10.
[36]
Ji R, Gao Y, Zhong B, Yao H, Tian Q (2011) Mining ickr landmarks by modeling reconstruction sparsity. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP). ACM, volume 7, p. 31.
[37]
Chen T, Cheng MM, Tan P, Shamir A, Hu SM (2009) Sketch2photo: Internet image montage. In: ACM Transactions on Graphics. volume 28, pp. 124: :1–10.