We propose using side information to further inform anomaly detection algorithms of the semantic context of the text data they are analyzing, thereby considering both divergence from the statistical pattern seen in particular datasets and divergence seen from more general semantic expectations. Computational experiments show that our algorithm performs as expected on data that reflect real-world events with contextual ambiguity, while replicating conventional clustering on data that are either too specialized or generic to result in contextual information being actionable. These results suggest that our algorithm could potentially reduce false positive rates in existing anomaly detection systems.
References
[1]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 1–58.
[2]
Manevitz, L.; Yousef, M. Document Classification on Neural Networks Using Only Positive Examples. In Proceedings of the 23rd Annual International ACM SIGIR Conference Research and Development in Information Retrieval, New Orleans, USA, 24-28 July 2000; 34, pp. 304–306.
[3]
Manevitz, L.; Yousef, M. One-class SVMs for document classification. J. Mach. Learning Res. 2002, 2, 139–154.
[4]
Srivastava, A.; Zane-Ulman, B. Discovering Recurring Anomalies in Text Reports Regarding Complex Space Systems. In Proceedings of IEEE Aerospace Conference, Los Alamitos, CA, USA, 5-12 March 2005; pp. 55–63.
[5]
Agovic, A.; Shan, H.; Banerjee, A. Analyzing Aviation Safety Reports: From Topic Modeling to Scalable Multi-label Classification. In Proceedings of the Conference on Intelligent Data Understanding, Mountain View, CA, USA, 5-6 October 2010; pp. 83–97.
[6]
Guthrie, D.; Guthrie, L.; Allison, B.; Wilks, Y. Unsupervised Anomaly Detection. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, India, 9-12 January 2007; pp. 1626–1628.
[7]
Lin, D. An Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI, USA, 24-27 July 1998; pp. 296–304.
[8]
Resnik, P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, CA, USA, 20-25 August 1995; pp. 448–453.
[9]
Jiang, J.J.; Conrath, D.W. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics, Taiwan; 1997; pp. 19–33.
[10]
Mangalath, P.; Quesada, J.; Kintsch, W. Analogy-making as Predication Using Relational Information and LSA Vectors. In Proceedings of the 26th Annual Meeting of the Cognitive Science Society, Chicago, USA, 5-7 August 2004.
[11]
Cilibrasi, R.; Vitanyi, P. The google similarity distance. IEEE Trans. Knowl. Data Eng. 2007, 19, 370–383.
[12]
Bollegala, D.; Matsuo, Y.; Ishizuka, M. Measuring the Similarity between Implicit Semantic Relations from the Web. In Proceedings of the 18th International Conference on World Wide Web, ACM, Madrid, Spain, 20-24 April 2009; pp. 651–660.
[13]
Liu, D.; Hua, X.; Yang, L.; Wang, L.; Zhang, H. Tag Ranking. In Proceedings of the 18th International Conference on The World Wide Web, Madrid, Spain, 20-24 April 2009; pp. 351–360.
[14]
Gligorov, R.; Kate, W.; Aleksovski, Z.; Harmelen, F. Using Google Distance to Weight Approximate Ontology Matches. In Proceedings of the 16th International Conference on the World Wide Web, Banff ALberta, Canada, 8–12 May, 2007; pp. 767–776.
[15]
Blei, D.; Ng, A.; Jordan, M. Latent Dirichlet allocation. J. Mach. Learning Res. 2003, 3, 993–1022.
[16]
Newman, D.; Asuncion, A.; Smyth, P.; Welling, M. Distributed Inference for Latent Dirichlet Allocation. In Proceedings of NIPS 2008, Vancouver, Canada, 8-11 December 2008; MIT Press: Cambridge, MA, USA, 2008; pp. 1081–1088.
[17]
Topic Modelling toolbox. Available online: http://psiexp.ss.uci.edu/research/programsdata (accessed on 10 August 2011).
[18]
WordNet. Available online: http://wordnet.princeton.edu/ (accessed on 10 August 2011).
[19]
Pedersen, T.; Patwardhan, S.; Michelizzi, J. WordNet: Similarity-measuring the Relatedness of Concepts. In Proceedings of the 19th National Conference on Artificial Intelligence, San Jose CA, USA, 25-29 July 2004; pp. 1024–1025.
[20]
Frank, A.; Asuncion, A. UCI Machine Learning Repository. University of California: Irvine, CA, USA, 2010. Available online: http://archive.ics.uci.edu/ml (accessed on 5 July 2011).
[21]
Srivastava, N.; Srivastava, J. A hybrid-logic Approach Towards Fault Detection in Complex Cyber-Physical Systems. In Proceedings of the Annual Conference of the Prognostics and Health Management Society, Portland, Oregon, USA, 13-16 October 2010.
[22]
Wagstaff, K.; Rogers, S.; Schroedl, S. Constrained K-Means Clustering With Background Knowledge. In Proceedings of the International Conference on Machine Learning, Williamstown, MA, USA, 28 June-1 July 2001; pp. 577–584.
[23]
Sontag, D.; Roy, D. Complexity of inference in Latent Dirichlet Allocation; NIPS: Grenada, Spain, 2011; pp. 1008–1016.
[24]
Petrovi, S.; Osborne, M.; Lavrenko, V. Streaming First Story Detection with Application to Twitter. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, LA, USA, 1-6 June 2010; pp. 181–189.
[25]
WordNet: Similarity. Available online: http://marimba.d.umn.edu/ (accessed on 10 August 2011).