In this paper we demonstrate a family of metrics for estimating the quality of a text summary relative to one or more human-generated summaries. The improved metrics are based on features automatically computed from the summaries to measure content and linguistic quality. The features are combined using one of three methods—robust regression, non-negative least squares, or canonical correlation, an eigenvalue method. The new metrics significantly outperform the previous standard for automatic text summarization evaluation, ROUGE.
References
[1]
Luhn, H.P. The Automatic Creation of Literature Abstracts. In Advances in Automatic Text Summarization; The MIT Press: Cambridge, MA, USA, 1956; pp. 58–63.
[2]
McKeown, K.; Radev, D.R. Generating Summaries of Multiple News Articles. In Proceedings of the 18th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval; ACM: New York, NY, USA, 1995; pp. 74–82. SIGIR ’95.
[3]
Text Analysis Conference, NIST, 2011. Available online: http://www.nist.gov/tac (accessed on 19 September 2012).
[4]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the ACL-04 Workshop: Text Summarization Branches Out; Barcelona, Spain: 22–24 July 2004; pp. 74–81.
[5]
Conroy, J.M.; Dang, H.T. Mind the Gap: Dangers of Divorcing Evaluations of Summary Content from Linguistic Quality. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Stroudsburg, PA, USA, 18–22 August 2008; pp. 145–152.
[6]
Conroy, J.M.; Schlesinger, J.D.; O’Leary, D.P. Nouveau-ROUGE: A novelty metric for update summarization. Comput. Linguist. 2011, 37, 1–8, doi:10.1162/coli_a_00033.
[7]
De Oliveira, P.C.F.; Torrens, E.W.; Cidral, A.; Schossland, S.; Bittencourt, E. Evaluating Summaries Automatically-A system Proposal. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08); European Language Resources Association (ELRA), Marrakech, Morocco, 28–30 May 2008; Available online: http://www.lrecconf. org/proceedings/lrec2008/ (accessed on 19 September 2012).
Giannakopoulos, G.; Vouros, G.A.; Karkaletsis, V. MUDOS-NG: Multi-Document Summaries Using N-gram Graphs (Technical Report). 2010. Available online: http://arxiv.org/abs/1012.2042 (accessed on 19 September 2012). arXiv:1012.2042.
[10]
Giannakopoulos, G.; Karkaletsis, V. AutoSummENG and MeMoG in Evaluating Guided Summaries. In Proceedings of the Text Analysis Conference (TAC 2011); NIST, Gaithersburg, MD, USA, 14–15 November 2011.
[11]
Pitler, E.; Louis, A.; Nenkova, A. Automatic Evaluation of Linguistic Quality in Multi-Document Summarization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 544–554.
[12]
Kumar, N.; Srinathan, K.; Varma, V. Using graph based mapping of co-occurring words and closeness centrality score for summarization evaluation. Comput. Linguist. Intell. Text Process. 2012, 7182, 353–365, doi:10.1007/978-3-642-28601-8_30.
[13]
Steinberger, J.; Je?ek, K. Evaluation measures for text summarization. Comput. Inf. 2012, 28, 251–275.
[14]
Saggion, H.; Torres-Moreno, J.; Cunha, I.; SanJuan, E. Multilingual Summarization Evaluation Without Human Models. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, Stroudsburg, PA, USA, 23–27 August 2010; pp. 1059–1067.
[15]
Louis, A.; Nenkova, A. Automatically Evaluating Content Selection in Summarization without Human Models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, Singapore, Singapore, 6–7 August 2009; pp. 306–314.
[16]
Document Understanding Conference, NIST, 2004. Available online: http://duc.nist.gov (accessed on 19 September 2012).
[17]
Over, P. Introduction to DUC-2001: An Intrinsic Evaluation of Generic News Text Summarization SystemsTechnical report, Retrieval Group, Information Access Division, National Institute of Standards and Technology, 2001.
[18]
Nenkova, A.; Passonneau, R.; Mckeown, K. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process. 2007, 4, 1–4.
[19]
Conroy, J.M.; Schlesinger, J.D.; Rankel, P.A.; O’Leary, D.P. Guiding CLASSY Toward More Responsive Summaries. Proceedings of the TAC 2010 Workshop, Gaithersburg, MD, USA, 15–16 November 2010; Available online: http://www.nist.gov/tac/publications/index.html (accessed on 19 Septmber 2012).
[20]
Seber, G. Multivariate Observations (Wiley series in probability and statistics); Wiley-Interscience: Weinheim, Germany, 2004.
[21]
Tavernier, J.; Bellot, P. Combining Relevance and Readability for INEX 2011 Question-Answering Track. In Pre-Proceedings of INEX 2011; IR Publications: Amsterdam, The Netherlands, 2011; pp. 185–195.