全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Extractive Summarization Using Structural Syntax, Term Expansion and Refinement

DOI: 10.4236/ijis.2017.73004, PP. 55-71

Keywords: Data Extractive Summarization, Syntactical Structures, Sentence Similarity, Longest Common Subsequence, Term Expansion, WordNet, Local Thesaurus

Full-Text   Cite this paper   Add to My Lib

Abstract:

This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a particular document. Sentences were tagged and normalized then subjected to the Longest Common Subsequence (LCS) algorithm [3] [4] for the selection of the most similar subset of sentences. Calculated similarity was based on LCS of pairs of sentences that make up the document. A normalized score was calculated and used to rank sentences. A selected top subset of the most similar sentences was then tokenized to produce a set of important keywords or terms. The produced terms were further expanded into two subsets using 1) WorldNet; and 2) a local electronic dictionary/thesaurus. The three sets obtained (the original and the expanded two) were then re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization. In order to verify the utility of the procedure, a number of experiments were conducted using an email corpus. The results were compared to those produced by human annotators as well as to results produced using some basic sentences similarity calculation method. Produced results were very encouraging and compared well to those of human annotators and Jacquard sentences similarity.

References

[1]  Kilgarriff, A. (2000) Wordnet: An Electronic Lexical Database. MIT Press, Cambridge.
[2]  http://www.mobysaurus.com
[3]  Conesa, B.A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szczesniak, M.W., Gaffney, D.J., Elo, L.L., Zhang, X. and Mortazavi, A. (2016) A Survey of Best Practices for RNA-seq Data Analysis. Genome Biology, 17, 13.
https://doi.org/10.1186/s13059-016-0881-8
[4]  Elhadi, M. and Al-Tobi, A. (2010) Refinements of Longest Common Subsequence Algorithm. 2010 IEEE/ACS International Conference on In Computer Systems and Applications (AICCSA), Washington DC, 16-19 May 2010, 1-5.
https://doi.org/10.1109/AICCSA.2010.5586959
[5]  Loza, V., Lahiri, S., Mihalcea, R. and Lai, P.H. (2014) Building a Dataset for Summarization and Keyword Extraction from Emails. InLREC, 2441-2446.
[6]  Talib, R., Hanif, M.K., Ayesha, S. and Fatima, F. (2016) Text Mining: Techniques, Applications and Issues. International Journal of Advanced Computer Science & Applications, 1, 414-418.
https://doi.org/10.14569/IJACSA.2016.071153
[7]  Pal, A.R., Maiti, P.K. and Saha, D. (2013) An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm and Wordnet. International Journal of Control Theory and Computer Modeling, 3, 15-23.
https://doi.org/10.5121/ijctcm.2013.3502
[8]  Hovy, E., Lin, C.Y., Zhou, L. and Fukumoto, J. (2006) Automated Summarization Evaluation with Basic Elements. Proceedings of the Fifth Conference on Language Resources and Evaluation (LREC 2006), Genoa, 22-28 May 2006, 604-611.
[9]  André, P., Kittur, A. and Dow, S.P. (2014) Crowd Synthesis: Extracting Categories and Clusters from Complex Data. Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, Maryland, 15-19 February 2014, 989-998.
https://doi.org/10.1145/2531602.2531653
[10]  Mihalcea, R. (2004) Graph-Based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization. Proceedings of the 2004 ACL on Interactive Poster and Demonstration Sessions, Barcelona, 21-26 July 2004.
[11]  Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2, 159-165.
https://doi.org/10.1147/rd.22.0159
[12]  Aggarwal, C.C., Ed. (2014) Data Classification: Algorithms and Applications. CRC Press, Boca Raton, Florida.
[13]  Klimt, B. and Yang, Y. (2004) The Enron Corpus: A New Dataset for Email Classification Research. In: Boulicaut, J.F., Esposito, F., Giannotti, F. and Pedreschi, D., Eds., Machine Learning: ECML 2004. Lecture Notes in Computer Science, Vol. 3201, Springer, Berlin, Heidelberg, 217-226.
[14]  Ramshaw, L.A. and Marcus, M.P. (1999) Text Chunking Using Transformation-Based Learning. In: Armstrong, S., et al., Eds., Natural Language Processing Using Very Large Corpora, Springer Netherlands, 157-176.
https://doi.org/10.1007/978-94-017-2390-9_10
[15]  Gupta, V. and Lehal, G.S. (2010) A Survey of Text Summarization Extractive Techniques. Journal of Emerging Technologies in Web Intelligence, 2, 258-268.
https://doi.org/10.4304/jetwi.2.3.258-268
[16]  Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B. and Kochut, K. (2017) Text Summarization Techniques: A Brief Survey. arXiv:1707.02268
[17]  Nenkova, A. and McKeown, K. (2011) Automatic Summarization. Foundations and Trends in Information Retrieval, 5, 103-233.
https://doi.org/10.1561/1500000015
[18]  Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Foufou, S. and Bouras, A. (2014) A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. IEEE Transactions on Emerging Topics in Computing, 2, 267-279.
https://doi.org/10.1109/TETC.2014.2330519
[19]  Aggarwal, C.C. and Reddy, C.K., Eds. (2013) Data Clustering: Algorithms and Applications. CRC Press, Boca Raton, Florida.
[20]  Aggarwal, C.C. and Zhai, C. (2012) A Survey of Text Clustering Algorithms. In: Mining Text Data, Springer US, 77-128.
https://doi.org/10.1007/978-1-4614-3223-4_4
[21]  Yang, L. and Xi, J. (2015) Human Behavior Recognition: Semantics-Based Text Copy Detection Method. 2015 First International Conference on Computational Intelligence Theory, Systems and Applications (CCITSA), Yilan, 10-12 December 2015, 158-162.
https://doi.org/10.1109/CCITSA.2015.28
[22]  Elhadi, M. and Al-Tobi, A. (2010) Detection of Duplication in Documents and WebPages Based Documents Syntactical Structures through an Improved Longest Common Subsequence. IJIPM, 1, 138-147.
https://doi.org/10.4156/ijipm.vol1.issue1.16
[23]  Potthast, M., Barrón-Cedeño, A., Stein, B. and Rosso, P. (2011) Cross-Language Plagiarism Detection. Language Resources and Evaluation, 45, 45-62.
https://doi.org/10.1007/s10579-009-9114-z
[24]  Bin-Habtoor, A.S. and Zaher, M.A. (2012) A Survey on Plagiarism Detection Systems. International Journal of Computer Theory and Engineering, 4, 185.
https://doi.org/10.7763/IJCTE.2012.V4.447
[25]  Osman, A.H., Salim, N. and Abuobieda, A. (2012) Survey of Text Plagiarism Detection. Computer Engineering and Applications Journal (ComEngApp), 1, 37-45.
[26]  Poinçot, P., Lesteven, S. and Murtagh, F. (1998) Comparison of Two “Document Similarity Search Engines. ASP Conference Series, 153, 85.
[27]  Elhadi, M. and Al-Tobi, A. (2009) Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm. 2009 WRI World Congress on Computer Science and Information Engineering, Los Angeles, 31 March-2 April 2009, 630-634.
https://doi.org/10.1109/CSIE.2009.771
[28]  Grossman, D.A. and Frieder, O. (2012) Information Retrieval: Algorithms and Heuristics. Springer Science & Business Media, Berlin.
[29]  Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabichler, B., Speicher, M.R., Zschocke, J. and Trajanoski, Z. (2014) A Survey of Tools for Variant Analysis of Next-Generation Genome Sequencing Data. Briefings in Bioinformatics, 15, 256-278.
[30]  Baral, C. (2004) Local Alignment: Smith-Waterman Algorithm, CSE 591: Computational Molecular Biology Course, Arizona State University.
[31]  Li, H. and Homer, N. (2010) A Survey of Sequence Alignment Algorithms for Next-Generation Sequencing. Briefings in Bioinformatics, 11, 473-483.
https://doi.org/10.1093/bib/bbq015
[32]  Schmid, H. (2013) Probabilistic Part-of-Speech Tagging Using Decision Trees.
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf
[33]  Schmid, H. (1995) TreeTagger—A Language Independent Part-of-Speech Tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, Vol. 43, 28.
[34]  Dalianis, H. (2000) SweSum—A Text Summarizer for Swedish No. TRITA-NA-P0015. NADA, KTH, Stockholm.
[35]  Hassel, M. (2007) Resource Lean and Portable Automatic Text Summarization [Internet].
http://www.csc.kth.se/~xmartin/papers/Nodalida03final.pdf
[36]  Jones, K.S. (2007) Automatic Summarising: The State of the Art. Information Processing & Management, 43, 1449-1481.
https://doi.org/10.1016/j.ipm.2007.03.009
[37]  Barzilay, R. and Elhadad, M. (1999) Using Lexical Chains for Text Summarization. In: Mani, I. and Maybury, M.T., Eds., Advances in Automatic Text Summarization, The MIT Press, Cambridge, 111-121.
[38]  Hassel, M. (2003) Exploitation of Named Entities in Automatic Text Summarization for Swedish. 14th Nordic Conference on Computational Linguistics, Reykjavik, 30-31 May 2003.
[39]  Nobata, C, Sekine, S., Isahara, H. and Grishman, R. (2002) Summarization System Integrated with Named Entity Tagging and IE Pattern Discovery. Proceedings of the LREC-2002 Conference, Canaria, May 2002, 1742-1745.
[40]  Mani, I. and Maybury, M.T., Eds. (1999) Advances in Automatic Text Summarization. MIT Press, Cambridge.
[41]  Mani, I. (2001) Automatic Summarization [Internet]. Vol. 3. John Benjamins Publishing, Amsterdam.
[42]  Mani, I. (2001) Recent Developments in Text Summarization. Proceedings of the Tenth International Conference on Information and Knowledge Management, Atlanta, 5-10 October 2001, 529-531.
https://doi.org/10.1145/502585.502677
[43]  Dalianis, H. and Åström, E. (2001) SweNam—A Swedish Named Entity Recognizer. Technical Report, TRITANA-P0113, IPLab-189, KTH NADA.
[44]  Bijalwan, V., Kumari, P., Pascual, J. and Semwal, V.B. (2014) Machine Learning Approach for Text and Document Mining. arXiv:1406.1580
[45]  Edmundson, H.P. (1969) New Methods in Automatic Extracting. Journal of the ACM (JACM), 16, 264-285.
https://doi.org/10.1145/321510.321519
[46]  Lin, C.Y. and Hovy, E. (1997) Identify Topics by Position. Proceedings of the 5th Conference on Applied Natural Language Processing, Washington DC, 31 March-3 April 1997, 283-290.
https://doi.org/10.3115/974557.974599
[47]  Hovy, E. and Lin, C.Y. (1998) Automated Text Summarization and the SUMMARIST System. Proceedings of a Workshop, Baltimore, 13-15 October 1998, 197-214.
[48]  Chuang, W.T. and Yang, J. (2000) Extracting Sentence Segments for Text Summarization: A Machine Learning Approach. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, 24 - 28 July 2000, 152-159.
[49]  Lin, C.Y. and Hovy, E. (2000) The Automated Acquisition of Topic Signatures for Text Summarization. Proceedings of the 18th Conference on Computational Linguistics, 1, 495-501.
https://doi.org/10.3115/990820.990892
[50]  Mihalcea, R. and Tarau, P. (2004) TextRank: Bringing Order into Text. EMNLP, 4, 404-411.
[51]  Page, L., Brin, S., Motwani, R. and Winograd, T. (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab, Stanford.
[52]  Lin, H. and Bilmes, J. (2011) A Class of submodular Functions for Document Summarization. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1, 510-520.
[53]  Murray, G., Renals, S. and Carletta, J. (2005) Extractive Summarization of Meeting Recordings. Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, 4-8 September 2005, 593-596.
[54]  Gambhir, M. and Gupta, V. (2017) Recent Automatic Text Summarization Techniques: A Survey. Artificial Intelligence Review, 47, 1-66.
https://doi.org/10.1007/s10462-016-9475-9
[55]  Ajmal, E.B. and Haroon, R.P. (2016) Maximal Marginal Relevance Based Malayalam Text Summarization with Successive Thresholds. International Journal on Cybernetics & Informatics, 5, 349-356.
https://doi.org/10.5121/ijci.2016.5237
[56]  Maguitman, A.G., Menczer, F., Roinestad, H. and Vespignani, A. (2005) Algorithmic Detection of Semantic Similarity. Proceedings of the 14th International Conference on World Wide Web, 10-14 May 2005, 107-116.
[57]  Erkan, G. and Radev, D.R. (2004) Lexrank: Graph-Based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research, 22, 457-479.
[58]  Litvak, M. and Last, M. (2008) Graph-Based Keyword Extraction for Single-Document Summarization. Proceedings of the Workshop on Multi-Source Multilingual Information Extraction and Summarization, Manchester, 23 August 2008, 17-24.
https://doi.org/10.3115/1613172.1613178
[59]  Palshikar, G.K. (2007) Keyword Extraction from a Single Document Using Centrality Measures. International Conference on Pattern Recognition and Machine Intelligence, Kolkata, 18 December 2007, 503-510.
https://doi.org/10.1007/978-3-540-77046-6_62
[60]  Lin, C.Y. (2004) Rouge: A Package for Automatic Evaluation of Summaries. Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL, Barcelona, 25 July 2004.
[61]  Hassel, M. (2004) Evaluation of Automatic Text Summarizaiton: A Practical Implementation. Doctoral Dissertation, Numerisk analys och datalogi.
[62]  Jing, H., Barzilay, R., McKeown, K. and Elhadad, M. (1998) Summarization Evaluation Methods: Experiments and Analysis. AAAI Symposium on Intelligent Summarization, Stanford, 23 March 1998, 51-59.
[63]  Muresan, S., Tzoukermann, E. and Klavans, J.L. (2001) Combining Linguistic and Machine Learning Techniques for Email Summarization. Proceedings of the 2001 Workshop on Computational Natural Language Learning, 7, 19.
https://doi.org/10.3115/1117822.1117837
[64]  Rambow, O., Shrestha, L., Chen, J. and Lauridsen, C. (2004) Summarizing Email Threads. HLT-NAACL-Short 04 Proceedings of HLT-NAACL, Boston, 2-7 May 2004, 105-108.
https://doi.org/10.3115/1613984.1614011
[65]  Nenkova, A. and Bagga, A. (2003) Email Classification for Contact Centers. Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, FL, 9 March 2003, 789-792.
https://doi.org/10.1145/952532.952689
[66]  Shlens, J. (2014) A Tutorial on Principal Component Analysis. arXiv:1404.1100
[67]  Zajic, D., Dorr, B.J., Lin, J. and Schwartz, R. (2007) Multi-Candidate Reduction: Sentence Compression as a Tool for Document Summarization Tasks. Information Processing & Management, 43, 1549-1570.
https://doi.org/10.1016/j.ipm.2007.01.016
[68]  Corston-Oliver, S., Ringger, E., Gamon, M. and Campbell, R. (2004) Task-Focused Summarization of Email. ACL-04 Workshop: Text Summarization Branches Out, 43-50.
[69]  Carenini, G., Ng, R.T. and Zhou, X. (2007) Summarizing Email Conversations with clue Words. Proceedings of the 16th International Conference on World Wide Web, Banff, 8 May 2007, 91-100.
https://doi.org/10.1145/1242572.1242586
[70]  Murray, G., Carenini, G. and Ng, R. (2010) Generating and Validating Abstracts of Meeting Conversations: A User Study. Proceedings of the 6th International Natural Language Generation Conference, Meath, 7 July 2010, 105-113.
[71]  Carenini, G., Ng, R.T. and Zhou, X. (2008) Summarizing Emails with Conversational Cohesion and Subjectivity. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 8, 353-361.
[72]  Monostori, K., Finkel, R., Zaslavsky, A., Hodász, G. and Pataki, M. (2002) Comparison of Overlap Detection Techniques. Computational Science—ICCS 2002, Amsterdam, 21-24 April 2002, 51-60.
https://doi.org/10.1007/3-540-46043-8_4
[73]  Elhadi, M.T. (2012) Text Similarity Calculations Using Text and Syntactical Structures. 2012 7th International Conference on Computing and Convergence Technology (ICCCT), Seoul, 3 December 2012, 715-719.
[74]  Elhadi, M. and Al-Tobi, A. (2009) Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures. Fourth International Conference on Computer Sciences and Convergence Information Technology, Seoul, 24-26 November 2009, 679-684.
https://doi.org/10.1109/ICCIT.2009.235
[75]  Elhadi, M. and Al-Tobi, A. (2008) Use of Text Syntactical Structures in Detection of Document Duplicates. Third International Conference on Digital Information Management, London, 13-16 November 2008, 520-525.
https://doi.org/10.1109/ICDIM.2008.4746719
[76]  Elhadi, M. and Al-Tobi, A. (2009) Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques. Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Delhi, 15-18 December 2009, 223-230.
https://doi.org/10.1007/978-3-642-10646-0_27
[77]  Elhadi, M.T. (2016) Arabic Text Copy Detection Using Full, Reduced and Unique Syntactical Structures. International Journal of Computer Applications, 154, 13-17.
https://doi.org/10.5120/ijca2016912088

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133