Readability indices measure how easy or difficult it is to read and comprehend a text. In this paper we look at the relation between readability indices and web documents from two different perspectives. On the one hand we analyse how to reliably measure the readability of web documents by applying content extraction techniques and incorporating a bias correction. On the other hand we investigate how web based corpus statistics can be used to measure readability in a novel and language independent way.
Nonaka, R.; Yumoto, T.; Nii, M.; Takahashi, Y. Finding How-to Information Web Pages and Their Ranking by Readability. In Proceedings of the IADIS International Conference Internet Technologies and Society (ITS ’10), Perth, Australia, 29 November 2010; pp. 155–163.
Lau, T.P.; King, I. Bilingual Web Page and Site Readability Assessment. In Proceedings of the 15th international conference on World Wide Web (WWW ’06), Edinburgh, UK, 22–26 May 2006; ACM: New York, NY, USA, 2006; pp. 993–994.
Miltsakaki, E.; Troutt, A. Real-Time Web Text Classification and Analysis of Reading Difficulty. In Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications (EANL ’08), Columbus, OH, USA, June 2008; Association for Computational Linguistics: Stroudsburg, PA, USA,, 2008; pp. 89–97.
Collins-Thompson, K.; Callan, J. Information Retrieval for Language Tutoring: An Overview of the REAP Project. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’04), Sheffield, UK, 25–29 July 2004; ACM: New York, NY, USA, 2004; pp. 544–545.
Gottron, T. Detecting Website Redesigns via Template Similarity on Streams of Documents. In Proceedings of the 3rd International Conference on Internet Technologies and Applications (ITA ’09), Wuhan, China, 18–20 August 2009.
Tanguy, L.; Tulechki, N. Sentence Complexity in French: A Corpus-Based Approach. In Proceedings of the 17th International Conference Intelligent Information Systems (IIS 09), Kraków, Poland, 16–18 July 2009; pp. 131–144.
Petersen, S.E.; Ostendorf, M. Assessing the Reading Level of Web Pages. In Proceedings of the ICSLP 9th International Conference on Spoken Language Processing (INTERSPEECH ’06), Pittsburgh, PA, USA, 17–21 September 2006; pp. 833–836.
Miltsakaki, E. Matching Readers’ Preferences and Reading Skills With Appropriate Web Texts. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session (EACL ’09), Athens, Greece, 30 March–3 April 2009; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 49–52.
Gottron, T. Evaluating Content Extraction on HTML Documents. In Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA ’07), Wrexham, North Wales, UK, 4–7 September 2007; pp. 123–132.
Pinto, D.; Branstein, M.; Coleman, R.; Croft, W.B.; King, M.; Li, W.; Wei, X. QuASM: A System for Question Answering Using Semi-Structured Data. In Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’02), Portland, OR, USA, 14–18 July 2002; ACM: New York, NY, USA, 2002; pp. 46–55.
Gottron, T. Content Code Blurring: A New Approach to Content Extraction. In Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA ’09), Turin, Italy, 1–5 September 2008; pp. 29–33.
Moreno, J.; Deschacht, K.; Moens, M. Language Independent Content Extraction From Web Pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, Enschede, The Netherlands, 2–3 February 2009; pp. 50–55.
Mohammadzadeh, H.; Gottron, T.; Schweiggert, F.; Nakhaeizadeh, G. A Fast and Accurate Approach for Main Content Extraction based on Character Encoding. In Proccedings of the 8th Workshop on Text-based Information Retrieval (TIR ’11), Toulouse, France, 29 August–2 September 2011. Unpublished work.
Gottron, T.; Martin, L. Estimating Web Site Readability Using Content Extraction. In Proceedings of the 18th International World Wide Web Conference (WWW ’09), Madrid, Spain, 20–24 April 2009; pp. 1169–1170.
Yan, X.; Song, D.; Li, X. Concept-Based Document Readability in Domain Specific Information Retrieval. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, VA, USA,, 6–11 November 2006; ACM: New York, NY, USA, 2006; pp. 540–549.
Rosa, K.D.; Eskenazi, M. Effect of Word Complexity on L2 Vocabulary Learning. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications (IUNLPBEA ’11), Portland, OR, USA, 24 June 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 76–80.
Fran？ois, T.; Watrin, P. On the Contribution of MWE-based Features to a Readability Formula for French as a Foreign Language. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011 (RANLP ’11), Hissar, Bulgaria, 12–14 September 2011; pp. 441–447.
Weir, G.R.S.; Ritchie, C. Estimating Readability with the Strathclyde Readability Measure. In Proceedings of the ICT in the Analysis, Teaching and Learning of Languages (ICTATLL’06), Glasgow, UK, 21–22 August 2006.
Quasthoff, U.; Richter, M.; Biemann, C. Corpus Portal for Search in Monolingual Corpora. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC ’06), Genoa, Italy, 24–26 May 2006.