全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Script Identification from Printed Indian Document Images and Performance Evaluation Using Different Classifiers

DOI: 10.1155/2014/896128

Full-Text   Cite this paper   Add to My Lib

Abstract:

Identification of script from document images is an active area of research under document image processing for a multilingual/ multiscript country like India. In this paper the real life problem of printed script identification from official Indian document images is considered and performances of different well-known classifiers are evaluated. Two important evaluating parameters, namely, AAR (average accuracy rate) and MBT (model building time), are computed for this performance analysis. Experiment was carried out on 459 printed document images with 5-fold cross-validation. Simple Logistic model shows highest AAR of 98.9% among all. BayesNet and Random Forest model have average accuracy rate of 96.7% and 98.2% correspondingly with lowest MBT of 0.09?s. 1. Introduction Automatic script identification is an active area of research under document image processing. The work is particularly relevant for a multiscript country like India. Right now there are officially 22 languages and 13 scripts [1] are used to write those languages. With English the figure becomes 23. Automatic document processing helps conversion of physical real world document into digital text form, which can be very much useful for further processing like storing, retrieval, and indexing of large volume of data. In our country there are many languages which use the same script for writing. For example, Devanagari is a well-known script in India which is used to write languages like Hindi, Marathi, Sanskrit, and so forth whereas Bangla is another popular script and is used to write languages like Bangla, Assamese, and Manipuri. Multilingual document is very common in our daily life which includes postal document, pre-printed application form, and so forth. Optical Character Recognizer (OCR) for specific language will not work for such multilingual documents. Therefore, to make a successful multilingual OCR, script identification is very essential before running an individual OCR for a specific language. In this context, the problem of script identification is addressed here. All the script identification techniques under printed category can be divided into four major groups, namely, (i) document level script identification, (ii) block level script identification, (iii) line level script identification, and (iv) word level script identification. Document level script identification is much faster than the other category because here the whole document is fed to the script identification system without performing fine segmentation into block, line, or word level. Ghosh et al. [2]

References

[1]  S. M. Obaidullah, S. K. Das, and K. Roy, “A system for handwritten script identification from Indian document,” Journal of Pattern Recognition Research, vol. 8, no. 1, pp. 1–12, 2013.
[2]  D. Ghosh, T. Dube, and A. Shivaprasad, “Script Recognition—a review,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 32, no. 12, pp. 2142–2161, 2010.
[3]  A. L. Spitz, “Determination of the script and language content of document images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235–245, 1997.
[4]  L. Lam, J. Ding, and C. Y. Suen, “Differentiating between oriental and European scripts by statistical features,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 12, no. 1, pp. 63–79, 1998.
[5]  J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic script identification from document images using cluster-based templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 177–181, 1997.
[6]  L. Zhou, Y. Lu, and C. L. Tan, “Bangla/English script identification based on analysis of connected component profiles,” in Proceedings of the 7th International Conference on Document Analysis Systems (DAS '06), vol. 3872 of Lecture Notes in Computer Science, pp. 243–254, 2006.
[7]  J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Template matching algorithm for Gujrati character recognition,” in Proceedings of the 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET '09), pp. 263–268, Nagpur, India, December 2009.
[8]  B. Patil and N. V. Subbareddy, “Neural network based system for script identification in Indian documents,” Sadhana, vol. 27, part i1, pp. 83–97, 2002.
[9]  A. M. Elgammal and M. A. Ismail, “Techniques for language identification for hybrid Arabic-English document images,” in Proceedings of the IEEE 6th International Conference on Document Analysis and Recognition, pp. 1100–1104, 2001.
[10]  B. V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, and V. S. Malemath, “Script identification based on morphological reconstruction in document images,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), vol. 2, pp. 950–953, Hong Kong, August 2006.
[11]  B. B. Chaudhuri and U. Pal, “An OCR system to read two Indian language scripts: Bangla and Devanagari (Hindi),” in Proceedings of the 4th International Conference on Document Analysis and Recognition (ICDAR '97), vol. 2, pp. 1011–1015, Ulm, Germany, August 1997.
[12]  C. L. Tan, P. Y. Leong, and S. He, “Language Identification in Multilingual Documents,” 2003.
[13]  S. Chaudhury and R. Sheth, “Trainable script identification strategies for Indian languages,” in Proceedings of the 5th International Conference on Document Analysis and Recognition (ICDAR '99), pp. 657–660, 1999.
[14]  M. C. Padma and P. A. Vijaya, “Wavelet packet based texture features for automatic script identification,” International Journal of Image Processing, vol. 4, no. 1, 2010.
[15]  G. D. Joshi, S. Garg, and J. Sivaswamy, “Script identification from Indian documents,” in Proceedings of the 7th International Workshop on Document Analysis Systems VII, vol. 3872 of Lecture Notes in Computer Science, pp. 255–267, Nelson, New Zealand, 2000.
[16]  D. Dhanya, A. G. Ramakrishnan, and P. B. Pati, “Script identification in printed bilingual documents,” Sadhana, vol. 27, part 1, pp. 73–82, 2002.
[17]  http://commons.wikimedia.org/wiki/File:States_of_South_Asia.png.
[18]  K. Roy, U. Pal, and A. Banerjee, “A system for word-wise handwritten script identification for Indian postal automation,” in Proceedings of the 1st IEEE INDICON India Annual Conference, pp. 266–271, December 2004.
[19]  A. Kaehler and G. R. Bradski, Learning OpenCV, O’reilly Media, 2008.
[20]  V. Singhal, N. Navin, and D. Ghosh, “Script-based classification of hand-written text documents in a multilingual environment,” in Proceedings of the 13th International Workshop on Research Issues in Data Engineering: Multi-lingual Information Management, Research Issues in Data Engineering, pp. 47–54, 2003.
[21]  J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, “Script and language identification for handwritten document images,” The International Journal on Document Analysis and Recognition, vol. 2, no. 2-3, pp. 45–52, 1999.
[22]  K. Roy, S. Kundu Das, and S. M. Obaidullah, “Script identification from handwritten document,” in Proceedings of the 3rd National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG '11), pp. 66–69, Karnataka, Hubli, India, December 2011.
[23]  S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. Kumar Basu, “A novel framework for automatic sorting of postal documents with multi-script address blocks,” Pattern Recognition, vol. 43, no. 10, pp. 3507–3521, 2010.
[24]  V. Singhal, N. Navin, and D. Ghosh, “Script-based classification of hand-written text documents in a multilingual environment,” in Proceedings of the 13th International Workshop on Research Issues in Data Engineering: Multi-Lingual Information Management (RIDE-MLIM '03), pp. 47–54, March 2003.
[25]  S. B. Moussa, A. Zahour, A. Benabdelhafid, and A. M. Alimi, “Fractal-based system for Arabic/Latin, printed/handwritten script identification,” in Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), pp. 1–4, IEEE, December 2008.
[26]  M. Hangarge, K. C. Santosh, and R. Pardeshi, “Directional discrete cosine transform for handwritten script identification,” in Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR '13), pp. 344–348, Washington, DC, USA, August 2013.
[27]  R. Rani, R. Dhir, and G. S. Lehal, “Script identification of pre-segmented multi-font characters and digits,” in Proceedings of the 12th International Conference on Document Analysis and Recognition, pp. 1150–1154, August 2013.
[28]  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” SIGKDD Explorations, vol. 11, pp. 10–18, 2009.
[29]  N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Machine Learning, vol. 29, no. 2-3, pp. 131–163, 1997.
[30]  R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: a library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
[31]  M. D. Buhmann, Radial Basis Functions: Theory and Implementations, Cambridge Monographs on Applied and Computational Mathematics (12), Cambridge University Press, Cambridge, UK, 2003.
[32]  S. V. Chakravarthy and J. Ghosh, “Scale-based clustering using the radial basis function network,” IEEE Transactions on Neural Networks, vol. 7, no. 5, pp. 1250–1261, 1996.
[33]  A. J. Howell and H. Buxton, “RBF network methods for face detection and attentional frames,” Neural Processing Letters, vol. 15, no. 3, pp. 197–211, 2002.
[34]  J. Hühn and E. Hüllermeier, “FURIA: an algorithm for unordered fuzzy rule induction,” Data Mining and Knowledge Discovery, vol. 19, no. 3, pp. 293–319, 2009.
[35]  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133