%0 Journal Article
%T Script Identification from Printed Indian Document Images and Performance Evaluation Using Different Classifiers
%A Sk Md Obaidullah
%A Anamika Mondal
%A Nibaran Das
%A Kaushik Roy
%J Applied Computational Intelligence and Soft Computing
%D 2014
%I Hindawi Publishing Corporation
%R 10.1155/2014/896128
%X Identification of script from document images is an active area of research under document image processing for a multilingual/ multiscript country like India. In this paper the real life problem of printed script identification from official Indian document images is considered and performances of different well-known classifiers are evaluated. Two important evaluating parameters, namely, AAR (average accuracy rate) and MBT (model building time), are computed for this performance analysis. Experiment was carried out on 459 printed document images with 5-fold cross-validation. Simple Logistic model shows highest AAR of 98.9% among all. BayesNet and Random Forest model have average accuracy rate of 96.7% and 98.2% correspondingly with lowest MBT of 0.09？s. 1. Introduction Automatic script identification is an active area of research under document image processing. The work is particularly relevant for a multiscript country like India. Right now there are officially 22 languages and 13 scripts [1] are used to write those languages. With English the figure becomes 23. Automatic document processing helps conversion of physical real world document into digital text form, which can be very much useful for further processing like storing, retrieval, and indexing of large volume of data. In our country there are many languages which use the same script for writing. For example, Devanagari is a well-known script in India which is used to write languages like Hindi, Marathi, Sanskrit, and so forth whereas Bangla is another popular script and is used to write languages like Bangla, Assamese, and Manipuri. Multilingual document is very common in our daily life which includes postal document, pre-printed application form, and so forth. Optical Character Recognizer (OCR) for specific language will not work for such multilingual documents. Therefore, to make a successful multilingual OCR, script identification is very essential before running an individual OCR for a specific language. In this context, the problem of script identification is addressed here. All the script identification techniques under printed category can be divided into four major groups, namely, (i) document level script identification, (ii) block level script identification, (iii) line level script identification, and (iv) word level script identification. Document level script identification is much faster than the other category because here the whole document is fed to the script identification system without performing fine segmentation into block, line, or word level. Ghosh et al. [2]
%U http://www.hindawi.com/journals/acisc/2014/896128/