%0 Journal Article
%T Printed Persian Subword Recognition Using Wavelet Packet Descriptors
%A Samira Nasrollahi
%A Afshin Ebrahimi
%J Journal of Engineering
%D 2013
%I Hindawi Publishing Corporation
%R 10.1155/2013/465469
%X In this paper, we present a new approach to offline OCR (optical character recognition) for printed Persian subwords using wavelet packet transform. The proposed algorithm is used to extract font invariant and size invariant features from 87804 subwords of 4 fonts and 3 sizes. The feature vectors are compressed using PCA. The obtained feature vectors yield a pictorial dictionary for which an entry is the mean of each group that consists of the same subword with 4 fonts in 3 sizes. The sets of these features are congregated by combining them with the dot features for the recognition of printed Persian subwords. To evaluate the feature extraction results, this algorithm was tested on a set of 2000 subwords in printed Persian text documents. An encouraging recognition rate of 97.9% is got at subword level recognition. 1. Introduction Optical character recognition (OCR) is one of the oldest subfields of pattern recognition with a rich contribution for the recognition of printed documents. The ultimate goal of OCR is to imitate the human ability to read—at much faster rate—by associating symbolic identities with images of characters [1]. In recent years, many OCR researches have extracted subword features. In the number of these researches, holistic shape information from subwords is extracted for modeling subwords [2]. In this work, we want to extract holistic shape features of printed Persian subwords using wavelet packet transform to build a pictorial dictionary. Feature extraction is a vital step for pattern recognition and optical character recognition systems [3], especially for printed Persian OCR as there are varieties of characters depending on the fonts and sizes. One of the main concerns of designing every OCR system is to make it robust to the font and size variations [4]. It is clear that OCR of multifont documents is more difficult than OCR of single-font documents. Design of an OCR engine which can recognize subwords independent of their font types and size variations is not impossible, but certainly it is very difficult and inefficient, because subwords take different shapes in different fonts [5]. There has been a great attempt to produce Omnifont OCR systems for Persian/Arabic languages, but the overall performance of such systems is far from being perfect. Persian written language which uses modified Arabic alphabet is written cursively, and this intrinsic feature makes it difficult for automatic recognition [6]. Many feature extraction methods have been reported such as various moment features, gradient-and distance-based features,
%U http://www.hindawi.com/journals/je/2013/465469/