Show simple item record

dc.contributor.advisorRamakrishnan, A G
dc.contributor.authorPati, Peeta Basa
dc.date.accessioned2008-01-17T10:33:19Z
dc.date.accessioned2018-07-31T04:56:34Z
dc.date.available2008-01-17T10:33:19Z
dc.date.available2018-07-31T04:56:34Z
dc.date.issued2008-01-17T10:33:19Z
dc.date.submitted2006
dc.identifier.urihttps://etd.iisc.ac.in/handle/2005/346
dc.description.abstractA document image, beside text, may contain pictures, graphs, signatures, logos, barcodes, hand-drawn sketches and/or seals. Further, the text blocks in an image may be in Manhattan or any complex layout. Document Layout Analysis is an important preprocessing step before subjecting any such image to OCR. Here, the image with complex layout and content is segmented into its constituent components. For many present day applications, separating the text from the non-text blocks is sufficient. This enables the conversion of the text elements present in the image to their corresponding editable form. In this work, an effort has been made to separate the text areas from the various kinds of possible non-text elements. The document images may have been obtained from a scanner or camera. If the source is a scanner, there is control on the scanning resolution, and lighting of the paper surface. Moreover, during the scanning process, the paper surface remains parallel to the sensor surface. However, when an image is obtained through a camera, these advantages are no longer available. Here, an algorithm is proposed to separate the text present in an image from the clutter, irrespective of the imaging technology used. This is achieved by using both the structural and textural information of the text present in the gray image. A bank of Gabor filters characterizes the statistical distribution of the text elements in the document. A connected component based technique removes certain types of non-text elements from the image. When a camera is used to acquire document images, generally, along with the structural and textural information of the text, color information is also obtained. It can be assumed that text present in an image has a certain amount of color homogeneity. So, a graph-theoretical color clustering scheme is employed to segment the iso-color components of the image. Each iso-color image is then analyzed separately for its structural and textural properties. The results of such analyses are merged with the information obtained from the gray component of the image. This helps to separate the colored text areas from the non-text elements. The proposed scheme is computationally intensive, because the separation of the text from non-text entities is performed at the pixel level Since any entity is represented by a connected set of pixels, it makes more sense to carry out the separation only at specific points, selected as representatives of their neighborhood. Harris' operator evaluates an edge-measure at each pixel and selects pixels, which are locally rich on this measure. These points are then employed for separating text from non-text elements. Many government documents and forms in India are bi-lingual or tri-lingual in nature. Further, in school text books, it is common to find English words interspersed within sentences in the main Indian language of the book. In such documents, successive words in a line of text may be of different scripts (languages). Hence, for OCR of these documents, the script must be recognized at the level of words, rather than lines or paragraphs. A database of about 20,000 words each from 11 Indian scripts1 is created. This is so far the largest database of Indian words collected and deployed for script recognition purpose. Here again, a bank of 36 Gabor filters is used to extract the feature vector which represents the script of the word. The effectiveness of Gabor features is compared with that of DCT and it is found that Gabor features marginally outperform the DOT. Simple, linear and non-linear classifiers are employed to classify the word in the feature space. It is assumed that a scheme developed to recognize the script of the words would work equally fine for sentences and paragraphs. This assumption has been verified with supporting results. A systematic study has been conducted to evaluate and compare the accuracy of various feature-classifier combinations for word script recognition. We have considered the cases of bi-script and tri-script documents, which are largely available. Average recognition accuracies for bi-script and tri-script cases are 98.4% and 98.2%, respectively. A hierarchical blind script recognizer, involving all eleven scripts has been developed and evaluated, which yields an average accuracy of 94.1%. The major contributions of the thesis are: • A graph theoretic color clustering scheme is used to segment colored text. • A scheme is proposed to separate text from the non-text content of documents with complex layout and content, captured by scanner or camera. • Computational complexity is reduced by performing the separation task on a selected set of locally edge-rich points. • Script identification at word level is carried out using different feature classifier combinations. Gabor features with SVM classifier outperforms any other feature-classifier combinations. A hierarchical blind script recognition algorithm, involving the recognition of 11 Indian scripts, is developed. This structure employs the most efficient feature-classifier combination at each individual nodal point of the tree to maximize the system performance. A sequential forward feature selection algorithm is employed to. select the most discriminating features, in a case by case basis, for script-recognition. The 11 scripts are Bengali, Devanagari, Gujarati, Kannada, Malayalam, Odiya, Puniabi, Roman. Tamil, Telugu and Urdu.en_US
dc.language.isoen_USen_US
dc.rightsI grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.
dc.subjectImage Processingen_US
dc.subjectPhonetics - Image Processingen_US
dc.subjectScript Identificationen_US
dc.subjectScript Recognition Algorithmen_US
dc.subjectPage Layout Analysis (PLA)en_US
dc.subjectText Extraction Methodsen_US
dc.subjectGray Imagesen_US
dc.subjectColor Imagesen_US
dc.subjectDocument Imagesen_US
dc.subjectMulti-script Documentsen_US
dc.subjectGabor Filteren_US
dc.subjectMulti-script Identificationen_US
dc.subjectBilingual Documentsen_US
dc.subject.classificationApplied Opticsen_US
dc.titleAnalysis Of Multi-lingual Documents With Complex Layout And Contenten_US
dc.typeThesisen_US
dc.degree.namePhDen_US
dc.degree.levelDoctoralen_US
dc.degree.grantorIndian Institute of Science
dc.degree.disciplineFaculty of Engineeringen_US


Files in this item

This item appears in the following Collection(s)

Show simple item record