Attention-Feedback and Representations in OCR
Abstract
A Kannada OCR, named Lipi Gnani, has been designed and developed from scratch, with the motivation
of it being able to convert printed text or poetry in Kannada script, without any restriction on vocabulary.
The training and test sets have been collected from over 35 books published between the period 1970
to 2002, and this includes books written in Halegannada and pages containing Sanskrit slokas written
in Kannada script. The coverage of the OCR is nearly complete in the sense that it recognizes all
the punctuation marks, special symbols, Indo-Arabic and Kannada numerals and also the interspersed
English words. Several minor and major original contributions have been done in developing this OCR
at the different processing stages such as binarization, line and character segmentation, recognition and
Unicode mapping. This has created a Kannada OCR that performs as good as, and in some cases, better
than the Google’s Tesseract OCR, as shown by the results. To the knowledge of the authors, this is the
maiden report of a complete Kannada OCR, handling all the issues involved. Currently, there is no
dictionary based postprocessing, and the obtained results are due solely to the recognition process. Four
benchmark test datasets containing scanned pages from books in Kannada, Sanskrit, Konkani and Tulu
languages, but all of them printed in Kannada script, have been created, along with the ground truth in
Unicode. The word level recognition accuracy of Lipi Gnani is 5.3% higher on the Kannada dataset than
that of Google’s Tesseract OCR, 8.5% higher on the Sanskrit dataset, and 23.4% higher on the datasets
of Konkani and Tulu.
Inspired by the rich feedback that exists in the visual neural pathway that is active during the
recognition process, we have proposed the use of feedback from the latter modules in the OCR workflow,
such as recognition and Unicode generation, to the earlier stages such as binarization and segmentation,
to result in the overall improvement of the performance of the OCR on old documents. The system
looks for singularities and inconsistencies in the sequence of recognition labels for each word image, and
their recognition scores output by the classifier, and based on these indicators, suspects merged or split
characters or interspersed English words. A nonlinear, locally adaptive, enhancement method is then
applied on the original, segmented gray level image of the word, and implemented in a slightly different
manner for handling merged and split characters. Multiple images of the word, enhanced to different
extents, are binarized and segmented into symbols and the best enhanced image is chosen based on
the best overall recognition score for the word. If the anomaly still persists, the system suspects the
word image to be of an interspersed English word, in an otherwise Kannada document. The segmented
components of the word are now rerecognized as English characters by a different classifier, trained on
the Latin script.
The effectiveness of the proposed attention-feedback processing has been thoroughly tested on a
challenging dataset of 250 pages of Kannada, which also include some Halegannada pages. It has also
been tested on three other datasets containing 40+ pages of Tulu, Konkani and Sanskrit text, printed in
Kannada script. The overall attention feedback processing results in an improvement in the word level
recognition accuracy of 4.56% on the Kannada dataset, 2.4% on Tulu and Konkani datasets and 6.3%
on the Sanskrit dataset.
We have also proposed an elegant and unique algorithm for the segmentation of text-lines from
iii
Abstract
printed and handwritten documents, using Red-Black Tree and Bipartite Graph Representation. We
first represent each connected component (CC) in a document page as a row interval and then exploit
the properties of the red-black tree (RBT) data structure in collecting the appropriate intervals (CC) in
the different nodes (text-lines) of the tree. We initially construct an RBT by inserting the row-intervals
of all the mid-sized connected components into the tree. While inserting an interval, we recursively
merge all the intervals that have significant overlap into a single enclosing interval. Tall CCs, which may
arise due to the touching of components from adjacent lines, are inserted into the tree after cutting if
needed. Non-overlapping short components, which may include diacritical marks, are considered last,
and inserted into the closest intervals. Once all the CCs of the document page are inserted, the RBT
has one node for each segmented text-line and we do in-order tree traversal to get the lines in the sorted
order. The algorithm is computationally efficient, since each CC is processed only once in creating the
tree and the time complexity of RBT search/edit operations is of the order of the logarithm of the
number of lines.
We have thoroughly tested our Red-Black Tree and Bipartite Graph based line segmentation algorithm
on many standard datasets. The Results on ICDAR-2013 Handwriting-Segmentation-Contest
dataset (English, Greek, Bangla) show that our approach marginally outperforms the state-of-the-art
text-line segmentation methods reported on this dataset. Results on ICDAR-2009 and PBOK datasets
(French, German, Kannada, Oriya) show that it also scales to these Indic and European languages.
We have also developed an intuitive user-friendly GUI for OCR, called PrintToBraille. This Print-
ToBraille GUI has facility to recognize individual scanned pages or all the pages of an entire book. The
latter facility was specifically added to help the NGO’s to create Braille versions of school texts for the
use of blind children. Thus, the output text of the OCR can be saved in .rtf, .xml or braille format. It
also has provision to save the recognized Unicode text, and the line and word boundaries in the industry
standard METS/ALTO XML format. The Lipi Gnani Kannada OCR and the PrintToBraille GUI, both
developed in Java, can be run on Windows, Linux and Mac operating systems. A setup/installer program
has also been made available for Windows users to ease the installation and running.