Skip navigation
Please use this identifier to cite or link to this item: http://repository.iitr.ac.in/handle/123456789/5652
Title: Word spotting in historical documents using primitive codebook and dynamic programming
Authors: Pratim Roy, Partha
Rayar F.
Ramel J.-Y.
Published in: Image and Vision Computing
Abstract: Word searching and indexing in historical document collections are a challenging problem because text characters are often touching or broken due to degradation or aging effects. In this paper, we present a novel approach towards word spotting using text line decomposition into character primitives and string matching. The text lines are initially separated by a segmentation process. Then each text line is described as sequences of primitive labels which correspond to single characters or parts of characters. These representative primitives are considered from a codebook of shapes generated from training pages taken from the collection. During indexation, the text lines are transcribed into strings of primitives in off-line stage and stored in files. For this purpose, an efficient indexation strategy using multi-label approach is used by a combination of two-level analysis of the primitives: coarse and fine levels. During retrieval, the query word image is encoded into strings of coarse and fine primitives chosen according to the codebook. Finally, a dynamic programming method based on approximate string matching is used to find similar primitive sequences in the text lines from the collection in runtime. We present the experimental evaluation on datasets of real life document images, gathered from historical books of different scripts. Experimental results show that the method is robust in searching text in noisy documents. © 2015 Elsevier B.V.
Citation: Image and Vision Computing (2015), 44(): 15-28
URI: https://doi.org/10.1016/j.imavis.2015.09.006
http://repository.iitr.ac.in/handle/123456789/5652
Issue Date: 2015
Publisher: Elsevier Ltd
Keywords: Approximate string matching
Coarse-to-fine
Document indexing
Word spotting
ISSN: 2628856
Author Scopus IDs: 56880478500
55247398900
8293131700
Author Affiliations: Roy, P.P., Laboratoire d'Informatique, Université François Rabelais, Tours, France
Rayar, F., Laboratoire d'Informatique, Université François Rabelais, Tours, France
Ramel, J.-Y., Laboratoire d'Informatique, Université François Rabelais, Tours, France
Funding Details: This work has been supported by the AAP program of Université François Rabelais, Tours, France (2010–2011) ( AAP-UFRT-2010-06 ) and by the Google Digital Humanities Research Awards (2010) given to the Computer Science Laboratory of Tours (RFAI team). Thanks to CESR for providing datasets and valuable discussions which helped us to improve our system.
Corresponding Author: Roy, P.P.; Laboratoire d'Informatique, Université François RabelaisFrance; email: partha.roy@univ-tours.fr
Appears in Collections:Journal Publications [CS]

Files in This Item:
There are no files associated with this item.
Show full item record


Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.