Title:A Search Engine for Historical Manuscript Images
Authors: T. Rath ; R. Manamtha ; Victor Lavrenko
Date:Jul 2004
Publication Title:Proceedings of the 27th ACM Conference on Information Retrieval (SIGIR) 2004
Publication Type:Conference Paper Publication Status:Published
Page Nos:369-376
DOI:10.1145/1008992.1009056 ISBN/ISSN:1-58113-881-4
Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.
