Informatics Report Series
|
|
|
|
|
|
Title:A Search Engine for Historical Manuscript Images |
Authors:
T. Rath
; R. Manamtha
; Victor Lavrenko
|
Date:Jul 2004 |
Publication Title:Proceedings of the 27th ACM Conference on Information Retrieval (SIGIR) 2004 |
Publication Type:Conference Paper
Publication Status:Published
|
Page Nos:369-376
|
DOI:10.1145/1008992.1009056
ISBN/ISSN:1-58113-881-4
|
- Abstract:
- Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.
- Links To Paper
- No links available
- Bibtex format
- @InProceedings{EDI-INF-RR-1195,
- author = {
T. Rath
and R. Manamtha
and Victor Lavrenko
},
- title = {A Search Engine for Historical Manuscript Images},
- book title = {Proceedings of the 27th ACM Conference on Information Retrieval (SIGIR) 2004},
- year = 2004,
- month = {Jul},
- pages = {369-376},
- doi = {10.1145/1008992.1009056},
- }
|