Informatics Report Series
|
|
|
|
|
|
Title:Randomised Language Modelling for Statistical Machine Translation |
Authors:
David Talbot
; Miles Osborne
|
Date: 2007 |
Publication Title:ACL 07 |
Publisher:Association for Computational Linguistics |
Publication Type:Conference Paper
Publication Status:Published
|
|
|
- Abstract:
- A Bloom filter (BF) is a randomized data structure for set membership queries. Its space requirements are significantly below lossless information-theoretic lower bounds but it produces false positives with some constant probability. Here we explore the use of BFs for language modelling in statistical machine translation. We investigate how a BF containing n-grams extracted from a large corpus can complement a standard n-gram LM within an SMT system and consider (i) how to include approximate frequency information efficiently and (ii) how to reduce the effective error rate by first checking for lower-order subsequences in candidate n-grams. Our solutions in both cases retain the one-sided error guarantees of the standard BF while taking advantage of the particular characteristics of natural language statistics to reduce the space requirements.
- Links To Paper
- 1st Link
- Bibtex format
- @InProceedings{EDI-INF-RR-1020,
- author = {
David Talbot
and Miles Osborne
},
- title = {Randomised Language Modelling for Statistical Machine Translation},
- book title = {ACL 07},
- publisher = {Association for Computational Linguistics},
- year = 2007,
- url = {http://acl.ldc.upenn.edu/P/P07/P07-1065.pdf},
- }
|