Informatics Report Series


Report   

EDI-INF-RR-0469


Related Pages

Report (by Number) Index
Report (by Date) Index
Author Index
Institute Index

Home
Title:Web-based Models for Natural Language Processing
Authors: Mirella Lapata ; Frank Keller
Date: 2005
Publication Title:ACM Transactions on Speech and Language Processing
Publisher:ACM
Publication Type:Journal Article Publication Status:Published
Volume No:# 2(1)
DOI:10.1145/1075389.1075392 ISBN/ISSN:15504875
Abstract:
Previous work demonstrated that web counts can be used to approximate bigram counts, thus suggesting that web-based frequencies should be useful for a wide variety of NLP tasks. However, only a limited number of tasks have so far been tested using web-scale data sets. The present paper overcomes this limitation by systematically investigating the performance of web-based models for several NLP tasks, covering both syntax and semantics, both generation and analysis, and a wider range of n-grams and parts of speech than have been previously explored. For the majority of our tasks, we find that simple, unsupervised models perform better when n-gram counts are obtained from the web rather than from a large corpus. In some cases, performance can be improved further by using backoff or interpolation techniques that combine web counts and corpus counts. However, unsupervised web-based models generally fail to outperform supervised state-of-the-art models trained on smaller corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard supervised models.
Links To Paper
1st Link
Bibtex format
@Article{EDI-INF-RR-0469,
author = { Mirella Lapata and Frank Keller },
title = {Web-based Models for Natural Language Processing},
journal = {ACM Transactions on Speech and Language Processing},
publisher = {ACM},
year = 2005,
volume = {# 2(1)},
doi = {10.1145/1075389.1075392},
url = {http://homepages.inf.ed.ac.uk/mlap/Papers/tslp05.html},
}


Home : Publications : Report 

Please mail <reports@inf.ed.ac.uk> with any changes or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh