MXPOST is a JAVA (JDK 1.1) implementation of the part-of-speech tagger described in:
Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996. University of Pennsylvania (it is no longer available at the original location http://www.cis.upenn.edu/~adwait/papers/tagger.ps, I found it at http://citeseer.ist.psu.edu/152720.html). Adwait's thesis and two FAQs
USERS MUST ABIDE BY THE LICENSE INCLUDED WITH THIS DISTRIBUTION.
MXPOST is copyright (c) 1997 Adwait Ratnaparkhi
To use:
mxpost projectdir < wordfile
where projectdir contains the files constituting
the model and wordfile contains one sentence per line.
An example of a "project directory" is
/group/contrib/nlp-speech/src/mxpost/tagger.project ,
it contains a model trained
from sections 0 through 18 of the Penn Treebank Wall St. Journal
corpus.
The sentences in wordfile must be tokenized according to
Penn Treebank conventions,
e.g., "The stock didn't rise $5." should be "The stock did n't rise $ 5 .
You may want to use the script Treebank_tokenization.sed
for that, which is at the same location as mxpost.
To train a new model:
trainmxpost projectdir traindata
where projectdir is the newly created project directory,
and where traindata contains one sentence per line, where
each sentence has the format:
word1_tag1 word2_tag2 word3_tag3 ... word4_tag4
|
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |