Probabilistic Parsing for New Languages and Annotation Schemes

Dr Frank Keller

Joint work with Amit Dubey, Saarbruecken.

Parsing, the task of assigning a syntactic structure to an utterance,is central to many natural language processing applications. Parsing has been the subject of intensive research over the past few years,resulting in probabilistic models that achieve both broad coverage and high accuracy. However, most of the existing parsing models have been developed for English and trained on a single corpus, the Penn Treebank. This raises the question whether these models generalize to other languages, and to annotation schemes that differ from the Penn Treebank markup.

We address this question by proposing a probabilistic parsing model
trained on Negra, a syntactically annotated corpus for German. German has a number of syntactic properties that set it apart from English, and the Negra annotation scheme differs in important respects from the Penn Treebank markup. We observe that existing lexicalized parsing models using head-head dependencies, while successful for English,fail to outperform an unlexicalized baseline model for German. Learning curves show that this effect is not due to lack of training data. We propose an alternative model that uses sister-head dependencies instead of head-head dependencies. This model outperforms the baseline, achieving a labeled precision and recall of around 74%.

We use this result to argue that head-sister dependencies are more
appropriate for parsing languages with a relatively free word order (such as German) and annotation schemes with very flat structures (such as Negra).


Home : News : Jamboree : 2003 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh