Informatics Report Series



Related Pages

Report (by Number) Index
Report (by Date) Index
Author Index
Institute Index

Title:Tools to Address the Interdependence between Tokenisation and Standoff Annotation
Authors: Claire Grover ; Michael Matthews ; Richard Tobin
Date:Apr 2006
Publication Title:Proceedings of NLPXML 2006 (Workshop on NLP and XML with theme Multi-dimensional Markup in Natural Language Processing)
Publication Type:Conference Paper Publication Status:Published
Page Nos:19-26
In this paper we discuss technical issues arising from the interdependence between tokenisation and xml-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an xml-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides the optimum granularity for stand-off annotation. Furthermore, it provides units which can be easily selected, swept out or snapped to by the annotators and certain classes of annotation mistakes can be prevented by building a tool that does not permit selection of a substring which does not entirely span one or more xml elements. There is a downside to this method of annotation, however, in that it assumes that for any given data set, in whatever domain, the optimal tokenisation is known before any annotation is performed. If mistakes are made in the initial tokenisation and the word boundaries conflict with the annotators' desired actions, then either the annotators will be forced to be inaccurate or expensive retokenisation and reannotation will be required. Here we describe the tools and methods we have developed to address this problem.
2006 EACL
Links To Paper
No links available
Bibtex format
author = { Claire Grover and Michael Matthews and Richard Tobin },
title = {Tools to Address the Interdependence between Tokenisation and Standoff Annotation},
book title = {Proceedings of NLPXML 2006 (Workshop on NLP and XML with theme Multi-dimensional Markup in Natural Language Processing)},
year = 2006,
month = {Apr},
pages = {19-26},

Home : Publications : Report 

Please mail <> with any changes or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh