- Abstract:
-
In this paper we discuss technical issues arising from the interdependence between tokenisation and xml-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an xml-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides the optimum granularity for stand-off annotation. Furthermore, it provides units which can be easily selected, swept out or snapped to by the annotators and certain classes of annotation mistakes can be prevented by building a tool that does not permit selection of a substring which does not entirely span one or more xml elements. There is a downside to this method of annotation, however, in that it assumes that for any given data set, in whatever domain, the optimal tokenisation is known before any annotation is performed. If mistakes are made in the initial tokenisation and the word boundaries conflict with the annotators' desired actions, then either the annotators will be forced to be inaccurate or expensive retokenisation and reannotation will be required. Here we describe the tools and methods we have developed to address this problem.
- Copyright:
- 2006 EACL
- Links To Paper
- No links available
- Bibtex format
- @InProceedings{EDI-INF-RR-0802,
- author = {
Claire Grover
and Michael Matthews
and Richard Tobin
},
- title = {Tools to Address the Interdependence between Tokenisation and Standoff Annotation},
- book title = {Proceedings of NLPXML 2006 (Workshop on NLP and XML with theme Multi-dimensional Markup in Natural Language Processing)},
- year = 2006,
- month = {Apr},
- pages = {19-26},
- }
|