Informatics Report Series



Related Pages

Report (by Number) Index
Report (by Date) Index
Author Index
Institute Index

Title:CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
Authors: Julia Hockenmaier ; Mark Steedman
Date:Sep 2007
Publication Title:Computational Linguistics
Publisher:MIT Press
Publication Type:Journal Article Publication Status:Pre-print
Volume No:33 Page Nos:355-396
DOI:10.1162/coli.2007.33.3.355 ISBN/ISSN:0891-2017
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies. The resulting corpus, CCGbank, includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium, and has been used to train wide-coverage statistical parsers that obtain state-of-the-art rates of dependency recovery. In order to obtain linguistically adequate CCG analyses, and to eliminate noise and inconsistencies in the original annotation, an extensive analysis of the constructions and annotations in the Penn Treebank was called for, and a substantial number of changes to the Treebank were necessary. We discuss the implications of our findings for the extraction of other linguistically expressive grammars from the Treebank, and for the design of future treebanks.
Links To Paper
1st Link
2nd Link
Bibtex format
author = { Julia Hockenmaier and Mark Steedman },
title = {CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank},
journal = {Computational Linguistics},
publisher = {MIT Press},
year = 2007,
month = {Sep},
volume = {33},
pages = {355-396},
doi = {10.1162/coli.2007.33.3.355},
url = {},

Home : Publications : Report 

Please mail <> with any changes or corrections.
Unless explicitly stated otherwise, all material is copyright The University of Edinburgh