Corpora and other Language and Speech Data under DICE

Information about NLP and speech software can be found here.
Conference and workshop proceedings can be found here.

General Information

All corpora (and other language and speech data) reside under /group/corpora, and there are three subdirectories. Which subdirectory a corpus is in depends on its licensing agreement and size:

/group/corpora/public  corpora with Informatics-wide or University-wide licenses (NFS filesystem)
/group/corpora/large  very large corpora with Informatics-wide or University-wide licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/large/)
/group/corpora/restricted  corpora with more restrictive licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/restricted/)

For AFS filesystems, you need to be authenticated in order to access the space. Before reporting an access problem, you should check that you have a valid AFS token.

In general, there should be symlinks from /group/corpora/public/... to /group/corpora/large/... for the few corpora that are in that location, to make browsing the filesystem easier. The reason for dividing the space is simply a limit on the size of single disk partitions.

For corpora with restrictive licenses, read access is limited to certain groups of users. The group names are specified in the list of restricted corpora below. If you need access to any of these corpora, please email corpus-admin@inf.ed.ac.uk.

Directory Structure

Corpora that exist in more than one version live in the same directory (the name of which is in all lowercase and identifies the corpus by name or acronym). Subdirectories identify different versions of a corpus, including annotated (or otherwise processed) versions. In this case, the unmodified version sits under `original'.

Examples:

/group/corpora/public/bnc/1.0 BNC, Version 1.0, unmodified
/group/corpora/public/bnc/2.0 BNC, Version 2.0, unmodified
/group/corpora/public/bnc/parsed_ims BNC, parsed with IMS parser
/group/corpora/public/bnc/parsed_minipar BNC, parsed with Minipar
/group/corpora/public/bllip/original BLLIP, unmodified
/group/corpora/public/bllip/parsed_miniparBLLIP, parsed with Minipar

Ordering Corpora

At the end of this page, you will find a list of the corpora that are installed in the DICE corpus space. We have licenses for some other corpora that are currently not installed (often due to space or licensing restrictions).

If you would like to find out if we have a corpus that you need for your work, or order new copora, please email corpus-admin@inf.ed.ac.uk.

LDC Corpora

The University is a member of the Linguistic Data Consortium (LDC) for the following years: 1995, 1996, and 1998-2005. This means that we are entitled to a reduced rate for all corpora released by the LDC during these years. Please have a look at the LDC web site for a list of available corpora.

As of 2005, the University has a subscription membership for the LDC. This means that we automatically get two copies of all new corpora released by the LDC in 2005 and subsequent years. Note, however, that not all LDC corpora are being installed automatically in the corpus space (due to constraints on disk space). If you want a new LDC corpus to be installed, please email corpus-admin@inf.ed.ac.uk.

If you are a corpus administrator, and you have an LDC membership account, please follow this link to find out more details about our LDC membership.

ELRA Corpora

The University is a member of ELRA for 2013. This means that we are entitled to a discounted price rate for all corpora (whenever they were released), but only if ordered during 2013. In general, we do not join ELRA every year, due to very limited demand for their products. Please contact corpus-admin@inf.ed.ac.uk before ordering any items from ELRA, to find out whether renewing our membership would be cost-effective.

Conference Proceedings

We also maintain an archive of conference and workshop proceedings. These can be found at /group/corpora/public/proceedings in the corpus space. There is also a list of proceeding with a web interface for easy access at this link.

List of Corpora

All paths are relative to /group/corpora.

Public Corpora

These are corpora that are licensed to the University of Edinburgh, or to the School of Informatics, or are in the public domain. Access is open to all Informatics users.
	
name:      AMI Meeting Corpus
directory: public/ami
type:      speech
size:      1026 MB
licenser:  Idiap
licensee:  UoE
webpage:   here

name:      ABI - Accents of the British Isles
directory: public/abi
type:      speech
size:      18 GB
licenser:  The Speech Ark / The University of Birmingham
licensee:  UoE
webpage:   here


name:      Abstract Meaning Representation (AMR) Annotation Release 1.0
directory: public/amr/1.0
type:      text
size:      24 MB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      ACL/DCI Corpus (includes the original Wall Street Journal corpus)
directory: public/acl_dci/original
type:      text
size:      629 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Asian Elephant Vocalizations
directory: public/elephant
type:      speech
size:      22265 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACL/DCI Corpus, processed version
directory: public/acl_dci/processed
type:      text
size:      17 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2004, Multilingual Training Corpus
directory: public/ace/ace_mtc/2004
type:      text
size:      34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2004, Time Normalization English Training Data
directory: public/ace/ace_tern
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2005, English SpatialML Annotations
directory: public/ace/ace_spatial
type:      text
size:      23 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE 2005, Multilingual Training Corpus
directory: public/ace/ace_mtc/2005
type:      text
size:      1617 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ACE-2, Version 1.0
directory: public/ace/ace2
type:      text
size:      34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Datasets for Generic Relation Extraction (reACE)
directory: public/ace/reace
type:      text
size:      69 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      AQUAINT-2 Information Retrieval Text Research Collection
directory: public/acquaint
type:      text
size:      1069 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ATIS3 (Air Travel Information Service), NIST Speech Discs 17-1.1 - 17-3.1
directory: public/atis3
type:      speech
size:      1300 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      An English Dictionary of the Tamil Verb
directory: public/tamil_dictionary
type:      text
size:      0.52 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Gigaword, fourth edition
directory: public/arabic_gigaword/4.0
type:      text
size:      2588 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Gigaword, fifth edition
directory: public/arabic_gigaword/5.0
type:      text
size:      3.2 GB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      Standard Arabic Morphological Analyzer
directory: public/arabic_morphology
type:      text
size:      5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Broadcast News Transcripts
directory: public/arabic_news_transcripts
type:      text
size:      3.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Translation Corpus, Part 1
directory: public/arabic_translation/part1
type:      text
size:      2.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Translation Corpus, Part 2
directory: public/arabic_translation/part2
type:      text
size:      3.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Newswire English Translation Collection
directory: public/arabic_translation/newswire
type:      text
size:      13 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Treebank, Part 1, Version 2.0
directory: public/arabic_treebank/part1v2.0
type:      text
size:      266 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Treebank, Part 1, Version 2.0, English Translation
directory: public/arabic_treebank/part1v2.0/translation
type:      text
size:      0.27 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Arabic Treebank, Part 3, Version 2.0
directory: public/arabic_treebank/part3v2.0
type:      text
size:      891 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Aurora Noisy TI Digits Database, Version 2.0
directory: public/aurora
type:      speech
size:      2629 MB
licenser:  ELDA
licensee:  UoE
webpage:   here

name:      Bible Corpus, 56 languages 
directory: public/bible
type:      text
size:      246 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      BBN IE/NE-tagged HUB-4 Training Transcripts
directory: public/bbn_ie_ne_tagged
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BBN Pronoun Coreference and Entity Type Corpus
directory: public/bbn_pronoun_coref
type:      text
size:      22 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus
directory: public/bllip/original
type:      text
size:      172 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus, parsed with Minipar
directory: public/bllip/parsed_minipar
type:      text
size:      293 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus, parsed in KAF format
directory: public/bllip/parsed_kaf
type:      text
size:      1382 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      BLLIP Corpus, text extracted
directory: public/bllip/text
type:      text
size:      290 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Basic Electricity and Electronics Corpus
directory: public/bee
type:      text
size:      2 MB
licenser:  University of Pittsburgh
licensee:  freely available
webpage:   here

name:      Biomedical Information Extraction Corpus
directory: public/biomedical_ie
type:      text
size:      320 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Blog 06 Test Collection
directory: public/blogs_collection
type:      text
size:      25000 MB
licenser:  University of Glasgow
licensee:  ILCC/HCRC
webpage:   here

name:      Boston University Radio Speech Corpus
directory: public/bu_radio
type:      speech
size:      2424 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      British National Corpus, Version 1.0
directory: public/bnc/1.0
type:      text
size:      2866 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, marked up in XML
directory: public/bnc/xml
type:      text
size:      815 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with Charniak parser
directory: public/bnc/parsed_charniak
type:      text
size:      419 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with IMS parser
directory: public/bnc/parsed_ims
type:      text
size:      2088 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with Minipar
directory: public/bnc/parsed_minipar
type:      text
size:      448 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with RASP parser
directory: public/bnc/parsed_rasp
type:      text
size:      3520 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, raw text without any markup
directory: public/bnc/text
type:      text
size:      579 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, various LTG data
directory: public/bnc/data
type:      text
size:      7 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 2.0 (World Edition)
directory: public/bnc/2.0
type:      text
size:      1779 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 2.0 (World Edition), indexed for IMS Corpus Workbench
directory: public/bnc/corpus_workbench
type:      text
size:      967 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 3.0 (XML Edition)
directory: public/bnc/3.0
type:      text
size:      4619 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      Buckwalter Arabic Morphological Analyzer Version 2.0
directory: public/buckwalter
type:      lexicon
size:      4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Corpus of Spolen Dutch, CGN
directory: public/cgn
type:      speech
size:      4554 MB
licenser:  Dutch Language Union
licensee:  ILCC/HCRC
webpage:   here

name:      CCGbank, Version 1.1
directory: public/ccgbank
type:      text
size:      387 MB
licenser:  LDC
licensee:  UoE
webpage:   CCG group home page, LDC catalog entry

name:      CELEX Lexical Database, Version 2.0
directory: public/celex/2.0
type:      lexcion
size:      288 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      CoNLL 2006 Shared Task Data
directory: public/conll/2006
type:      text
size:      138 MB
licenser:  LDC
licensee:  UoE
webpage:   here, here, and  here

name:      CoNLL 2008 Shared Task Data
directory: public/conll/2008
type:      text
size:      84 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      CoNLL 2009 Shared Task Part 2
directory: public/conll/2009
type:      text
size:      389 MB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      COMLEX English Syntax Corpus
directory: public/comlex/corpus
type:      text
size:      98 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      COMLEX English Syntax Lexicon
directory: public/comlex/lexicon
type:      lexicon
size:      18 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      CSTR TIMIT Sentence Data
directory: public/timit/cstr
type:      speech
size:      478 MB
licenser:  LDC
licensee:  UoE
webpage:   none

name:      CSTR Weather Database for Speech Synthesis
directory: public/synthesis/cstr/weather
type:      text
size:      255 MB
licenser:  CSTR
licensee:  UoE
webpage:   none

name:      CallFriend American English Speech Corpus Non-Southern Dialect
directory: public/callfriend/cf_ameng_n
type:      speech
size:      1.5 GB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      Callhome American English, Speech
directory: public/callhome/english/speech
type:      speech
size:      1830 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Mandarin Chinese Transcripts - XML version
directory: public/callhome/chinese
type:      speech
size:      9.5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Spanish Dialog Act Annotation
directory: public/callhome/spanish/dialog_act
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Spanish Lexicon
directory: public/callhome/spanish/lexicon
type:      text
size:      3.1 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Callhome Spanish Transcripts
directory: public/callhome/spanish/transcripts
type:      text
size:      1.9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Candian Hansard
directory: public/canadian_hansard
type:      text
size:      685 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      CHAINS - CHAracterizing INdividual Speakers
directory: public/chains
type:      speech
size:      3.3 GB
licenser:  University College Dublin
licensee:  UoE
webpage:   here

name:      Childes Child Language Database
directory: public/childes
type:      text
size:      1266 MB
licenser:  Carnegie Mellon University
licensee:  GPL
webpage:   here

name:      Chinese English Name Entity Lists, Version 1.0
directory: public/chinese_english_ne
type:      text
size:      97 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Gigaword, second edition
directory: public/chinese_gigaword/2.0
type:      text
size:      1700 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Gigaword, fourth edition
directory: public/chinese_gigaword/4.0
type:      text
size:      2990 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese News Translation Corpus, Part 1
directory: public/chinese_translation
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Proposition Bank 1.0
directory: public/chinese_propbank/1.0
type:      text
size:      21 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Proposition Bank 2.0
directory: public/chinese_propbank/2.0
type:      text
size:      112 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 2.0
directory: public/chinese_treebank/2.0
type:      text
size:      4.3 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 2.0, English Translation
directory: public/chinese_treebank/2.0/translation
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      Chinese Treebank, Version 3.0
directory: public/chinese_treebank/3.0
type:      text
size:      14.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 3.0, English Translation
directory: public/chinese_treebank/3.0/translation
type:      text
size:      1.7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 5.0
directory: public/chinese_treebank/5.0
type:      text
size:      31 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese Treebank, Version 6.0
directory: public/chinese_treebank/6.0
type:      text
size:      115 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Chinese-English Translation Lexicon, Version 3.0
directory: public/chinese_english_lexicon
type:      lexicon
size:      1.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Christine Corpus of Spoken British English
directory: public/christine
type:      text
size:      4.3 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Conference Proceedings from CDROMs
directory: public/proceedings
type:      text
size:      18000 MB and growing
licenser:  various
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (CSR-III Speech)
directory: public/csr/csr3/speech
type:      speech
size:      1952 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (CSR-III Text)
directory: public/csr/csr3/text
type:      speech
size:      1791 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (HUB-4 Language Model)
directory: public/csr/hub4
type:      text
size:      845 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Corpus of IMDB Movie Summaries, indexed for IMS Corpus Workbench
directory: public/imdb
type:      text
size:      168 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Cytology Corpus (Alvey Project)
directory: public/cytol
type:      speech
size:      372 MB
licenser:  CSTR
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2000 Dialogue Act Tagged
directory: public/darpa_communicator/2000/tagged
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2000 Evaluation
directory: public/darpa_communicator/2000/evaluation
type:      speech
size:      4384 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2001 Dialogue Act Tagged
directory: public/darpa_communicator/2001/tagged
type:      text
size:      88 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2001 Evaluation
directory: public/darpa_communicator/2001/evaluation
type:      speech
size:      3804 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Resource Management Continuous Speech Database (RM1)
directory: public/resource_management/rm1
type:      speech
size:      387 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DARPA Resource Management Continuous Speech Database (RM2)
directory: public/resource_management/rm2
type:      speech
size:      688 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DCIEM Sleep Deprivation Corpus
directory: public/dciem
type:      speech
size:      7448 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      DSO Corpus of Sense-Tagged English
directory: public/dso
type:      text
size:      37 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Document Understanding Conference (DUC) data, 2001-2007
directory: public/duc
type:      text
size:      108 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      Dickens Corpus, indexed for IMS Corpus Workbench
directory: public/dickens
type:      text
size:      65 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      PARC 700 Dependency Bank
directory: public/depbank
type:      text
size:      3 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Diphone Voices for Festival
directory: public/synthesis/diphone_voices
type:      speech
size:      4477 MB
licenser:  CSTR
licensee:  UoE
webpage:   here

name:      Discourse Graphbank
directory: public/discourse_graphbank
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Dundee Corpus of English and French Eye-movement Data
directory: public/dundee_eyemovement
type:      speech
size:      207 MB
licenser:  Department of Psychology, University of Dundee
licensee:  UoE
webpage:   none

name:      ESP Game 100k Corpus
directory: public/esp_game
type:      text and images
size:      2783 MB
licenser:  Luis von Ahn
licensee:  publicly available
webpage:   here

name:      EMILLE/CIIL Corpus
directory: public/emille
type:      text
size:      1645 MB
licenser:  ELDA
licensee:  ILCC/HCRC
webpage:   here

name:      Electromagnetic Articulograph (EMA) Data
directory: public/ema/other
type:      speech and EMA
size:      2394 MB
licenser:  QMUC/CSTR
licensee:  UoE
webpage:   here

name:      Emotional Prosody Speech and Transcripts
directory: public/emotional_speech
type:      speech
size:      2845 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Web Treebank
directory: public/english_web_treebank
type:      text
size:      104 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, first edition
directory: large/english_gigaword/1.0/original
type:      text
size:      3960 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, first edition, parsed with Minipar
directory: large/english_gigaword/1.0/parsed_minipar
type:      text
size:      17157 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, first edition, tokenized and tagged
directory: large/english_gigaword/1.0/tagged
type:      text
size:      22169 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, fourth edition
directory: large/english_gigaword/4.0
type:      text
size:      8223 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, fifth edition
directory: large/english_gigaword/5.0
type:      text
size:      9328 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Gigaword, annotated
directory: large/annotated_english_gigaword
type:      text
size:      172 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English Intonation in the British Isles Corpus
directory: public/ivie
type:      text
size:      2471 MB
licenser:  University of Oxford
licensee:  freely available
webpage:   here

name:      English-Arabic Parallel Treebank
directory: public/english_arabic_treebank
type:      text
size:      18 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      English-Chinese Translation Treebank 1.0
directory: public/chinese_translation_treebank
type:      speech
size:      9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Enron Email Dataset
directory: public/enron/original
type:      text
size:      1646 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Enron Email Dataset, prepared for Rainbow
directory: public/enron/rainbow
type:      text
size:      290 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Enron Email Dataset, with Topic Annotations
directory: public/enron/annotations
type:      text
size:      0.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      European Corpus Initiative Multilingual Corpus
directory: public/eci
type:      text
size:      685 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      European News Corpus
directory: public/european_news
type:      text
size:      715 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      European Parliament Interpretation Corpus (EPIC)
directory: public/ELRA/ELRA-S0323-EPIC-European-Parliament-Interpretation-Corpus
type:      text
size:      3.8 GB
licenser:  ELRA
licensee:  The School of Informatics
webpage:   here

name:      European Parliament Proceedings Parallel Corpus, Version 2.0
directory: public/europarl
type:      text
size:      3809 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Extended VerbNet
directory: public/verbnet
type:      lexicon
size:      2.5 MB
licenser:  University of Colorado
licensee:  freely available
webpage:   here

name:      Extended WordNet Lexical Database, WordNet Version 2.0, Extension Version 1.1
directory: public/wordnet/xwn
type:      lexicon
size:      154 MB
licenser:  University of Texas at Dallas
licensee:  freely available
webpage:   here

name:      Fisher English Training Speech, Part 1, Speech
directory: large/fisher/speech
type:      speech
size:      28 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Fisher English Training Speech, Part 1, Transcripts
directory: large/fisher/transcripts
type:      text
size:      275 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Factbank
directory: public/factbank
type:      text
size:      22 MB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      Fisher English Training Speech, Part 2, Speech
directory: large/fisher/speech
type:      speech
size:      29 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Fisher English Training Speech, Part 2, Transcripts
directory: large/fisher/transcripts
type:      text
size:      279 MB
licenser:  LDC
licensee:  UoE
webpage:   here


name:      FrameNet 1.1
directory: public/framenet/1.1
type:      text
size:      1024 MB
licenser:  University of California at Berkeley
licensee:  freely available
webpage:   here

name:      FrameNet 1.3
directory: public/framenet/1.3
type:      text
size:      783 MB
licenser:  University of California at Berkeley
licensee:  freely available
webpage:   here

name:      Frankfurter Rundschau corpus (part of ECI), tokenized and tagged
directory: public/rundschau
type:      text
size:      1074 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      French Treebank, Version 1.4
directory: public/french_treebank
type:      text
size:      147 MB
licenser:  LLF, Universite Paris 7
licensee:  UoE
webpage:   here

name:      French Gigaword, second edition
directory: public/french_gigaword/2.0
type:      text
size:      1790 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      French Gigaword, third edition
directory: public/french_gigaword/3
type:      text
size:      2 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
directory: public/gale/gale_ch_nw_wb_word_align_p3/
type:      text
size:      29 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Arabic Blog Parallel Text
directory: public/gale/galep1_ara_bl_ptxt/
type:      text
size:      5.7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
directory: public/gale/ara_bn_ptext
type:      text
size:      36 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
directory: public/gale/ar_bn_ptxt_p2/
type:      text
size:      2.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Blog Parallel Text
directory: public/gale/gale_p1_ch_blog
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
directory: public/gale/ch_bn_ptxt
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
directory: public/gale/ch_bn_ptxt_p2
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
directory: public/gale/ch_bn_ptxt_p3
type:      text
size:      4.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 1 Distillation Training
directory: public/gale/galep1_distill_tr
type:      text
size:      31 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - MTPlus Pilot
directory: restricted/gale/GALE-P3-MTPlus_Pilot
group:     smt
type:      text
size:      0.86 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - Transcripts
directory: restricted/gale/GALE-P3R2/transcription
group:     smt
type:      text
size:      117 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - Translations
directory: restricted/gale/GALE-P3R2/translationGALE-P3R1
group:     smt
type:      text
size:      11.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      German Law Corpus, indexed for IMS Corpus Workbench
directory: public/german_law
type:      text
size:      40 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      GlobalPhone
directory: public/global_phone
type:      speech
size:      18000 MB
licenser:  ELDA
licensee:  UoE
webpage:   GlobalPhone

name:      Google Book Corpus
directory: public/google_books
type:      text
size:      8119 MB
licenser:  LDC
licensee:  ILCC/HCRC
webpage:   n/a

name:      Google n-Gram Corpus
directory: public/google_ngrams
type:      text
size:      25000 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Google Syntactic N-grams (derived from the Google English Books collection)
directory: large/google-syntactic-ngrams
type:      dependency tree fragments
size:      318 GB
licenser:  Google (Creative Commons Attribution-Non Commercial ShareAlike 3.0 Unported License)
licensee:  UoE
webpage:   here

name:      Gulf Arabic Conversational Telephone Speech, Transcripts
directory: public/arabic_telephone
type:      text
size:      11 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      HUB-5 English Evaluation 1997
directory: public/hub5/1997
type:      speech
size:      593 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      HUB-5 English Evaluation 1998
directory: public/hub5/1998
type:      speech + text
size:      607 MB
licenser:  LDC
licensee:  UoE
webpage:   here and here

name:      HUB-5 English Evaluation 2000
directory: public/hub5/2000
type:      speech + text
size:      617 MB
licenser:  LDC
licensee:  UoE
webpage:   here and here

name:      HUB-5 English Evaluation 2001
directory: public/hub5/2001
type:      speech
size:      2.1 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      HARD 2004 Topics and Annotations
directory: public/hard
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hebrew Treebank
directory: public/hebrew_treebank
type:      text
size:      20 MB
licenser:  Technion
licensee:  public domain
webpage:   here

name:      Hidi Wordnet
directory: public/hindi_wordnet
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong Hansard Parallel Text, Alignments
directory: public/hong_kong_hansard/alignments
type:      text
size:      91 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong Hansard Parallel Text, Text
directory: public/hong_kong_hansard/text
type:      text
size:      110 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong Laws Parallel Text
directory: public/hong_kong_laws
type:      text
size:      75 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong News Parallel Text, Alignments
directory: public/hong_kong_news/alignments
type:      text
size:      107 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Hong Kong News Parallel Text, Text
directory: public/hong_kong_news/text
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ILE: Italian LExicon
directory: public/italian_lexicon
type:      text
size:      20 MB
licenser:  ELRA
licensee:  UoE
webpage:   here

name:      ICSI Meeting Speech
directory: public/icsi_meeting/speech
type:      text
size:      33400 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ICSI Meeting Transcripts
directory: public/icsi_meeting/transcripts
type:      text
size:      3.51 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      IMS Corpus Workbench (corpus registry files only)
directory: public/corpus_workbench
type:      text/speech
size:      0 MB
licenser:  IMS Stuttgart
licensee:  ILCC/HCRC
webpage:   here

name:      ISI Chinese-English Automatically Extracted Parallel Text
directory: public/isi_chi_eng_par_txt
type:      text
size:      206 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ISL Meeting Speech
directory: public/isl_meeting/speech
type:      speech
size:      5975 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      ISL Meeting Transcripts
directory: public/isl_meeting/transcripts
type:      text
size:      1.81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Instruction-based Learning for Mobile Robots Corpus
directory: public/ibl
type:      speech
size:      123 MB
licenser:  University of Edinburgh/University of Plymouth
licensee:  freely available
webpage:   here

name:      KAIST Korean Speech Database
directory: public/kaist
type:      speech
size:      3711 MB
licenser:  The Korean Advanced Institute of Science and Technology
licensee:  UoE
webpage:   none

name:      Korean Broadcast News Transcripts 
directory: public/korean_news_transcripts
type:      text
size:      1.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Korean Propbank
directory: public/korean_propbank
type:      text
size:      24 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Korean Treebank, Version 1.0
directory: public/korean_treebank/1.0
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Korean Treebank, Version 2.0
directory: public/korean_treebank/2.0
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Lancaster Corpus of Mandarin (LCMC)
directory: public/lcmc
type:      text
size:      46 MB
licenser:  ELDA
licensee:  UoE
webpage:   here

name:      Levantine Arabic QT Training Data Set 5, Transcripts
directory: public/arabic_qt_data
type:      text
size:      27 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Lucy Corpus of Written British English
directory: public/lucy
type:      text
size:      4.7 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Rich Transcription (RT) Evaluation Project datasets
directory: public/rich_transcription
type:      speech + text
size:      various - see separate entry per corpus
licenser:  LDC
licensee:  UoE
webpage:   NIST page describing RT tasks from 2002 onwards

name:      MDE RT-02 Rich Transcription Broadcast News and Conversational Telephone Speech 2002
directory: rich_transcription/rt-02/train/speech
type:      speech
size:      815 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-03 Training Data, Speech
directory: rich_transcription/rt-03/train/speech
type:      speech
size:      5256 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-03 Training Data, Text and Annotations
directory: rich_transcription/rt-03/train/text
type:      text
size:      723 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      2003 NIST Rich Transcription Evaluation Data
directory: rich_transcription/rt-03/eval/speech
type:      speech
size:      2.16 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-04 Training Data, Speech
directory: rich_transcription/rt-04/train/speech
type:      speech
size:      4829 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MDE RT-04 Training Data, Text and Annotations
directory: rich_transcription/rt-04/train/text
type:      text
size:      567 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MITRE 1997 Mandarin Broadcast News Speech Translations (Hub-4NE)
directory: public/mandarin_transcripts/hub4-ne
type:      text
size:      2.35 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MMA_2012 Multiple Microphone Array corpus
directory: large/2012_MMA
type:      speech
size:      115 GB
licenser:  UoE
licensee:  UoE
webpage:   here

name:      MOCHA Electromagnetic Articulograph (EMA) Corpus
directory: public/ema/mocha
type:      speech
size:      2221 MB
licenser:  QMUC/CSTR
licensee:  UoE
webpage:   here

name:      MRC Psycholinguistic Database
directory: public/mrc
type:      lexicon
size:      11 MB
licenser:  MRC
licensee:  freely available
webpage:   here

name:      Machine-readable Spoken English Corpus
directory: public/marsec
type:      speech
size:      2 MB
licenser:  Reading University
licensee:  UoE
webpage:   here

name:      Macrophone: American English Segment of the Polyphone Corpus
directory: public/macrophone
type:      speech
size:      3809 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Mandarin Transcripts (HUB-5, 2001)
directory: public/mandarin_transcripts/hub5
type:      text
size:      0.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Mandarin Transcripts, HKUST Telephone Data, Part 1
directory: public/mandarin_transcripts/hkust
type:      text
size:      11 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Maptask Corpus
directory: public/maptask
type:      speech
size:      13665 MB
licenser:  LDC and UoE/LDC
licensee:  UoE
webpage:   Maptask home page, LDC catalog entry

name:      Mawukakan Lexicon
directory: public/mawukakan_lexicon
type:      lexicon
size:      4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MC-WSJ-AV - multichannel Wall Street Journal audiovisual
directory: public/MC-WSJ-AV
type:      speech
size:      11 GB
licenser:  UoE
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 6
directory: public/muc/muc6
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 6, Additional News Text 
directory: public/muc/muc6/additional_text
type:      text
size:      0.67 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 7
directory: public/muc/muc7
type:      text
size:      45.7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      MT08 - NIST Open Machine Translation 2008 Evaluation data
directory: public/mt08_shared_data
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multext East
directory: public/multext/east/3.0
type:      text
size:      295 MB
licenser:  ILCC/HCRC
licensee:  Jozef Stefan Institute, Ljubljana
webpage:   here

name:      Multext East
directory: public/multext/east/4.0
type:      text
size:      341 MB
licenser:  ILCC/HCRC
licensee:  Jozef Stefan Institute, Ljubljana
webpage:   here

name:      Multext JOC
directory: public/multext/joc
type:      text
size:      122 MB
licenser:  ILCC/HCRC
licensee:  ELDA
webpage:   here

name:      Multilingual Corpora for Cooperation
directory: public/mlcc
type:      text
size:      1223 MB
licenser:  internal
licensee:  internal
webpage:   here

name:      Multilingual Semcor, Version 1.1
directory: public/semcor/multisemcor
type:      text
size:      142 MB
licenser:  ITC/IRST
licensee:  University of Edinburgh
webpage:   here

name:      Multiple-Translation Arabic Corpus, Part 1
directory: public/mt_arabic/part1
type:      text
size:      5.0 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Arabic Corpus, Part 2
directory: public/mt_arabic/part2
type:      text
size:      2.5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 1, Version 1.0
directory: public/mt_chinese/part1/1.0
type:      text
size:      4.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 1, Version 2.0
directory: public/mt_chinese/part1/2.0
type:      text
size:      2.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 2
directory: public/mt_chinese/part2
type:      text
size:      3.5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 3
directory: public/mt_chinese/part3
type:      text
size:      1.1 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Multiple-Translation Chinese Corpus, Part 4
directory: public/mt_chinese/part4
type:      text
size:      5.2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      News Spike Corpus
directory: public/news_spike
type:      text
size:      800 MB
licenser:  University of Washington
licensee:  freely available
webpage:   here

name:      NPS Internet Chatroom Conversations, Release 1.0
directory: public/nps_chat
type:      text
size:      7 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NIST Meeting Pilot Corpus Transcripts and Metadata
directory: public/nist_meeting_pilot
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NIST Speaker Recognition Evaluation 2002
directory: public/nist_speaker_rec
type:      speech
size:      4724 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NIST TI Digits
directory: public/tidigits
type:      speech
size:      786 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NTIMIT Acoustic-Phonetic Continuous Speech Corpus
directory: public/timit/ntimit
type:      speech
size:      1146 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      New York Times Annotated Corpus
directory: public/nyt_annotated
type:      text
size:      3202 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      NYNEX Phonebook
directory: public/phonebook
type:      speech
size:      1400 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Newsgroup Corpus
directory: public/newsgroups
type:      text
size:      55 MB
licenser:  public domain
licensee:  none
webpage:   various newsgroups

name:      NomBank
directory: public/nombank
type:      speech
size:      56 MB
licenser:  NYU
licensee:  none
webpage:   here

name:      North American News Text Corpus
directory: public/american_news/original
type:      text
size:      2342 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      North American Newstext Corpus, parsed with Minipar
directory: public/american_news/parsed_minipar
type:      text
size:      3392 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OntoNotes Release 1.0
directory: public/ontonotes/1.0
type:      text
size:      750 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OntoNotes Release 2.0
directory: public/ontonotes/2.0
type:      text
size:      1299 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OntoNotes Release 3.0
directory: public/ontonotes/3.0
type:      text
size:      444 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OntoNotes Release 4.0
directory: public/ontonotes/4.0
type:      text
size:      764 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      OHSUMED Corpus (also used for the TREC 9 Filtering Track)
directory: public/ohsumed
type:      text
size:      1176 MB
licenser:  NIST
licensee:  freely available
webpage:   here

name:      PASCAL Syntax Induction Challenge Training and Development  
directory: public/pascal_syntax_challenge
type:      text
size:      102 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Discourse Treebank, Version 1.0
directory: public/penn_discourse_treebank/1.0
type:      text
size:      10 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Discourse Treebank, Version 2.0
directory: public/penn_discourse_treebank/2.0
type:      text
size:      38 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Treebank, Version 2.0
directory: public/penn_treebank/2.0
type:      text
size:      655 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Penn Treebank, Version 3.0
directory: public/penn_treebank/3.0
type:      text
size:      256 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Prague Czech-English Dependency Treebank, Version 1.0
directory: public/prague_treebank/
type:      text
size:      587 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Proposition Bank, Version 1.0
directory: public/propbank
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      RST Discourse Treebank
directory: public/rst_treebank
type:      text
size:      26 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Research Cyc
directory: public/cyc
type:      text
size:      4118 MB
licenser:  Cycorp
licensee:  UoE
webpage:   here

name:      Reuters Text Categorization Corpus 21578
directory: public/reuters/21578
type:      text
size:      28 MB
licenser:  Reuters
licensee:  freely available
webpage:   here

name:      Roget's Thesaurus from 1911
directory: public/roget
type:      text
size:      12 MB
licenser:  public domain
licensee:  freely available
webpage:   here

name:      Russian-English Computer Security Parallel Text
directory: public/rus_eng_compsec_para
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Santa Barbara Corpus of Spoken American English - Parts I to IV
directory: public/santa_barbara
type:      speech
size:      6.7 GB
licenser:  University of California
licensee:  UoE
webpage:   here

name:      SALSA Corpus
directory: public/salsa
type:      text
size:      251 MB
licenser:  Saarland University
licensee:  UoE
webpage:   here

name:      SAID (Syntactically Annotated Idiom Dataset)
directory: public/said
type:      text
size:      3 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      SOLE Project Corpus
directory: public/synthesis/cstr/sole
type:      speech
size:      895 MB
licenser:  HCRC
licensee:  UoE
webpage:   here

name:      Search Engine Logs (Alltheweb, Excite, Altavista)
directory: public/searchengine_logs
type:      text
size:      440 MB
licenser:  Jim Jansen, Penn State University
licensee:  ILCC/HCRC
webpage:   ?

name:      Semcor Semantically Annotated Corpus, Version 1.6
directory: public/semcor/1.6
type:      text
size:      39 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Semcor Semantically Annotated Corpus, Version 2.0
directory: public/semcor/2.0
type:      text
size:      34 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Sinorama Chinese English Parallel Text
directory: public/sinorama
type:      text
size:      64 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Broadcast News
directory: public/spanish_broadcast_news
type:      speech
size:      5200 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Gigaword, First Edition
directory: public/spanish_gigaword/1.0
type:      text
size:      1775 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Gigaword, Second Edition
directory: public/spanish_gigaword/2.0
type:      text
size:      2679 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Newswire, vols 1 and 2
directory: public/spanish_newswire
type:      text
size:      556 MB + 624 MB
licenser:  LDC
licensee:  UoE
webpage:   here, here

name:      Spanish Treebank
directory: public/spanish_treebank
type:      text
size:      8 MB
licenser:  University of Barcelona
licensee:  freely available
webpage:   here


name:      Susanne Corpus of Written American English, Version 1.0
directory: public/susanne/1.0
type:      text
size:      5 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Susanne Corpus of Written American English, Version 5.0
directory: public/susanne/5.0
type:      text
size:      6 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Switchboard Corpus, NXT Annotations
directory: large/switchboard/nxt
type:      speech
size:      1230 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard 1 Telephone Speech Corpus, Release 2
directory: large/switchboard/switchboard1
type:      speech
size:      1485 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard 2 Telephone Speech Corpus, Phases 1-3
directory: large/switchboard/switchboard2
type:      speech
size:      50643 MB
licenser:  LDC
licensee:  UoE
webpage:   here and here and here

name:      Switchboard Cellular Telephone Speech Corpus, Part 1, Audio
directory: large/switchboard/cellular/part1/audio
type:      speech
size:      1401 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard Cellular Telephone Speech Corpus, Part 1, Transcripts
directory: large/switchboard/cellular/part1/transcripts
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Switchboard Cellular Telephone Speech Corpus, Part 2, Audio
directory: large/switchboard/cellular/part2/audio
type:      speech
size:      11364 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TORGO Database of Dysarthric Articulation
directory: public/torgo
type:      speech
size:      15234 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TAC 2008 Data
directory: public/tac/2008/
type:      text
size:      34 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      TAC 2009 Data
directory: public/tac/2009/
type:      text
size:      226 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      TAC 2010 Data
directory: public/tac/2010/
type:      text
size:      12 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      TC-STAR (a large collection of several sub-corpora)
directory: public/ELRA/TC-STAR
type:      text and speech
size:      83 GB (current size)
licenser:  ELRA
licensee:  The School of Informatics
webpage:   here

name:      TDT Pilot Corpus
directory: public/tdt/tdt2_pilot
type:      text
size:      53 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT2 Careful Transcription Text
directory: public/tdt/tdt2_careful_text
type:      text
size:      2.4 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT2 Careful Transcription Audio
directory: public/tdt/tdt2_careful_audio
type:      speech
size:      1077 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT2 Multilanguage Text Corpus, Version 4.0
directory: public/tdt/tdt2_multilanguage
type:      text
size:      623 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT3 Multilanguage Text Corpus, Version 2.0
directory: public/tdt/tdt2_multilanguage
type:      text
size:      367 MB
licenser:  LDC
licensee:  UoE
webpage:   here



name:      TDT4 Multilingual Broadcast News Speech Corpus
directory: public/tdt/tdt4_multilingual
type:      speech
size:      xxx GB
licenser:  LDC
licensee:  UoE
webpage:   here



name:      TDT5 Multilingual Text
directory: public/tdt/tdt5_multilingual_text
type:      text
size:      1200 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TDT5 Topics and Annotations
directory: public/tdt/tdt5_topics_and_annotations
type:      text
size:      80 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TIMIT Acoustic-Phonetic Continuous Speech Corpus
directory: public/timit/original
type:      speech
size:      668 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TRECVID 2003 Keyframes & Transcripts
directory: public/trec/trecvid/2003
type:      video
size:      3570 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TRECVID 2005 Keyframes & Transcripts
directory: public/trec/trecvid/2005
type:      video
size:      3283 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Tageszeitung (TAZ) Corpus
directory: public/taz
type:      text
size:      1439 MB
licenser:  ILCC/HCRC
licensee:  Contrapress Media GmbH
webpage:   here

name:      Talbanken05 Swedish Treebank
directory: public/talbanken
type:      speech
size:      144 MB
licenser:  University of Växjö and University of Lund 
licensee:  freely available
webpage:   here

name:      Timebank 1.2
directory: public/timebank
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      TITML - Tokyo Institute of Technology Multilingual Speech Corpus (currently we have: Indonesian, Icelandic)
directory: public/tokyo_multilingual
type:      speech
size:      2.5 GB
licenser:  Tokyo Institute of Technology
licensee:  UoE
webpage:   here

name:      ToBI Guidelines and Examples
directory: public/tobi_course
type:      text
size:      19 MB
licenser:  Ohio State University 
licensee:  UoE
webpage:   here

name:      Translanguage English Database (TED), Speech
directory: public/ted/speech
type:      speech
size:      2903 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Translanguage English Database (TED), Transcripts
directory: public/ted/transcripts
type:      text
size:      1.3 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      UASPEECH
directory: public/UASPEECH
type:      speech
size:      15 GB
licenser:  University of Illinois
licensee:  UoE
webpage:   here


name:      Ummah Arabic English Parallel News Text
directory: public/ummah
type:      text
size:      6.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Underspecified Rhetorical Markup Language (URML) Corpus aka
           Potsdam Commentary Corpus
directory: public/urml
type:      text
size:      1.7 MB
licenser:  University of Potsdam
licensee:  HCRC/ILCC
webpage:   here

name:      UKWAC (Web as Corpus) English, parsed version
directory: public/wac/ukwac
type:      text
size:      31000 MB
licenser:  University of Bologna
licensee:  freely available
webpage:   here

name:      UKWAC (Web as Corpus) English, dependency-parsed version
directory: public/wac/ukwac_dep
type:      text
size:      16000 MB
licenser:  University of Bologna
licensee:  freely available
webpage:   here

name:      Unified Linguistic Annotation Text Collection
directory: public/unified_annotation
type:      text
size:      363 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Wackypedia EN 1.0
directory: public/wac/wackypedia_en
type:      text
size:      6000 MB
licenser:  University of Bologna
licensee:  freely available
webpage:   here

name:      WSJ0 complete (also known as CSR-I)
directory: public/wsj/wsj0
type:      speech
size:      8.9 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      WSJ1 complete (also known as CSR-II)
directory: public/wsj/wsj1
type:      speech
size:      18 GB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      WSJCAM0 Cambridge Read News
directory: public/wsjcam0/original
type:      speech
size:      3848 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      WSJCAM0 Cambridge Read News, processed data
directory: public/wsjcam0/data
type:      speech
size:      13571 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Wikipedia Corpus (raw data, dated 2009-06-18)
directory: large/wikipedia/raw
type:      text
size:      22797 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      Wikipedia Corpus (INEX 2006 Corpus)
directory: large/wikipedia/inex
type:      text
size:      4959 MB
licenser:  various
licensee:  ILCC/HCRC
webpage:   here

name:      Wikipedia Corpus (INEX 2006 Corpus), Question Answering Version
directory: large/wikipedia/inex_qa
type:      text
size:      5143 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      Wikipedia Corpus, Tagged and Cleaned
directory: large/wikipedia/tagged_cleaned
type:      text
size:      51866 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 1.6
directory: public/wordnet/1.6
type:      lexicon
size:      40 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 1.7.1
directory: public/wordnet/1.7.1
type:      lexicon
size:      40 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 2.0
directory: public/wordnet/2.0
type:      lexicon
size:      41 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 2.1
directory: public/wordnet/2.1
type:      lexicon
size:      38 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 3.0, with sense maps and standoff annotation
directory: public/wordnet/2.1
type:      lexicon
size:      92 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Wordlists for various languages
directory: public/wordlists
type:      lexicon
size:      2 MB
licenser:  n/a
licensee:  freely available
webpage:   n/a

name:      Global Yoruba Lexical Database 1.0
directory: public/yoruba
type:      text
size:      183 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Xinhua Chinese English Parallel News Text, Version 1.0 beta
directory: public/xinhua
type:      text
size:      40 MB
licenser:  LDC
licensee:  UoE
webpage:   here

Restricted Corpora

These are corpora which are licenced to a paticular institute, project, or a group of individuals. Access is limited to a specific Unix groups consisting of the correct set of users.
Name:      Prague Czech-English Dependency Treebank, 2.0 alpha
directory: restricted/pcedt/v2alpha
group:     -
type:      text
size:      939 MB
licenser:  Charles University, Prague
licensee:  Bonnie Webber
webpage:   ?

Name:      Reuters Corpus, various supporting data
directory: restricted/reuters/data
group:     reuters01
type:      text
size:      2 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      AQUAINT-2 Information-Retrieval Text Research Collection
directory: restricted/aquaint
group:     trec
type:      text
size:      2498 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (HUB-4)
directory: restricted/csr
group:     corpman
type:      speech
size:      908 MB
licenser:  LDC
licensee:  UoE
webpage:   here,
here,
here,
here

note:      contains a mixture of LDC and proprietary data (incl lattices supplied by Cambridge), thus restricted

name:      DMM German Morphological Database
directory: restricted/dmm
group:     dmm
type:      lexicon
size:      21 MB
licenser:  University of Erlangen-Nuernberg
licensee:  ILCC/HCRC
webpage:   here

name:      GALE Kickoff
directory: restricted/gale/kickoff
group:     smt
type:      text
size:      106 MB
licenser:  LDC
licensee:  UoE
webpage:   here and here

name:      GALE Phase 2 Releases 1, 2 and 3
directory: restricted/gale/GALE-P2*
group:     smt
type:      text
size:      1500 MB
licenser:  LDC
licensee:  UoE
webpage:   release 1: here and here; release 2: here and here; release 3: here and here

name:      GALE Phase 3 DevTest - Source Text, Transcripts and Translations
directory: restricted/gale/GALE-P3-DevTest-V1_0
group:     smt
type:      text
size:      5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Distillation
directory: restricted/gale/GALE-Phase3-Distillation-TrainingData-V1_0
group:     smt
type:      text
size:      1.34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - English Translation Treebank
directory: restricted/gale/GALE-P3R1-EBNTT-Sep07
group:     smt
type:      text
size:      3.76 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Found Parallel Text
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      222.13 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Transcripts
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      70.57 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Translations
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      7.91 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 - IBM Arabic-English Word Alignment Corpus
directory: restricted/gale/Y1
group:     smt
type:      text
size:      25 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 Q3
directory: restricted/gale/GALE-Y1Q3
group:     smt
type:      text
size:      14 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 Q4
directory: restricted/gale/GALE-Y1Q4
group:     smt
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 4 Release 1 - Transcripts V1.0
directory: restricted/gale/GALE-P4R1
group:     smt
type:      text
size:      72 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 4 Release 1 - Translations V1.0
directory: restricted/gale/P4R1
group:     smt
type:      text
size:      72 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GermaNet (German WordNet) 4.0
directory: restricted/germanet/4.0
group:     dmm
type:      lexicon
size:      11 MB
licenser:  University of Tuebingen
licensee:  ILCC/HCRC
webpage:   here

name:      GermaNet (German WordNet) 8.0
directory: restricted/germanet/8.0
group:     dmm
type:      lexicon
size:      68 MB
licenser:  University of Tuebingen
licensee:  University of Edinburgh
webpage:   here


name:      Lancaster-Oslo-Bergen Corpus of British English
directory: restricted/lob
group:     corpman
type:      text
size:      8 MB
licenser:  ?
licensee:  LTG?
webpage:   here

name:      London-Lund Corpus of Spoken English
directory: restricted/london_lund
group:     corpman
type:      text
size:      10 MB
licenser:  ?
licensee:  LTG?
webpage:   here

name:      Maptask corpora for different languages and situations
directory: restricted/maptask
group:     corpman
type:      text
size:      622 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      Medline Corpus
directory: restricted/medline
group:     umls
type:      text
size:      7602 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      NTCIR Corpora 
directory: restricted/ntcir
group:     -
type:      text
size:      48453 MB
licenser:  National Institute of Informatics, Tokyo
licensee:  Bonnie Webber
webpage:
 here,  here, and  here

name:      NEGRA Parsed Corpus of German
directory: restricted/negra
group:     negra
type:      text
size:      55 MB
licenser:  Saarland University
licensee:  ILCC/HCRC
webpage:   here

name:      Reuters Corpus Volume 1 (English), Release 2000-11-03
directory: restricted/reuters/english
group:     reuters01
type:      text
size:      1012 MB
licenser:  NIST/Reuters
licensee:  Informatics/CSTR
webpage:   here

name:      Reuters Corpus Volume 2 (Multilingual), Release 2000-05-31
directory: restricted/reuters/multilingual
group:     reuters01
type:      text
size:      622 MB
licenser:  NIST/Reuters
licensee:  Informatics/CSTR
webpage:   here

name:      Search Engine Logs (AOL)
directory: restricted/searchengine_logs/aol
group:     querylogs
type:      text
size:      449 MB
licenser:  AOL
licensee:  freely available [but privacy concerns, hence restricted]
webpage:   ?

name:      Search Engine Logs (Excite)
directory: restricted/searchengine_logs/excite
group:     querylogs
type:      text
size:      52 MB
licenser:  Excite
licensee:  freely available [but privacy concerns, hence restricted]
webpage:   ?

name:      TIGER Parsed Corpus of German
directory: restricted/tiger
group:     negra
type:      text
size:      140 MB
licenser:  IMS Stuttgart
licensee:  ILCC/HCRC
webpage:   here

name:      TREC-9 Question Answering Track Corpus
directory: restricted/trec/trec9/question_answering
group:     trec
type:      text
size:      62 MB
licenser:  NIST
licensee:  LTG?
webpage:   here

name:      TREC Text Research Collection Vol. 4
directory: restricted/trec/text_collection4
group:     trec
type:      text
size:      443 MB
licenser:  NIST
licensee:  HCRC (but individual registration required; see Avril)
webpage:   here

name:      TREC Text Research Collection Vol. 5
directory: restricted/trec/text_collection5
group:     trec
type:      text
size:      394 MB
licenser:  NIST
licensee:  HCRC (but individual registration required; see Avril)
webpage:   here

name:      Tuebingen Partially Parsed Corpus of German, Newspaper (TüPP-D/Z) [based on TAZ corpus]
directory: restricted/tuebingen/tueppdz
group:     negra
type:      text
size:      7651 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 2 [based on TAZ corpus]
directory: restricted/tuebingen/tuebadz/2.0
group:     negra
type:      text
size:      185 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 8 [based on TAZ corpus]
directory: restricted/tuebingen/tuebadz/8.0
group:     negra
type:      text
size:      185 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Speech (TüBa-D/S), Release 1 [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebads/1.0
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Speech (TüBa-D/S), Release 2 [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebads/2.0
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of English, Speech (TüBa-E/S) [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebaes
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of Japanese, Speech (TüBa-J/S) [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebajs
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      UMLS Metathesaurus 2005AC
directory: restricted/umls
group:     umls
type:      text
size:      7780 MB
licenser:  American Medical Association
licensee:  ILCC/HCRC
webpage:   here


Home : Resources 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh