Corpora and other Language and Speech Data under DICE

Information about NLP and speech software can be found here.
Conference and workshop proceedings can be found here.

General Information

All corpora (and other language and speech data) reside under /group/corpora, and there are three subdirectories. Which subdirectory a corpus is in depends on its licensing agreement and size:

/group/corpora/public  corpora with Informatics-wide or University-wide licenses (NFS filesystem)
/group/corpora/large  very large corpora with Informatics-wide or University-wide licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/large/)
/group/corpora/large2  very large corpora with Informatics-wide or University-wide licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/large2/)
/group/corpora/restricted  corpora with more restrictive licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/restricted/)

For AFS filesystems, you need to be authenticated in order to access the space. Before reporting an access problem, you should check that you have a valid AFS token.

In general, there should be symlinks from /group/corpora/public/... to /group/corpora/large/... for the few corpora that are in that location, to make browsing the filesystem easier. The reason for dividing the space is simply a limit on the size of single disk partitions.

For corpora with restrictive licenses, read access is limited to certain groups of users. The group names are specified in the list of restricted corpora below. If you need access to any of these corpora, please email corpus-admin@inf.ed.ac.uk.

Directory Structure

Corpora that exist in more than one version live in the same directory (the name of which is in all lowercase and identifies the corpus by name or acronym). Subdirectories identify different versions of a corpus, including annotated (or otherwise processed) versions. In this case, the unmodified version sits under `original'.

Examples:

/group/corpora/public/bnc/1.0 BNC, Version 1.0, unmodified
/group/corpora/public/bnc/2.0 BNC, Version 2.0, unmodified
/group/corpora/public/bnc/parsed_ims BNC, parsed with IMS parser
/group/corpora/public/bnc/parsed_minipar BNC, parsed with Minipar
/group/corpora/public/bllip/original BLLIP, unmodified
/group/corpora/public/bllip/parsed_miniparBLLIP, parsed with Minipar

Ordering Corpora

At the end of this page, you will find a list of the corpora that are installed in the DICE corpus space. We have licenses for some other corpora that are currently not installed (often due to space or licensing restrictions).

If you would like to find out if we have a corpus that you need for your work, or order new copora, please email corpus-admin@inf.ed.ac.uk.

LDC Corpora

The University is a member of the Linguistic Data Consortium (LDC) for the following years: 1995, 1996, and 1998-2005. This means that we are entitled to a reduced rate for all corpora released by the LDC during these years. Please have a look at the LDC web site for a list of available corpora.

As of 2005, the University has a subscription membership for the LDC. This means that we automatically get two copies of all new corpora released by the LDC in 2005 and subsequent years. Note, however, that not all LDC corpora are being installed automatically in the corpus space (due to constraints on disk space). If you want a new LDC corpus to be installed, please email corpus-admin@inf.ed.ac.uk.

If you are a corpus administrator, and you have an LDC membership account, please follow this link to find out more details about our LDC membership.

ELRA Corpora

The University is a member of ELRA for 2013. This means that we are entitled to a discounted price rate for all corpora (whenever they were released), but only if ordered during 2013. In general, we do not join ELRA every year, due to very limited demand for their products. Please contact corpus-admin@inf.ed.ac.uk before ordering any items from ELRA, to find out whether renewing our membership would be cost-effective.

Conference Proceedings

We also maintain an archive of conference and workshop proceedings. These can be found at /group/corpora/public/proceedings in the corpus space. There is also a list of proceeding with a web interface for easy access at this link.

List of Corpora

All paths are relative to /group/corpora.

Public Corpora

These are corpora that are licensed to the University of Edinburgh, or to the School of Informatics, or are in the public domain. Access is open to all Informatics users.
	
name:      AMI Meeting Corpus
directory: public/ami
type:      speech
size:      1026 MB
licenser:  Idiap
licensee:  UoE
webpage:   here

name:      ABI - Accents of the British Isles
directory: public/abi
type:      speech
size:      18 GB
licenser:  The Speech Ark / The University of Birmingham
licensee:  UoE
webpage:   here


name:      Abstract Meaning Representation (AMR) Annotation Release 1.0
directory: public/amr/1.0
type:      text
size:      24 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2014T12


name:      ACL/DCI Corpus (includes the original Wall Street Journal corpus)
directory: public/acl_dci/original
type:      text
size:      629 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93T1

name:      Asian Elephant Vocalizations
directory: public/elephant
type:      speech
size:      22265 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2010S05

name:      ACL/DCI Corpus, processed version
directory: public/acl_dci/processed
type:      text
size:      17 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93T1

name:      ACE 2004, Multilingual Training Corpus
directory: public/ace/ace_mtc/2004
type:      text
size:      34 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T09

name:      ACE 2004, Time Normalization English Training Data
directory: public/ace/ace_tern
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T07

name:      ACE 2005, English SpatialML Annotations
directory: public/ace/ace_spatial
type:      text
size:      23 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T03

name:      ACE 2005, Multilingual Training Corpus
directory: public/ace/ace_mtc/2005
type:      text
size:      1617 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T06

name:      ACE-2, Version 1.0
directory: public/ace/ace2
type:      text
size:      34 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T11

name:      Datasets for Generic Relation Extraction (reACE)
directory: public/ace/reace
type:      text
size:      69 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011T08

name:      AQUAINT-2 Information Retrieval Text Research Collection
directory: public/acquaint
type:      text
size:      1069 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002T31

name:      ATIS3 (Air Travel Information Service), NIST Speech Discs 17-1.1 - 17-3.1
directory: public/atis3
type:      speech
size:      1300 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC94S19

name:      An English Dictionary of the Tamil Verb
directory: public/tamil_dictionary
type:      text
size:      0.52 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008L01

name:      Arabic Gigaword, fourth edition
directory: public/arabic_gigaword/4.0
type:      text
size:      2588 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T30

name:      Arabic Gigaword, fifth edition
directory: public/arabic_gigaword/5.0
type:      text
size:      3.2 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011T11


name:      Standard Arabic Morphological Analyzer
directory: public/arabic_morphology
type:      text
size:      5 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2010L01

name:      Arabic Broadcast News Transcripts
directory: public/arabic_news_transcripts
type:      text
size:      3.6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T20

name:      Arabic Translation Corpus, Part 1
directory: public/arabic_translation/part1
type:      text
size:      2.6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003E05

name:      Arabic Translation Corpus, Part 2
directory: public/arabic_translation/part2
type:      text
size:      3.2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003E09

name:      Arabic Newswire English Translation Collection
directory: public/arabic_translation/newswire
type:      text
size:      13 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T22

name:      Arabic Treebank, Part 1, Version 2.0
directory: public/arabic_treebank/part1v2.0
type:      text
size:      266 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T06

name:      Arabic Treebank, Part 1, Version 2.0, English Translation
directory: public/arabic_treebank/part1v2.0/translation
type:      text
size:      0.27 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T07

name:      Arabic Treebank, Part 3, Version 2.0
directory: public/arabic_treebank/part3v2.0
type:      text
size:      891 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T20

name:      Aurora Noisy TI Digits Database, Version 2.0
directory: public/aurora
type:      speech
size:      2629 MB
licenser:  ELDA
licensee:  UoE
webpage:   here

name:      Bible Corpus, 56 languages 
directory: public/bible
type:      text
size:      246 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      BBN IE/NE-tagged HUB-4 Training Transcripts
directory: public/bbn_ie_ne_tagged
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC98E11

name:      BBN Pronoun Coreference and Entity Type Corpus
directory: public/bbn_pronoun_coref
type:      text
size:      22 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T33

name:      BLLIP Corpus
directory: public/bllip/original
type:      text
size:      172 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T43

name:      BLLIP Corpus, parsed with Minipar
directory: public/bllip/parsed_minipar
type:      text
size:      293 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T43

name:      BLLIP Corpus, parsed in KAF format
directory: public/bllip/parsed_kaf
type:      text
size:      1382 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T43

name:      BLLIP Corpus, text extracted
directory: public/bllip/text
type:      text
size:      290 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T43

name:      Basic Electricity and Electronics Corpus
directory: public/bee
type:      text
size:      2 MB
licenser:  University of Pittsburgh
licensee:  freely available
webpage:   here

name:      Biomedical Information Extraction Corpus
directory: public/biomedical_ie
type:      text
size:      320 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Blog 06 Test Collection
directory: public/blogs_collection
type:      text
size:      25000 MB
licenser:  University of Glasgow
licensee:  ILCC/HCRC
webpage:   here

name:      Boston University Radio Speech Corpus
directory: public/bu_radio
type:      speech
size:      2424 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96S36

name:      British National Corpus, Version 1.0
directory: public/bnc/1.0
type:      text
size:      2866 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, marked up in XML
directory: public/bnc/xml
type:      text
size:      815 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with Charniak parser
directory: public/bnc/parsed_charniak
type:      text
size:      419 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with IMS parser
directory: public/bnc/parsed_ims
type:      text
size:      2088 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with Minipar
directory: public/bnc/parsed_minipar
type:      text
size:      448 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, parsed with RASP parser
directory: public/bnc/parsed_rasp
type:      text
size:      3520 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, raw text without any markup
directory: public/bnc/text
type:      text
size:      579 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 1.0, various LTG data
directory: public/bnc/data
type:      text
size:      7 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 2.0 (World Edition)
directory: public/bnc/2.0
type:      text
size:      1779 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 2.0 (World Edition), indexed for IMS Corpus Workbench
directory: public/bnc/corpus_workbench
type:      text
size:      967 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      British National Corpus, Version 3.0 (XML Edition)
directory: public/bnc/3.0
type:      text
size:      4619 MB
licenser:  BNC Consortium
licensee:  ILCC/HCRC
webpage:   here

name:      Buckwalter Arabic Morphological Analyzer Version 2.0
directory: public/buckwalter
type:      lexicon
size:      4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004L02

name:      Corpus of Spolen Dutch, CGN
directory: public/cgn
type:      speech
size:      4554 MB
licenser:  Dutch Language Union
licensee:  ILCC/HCRC
webpage:   here

name:      CCGbank, Version 1.1
directory: public/ccgbank
type:      text
size:      387 MB
licenser:  LDC
licensee:  UoE
webpage:   CCG group home page, LDC catalog entry

name:      CELEX Lexical Database, Version 2.0
directory: public/celex/2.0
type:      lexcion
size:      288 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96L14

name:      CoNLL 2006 Shared Task Data
directory: public/conll/2006
type:      text
size:      138 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006E01, LDC2006E02, and  LDC2006E02P1

name:      CoNLL 2008 Shared Task Data
directory: public/conll/2008
type:      text
size:      84 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T12

name:      CoNLL 2009 Shared Task Part 2
directory: public/conll/2009
type:      text
size:      389 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2012T04


name:      COMLEX English Syntax Corpus
directory: public/comlex/corpus
type:      text
size:      98 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96T11

name:      COMLEX English Syntax Lexicon
directory: public/comlex/lexicon
type:      lexicon
size:      18 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC98L21

name:      CSTR TIMIT Sentence Data
directory: public/timit/cstr
type:      speech
size:      478 MB
licenser:  LDC
licensee:  UoE
webpage:   none

name:      CSTR Weather Database for Speech Synthesis
directory: public/synthesis/cstr/weather
type:      text
size:      255 MB
licenser:  CSTR
licensee:  UoE
webpage:   none

name:      CallFriend American English Speech Corpus Non-Southern Dialect
directory: public/callfriend/cf_ameng_n
type:      speech
size:      1.5 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC96S46


name:      Callhome American English, Speech
directory: public/callhome/english/speech
type:      speech
size:      1830 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC97S42

name:      Callhome Mandarin Chinese Transcripts - XML version
directory: public/callhome/chinese
type:      speech
size:      9.5 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T17

name:      Callhome Spanish Dialog Act Annotation
directory: public/callhome/spanish/dialog_act
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001T61

name:      Callhome Spanish Lexicon
directory: public/callhome/spanish/lexicon
type:      text
size:      3.1 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96L16

name:      Callhome Spanish Transcripts
directory: public/callhome/spanish/transcripts
type:      text
size:      1.9 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96T17

name:      Candian Hansard
directory: public/canadian_hansard
type:      text
size:      685 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T20

name:      CHAINS - CHAracterizing INdividual Speakers
directory: public/chains
type:      speech
size:      3.3 GB
licenser:  University College Dublin
licensee:  UoE
webpage:   here

name:      Childes Child Language Database
directory: public/childes
type:      text
size:      1266 MB
licenser:  Carnegie Mellon University
licensee:  GPL
webpage:   here

name:      Chinese English Name Entity Lists, Version 1.0
directory: public/chinese_english_ne
type:      text
size:      97 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T34

name:      Chinese Gigaword, second edition
directory: public/chinese_gigaword/2.0
type:      text
size:      1700 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T14

name:      Chinese Gigaword, fourth edition
directory: public/chinese_gigaword/4.0
type:      text
size:      2990 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T27

name:      Chinese News Translation Corpus, Part 1
directory: public/chinese_translation
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003E08

name:      Chinese Proposition Bank 1.0
directory: public/chinese_propbank/1.0
type:      text
size:      21 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T23

name:      Chinese Proposition Bank 2.0
directory: public/chinese_propbank/2.0
type:      text
size:      112 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T07

name:      Chinese Treebank, Version 2.0
directory: public/chinese_treebank/2.0
type:      text
size:      4.3 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001T11

name:      Chinese Treebank, Version 2.0, English Translation
directory: public/chinese_treebank/2.0/translation
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002E17


name:      Chinese Treebank, Version 3.0
directory: public/chinese_treebank/3.0
type:      text
size:      14.4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003E06

name:      Chinese Treebank, Version 3.0, English Translation
directory: public/chinese_treebank/3.0/translation
type:      text
size:      1.7 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003E07

name:      Chinese Treebank, Version 5.0
directory: public/chinese_treebank/5.0
type:      text
size:      31 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T01

name:      Chinese Treebank, Version 6.0
directory: public/chinese_treebank/6.0
type:      text
size:      115 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T36

name:      Chinese-English Translation Lexicon, Version 3.0
directory: public/chinese_english_lexicon
type:      lexicon
size:      1.4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002L27

name:      Christine Corpus of Spoken British English
directory: public/christine
type:      text
size:      4.3 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      CMU Kids
directory: public/cmu_kids
type:      speech
size:      1.1 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC97S63

name:      Conference Proceedings from CDROMs
directory: public/proceedings
type:      text
size:      18000 MB and growing
licenser:  various
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (CSR-III Speech)
directory: public/csr/csr3/speech
type:      speech
size:      1952 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95S23

name:      Continuous Speech Recognition Corpus (CSR-III Text)
directory: public/csr/csr3/text
type:      speech
size:      1791 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T6

name:      Continuous Speech Recognition Corpus (HUB-4 Language Model)
directory: public/csr/hub4
type:      text
size:      845 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC98T31

name:      Corpus of IMDB Movie Summaries, indexed for IMS Corpus Workbench
directory: public/imdb
type:      text
size:      168 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      CSLU Kid's Speech
directory: public/cslu_kids
type:      speech
size:      12 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007S18

name:      Cytology Corpus (Alvey Project)
directory: public/cytol
type:      speech
size:      372 MB
licenser:  CSTR
licensee:  UoE
webpage:   here

name:      DARPA Communicator 2000 Dialogue Act Tagged
directory: public/darpa_communicator/2000/tagged
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T15

name:      DARPA Communicator 2000 Evaluation
directory: public/darpa_communicator/2000/evaluation
type:      speech
size:      4384 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S56

name:      DARPA Communicator 2001 Dialogue Act Tagged
directory: public/darpa_communicator/2001/tagged
type:      text
size:      88 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T16

name:      DARPA Communicator 2001 Evaluation
directory: public/darpa_communicator/2001/evaluation
type:      speech
size:      3804 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003S01

name:      DARPA Resource Management Continuous Speech Database (RM1)
directory: public/resource_management/rm1
type:      speech
size:      387 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93S3B

name:      DARPA Resource Management Continuous Speech Database (RM2)
directory: public/resource_management/rm2
type:      speech
size:      688 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93S3C

name:      DCIEM Sleep Deprivation Corpus
directory: public/dciem
type:      speech
size:      7448 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96S38

name:      DSO Corpus of Sense-Tagged English
directory: public/dso
type:      text
size:      37 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC97T12

name:      Document Understanding Conference (DUC) data, 2001-2007
directory: public/duc
type:      text
size:      108 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      Dickens Corpus, indexed for IMS Corpus Workbench
directory: public/dickens
type:      text
size:      65 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      PARC 700 Dependency Bank
directory: public/depbank
type:      text
size:      3 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Diphone Voices for Festival
directory: public/synthesis/diphone_voices
type:      speech
size:      4477 MB
licenser:  CSTR
licensee:  UoE
webpage:   here

name:      Discourse Graphbank
directory: public/discourse_graphbank
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T08

name:      Dundee Corpus of English and French Eye-movement Data
directory: public/dundee_eyemovement
type:      speech
size:      207 MB
licenser:  Department of Psychology, University of Dundee
licensee:  UoE
webpage:   none

name:      ESP Game 100k Corpus
directory: public/esp_game
type:      text and images
size:      2783 MB
licenser:  Luis von Ahn
licensee:  publicly available
webpage:   here

name:      EMILLE/CIIL Corpus
directory: public/emille
type:      text
size:      1645 MB
licenser:  ELDA
licensee:  ILCC/HCRC
webpage:   here

name:      Electromagnetic Articulograph (EMA) Data
directory: public/ema/other
type:      speech and EMA
size:      2394 MB
licenser:  QMUC/CSTR
licensee:  UoE
webpage:   here

name:      Emotional Prosody Speech and Transcripts
directory: public/emotional_speech
type:      speech
size:      2845 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S28

name:      English Web Treebank
directory: public/english_web_treebank
type:      text
size:      104 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2012T13

name:      English Gigaword, first edition
directory: large/english_gigaword/1.0/original
type:      text
size:      3960 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T05

name:      English Gigaword, first edition, parsed with Minipar
directory: large/english_gigaword/1.0/parsed_minipar
type:      text
size:      17157 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T05

name:      English Gigaword, first edition, tokenized and tagged
directory: large/english_gigaword/1.0/tagged
type:      text
size:      22169 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T05

name:      English Gigaword, fourth edition
directory: large/english_gigaword/4.0
type:      text
size:      8223 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T13

name:      English Gigaword, fifth edition
directory: large/english_gigaword/5.0
type:      text
size:      9328 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011T07

name:      English Gigaword, annotated
directory: large/annotated_english_gigaword
type:      text
size:      172 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2012T21

name:      English Intonation in the British Isles Corpus
directory: public/ivie
type:      text
size:      2471 MB
licenser:  University of Oxford
licensee:  freely available
webpage:   here

name:      English-Arabic Parallel Treebank
directory: public/english_arabic_treebank
type:      text
size:      18 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T10

name:      English-Chinese Translation Treebank 1.0
directory: public/chinese_translation_treebank
type:      speech
size:      9 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T02

name:      Enron Email Dataset
directory: public/enron/original
type:      text
size:      1646 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Enron Email Dataset, prepared for Rainbow
directory: public/enron/rainbow
type:      text
size:      290 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Enron Email Dataset, with Topic Annotations
directory: public/enron/annotations
type:      text
size:      0.2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T22

name:      Edinburgh Speech Production Facility - DoubleTalk
directory: public/ESPF/DoubleTalk
type:      speech, EMA
size:      5.5 GB
licenser:  UoE
licensee:  UoE (under CC-BY-SA 3.0)
webpage:   here

name:      European Corpus Initiative Multilingual Corpus
directory: public/eci
type:      text
size:      685 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC94T5

name:      European News Corpus
directory: public/european_news
type:      text
size:      715 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T11

name:      European Parliament Interpretation Corpus (EPIC)
directory: public/ELRA/ELRA-S0323-EPIC-European-Parliament-Interpretation-Corpus
type:      text
size:      3.8 GB
licenser:  ELRA
licensee:  The School of Informatics
webpage:   here

name:      European Parliament Proceedings Parallel Corpus, Version 2.0
directory: public/europarl
type:      text
size:      3809 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      Extended VerbNet
directory: public/verbnet
type:      lexicon
size:      2.5 MB
licenser:  University of Colorado
licensee:  freely available
webpage:   here

name:      Extended WordNet Lexical Database, WordNet Version 2.0, Extension Version 1.1
directory: public/wordnet/xwn
type:      lexicon
size:      154 MB
licenser:  University of Texas at Dallas
licensee:  freely available
webpage:   here




name:      Factbank
directory: public/factbank
type:      text
size:      22 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T23



name:      Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
directory: large/fisher/arabic
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T04


name:      Fisher English Training Speech, Part 1, Speech
directory: large/fisher/speech
type:      speech
size:      28 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S13

name:      Fisher English Training Speech, Part 1, Transcripts
directory: large/fisher/transcripts
type:      text
size:      275 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T19

name:      Fisher English Training Speech, Part 2, Speech
directory: large/fisher/speech
type:      speech
size:      29 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S13

name:      Fisher English Training Speech, Part 2, Transcripts
directory: large/fisher/transcripts
type:      text
size:      279 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T19


name:      FrameNet 1.1
directory: public/framenet/1.1
type:      text
size:      1024 MB
licenser:  University of California at Berkeley
licensee:  freely available
webpage:   here

name:      FrameNet 1.3
directory: public/framenet/1.3
type:      text
size:      783 MB
licenser:  University of California at Berkeley
licensee:  freely available
webpage:   here

name:      Frankfurter Rundschau corpus (part of ECI), tokenized and tagged
directory: public/rundschau
type:      text
size:      1074 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC94T5

name:      French Treebank, Version 1.4
directory: public/french_treebank
type:      text
size:      147 MB
licenser:  LLF, Universite Paris 7
licensee:  UoE
webpage:   here

name:      French Gigaword, second edition
directory: public/french_gigaword/2.0
type:      text
size:      1790 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T28

name:      French Gigaword, third edition
directory: public/french_gigaword/3
type:      text
size:      2 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011T10

name:      GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
directory: public/gale/gale_ch_nw_wb_word_align_p3/
type:      text
size:      29 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2012T24

name:      GALE Phase 1 Arabic Blog Parallel Text
directory: public/gale/galep1_ara_bl_ptxt/
type:      text
size:      5.7 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T02

name:      GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
directory: public/gale/ara_bn_ptext
type:      text
size:      36 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T24

name:      GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
directory: public/gale/ar_bn_ptxt_p2/
type:      text
size:      2.8 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T09

name:      GALE Phase 1 Chinese Blog Parallel Text
directory: public/gale/gale_p1_ch_blog
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T06

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
directory: public/gale/ch_bn_ptxt
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T23

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
directory: public/gale/ch_bn_ptxt_p2
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T08

name:      GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
directory: public/gale/ch_bn_ptxt_p3
type:      text
size:      4.4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T18

name:      GALE Phase 1 Distillation Training
directory: public/gale/galep1_distill_tr
type:      text
size:      31 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T20

name:      German Law Corpus, indexed for IMS Corpus Workbench
directory: public/german_law
type:      text
size:      40 MB
licenser:  public domain
licensee:  none
webpage:   here

name:      GlobalPhone
directory: public/global_phone
type:      speech
size:      18000 MB
licenser:  ELDA
licensee:  UoE
webpage:   GlobalPhone

name:      Google Book Corpus
directory: public/google_books
type:      text
size:      8119 MB
licenser:  LDC
licensee:  ILCC/HCRC
webpage:   n/a

name:      Google n-Gram Corpus
directory: public/google_ngrams
type:      text
size:      25000 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T13

name:      Google Syntactic N-grams (derived from the Google English Books collection)
directory: large/google-syntactic-ngrams
type:      dependency tree fragments
size:      318 GB
licenser:  Google (Creative Commons Attribution-Non Commercial ShareAlike 3.0 Unported License)
licensee:  UoE
webpage:   here

name:      Gulf Arabic Conversational Telephone Speech, Transcripts
directory: public/arabic_telephone
type:      text
size:      11 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T15

name:      HUB-5 English Evaluation 1997
directory: public/hub5/1997
type:      speech
size:      593 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S23

name:      HUB-5 English Evaluation 1998
directory: public/hub5/1998
type:      speech + text
size:      607 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S10 and LDC2003T02

name:      HUB-5 English Evaluation 2000
directory: public/hub5/2000
type:      speech + text
size:      617 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S09 and LDC2002T43

name:      HUB-5 English Evaluation 2001
directory: public/hub5/2001
type:      speech
size:      2.1 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S13

name:      HARD 2004 Topics and Annotations
directory: public/hard
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T29

name:      Hebrew Treebank
directory: public/hebrew_treebank
type:      text
size:      20 MB
licenser:  Technion
licensee:  public domain
webpage:   here

name:      Hidi Wordnet
directory: public/hindi_wordnet
type:      text
size:      19 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008L02

name:      Hong Kong Hansard Parallel Text, Alignments
directory: public/hong_kong_hansard/alignments
type:      text
size:      91 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002E19

name:      Hong Kong Hansard Parallel Text, Text
directory: public/hong_kong_hansard/text
type:      text
size:      110 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T50

name:      Hong Kong Laws Parallel Text
directory: public/hong_kong_laws
type:      text
size:      75 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T47

name:      Hong Kong News Parallel Text, Alignments
directory: public/hong_kong_news/alignments
type:      text
size:      107 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002E16

name:      Hong Kong News Parallel Text, Text
directory: public/hong_kong_news/text
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T46

name:      HyTER Networks of Selected OpenMT08/09 Sentences
directory: public/hong_kong_news/text
type:      text
size:      369 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2014T09

name:      ImageClef 2015 Image Annotation Track
directory: public/imageclef
type:      text and images
size:      47326 MB
licenser:  University of Valencia
licensee:  Informatics
webpage:   here

name:      ILE: Italian LExicon
directory: public/italian_lexicon
type:      text
size:      20 MB
licenser:  ELRA
licensee:  UoE
webpage:   here

name:      ICSI Meeting Speech
directory: public/icsi_meeting/speech
type:      text
size:      33400 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S02

name:      ICSI Meeting Transcripts
directory: public/icsi_meeting/transcripts
type:      text
size:      3.51 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T04

name:      IMS Corpus Workbench (corpus registry files only)
directory: public/corpus_workbench
type:      text/speech
size:      0 MB
licenser:  IMS Stuttgart
licensee:  ILCC/HCRC
webpage:   here

name:      ISI Chinese-English Automatically Extracted Parallel Text
directory: public/isi_chi_eng_par_txt
type:      text
size:      206 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T09

name:      ISL Meeting Speech
directory: public/isl_meeting/speech
type:      speech
size:      5975 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S05

name:      ISL Meeting Transcripts
directory: public/isl_meeting/transcripts
type:      text
size:      1.81 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T10

name:      Instruction-based Learning for Mobile Robots Corpus
directory: public/ibl
type:      speech
size:      123 MB
licenser:  University of Edinburgh/University of Plymouth
licensee:  freely available
webpage:   here

name:      KAIST Korean Speech Database
directory: public/kaist
type:      speech
size:      3711 MB
licenser:  The Korean Advanced Institute of Science and Technology
licensee:  UoE
webpage:   none

name:      Korean Broadcast News Transcripts 
directory: public/korean_news_transcripts
type:      text
size:      1.4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T14

name:      Korean Propbank
directory: public/korean_propbank
type:      text
size:      24 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T03

name:      Korean Treebank, Version 1.0
directory: public/korean_treebank/1.0
type:      text
size:      6.9 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002T26

name:      Korean Treebank, Version 2.0
directory: public/korean_treebank/2.0
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T09

name:      Lancaster Corpus of Mandarin (LCMC)
directory: public/lcmc
type:      text
size:      46 MB
licenser:  ELDA
licensee:  UoE
webpage:   here

name:      Levantine Arabic QT Training Data Set 5, Transcripts
directory: public/arabic_qt_data
type:      text
size:      27 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T07

name:      Lucy Corpus of Written British English
directory: public/lucy
type:      text
size:      4.7 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Rich Transcription (RT) Evaluation Project datasets
directory: public/rich_transcription
type:      speech + text
size:      various - see separate entry per corpus
licenser:  LDC
licensee:  UoE
webpage:   NIST page describing RT tasks from 2002 onwards

name:      MDE RT-02 Rich Transcription Broadcast News and Conversational Telephone Speech 2002
directory: rich_transcription/rt-02/train/speech
type:      speech
size:      815 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S11

name:      MDE RT-03 Training Data, Speech
directory: rich_transcription/rt-03/train/speech
type:      speech
size:      5256 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S08

name:      MDE RT-03 Training Data, Text and Annotations
directory: rich_transcription/rt-03/train/text
type:      text
size:      723 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T12

name:      2003 NIST Rich Transcription Evaluation Data
directory: rich_transcription/rt-03/eval/speech
type:      speech
size:      2.16 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007S10

name:      MDE RT-04 Training Data, Speech
directory: rich_transcription/rt-04/train/speech
type:      speech
size:      4829 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005S16

name:      MDE RT-04 Training Data, Text and Annotations
directory: rich_transcription/rt-04/train/text
type:      text
size:      567 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T24

name:      MITRE 1997 Mandarin Broadcast News Speech Translations (Hub-4NE)
directory: public/mandarin_transcripts/hub4-ne
type:      text
size:      2.35 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T19

name:      MMA_2012 Multiple Microphone Array corpus
directory: large/2012_MMA
type:      speech
size:      115 GB
licenser:  UoE
licensee:  UoE
webpage:   here

name:      MOCHA Electromagnetic Articulograph (EMA) Corpus
directory: public/ema/mocha
type:      speech
size:      2221 MB
licenser:  QMUC/CSTR
licensee:  UoE
webpage:   here

name:      MRC Psycholinguistic Database
directory: public/mrc
type:      lexicon
size:      11 MB
licenser:  MRC
licensee:  freely available
webpage:   here

name:      Machine-readable Spoken English Corpus
directory: public/marsec
type:      speech
size:      2 MB
licenser:  Reading University
licensee:  UoE
webpage:   here

name:      Macrophone: American English Segment of the Polyphone Corpus
directory: public/macrophone
type:      speech
size:      3809 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC94S21

name:      Mandarin Transcripts (HUB-5, 2001)
directory: public/mandarin_transcripts/hub5
type:      text
size:      0.2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T01

name:      Mandarin Transcripts, HKUST Telephone Data, Part 1
directory: public/mandarin_transcripts/hkust
type:      text
size:      11 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T32

name:      Maptask Corpus
directory: public/maptask
type:      speech
size:      13665 MB
licenser:  LDC and UoE/LDC
licensee:  UoE
webpage:   Maptask home page, LDC catalog entry

name:      Mawukakan Lexicon
directory: public/mawukakan_lexicon
type:      lexicon
size:      4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005L01

name:      MC-WSJ-AV - multichannel Wall Street Journal audiovisual
directory: public/MC-WSJ-AV
type:      speech
size:      11 GB
licenser:  UoE
licensee:  UoE
webpage:   here

name:      Message Understanding Conference (MUC) 6
directory: public/muc/muc6
type:      text
size:      10 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T13

name:      Message Understanding Conference (MUC) 6, Additional News Text 
directory: public/muc/muc6/additional_text
type:      text
size:      0.67 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC96T10

name:      Message Understanding Conference (MUC) 7
directory: public/muc/muc7
type:      text
size:      45.7 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC97E4

name:      MT08 - NIST Open Machine Translation 2008 Evaluation data
directory: public/mt08_shared_data
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2010T01

name:      Multext East
directory: public/multext/east/3.0
type:      text
size:      295 MB
licenser:  ILCC/HCRC
licensee:  Jozef Stefan Institute, Ljubljana
webpage:   here

name:      Multext East
directory: public/multext/east/4.0
type:      text
size:      341 MB
licenser:  ILCC/HCRC
licensee:  Jozef Stefan Institute, Ljubljana
webpage:   here

name:      Multext JOC
directory: public/multext/joc
type:      text
size:      122 MB
licenser:  ILCC/HCRC
licensee:  ELDA
webpage:   here

name:      Multilingual Corpora for Cooperation
directory: public/mlcc
type:      text
size:      1223 MB
licenser:  internal
licensee:  internal
webpage:   here

name:      Multilingual Semcor, Version 1.1
directory: public/semcor/multisemcor
type:      text
size:      142 MB
licenser:  ITC/IRST
licensee:  University of Edinburgh
webpage:   here

name:      Multiple-Translation Arabic Corpus, Part 1
directory: public/mt_arabic/part1
type:      text
size:      5.0 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002E54

name:      Multiple-Translation Arabic Corpus, Part 2
directory: public/mt_arabic/part2
type:      text
size:      2.5 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005T05

name:      Multiple-Translation Chinese Corpus, Part 1, Version 1.0
directory: public/mt_chinese/part1/1.0
type:      text
size:      4.8 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002T01

name:      Multiple-Translation Chinese Corpus, Part 1, Version 2.0
directory: public/mt_chinese/part1/2.0
type:      text
size:      2.8 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002E53

name:      Multiple-Translation Chinese Corpus, Part 2
directory: public/mt_chinese/part2
type:      text
size:      3.5 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T17

name:      Multiple-Translation Chinese Corpus, Part 3
directory: public/mt_chinese/part3
type:      text
size:      1.1 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003E04

name:      Multiple-Translation Chinese Corpus, Part 4
directory: public/mt_chinese/part4
type:      text
size:      5.2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T04

name:      News Spike Corpus
directory: public/news_spike
type:      text
size:      800 MB
licenser:  University of Washington
licensee:  freely available
webpage:   here

name:      NPS Internet Chatroom Conversations, Release 1.0
directory: public/nps_chat
type:      text
size:      7 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2010T05

name:      NIST Meeting Pilot Corpus Transcripts and Metadata
directory: public/nist_meeting_pilot
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T13

name:      NIST Speaker Recognition Evaluation 2002
directory: large2/nist_speaker_recognition/2002
type:      speech
size:      7.1 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S04

name:      NIST Speaker Recognition Evaluation 2004
directory: large2/nist_speaker_recognition/2004
type:      speech
size:      23 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006S44

name:      NIST Speaker Recognition Evaluation 2005 - training data
directory: large2/nist_speaker_recognition/2005
type:      speech
size:      22 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011S01

name:      NIST Speaker Recognition Evaluation 2005 - test data
directory: large2/nist_speaker_recognition/2005
type:      speech
size:      29 GB 
licenser:  LDC
licensee:  UoE
webpage:   LDC2011S04

name:      NIST Speaker Recognition Evaluation 2006 - training data
directory: large2/nist_speaker_recognition/2006
type:      speech
size:      29 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011S09

name:      NIST Speaker Recognition Evaluation (SRE) 2008 - training set part 1
directory: large2/nist_speaker_recognition/2008
type:      speech
size:      36 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011S05

name:      NIST Speaker Recognition Evaluation (SRE) 2008 - test set
directory: large2/nist_speaker_recognition/2008
type:      speech
size:      39 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011S08

name:      NIST TI Digits
directory: public/tidigits
type:      speech
size:      786 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93S10

name:      NTIMIT Acoustic-Phonetic Continuous Speech Corpus
directory: public/timit/ntimit
type:      speech
size:      1146 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93S2

name:      New York Times Annotated Corpus
directory: public/nyt_annotated
type:      text
size:      3202 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T19

name:      NYNEX Phonebook
directory: public/phonebook
type:      speech
size:      1400 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Newsgroup Corpus
directory: public/newsgroups
type:      text
size:      55 MB
licenser:  public domain
licensee:  none
webpage:   various newsgroups

name:      NomBank
directory: public/nombank
type:      speech
size:      56 MB
licenser:  NYU
licensee:  none
webpage:   here

name:      North American News Text Corpus
directory: public/american_news/original
type:      text
size:      2342 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T21

name:      North American Newstext Corpus, parsed with Minipar
directory: public/american_news/parsed_minipar
type:      text
size:      3392 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T21

name:      OntoNotes Release 1.0
directory: public/ontonotes/1.0
type:      text
size:      750 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007T21

name:      OntoNotes Release 2.0
directory: public/ontonotes/2.0
type:      text
size:      1299 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008T04

name:      OntoNotes Release 3.0
directory: public/ontonotes/3.0
type:      text
size:      444 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T24

name:      OntoNotes Release 4.0
directory: public/ontonotes/4.0
type:      text
size:      764 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011T03

name:      OntoNotes Release 5.0
directory: public/ontonotes/5.0
type:      text
size:      2366 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2013T19

name:      OHSUMED Corpus (also used for the TREC 9 Filtering Track)
directory: public/ohsumed
type:      text
size:      1176 MB
licenser:  NIST
licensee:  freely available
webpage:   here

name:      PASCAL Syntax Induction Challenge Training and Development  
directory: public/pascal_syntax_challenge
type:      text
size:      102 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   LDC2012E41

name:      Penn Discourse Treebank, Version 1.0
directory: public/penn_discourse_treebank/1.0
type:      text
size:      10 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Discourse Treebank, Version 2.0
directory: public/penn_discourse_treebank/2.0
type:      text
size:      38 MB
licenser:  University of Pennsylvania
licensee:  UoE
webpage:   here

name:      Penn Treebank, Version 2.0
directory: public/penn_treebank/2.0
type:      text
size:      655 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T7

name:      Penn Treebank, Version 3.0
directory: public/penn_treebank/3.0
type:      text
size:      256 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC99T42


name:      PF-STAR British English Children's Speech Corpus, Version 1.1
directory: public/pfstar/pfstar_british_english_children
type:      speech
size:      4.2 GB
licenser:  The Speech Ark
licensee:  UoE
webpage:   here


name:      PF-STAR British English Children's EMOTIONAL Speech Corpus, Version 1.0
directory: public/pfstar/pfstar_british_english_children_emotional
type:      speech 	
size:      5.1 GB
licenser:  The Speech Ark
licensee:  UoE
webpage:   here


name:      Prague Czech-English Dependency Treebank, Version 1.0
directory: public/prague_treebank/
type:      text
size:      587 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T25

name:      Proposition Bank, Version 1.0
directory: public/propbank
type:      text
size:      20 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004T14

name:      RST Discourse Treebank
directory: public/rst_treebank
type:      text
size:      26 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002T07

name:      Research Cyc
directory: public/cyc
type:      text
size:      4118 MB
licenser:  Cycorp
licensee:  UoE
webpage:   here

name:      Reuters Text Categorization Corpus 21578
directory: public/reuters/21578
type:      text
size:      28 MB
licenser:  Reuters
licensee:  freely available
webpage:   here

name:      Roget's Thesaurus from 1911
directory: public/roget
type:      text
size:      12 MB
licenser:  public domain
licensee:  freely available
webpage:   here

name:      Russian-English Computer Security Parallel Text
directory: public/rus_eng_compsec_para
type:      text
size:      1.6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2012T23

name:      Santa Barbara Corpus of Spoken American English - Parts I to IV
directory: public/santa_barbara
type:      speech
size:      6.7 GB
licenser:  University of California
licensee:  UoE
webpage:   here

name:      SALSA Corpus
directory: public/salsa
type:      text
size:      251 MB
licenser:  Saarland University
licensee:  UoE
webpage:   here

name:      SAID (Syntactically Annotated Idiom Dataset)
directory: public/said
type:      text
size:      3 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2003T10

name:      SOLE Project Corpus
directory: public/synthesis/cstr/sole
type:      speech
size:      895 MB
licenser:  HCRC
licensee:  UoE
webpage:   here

name:      Search Engine Logs (Alltheweb, Excite, Altavista)
directory: public/searchengine_logs
type:      text
size:      440 MB
licenser:  Jim Jansen, Penn State University
licensee:  ILCC/HCRC
webpage:   ?

name:      Semcor Semantically Annotated Corpus, Version 1.6
directory: public/semcor/1.6
type:      text
size:      39 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Semcor Semantically Annotated Corpus, Version 2.0
directory: public/semcor/2.0
type:      text
size:      34 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Sinorama Chinese English Parallel Text
directory: public/sinorama
type:      text
size:      64 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Spanish Broadcast News
directory: public/spanish_broadcast_news
type:      speech
size:      5200 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC98S74

name:      Spanish Gigaword, First Edition
directory: public/spanish_gigaword/1.0
type:      text
size:      1775 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T12

name:      Spanish Gigaword, Second Edition
directory: public/spanish_gigaword/2.0
type:      text
size:      2679 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2011T12

name:      Spanish Newswire, vols 1 and 2
directory: public/spanish_newswire
type:      text
size:      556 MB + 624 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95T9, LDC99T41

name:      Spanish Treebank
directory: public/spanish_treebank
type:      text
size:      8 MB
licenser:  University of Barcelona
licensee:  freely available
webpage:   here


name:      Susanne Corpus of Written American English, Version 1.0
directory: public/susanne/1.0
type:      text
size:      5 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Susanne Corpus of Written American English, Version 5.0
directory: public/susanne/5.0
type:      text
size:      6 MB
licenser:  University of Sussex
licensee:  freely available
webpage:   here

name:      Switchboard Corpus, NXT Annotations
directory: large/switchboard/nxt
type:      speech
size:      1230 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T26

name:      Switchboard 1 Telephone Speech Corpus, Release 2
directory: large/switchboard/switchboard1
type:      speech
size:      1485 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC97S62

name:      Switchboard 2 Telephone Speech Corpus, Phases 1-3
directory: large/switchboard/switchboard2
type:      speech
size:      50643 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC98S75 and LDC99S79 and LDC2002S06

name:      Switchboard Cellular Part 1, Speech Files for Speaker Identification
directory: large/switchboard/cellular/part1/audio
type:      speech
size:      7.1 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001S13

name:      Switchboard Cellular Telephone Speech Corpus, Part 1, Audio
directory: large/switchboard/cellular/part1/audio
type:      speech
size:      1.4 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001S15

name:      Switchboard Cellular Telephone Speech Corpus, Part 1, Transcripts
directory: large/switchboard/cellular/part1/transcripts
type:      text
size:      2 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001T14

name:      Switchboard Cellular Telephone Speech Corpus, Part 2, Audio
directory: large/switchboard/cellular/part2/audio
type:      speech
size:      11364 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2004S07

name:      TORGO Database of Dysarthric Articulation
directory: public/torgo
type:      speech
size:      15234 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2012S02

name:      TAC 2008 Data
directory: public/tac/2008/
type:      text
size:      34 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      TAC 2009 Data
directory: public/tac/2009/
type:      text
size:      226 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      TAC 2010 Data
directory: public/tac/2010/
type:      text
size:      12 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      TC-STAR (a large collection of several sub-corpora)
directory: public/ELRA/TC-STAR
type:      text and speech
size:      83 GB (current size)
licenser:  ELRA
licensee:  The School of Informatics
webpage:   here

name:      TDT Pilot Corpus
directory: public/tdt/tdt2_pilot
type:      text
size:      53 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC98T25

name:      TDT2 Careful Transcription Text
directory: public/tdt/tdt2_careful_text
type:      text
size:      2.4 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000T44

name:      TDT2 Careful Transcription Audio
directory: public/tdt/tdt2_careful_audio
type:      speech
size:      1077 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2000S92

name:      TDT2 Multilanguage Text Corpus, Version 4.0
directory: public/tdt/tdt2_multilanguage
type:      text
size:      623 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001T57

name:      TDT3 Multilanguage Text Corpus, Version 2.0
directory: public/tdt/tdt2_multilanguage
type:      text
size:      367 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2001T58



name:      TDT4 Multilingual Broadcast News Speech Corpus
directory: public/tdt/tdt4_multilingual
type:      speech
size:      xxx GB
licenser:  LDC
licensee:  UoE
webpage:   LDC2005S11



name:      TDT5 Multilingual Text
directory: public/tdt/tdt5_multilingual_text
type:      text
size:      1200 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T18

name:      TDT5 Topics and Annotations
directory: public/tdt/tdt5_topics_and_annotations
type:      text
size:      80 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T19

name:      TIMIT Acoustic-Phonetic Continuous Speech Corpus
directory: public/timit/original
type:      speech
size:      668 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC93S1

name:      TRECVID 2003 Keyframes & Transcripts
directory: public/trec/trecvid/2003
type:      video
size:      3570 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007V02

name:      TRECVID 2005 Keyframes & Transcripts
directory: public/trec/trecvid/2005
type:      video
size:      3283 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2007V01

name:      Tageszeitung (TAZ) Corpus
directory: public/taz
type:      text
size:      1439 MB
licenser:  ILCC/HCRC
licensee:  Contrapress Media GmbH
webpage:   here

name:      Talbanken05 Swedish Treebank
directory: public/talbanken
type:      speech
size:      144 MB
licenser:  University of Växjö and University of Lund 
licensee:  freely available
webpage:   here

name:      Timebank 1.2
directory: public/timebank
type:      text
size:      6 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006T08

name:      TITML - Tokyo Institute of Technology Multilingual Speech Corpus (currently we have: Indonesian, Icelandic)
directory: public/tokyo_multilingual
type:      speech
size:      2.5 GB
licenser:  Tokyo Institute of Technology
licensee:  UoE
webpage:   here

name:      ToBI Guidelines and Examples
directory: public/tobi_course
type:      text
size:      19 MB
licenser:  Ohio State University 
licensee:  UoE
webpage:   here

name:      Translanguage English Database (TED), Speech
directory: public/ted/speech
type:      speech
size:      2903 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002S04

name:      Translanguage English Database (TED), Transcripts
directory: public/ted/transcripts
type:      text
size:      1.3 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2002T03

name:      UASPEECH
directory: public/UASPEECH
type:      speech
size:      15 GB
licenser:  University of Illinois
licensee:  UoE
webpage:   here


name:      Ummah Arabic English Parallel News Text
directory: public/ummah
type:      text
size:      6.6 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Underspecified Rhetorical Markup Language (URML) Corpus aka
           Potsdam Commentary Corpus
directory: public/urml
type:      text
size:      1.7 MB
licenser:  University of Potsdam
licensee:  HCRC/ILCC
webpage:   here

name:      UKWAC (Web as Corpus) English, parsed version
directory: public/wac/ukwac
type:      text
size:      31000 MB
licenser:  University of Bologna
licensee:  freely available
webpage:   here

name:      UKWAC (Web as Corpus) English, dependency-parsed version
directory: public/wac/ukwac_dep
type:      text
size:      16000 MB
licenser:  University of Bologna
licensee:  freely available
webpage:   here

name:      Unified Linguistic Annotation Text Collection
directory: public/unified_annotation
type:      text
size:      363 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2009T07

name:      Wackypedia EN 1.0
directory: public/wac/wackypedia_en
type:      text
size:      6000 MB
licenser:  University of Bologna
licensee:  freely available
webpage:   here

name:      WSJ0 complete (also known as CSR-I)
directory: public/wsj/wsj0
type:      speech
size:      8.9 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC93S6A

name:      WSJ1 complete (also known as CSR-II)
directory: public/wsj/wsj1
type:      speech
size:      18 GB
licenser:  LDC
licensee:  UoE
webpage:   LDC94S13A

name:      WSJCAM0 Cambridge Read News
directory: public/wsjcam0/original
type:      speech
size:      3848 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95S24

name:      WSJCAM0 Cambridge Read News, processed data
directory: public/wsjcam0/data
type:      speech
size:      13571 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC95S24

name:      Wikipedia Corpus (raw data, dated 2009-06-18)
directory: large/wikipedia/raw
type:      text
size:      22797 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      Wikipedia Corpus (INEX 2006 Corpus)
directory: large/wikipedia/inex
type:      text
size:      4959 MB
licenser:  various
licensee:  ILCC/HCRC
webpage:   here

name:      Wikipedia Corpus (INEX 2006 Corpus), Question Answering Version
directory: large/wikipedia/inex_qa
type:      text
size:      5143 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      Wikipedia Corpus, Tagged and Cleaned
directory: large/wikipedia/tagged_cleaned
type:      text
size:      51866 MB
licenser:  various
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 1.6
directory: public/wordnet/1.6
type:      lexicon
size:      40 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 1.7.1
directory: public/wordnet/1.7.1
type:      lexicon
size:      40 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 2.0
directory: public/wordnet/2.0
type:      lexicon
size:      41 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 2.1
directory: public/wordnet/2.1
type:      lexicon
size:      38 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      WordNet Lexical Database, Version 3.0, with sense maps and standoff annotation
directory: public/wordnet/2.1
type:      lexicon
size:      92 MB
licenser:  Princeton University
licensee:  freely available
webpage:   here

name:      Wordlists for various languages
directory: public/wordlists
type:      lexicon
size:      2 MB
licenser:  n/a
licensee:  freely available
webpage:   n/a

name:      Global Yoruba Lexical Database 1.0
directory: public/yoruba
type:      text
size:      183 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2008L03

name:      Xinhua Chinese English Parallel News Text, Version 1.0 beta
directory: public/xinhua
type:      text
size:      40 MB
licenser:  LDC
licensee:  UoE
webpage:   here

Restricted Corpora

These are corpora which are licenced to a paticular institute, project, or a group of individuals. Access is limited to a specific Unix groups consisting of the correct set of users.
Name:      Prague Czech-English Dependency Treebank, 2.0 alpha
directory: restricted/pcedt/v2alpha
group:     -
type:      text
size:      939 MB
licenser:  Charles University, Prague
licensee:  Bonnie Webber
webpage:   ?

Name:      Reuters Corpus, various supporting data
directory: restricted/reuters/data
group:     reuters01
type:      text
size:      2 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      AQUAINT-2 Information-Retrieval Text Research Collection
directory: restricted/aquaint
group:     trec
type:      text
size:      2498 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      Continuous Speech Recognition Corpus (HUB-4)
directory: restricted/csr
group:     corpman
type:      speech
size:      908 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC97T22,
LDC98E10,
LDC98T28,
LDC2000S86

note:      contains a mixture of LDC and proprietary data (incl lattices supplied by Cambridge), thus restricted

name:      DMM German Morphological Database
directory: restricted/dmm
group:     dmm
type:      lexicon
size:      21 MB
licenser:  University of Erlangen-Nuernberg
licensee:  ILCC/HCRC
webpage:   here

name:      GALE Kickoff
directory: restricted/gale/kickoff
group:     smt
type:      text
size:      106 MB
licenser:  LDC
licensee:  UoE
webpage:   LDC2006G01 and LDC2006G02

name:      GALE Phase 2 Releases 1, 2 and 3
directory: restricted/gale/GALE-P2*
group:     smt
type:      text
size:      1500 MB
licenser:  LDC
licensee:  UoE
webpage:   release 1: LDC2007E86 and LDC2007E87; release 2: LDC2007E86 and LDC2007E87; release 3: LDC2007E86 and LDC2007E87

name:      GALE Phase 3 DevTest - Source Text, Transcripts and Translations
directory: restricted/gale/GALE-P3-DevTest-V1_0
group:     smt
type:      text
size:      5 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Distillation
directory: restricted/gale/GALE-Phase3-Distillation-TrainingData-V1_0
group:     smt
type:      text
size:      1.34 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - English Translation Treebank
directory: restricted/gale/GALE-P3R1-EBNTT-Sep07
group:     smt
type:      text
size:      3.76 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Found Parallel Text
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      222.13 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Transcripts
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      70.57 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 1 - Translations
directory: restricted/gale/GALE-P3R1
group:     smt
type:      text
size:      7.91 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 - IBM Arabic-English Word Alignment Corpus
directory: restricted/gale/Y1
group:     smt
type:      text
size:      25 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 Q3
directory: restricted/gale/GALE-Y1Q3
group:     smt
type:      text
size:      14 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Y1 Q4
directory: restricted/gale/GALE-Y1Q4
group:     smt
type:      text
size:      81 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 4 Release 1 - Transcripts V1.0
directory: restricted/gale/GALE-P4R1
group:     smt
type:      text
size:      72 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 4 Release 1 - Translations V1.0
directory: restricted/gale/P4R1
group:     smt
type:      text
size:      72 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - MTPlus Pilot
directory: restricted/gale/GALE-P3-MTPlus_Pilot
group:     smt
type:      text
size:      0.86 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - Transcripts
directory: restricted/gale/GALE-P3R2/transcription
group:     smt
type:      text
size:      117 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GALE Phase 3 Release 2 - Translations
directory: restricted/gale/GALE-P3R2/translationGALE-P3R1
group:     smt
type:      text
size:      11.8 MB
licenser:  LDC
licensee:  UoE
webpage:   here

name:      GermaNet (German WordNet) 4.0
directory: restricted/germanet/4.0
group:     dmm
type:      lexicon
size:      11 MB
licenser:  University of Tuebingen
licensee:  ILCC/HCRC
webpage:   here

name:      GermaNet (German WordNet) 8.0
directory: restricted/germanet/8.0
group:     dmm
type:      lexicon
size:      68 MB
licenser:  University of Tuebingen
licensee:  University of Edinburgh
webpage:   here

name:      Lancaster-Oslo-Bergen Corpus of British English
directory: restricted/lob
group:     corpman
type:      text
size:      8 MB
licenser:  ?
licensee:  LTG?
webpage:   here

name:      London-Lund Corpus of Spoken English
directory: restricted/london_lund
group:     corpman
type:      text
size:      10 MB
licenser:  ?
licensee:  LTG?
webpage:   here

name:      Maptask corpora for different languages and situations
directory: restricted/maptask
group:     corpman
type:      text
size:      622 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      Medline Corpus
directory: restricted/medline
group:     umls
type:      text
size:      7602 MB
licenser:  ?
licensee:  LTG?
webpage:   ?

name:      NTCIR Corpora 
directory: restricted/ntcir
group:     -
type:      text
size:      48453 MB
licenser:  National Institute of Informatics, Tokyo
licensee:  Bonnie Webber
webpage:
 here,  LDC2008E48, and  LDC2006E108

name:      NEGRA Parsed Corpus of German
directory: restricted/negra
group:     negra
type:      text
size:      55 MB
licenser:  Saarland University
licensee:  ILCC/HCRC
webpage:   here

name:      Reuters Corpus Volume 1 (English), Release 2000-11-03
directory: restricted/reuters/english
group:     reuters01
type:      text
size:      1012 MB
licenser:  NIST/Reuters
licensee:  Informatics/CSTR
webpage:   here

name:      Reuters Corpus Volume 2 (Multilingual), Release 2000-05-31
directory: restricted/reuters/multilingual
group:     reuters01
type:      text
size:      622 MB
licenser:  NIST/Reuters
licensee:  Informatics/CSTR
webpage:   here

name:      Search Engine Logs (AOL)
directory: restricted/searchengine_logs/aol
group:     querylogs
type:      text
size:      449 MB
licenser:  AOL
licensee:  freely available [but privacy concerns, hence restricted]
webpage:   ?

name:      Search Engine Logs (Excite)
directory: restricted/searchengine_logs/excite
group:     querylogs
type:      text
size:      52 MB
licenser:  Excite
licensee:  freely available [but privacy concerns, hence restricted]
webpage:   ?

name:      TIGER Parsed Corpus of German
directory: restricted/tiger
group:     negra
type:      text
size:      140 MB
licenser:  IMS Stuttgart
licensee:  ILCC/HCRC
webpage:   here

name:      TREC-9 Question Answering Track Corpus
directory: restricted/trec/trec9/question_answering
group:     trec
type:      text
size:      62 MB
licenser:  NIST
licensee:  LTG?
webpage:   here

name:      TREC Text Research Collection Vol. 4
directory: restricted/trec/text_collection4
group:     trec
type:      text
size:      443 MB
licenser:  NIST
licensee:  HCRC (but individual registration required; see Avril)
webpage:   here

name:      TREC Text Research Collection Vol. 5
directory: restricted/trec/text_collection5
group:     trec
type:      text
size:      394 MB
licenser:  NIST
licensee:  HCRC (but individual registration required; see Avril)
webpage:   here

name:      Tuebingen Partially Parsed Corpus of German, Newspaper (TüPP-D/Z) [based on TAZ corpus]
directory: restricted/tuebingen/tueppdz
group:     negra
type:      text
size:      7651 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 2 [based on TAZ corpus]
directory: restricted/tuebingen/tuebadz/2.0
group:     negra
type:      text
size:      185 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 8 [based on TAZ corpus]
directory: restricted/tuebingen/tuebadz/8.0
group:     negra
type:      text
size:      185 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Speech (TüBa-D/S), Release 1 [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebads/1.0
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of German, Speech (TüBa-D/S), Release 2 [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebads/2.0
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of English, Speech (TüBa-E/S) [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebaes
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tuebingen Treebank of Japanese, Speech (TüBa-J/S) [based on Verbmobil corpus]
directory: restricted/tuebingen/tuebajs
group:     negra
type:      speech
size:      107 MB
licenser:  Tuebingen University
licensee:  ILCC/HCRC
webpage:   here

name:      Tweets 2011
directory: restricted/tweets2011
group:     tweets2011
type:      text
size:      1302 MB
licenser:  NIST
licensee:  ILCC/HCRC
webpage:   here

name:      UMLS Metathesaurus 2005AC
directory: restricted/umls
group:     umls
type:      text
size:      7780 MB
licenser:  American Medical Association
licensee:  ILCC/HCRC
webpage:   here


Home : Resources 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh