| /group/corpora/public | corpora with Informatics-wide or University-wide licenses (NFS filesystem) |
| /group/corpora/large | very large corpora with Informatics-wide or University-wide licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/large/) |
| /group/corpora/restricted | corpora with more restrictive licenses (AFS filesystem - actual location is /afs/inf.ed.ac.uk/group/corpora/restricted/) |
For AFS filesystems, you need to be authenticated in order to access the space. Before reporting an access problem, you should check that you have a valid AFS token.
In general, there should be symlinks from /group/corpora/public/... to /group/corpora/large/... for the few corpora that are in that location, to make browsing the filesystem easier. The reason for dividing the space is simply a limit on the size of single disk partitions.
For corpora with restrictive licenses, read access is limited to certain groups of users. The group names are specified in the list of restricted corpora below. If you need access to any of these corpora, please email corpus-admin@inf.ed.ac.uk.
Examples:
| /group/corpora/public/bnc/1.0 | BNC, Version 1.0, unmodified |
| /group/corpora/public/bnc/2.0 | BNC, Version 2.0, unmodified |
| /group/corpora/public/bnc/parsed_ims | BNC, parsed with IMS parser |
| /group/corpora/public/bnc/parsed_minipar | BNC, parsed with Minipar |
| /group/corpora/public/bllip/original | BLLIP, unmodified |
| /group/corpora/public/bllip/parsed_minipar | BLLIP, parsed with Minipar |
If you would like to find out if we have a corpus that you need for your work, or order new copora, please email corpus-admin@inf.ed.ac.uk.
As of 2005, the University has a subscription membership for the LDC. This means that we automatically get two copies of all new corpora released by the LDC in 2005 and subsequent years. Note, however, that not all LDC corpora are being installed automatically in the corpus space (due to constraints on disk space). If you want a new LDC corpus to be installed, please email corpus-admin@inf.ed.ac.uk.
If you are a corpus administrator, and you have an LDC membership account, please follow this link to find out more details about our LDC membership.
name: ABI - Accents of the British Isles directory: public/abi type: speech size: 18 GB licenser: The Speech Ark / The University of Birmingham licensee: UoE webpage: here name: ACL/DCI Corpus (includes the original Wall Street Journal corpus) directory: public/acl_dci/original type: text size: 629 MB licenser: LDC licensee: UoE webpage: here name: Asian Elephant Vocalizations directory: public/elephant type: speech size: 22265 MB licenser: LDC licensee: UoE webpage: here name: ACL/DCI Corpus, processed version directory: public/acl_dci/processed type: text size: 17 MB licenser: LDC licensee: UoE webpage: here name: ACE 2004, Multilingual Training Corpus directory: public/ace/ace_mtc/2004 type: text size: 34 MB licenser: LDC licensee: UoE webpage: here name: ACE 2004, Time Normalization English Training Data directory: public/ace/ace_tern type: text size: 6.9 MB licenser: LDC licensee: UoE webpage: here name: ACE 2005, English SpatialML Annotations directory: public/ace/ace_spatial type: text size: 23 MB licenser: LDC licensee: UoE webpage: here name: ACE 2005, Multilingual Training Corpus directory: public/ace/ace_mtc/2005 type: text size: 1617 MB licenser: LDC licensee: UoE webpage: here name: ACE-2, Version 1.0 directory: public/ace/ace2 type: text size: 34 MB licenser: LDC licensee: UoE webpage: here name: Datasets for Generic Relation Extraction (reACE) directory: public/ace/reace type: text size: 69 MB licenser: LDC licensee: UoE webpage: here name: AQUAINT-2 Information Retrieval Text Research Collection directory: public/acquaint type: text size: 1069 MB licenser: LDC licensee: UoE webpage: here name: ATIS3 (Air Travel Information Service), NIST Speech Discs 17-1.1 - 17-3.1 directory: public/atis3 type: speech size: 1300 MB licenser: LDC licensee: UoE webpage: here name: An English Dictionary of the Tamil Verb directory: public/tamil_dictionary type: text size: 0.52 MB licenser: LDC licensee: UoE webpage: here name: Arabic Gigaword, fourth edition directory: public/arabic_gigaword/4.0 type: text size: 2588 MB licenser: LDC licensee: UoE webpage: here name: Standard Arabic Morphological Analyzer directory: public/arabic_morphology type: text size: 5 MB licenser: LDC licensee: UoE webpage: here name: Arabic Broadcast News Transcripts directory: public/arabic_news_transcripts type: text size: 3.6 MB licenser: LDC licensee: UoE webpage: here name: Arabic Translation Corpus, Part 1 directory: public/arabic_translation/part1 type: text size: 2.6 MB licenser: LDC licensee: UoE webpage: here name: Arabic Translation Corpus, Part 2 directory: public/arabic_translation/part2 type: text size: 3.2 MB licenser: LDC licensee: UoE webpage: here name: Arabic Newswire English Translation Collection directory: public/arabic_translation/newswire type: text size: 13 MB licenser: LDC licensee: UoE webpage: here name: Arabic Treebank, Part 1, Version 2.0 directory: public/arabic_treebank/part1v2.0 type: text size: 266 MB licenser: LDC licensee: UoE webpage: here name: Arabic Treebank, Part 1, Version 2.0, English Translation directory: public/arabic_treebank/part1v2.0/translation type: text size: 0.27 MB licenser: LDC licensee: UoE webpage: here name: Arabic Treebank, Part 3, Version 2.0 directory: public/arabic_treebank/part3v2.0 type: text size: 891 MB licenser: LDC licensee: UoE webpage: here name: Aurora Noisy TI Digits Database, Version 2.0 directory: public/aurora type: speech size: 2629 MB licenser: ELDA licensee: UoE webpage: here name: Bible Corpus, 56 languages directory: public/bible type: text size: 246 MB licenser: public domain licensee: none webpage: here name: BBN IE/NE-tagged HUB-4 Training Transcripts directory: public/bbn_ie_ne_tagged type: text size: 10 MB licenser: LDC licensee: UoE webpage: here name: BBN Pronoun Coreference and Entity Type Corpus directory: public/bbn_pronoun_coref type: text size: 22 MB licenser: LDC licensee: UoE webpage: here name: BLLIP Corpus directory: public/bllip/original type: text size: 172 MB licenser: LDC licensee: UoE webpage: here name: BLLIP Corpus, parsed with Minipar directory: public/bllip/parsed_minipar type: text size: 293 MB licenser: LDC licensee: UoE webpage: here name: BLLIP Corpus, parsed in KAF format directory: public/bllip/parsed_kaf type: text size: 1382 MB licenser: LDC licensee: UoE webpage: here name: BLLIP Corpus, text extracted directory: public/bllip/text type: text size: 290 MB licenser: LDC licensee: UoE webpage: here name: Basic Electricity and Electronics Corpus directory: public/bee type: text size: 2 MB licenser: University of Pittsburgh licensee: freely available webpage: here name: Biomedical Information Extraction Corpus directory: public/biomedical_ie type: text size: 320 MB licenser: LDC licensee: UoE webpage: here name: Blog 06 Test Collection directory: public/blogs_collection type: text size: 25000 MB licenser: University of Glasgow licensee: ILCC/HCRC webpage: here name: Boston University Radio Speech Corpus directory: public/bu_radio type: speech size: 2424 MB licenser: LDC licensee: UoE webpage: here name: British National Corpus, Version 1.0 directory: public/bnc/1.0 type: text size: 2866 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, marked up in XML directory: public/bnc/xml type: text size: 815 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, parsed with Charniak parser directory: public/bnc/parsed_charniak type: text size: 419 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, parsed with IMS parser directory: public/bnc/parsed_ims type: text size: 2088 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, parsed with Minipar directory: public/bnc/parsed_minipar type: text size: 448 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, parsed with RASP parser directory: public/bnc/parsed_rasp type: text size: 3520 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, raw text without any markup directory: public/bnc/text type: text size: 579 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 1.0, various LTG data directory: public/bnc/data type: text size: 7 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 2.0 (World Edition) directory: public/bnc/2.0 type: text size: 1779 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 2.0 (World Edition), indexed for IMS Corpus Workbench directory: public/bnc/corpus_workbench type: text size: 967 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: British National Corpus, Version 3.0 (XML Edition) directory: public/bnc/3.0 type: text size: 4619 MB licenser: BNC Consortium licensee: ILCC/HCRC webpage: here name: Buckwalter Arabic Morphological Analyzer Version 2.0 directory: public/buckwalter type: lexicon size: 4 MB licenser: LDC licensee: UoE webpage: here name: Corpus of Spolen Dutch, CGN directory: public/cgn type: speech size: 4554 MB licenser: Dutch Language Union licensee: ILCC/HCRC webpage: here name: CCGbank, Version 1.1 directory: public/ccgbank type: text size: 387 MB licenser: LDC licensee: UoE webpage: CCG group home page, LDC catalog entry name: CELEX Lexical Database, Version 2.0 directory: public/celex/2.0 type: lexcion size: 288 MB licenser: LDC licensee: UoE webpage: here name: CoNLL 2006 Shared Task Data directory: public/conll/2006 type: text size: 138 MB licenser: LDC licensee: UoE webpage: here, here, and here name: CoNLL 2008 Shared Task Data directory: public/conll/2008 type: text size: 84 MB licenser: LDC licensee: UoE webpage: here name: COMLEX English Syntax Corpus directory: public/comlex/corpus type: text size: 98 MB licenser: LDC licensee: UoE webpage: here name: COMLEX English Syntax Lexicon directory: public/comlex/lexicon type: lexicon size: 18 MB licenser: LDC licensee: UoE webpage: here name: CSTR TIMIT Sentence Data directory: public/timit/cstr type: speech size: 478 MB licenser: LDC licensee: UoE webpage: none name: CSTR Weather Database for Speech Synthesis directory: public/synthesis/cstr/weather type: text size: 255 MB licenser: CSTR licensee: UoE webpage: none name: Callhome American English, Speech directory: public/callhome/english/speech type: speech size: 1830 MB licenser: LDC licensee: UoE webpage: here name: Callhome Mandarin Chinese Transcripts - XML version directory: public/callhome/chinese type: speech size: 9.5 MB licenser: LDC licensee: UoE webpage: here name: Callhome Spanish Dialog Act Annotation directory: public/callhome/spanish/dialog_act type: text size: 10 MB licenser: LDC licensee: UoE webpage: here name: Callhome Spanish Lexicon directory: public/callhome/spanish/lexicon type: text size: 3.1 MB licenser: LDC licensee: UoE webpage: here name: Callhome Spanish Transcripts directory: public/callhome/spanish/transcripts type: text size: 1.9 MB licenser: LDC licensee: UoE webpage: here name: Candian Hansard directory: public/canadian_hansard type: text size: 685 MB licenser: LDC licensee: UoE webpage: here name: CHAINS - CHAracterizing INdividual Speakers directory: public/chains type: speech size: 3.3 GB licenser: University College Dublin licensee: UoE webpage: here name: Childes Child Language Database directory: public/childes type: text size: 1266 MB licenser: Carnegie Mellon University licensee: GPL webpage: here name: Chinese English Name Entity Lists, Version 1.0 directory: public/chinese_english_ne type: text size: 97 MB licenser: LDC licensee: UoE webpage: here name: Chinese Gigaword, second edition directory: public/chinese_gigaword/2.0 type: text size: 1700 MB licenser: LDC licensee: UoE webpage: here name: Chinese Gigaword, fourth edition directory: public/chinese_gigaword/4.0 type: text size: 2990 MB licenser: LDC licensee: UoE webpage: here name: Chinese News Translation Corpus, Part 1 directory: public/chinese_translation type: text size: 1.6 MB licenser: LDC licensee: UoE webpage: here name: Chinese Proposition Bank 1.0 directory: public/chinese_propbank/1.0 type: text size: 21 MB licenser: LDC licensee: UoE webpage: here name: Chinese Proposition Bank 2.0 directory: public/chinese_propbank/2.0 type: text size: 112 MB licenser: LDC licensee: UoE webpage: here name: Chinese Treebank, Version 2.0 directory: public/chinese_treebank/2.0 type: text size: 4.3 MB licenser: LDC licensee: UoE webpage: here name: Chinese Treebank, Version 2.0, English Translation directory: public/chinese_treebank/2.0/translation type: text size: 1.6 MB licenser: LDC licensee: UoE webpage: here name: Chinese Treebank, Version 3.0 directory: public/chinese_treebank/3.0 type: text size: 14.4 MB licenser: LDC licensee: UoE webpage: here name: Chinese Treebank, Version 3.0, English Translation directory: public/chinese_treebank/3.0/translation type: text size: 1.7 MB licenser: LDC licensee: UoE webpage: here name: Chinese Treebank, Version 5.0 directory: public/chinese_treebank/5.0 type: text size: 31 MB licenser: LDC licensee: UoE webpage: here name: Chinese Treebank, Version 6.0 directory: public/chinese_treebank/6.0 type: text size: 115 MB licenser: LDC licensee: UoE webpage: here name: Chinese-English Translation Lexicon, Version 3.0 directory: public/chinese_english_lexicon type: lexicon size: 1.4 MB licenser: LDC licensee: UoE webpage: here name: Christine Corpus of Spoken British English directory: public/christine type: text size: 4.3 MB licenser: University of Sussex licensee: freely available webpage: here name: Conference Proceedings from CDROMs directory: public/proceedings type: text size: 18000 MB and growing licenser: various licensee: UoE webpage: here name: Continuous Speech Recognition Corpus (CSR-III Speech) directory: public/csr/csr3/speech type: speech size: 1952 MB licenser: LDC licensee: UoE webpage: here name: Continuous Speech Recognition Corpus (CSR-III Text) directory: public/csr/csr3/text type: speech size: 1791 MB licenser: LDC licensee: UoE webpage: here name: Continuous Speech Recognition Corpus (HUB-4 Language Model) directory: public/csr/hub4 type: text size: 845 MB licenser: LDC licensee: UoE webpage: here name: Corpus of IMDB Movie Summaries, indexed for IMS Corpus Workbench directory: public/imdb type: text size: 168 MB licenser: public domain licensee: none webpage: here name: Cytology Corpus (Alvey Project) directory: public/cytol type: speech size: 372 MB licenser: CSTR licensee: UoE webpage: here name: DARPA Communicator 2000 Dialogue Act Tagged directory: public/darpa_communicator/2000/tagged type: text size: 19 MB licenser: LDC licensee: UoE webpage: here name: DARPA Communicator 2000 Evaluation directory: public/darpa_communicator/2000/evaluation type: speech size: 4384 MB licenser: LDC licensee: UoE webpage: here name: DARPA Communicator 2001 Dialogue Act Tagged directory: public/darpa_communicator/2001/tagged type: text size: 88 MB licenser: LDC licensee: UoE webpage: here name: DARPA Communicator 2001 Evaluation directory: public/darpa_communicator/2001/evaluation type: speech size: 3804 MB licenser: LDC licensee: UoE webpage: here name: DARPA Resource Management Continuous Speech Database (RM1) directory: public/resource_management/rm1 type: speech size: 387 MB licenser: LDC licensee: UoE webpage: here name: DARPA Resource Management Continuous Speech Database (RM2) directory: public/resource_management/rm2 type: speech size: 688 MB licenser: LDC licensee: UoE webpage: here name: DCIEM Sleep Deprivation Corpus directory: public/dciem type: speech size: 7448 MB licenser: LDC licensee: UoE webpage: here name: DSO Corpus of Sense-Tagged English directory: public/dso type: text size: 37 MB licenser: LDC licensee: UoE webpage: here name: Document Understanding Conference (DUC) data, 2001-2007 directory: public/duc type: text size: 108 MB licenser: NIST licensee: ILCC/HCRC webpage: here name: Dickens Corpus, indexed for IMS Corpus Workbench directory: public/dickens type: text size: 65 MB licenser: public domain licensee: none webpage: here name: PARC 700 Dependency Bank directory: public/depbank type: text size: 3 MB licenser: public domain licensee: none webpage: here name: Diphone Voices for Festival directory: public/synthesis/diphone_voices type: speech size: 4477 MB licenser: CSTR licensee: UoE webpage: here name: Discourse Graphbank directory: public/discourse_graphbank type: text size: 2 MB licenser: LDC licensee: UoE webpage: here name: Dundee Corpus of English and French Eye-movement Data directory: public/dundee_eyemovement type: speech size: 207 MB licenser: Department of Psychology, University of Dundee licensee: UoE webpage: none name: ESP Game 100k Corpus directory: public/esp_game type: text and images size: 2783 MB licenser: Luis von Ahn licensee: publicly available webpage: here name: EMILLE/CIIL Corpus directory: public/emille type: text size: 1645 MB licenser: ELDA licensee: ILCC/HCRC webpage: here name: Electromagnetic Articulograph (EMA) Data directory: public/ema/other type: speech and EMA size: 2394 MB licenser: QMUC/CSTR licensee: UoE webpage: here name: Emotional Prosody Speech and Transcripts directory: public/emotional_speech type: speech size: 2845 MB licenser: LDC licensee: UoE webpage: here name: English Gigaword, first edition directory: large/english_gigaword/1.0/original type: text size: 3960 MB licenser: LDC licensee: UoE webpage: here name: English Gigaword, first edition, parsed with Minipar directory: large/english_gigaword/1.0/parsed_minipar type: text size: 17157 MB licenser: LDC licensee: UoE webpage: here name: English Gigaword, first edition, tokenized and tagged directory: large/english_gigaword/1.0/tagged type: text size: 22169 MB licenser: LDC licensee: UoE webpage: here name: English Gigaword, fourth edition directory: large/english_gigaword/4.0 type: text size: 8223 MB licenser: LDC licensee: UoE webpage: here name: English Gigaword, fifth edition directory: large/english_gigaword/5.0 type: text size: 9328 MB licenser: LDC licensee: UoE webpage: here name: English Gigaword, annotated directory: large/annotated_english_gigaword type: text size: 172 GB licenser: LDC licensee: UoE webpage: here name: English Intonation in the British Isles Corpus directory: public/ivie type: text size: 2471 MB licenser: University of Oxford licensee: freely available webpage: here name: English-Arabic Parallel Treebank directory: public/english_arabic_treebank type: text size: 18 MB licenser: LDC licensee: UoE webpage: here name: English-Chinese Translation Treebank 1.0 directory: public/chinese_translation_treebank type: speech size: 9 MB licenser: LDC licensee: UoE webpage: here name: Enron Email Dataset directory: public/enron/original type: text size: 1646 MB licenser: public domain licensee: none webpage: here name: Enron Email Dataset, prepared for Rainbow directory: public/enron/rainbow type: text size: 290 MB licenser: public domain licensee: none webpage: here name: Enron Email Dataset, with Topic Annotations directory: public/enron/annotations type: text size: 0.2 MB licenser: LDC licensee: UoE webpage: here name: European Corpus Initiative Multilingual Corpus directory: public/eci type: text size: 685 MB licenser: LDC licensee: UoE webpage: here name: European News Corpus directory: public/european_news type: text size: 715 MB licenser: LDC licensee: UoE webpage: here name: European Parliament Interpretation Corpus (EPIC) directory: public/ELRA/ELRA-S0323-EPIC-European-Parliament-Interpretation-Corpus type: text size: 3.8 GB licenser: ELRA licensee: The School of Informatics webpage: here name: European Parliament Proceedings Parallel Corpus, Version 2.0 directory: public/europarl type: text size: 3809 MB licenser: public domain licensee: none webpage: here name: Extended VerbNet directory: public/verbnet type: lexicon size: 2.5 MB licenser: University of Colorado licensee: freely available webpage: here name: Extended WordNet Lexical Database, WordNet Version 2.0, Extension Version 1.1 directory: public/wordnet/xwn type: lexicon size: 154 MB licenser: University of Texas at Dallas licensee: freely available webpage: here name: Fisher English Training Speech, Part 1, Speech directory: large/fisher/speech type: speech size: 28 GB licenser: LDC licensee: UoE webpage: here name: Fisher English Training Speech, Part 1, Transcripts directory: large/fisher/transcripts type: text size: 275 MB licenser: LDC licensee: UoE webpage: here name: Factbank directory: public/factbank type: text size: 22 MB licenser: LDC licensee: UoE webpage: here name: Fisher English Training Speech, Part 2, Speech directory: large/fisher/speech type: speech size: 29 GB licenser: LDC licensee: UoE webpage: here name: Fisher English Training Speech, Part 2, Transcripts directory: large/fisher/transcripts type: text size: 279 MB licenser: LDC licensee: UoE webpage: here name: FrameNet 1.1 directory: public/framenet/1.1 type: text size: 1024 MB licenser: University of California at Berkeley licensee: freely available webpage: here name: FrameNet 1.3 directory: public/framenet/1.3 type: text size: 783 MB licenser: University of California at Berkeley licensee: freely available webpage: here name: Frankfurter Rundschau corpus (part of ECI), tokenized and tagged directory: public/rundschau type: text size: 1074 MB licenser: LDC licensee: UoE webpage: here name: French Treebank, Version 1.4 directory: public/french_treebank type: text size: 147 MB licenser: LLF, Universite Paris 7 licensee: UoE webpage: here name: French Gigaword, second edition directory: public/french_gigaword/2.0 type: text size: 1790 MB licenser: LDC licensee: UoE webpage: here name: French Gigaword, third edition directory: public/french_gigaword/3 type: text size: 2 GB licenser: LDC licensee: UoE webpage: here name: GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web directory: public/gale/gale_ch_nw_wb_word_align_p3/ type: text size: 29 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Arabic Blog Parallel Text directory: public/gale/galep1_ara_bl_ptxt/ type: text size: 5.7 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 directory: public/gale/ara_bn_ptext type: text size: 36 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 directory: public/gale/ar_bn_ptxt_p2/ type: text size: 2.8 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Chinese Blog Parallel Text directory: public/gale/gale_p1_ch_blog type: text size: 2 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 directory: public/gale/ch_bn_ptxt type: text size: 6 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 directory: public/gale/ch_bn_ptxt_p2 type: text size: 6 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 directory: public/gale/ch_bn_ptxt_p3 type: text size: 4.4 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 1 Distillation Training directory: public/gale/galep1_distill_tr type: text size: 31 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 2 - MTPlus Pilot directory: restricted/gale/GALE-P3-MTPlus_Pilot group: smt type: text size: 0.86 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 2 - Transcripts directory: restricted/gale/GALE-P3R2/transcription group: smt type: text size: 117 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 2 - Translations directory: restricted/gale/GALE-P3R2/translationGALE-P3R1 group: smt type: text size: 11.8 MB licenser: LDC licensee: UoE webpage: here name: German Law Corpus, indexed for IMS Corpus Workbench directory: public/german_law type: text size: 40 MB licenser: public domain licensee: none webpage: here name: GlobalPhone directory: public/global_phone type: speech size: 18000 MB licenser: UoE licensee: ELDA webpage: GlobalPhone name: Google Book Corpus directory: public/google_books type: text size: 8119 MB licenser: ILCC/HCRC licensee: LDC webpage: n/a name: Google n-Gram Corpus directory: public/google_ngrams type: text size: 25000 MB licenser: UoE licensee: LDC webpage: here name: Gulf Arabic Conversational Telephone Speech, Transcripts directory: public/arabic_telephone type: text size: 11 MB licenser: LDC licensee: UoE webpage: here name: HUB-5 English Evaluation 1997 directory: public/hub5/1997 type: speech size: 593 MB licenser: LDC licensee: UoE webpage: here name: HUB-5 English Evaluation 1998 directory: public/hub5/1998 type: speech + text size: 607 MB licenser: LDC licensee: UoE webpage: here and here name: HUB-5 English Evaluation 2000 directory: public/hub5/2000 type: speech + text size: 617 MB licenser: LDC licensee: UoE webpage: here and here name: HUB-5 English Evaluation 2001 directory: public/hub5/2001 type: speech size: 2.1 GB licenser: LDC licensee: UoE webpage: here name: HARD 2004 Topics and Annotations directory: public/hard type: text size: 19 MB licenser: LDC licensee: UoE webpage: here name: Hebrew Treebank directory: public/hebrew_treebank type: text size: 20 MB licenser: Technion licensee: public domain webpage: here name: Hidi Wordnet directory: public/hindi_wordnet type: text size: 19 MB licenser: LDC licensee: UoE webpage: here name: Hong Kong Hansard Parallel Text, Alignments directory: public/hong_kong_hansard/alignments type: text size: 91 MB licenser: LDC licensee: UoE webpage: here name: Hong Kong Hansard Parallel Text, Text directory: public/hong_kong_hansard/text type: text size: 110 MB licenser: LDC licensee: UoE webpage: here name: Hong Kong Laws Parallel Text directory: public/hong_kong_laws type: text size: 75 MB licenser: LDC licensee: UoE webpage: here name: Hong Kong News Parallel Text, Alignments directory: public/hong_kong_news/alignments type: text size: 107 MB licenser: LDC licensee: UoE webpage: here name: Hong Kong News Parallel Text, Text directory: public/hong_kong_news/text type: text size: 81 MB licenser: LDC licensee: UoE webpage: here name: ICSI Meeting Speech directory: public/icsi_meeting/speech type: text size: 33400 MB licenser: LDC licensee: UoE webpage: here name: ICSI Meeting Transcripts directory: public/icsi_meeting/transcripts type: text size: 3.51 MB licenser: LDC licensee: UoE webpage: here name: IMS Corpus Workbench (corpus registry files only) directory: public/corpus_workbench type: text/speech size: 0 MB licenser: IMS Stuttgart licensee: ILCC/HCRC webpage: here name: ISL Meeting Speech directory: public/isl_meeting/speech type: speech size: 5975 MB licenser: LDC licensee: UoE webpage: here name: ISL Meeting Transcripts directory: public/isl_meeting/transcripts type: text size: 1.81 MB licenser: LDC licensee: UoE webpage: here name: Instruction-based Learning for Mobile Robots Corpus directory: public/ibl type: speech size: 123 MB licenser: University of Edinburgh/University of Plymouth licensee: freely available webpage: here name: KAIST Korean Speech Database directory: public/kaist type: speech size: 3711 MB licenser: The Korean Advanced Institute of Science and Technology licensee: UoE webpage: none name: Korean Broadcast News Transcripts directory: public/korean_news_transcripts type: text size: 1.4 MB licenser: LDC licensee: UoE webpage: here name: Korean Propbank directory: public/korean_propbank type: text size: 24 MB licenser: LDC licensee: UoE webpage: here name: Korean Treebank, Version 1.0 directory: public/korean_treebank/1.0 type: text size: 6.9 MB licenser: LDC licensee: UoE webpage: here name: Korean Treebank, Version 2.0 directory: public/korean_treebank/2.0 type: text size: 20 MB licenser: LDC licensee: UoE webpage: here name: Lancaster Corpus of Mandarin (LCMC) directory: public/lcmc type: text size: 46 MB licenser: ELDA licensee: UoE webpage: here name: Levantine Arabic QT Training Data Set 5, Transcripts directory: public/arabic_qt_data type: text size: 27 MB licenser: LDC licensee: UoE webpage: here name: Lucy Corpus of Written British English directory: public/lucy type: text size: 4.7 MB licenser: University of Sussex licensee: freely available webpage: here name: MDE RT-02 Rich Transcription Broadcast News and Conversational Telephone Speech 2002 directory: public/mde/rt-02 type: speech size: 815 MB licenser: LDC licensee: UoE webpage: here name: MDE RT-03 Training Data, Speech directory: public/mde/rt-03/speech type: speech size: 5256 MB licenser: LDC licensee: UoE webpage: here name: MDE RT-03 Training Data, Text and Annotations directory: public/mde/rt-03/text type: text size: 723 MB licenser: LDC licensee: UoE webpage: here name: MDE RT-04 Training Data, Speech directory: public/mde/rt-04/speech type: speech size: 4829 MB licenser: LDC licensee: UoE webpage: here name: MDE RT-04 Training Data, Text and Annotations directory: public/mde/rt-04/text type: text size: 567 MB licenser: LDC licensee: UoE webpage: here name: MITRE 1997 Mandarin Broadcast News Speech Translations (Hub-4NE) directory: public/mandarin_transcripts/hub4-ne type: text size: 2.35 MB licenser: LDC licensee: UoE webpage: here name: MMA_2012 Multiple Microphone Array corpus directory: large/2012_MMA type: speech size: 115 GB licenser: UoE licensee: UoE webpage: here name: MOCHA Electromagnetic Articulograph (EMA) Corpus directory: public/ema/mocha type: speech size: 2221 MB licenser: QMUC/CSTR licensee: UoE webpage: here name: MRC Psycholinguistic Database directory: public/mrc type: lexicon size: 11 MB licenser: MRC licensee: freely available webpage: here name: Machine-readable Spoken English Corpus directory: public/marsec type: speech size: 2 MB licenser: Reading University licensee: UoE webpage: here name: Macrophone: American English Segment of the Polyphone Corpus directory: public/macrophone type: speech size: 3809 MB licenser: LDC licensee: UoE webpage: here name: Mandarin Transcripts (HUB-5, 2001) directory: public/mandarin_transcripts/hub5 type: text size: 0.2 MB licenser: LDC licensee: UoE webpage: here name: Mandarin Transcripts, HKUST Telephone Data, Part 1 directory: public/mandarin_transcripts/hkust type: text size: 11 MB licenser: LDC licensee: UoE webpage: here name: Maptask Corpus directory: public/maptask type: speech size: 13665 MB licenser: LDC and UoE/LDC licensee: UoE webpage: Maptask home page, LDC catalog entry name: Mawukakan Lexicon directory: public/mawukakan_lexicon type: lexicon size: 4 MB licenser: LDC licensee: UoE webpage: here name: MC-WSJ-AV - multichannel Wall Street Journal audiovisual directory: public/MC-WSJ-AV type: speech size: 11 GB licenser: UoE licensee: UoE webpage: here name: Message Understanding Conference (MUC) 6 directory: public/muc/muc6 type: text size: 10 MB licenser: LDC licensee: UoE webpage: here name: Message Understanding Conference (MUC) 6, Additional News Text directory: public/muc/muc6/additional_text type: text size: 0.67 MB licenser: LDC licensee: UoE webpage: here name: Message Understanding Conference (MUC) 7 directory: public/muc/muc7 type: text size: 45.7 MB licenser: LDC licensee: UoE webpage: here name: MT08 - NIST Open Machine Translation 2008 Evaluation data directory: public/mt08_shared_data type: text size: 20 MB licenser: LDC licensee: UoE webpage: here name: Multext East directory: public/multext/east/3.0 type: text size: 295 MB licenser: ILCC/HCRC licensee: Jozef Stefan Institute, Ljubljana webpage: here name: Multext East directory: public/multext/east/4.0 type: text size: 341 MB licenser: ILCC/HCRC licensee: Jozef Stefan Institute, Ljubljana webpage: here name: Multext JOC directory: public/multext/joc type: text size: 122 MB licenser: ILCC/HCRC licensee: ELDA webpage: here name: Multilingual Corpora for Cooperation directory: public/mlcc type: text size: 1223 MB licenser: internal licensee: internal webpage: here name: Multilingual Semcor, Version 1.1 directory: public/semcor/multisemcor type: text size: 142 MB licenser: ITC/IRST licensee: University of Edinburgh webpage: here name: Multiple-Translation Arabic Corpus, Part 1 directory: public/mt_arabic/part1 type: text size: 5.0 MB licenser: LDC licensee: UoE webpage: here name: Multiple-Translation Arabic Corpus, Part 2 directory: public/mt_arabic/part2 type: text size: 2.5 MB licenser: LDC licensee: UoE webpage: here name: Multiple-Translation Chinese Corpus, Part 1, Version 1.0 directory: public/mt_chinese/part1/1.0 type: text size: 4.8 MB licenser: LDC licensee: UoE webpage: here name: Multiple-Translation Chinese Corpus, Part 1, Version 2.0 directory: public/mt_chinese/part1/2.0 type: text size: 2.8 MB licenser: LDC licensee: UoE webpage: here name: Multiple-Translation Chinese Corpus, Part 2 directory: public/mt_chinese/part2 type: text size: 3.5 MB licenser: LDC licensee: UoE webpage: here name: Multiple-Translation Chinese Corpus, Part 3 directory: public/mt_chinese/part3 type: text size: 1.1 MB licenser: LDC licensee: UoE webpage: here name: Multiple-Translation Chinese Corpus, Part 4 directory: public/mt_chinese/part4 type: text size: 5.2 MB licenser: LDC licensee: UoE webpage: here name: NPS Internet Chatroom Conversations, Release 1.0 directory: public/nps_chat type: text size: 7 MB licenser: LDC licensee: UoE webpage: here name: NIST Meeting Pilot Corpus Transcripts and Metadata directory: public/nist_meeting_pilot type: text size: 2 MB licenser: LDC licensee: UoE webpage: here name: NIST Speaker Recognition Evaluation 2002 directory: public/nist_speaker_rec type: speech size: 4724 MB licenser: LDC licensee: UoE webpage: here name: NIST TI Digits directory: public/tidigits type: speech size: 786 MB licenser: LDC licensee: UoE webpage: here name: NTIMIT Acoustic-Phonetic Continuous Speech Corpus directory: public/timit/ntimit type: speech size: 1146 MB licenser: LDC licensee: UoE webpage: here name: New York Times Annotated Corpus directory: public/nyt_annotated type: text size: 3202 MB licenser: LDC licensee: UoE webpage: here name: NYNEX Phonebook directory: public/phonebook type: speech size: 1400 MB licenser: LDC licensee: UoE webpage: here name: Newsgroup Corpus directory: public/newsgroups type: text size: 55 MB licenser: public domain licensee: none webpage: various newsgroups name: NomBank directory: public/nombank type: speech size: 56 MB licenser: NYU licensee: none webpage: here name: North American News Text Corpus directory: public/american_news/original type: text size: 2342 MB licenser: LDC licensee: UoE webpage: here name: North American Newstext Corpus, parsed with Minipar directory: public/american_news/parsed_minipar type: text size: 3392 MB licenser: LDC licensee: UoE webpage: here name: OntoNotes Release 1.0 directory: public/ontonotes/1.0 type: text size: 750 MB licenser: LDC licensee: UoE webpage: here name: OntoNotes Release 2.0 directory: public/ontonotes/2.0 type: text size: 1299 MB licenser: LDC licensee: UoE webpage: here name: OntoNotes Release 3.0 directory: public/ontonotes/3.0 type: text size: 444 MB licenser: LDC licensee: UoE webpage: here name: OntoNotes Release 4.0 directory: public/ontonotes/4.0 type: text size: 764 MB licenser: LDC licensee: UoE webpage: here name: OHSUMED Corpus (also used for the TREC 9 Filtering Track) directory: public/ohsumed type: text size: 1176 MB licenser: NIST licensee: freely available webpage: here name: PASCAL Syntax Induction Challenge Training and Development directory: public/pascal_syntax_challenge type: text size: 102 MB licenser: University of Pennsylvania licensee: UoE webpage: here name: Penn Discourse Treebank, Version 1.0 directory: public/penn_discourse_treebank/1.0 type: text size: 10 MB licenser: University of Pennsylvania licensee: UoE webpage: here name: Penn Discourse Treebank, Version 2.0 directory: public/penn_discourse_treebank/2.0 type: text size: 38 MB licenser: University of Pennsylvania licensee: UoE webpage: here name: Penn Treebank, Version 2.0 directory: public/penn_treebank/2.0 type: text size: 655 MB licenser: LDC licensee: UoE webpage: here name: Penn Treebank, Version 3.0 directory: public/penn_treebank/3.0 type: text size: 256 MB licenser: LDC licensee: UoE webpage: here name: Prague Czech-English Dependency Treebank, Version 1.0 directory: public/prague_treebank/ type: text size: 587 MB licenser: LDC licensee: UoE webpage: here name: Proposition Bank, Version 1.0 directory: public/propbank type: text size: 20 MB licenser: LDC licensee: UoE webpage: here name: RST Discourse Treebank directory: public/rst_treebank type: text size: 26 MB licenser: LDC licensee: UoE webpage: here name: Research Cyc directory: public/cyc type: text size: 4118 MB licenser: Cycorp licensee: UoE webpage: here name: Reuters Text Categorization Corpus 21578 directory: public/reuters/21578 type: text size: 28 MB licenser: Reuters licensee: freely available webpage: here name: Roget's Thesaurus from 1911 directory: public/roget type: text size: 12 MB licenser: public domain licensee: freely available webpage: here name: Russian-English Computer Security Parallel Text directory: public/rus_eng_compsec_para type: text size: 1.6 MB licenser: LDC licensee: UoE webpage: here name: Santa Barbara Corpus of Spoken American English - Parts I to IV directory: public/santa_barbara type: speech size: 6.7 GB licenser: University of California licensee: UoE webpage: here name: SALSA Corpus directory: public/salsa type: text size: 251 MB licenser: Saarland University licensee: UoE webpage: here name: SAID (Syntactically Annotated Idiom Dataset) directory: public/said type: text size: 3 MB licenser: LDC licensee: UoE webpage: here name: SOLE Project Corpus directory: public/synthesis/cstr/sole type: speech size: 895 MB licenser: HCRC licensee: UoE webpage: here name: Search Engine Logs (Alltheweb, Excite, Altavista) directory: public/searchengine_logs type: text size: 440 MB licenser: Jim Jansen, Penn State University licensee: ILCC/HCRC webpage: ? name: Semcor Semantically Annotated Corpus, Version 1.6 directory: public/semcor/1.6 type: text size: 39 MB licenser: Princeton University licensee: freely available webpage: here name: Semcor Semantically Annotated Corpus, Version 2.0 directory: public/semcor/2.0 type: text size: 34 MB licenser: Princeton University licensee: freely available webpage: here name: Sinorama Chinese English Parallel Text directory: public/sinorama type: text size: 64 MB licenser: LDC licensee: UoE webpage: here name: Spanish Broadcast News directory: public/spanish_broadcast_news type: speech size: 5200 MB licenser: LDC licensee: UoE webpage: here name: Spanish Gigaword, First Edition directory: public/spanish_gigaword/1.0 type: text size: 1775 MB licenser: LDC licensee: UoE webpage: here name: Spanish Gigaword, Second Edition directory: public/spanish_gigaword/2.0 type: text size: 2679 MB licenser: LDC licensee: UoE webpage: here name: Spanish Newswire, vols 1 and 2 directory: public/spanish_newswire type: text size: 556 MB + 624 MB licenser: LDC licensee: UoE webpage: here, here name: Spanish Treebank directory: public/spanish_treebank type: text size: 8 MB licenser: University of Barcelona licensee: freely available webpage: here name: Susanne Corpus of Written American English, Version 1.0 directory: public/susanne/1.0 type: text size: 5 MB licenser: University of Sussex licensee: freely available webpage: here name: Susanne Corpus of Written American English, Version 5.0 directory: public/susanne/5.0 type: text size: 6 MB licenser: University of Sussex licensee: freely available webpage: here name: Switchboard Corpus, NXT Annotations directory: large/switchboard/nxt type: speech size: 1230 MB licenser: LDC licensee: UoE webpage: here name: Switchboard 1 Telephone Speech Corpus, Release 2 directory: large/switchboard/switchboard1 type: speech size: 1485 MB licenser: LDC licensee: UoE webpage: here name: Switchboard 2 Telephone Speech Corpus, Phases 1-3 directory: large/switchboard/switchboard2 type: speech size: 50643 MB licenser: LDC licensee: UoE webpage: here and here and here name: Switchboard Cellular Telephone Speech Corpus, Part 1, Audio directory: large/switchboard/cellular/part1/audio type: speech size: 1401 MB licenser: LDC licensee: UoE webpage: here name: Switchboard Cellular Telephone Speech Corpus, Part 1, Transcripts directory: large/switchboard/cellular/part1/transcripts type: text size: 2 MB licenser: LDC licensee: UoE webpage: here name: Switchboard Cellular Telephone Speech Corpus, Part 2, Audio directory: large/switchboard/cellular/part2/audio type: speech size: 11364 MB licenser: LDC licensee: UoE webpage: here name: TORGO Database of Dysarthric Articulation directory: public/torgo type: speech size: 15234 MB licenser: LDC licensee: UoE webpage: here name: TAC 2008 Data directory: public/tac/2008/ type: text size: 34 MB licenser: NIST licensee: ILCC/HCRC webpage: here name: TAC 2009 Data directory: public/tac/2009/ type: text size: 226 MB licenser: NIST licensee: ILCC/HCRC webpage: here name: TAC 2010 Data directory: public/tac/2010/ type: text size: 12 MB licenser: NIST licensee: ILCC/HCRC webpage: here name: TC-STAR (a large collection of several sub-corpora) directory: public/ELRA/TC-STAR type: text and speech size: 83 GB (current size) licenser: ELRA licensee: The School of Informatics webpage: here name: TDT Pilot Corpus directory: public/tdt/tdt2_pilot type: text size: 53 MB licenser: LDC licensee: UoE webpage: here name: TDT2 Careful Transcription Text directory: public/tdt/tdt2_careful_text type: text size: 2.4 MB licenser: LDC licensee: UoE webpage: here name: TDT2 Careful Transcription Audio directory: public/tdt/tdt2_careful_audio type: speech size: 1077 MB licenser: LDC licensee: UoE webpage: here name: TDT2 Multilanguage Text Corpus, Version 4.0 directory: public/tdt/tdt2_multilanguage type: text size: 623 MB licenser: LDC licensee: UoE webpage: here name: TDT3 Multilanguage Text Corpus, Version 2.0 directory: public/tdt/tdt2_multilanguage type: text size: 367 MB licenser: LDC licensee: UoE webpage: here name: TDT5 Multilingual Text directory: public/tdt/tdt5_multilingual_text type: text size: 1200 MB licenser: LDC licensee: UoE webpage: here name: TDT5 Topics and Annotations directory: public/tdt/tdt5_topics_and_annotations type: text size: 80 MB licenser: LDC licensee: UoE webpage: here name: TIMIT Acoustic-Phonetic Continuous Speech Corpus directory: public/timit/original type: speech size: 668 MB licenser: LDC licensee: UoE webpage: here name: TRECVID 2003 Keyframes & Transcripts directory: public/trec/trecvid/2003 type: video size: 3570 MB licenser: LDC licensee: UoE webpage: here name: TRECVID 2005 Keyframes & Transcripts directory: public/trec/trecvid/2005 type: video size: 3283 MB licenser: LDC licensee: UoE webpage: here name: Tageszeitung (TAZ) Corpus directory: public/taz type: text size: 1439 MB licenser: ILCC/HCRC licensee: Contrapress Media GmbH webpage: here name: Talbanken05 Swedish Treebank directory: public/talbanken type: speech size: 144 MB licenser: University of Växjö and University of Lund licensee: freely available webpage: here name: Timebank 1.2 directory: public/timebank type: text size: 6 MB licenser: LDC licensee: UoE webpage: here name: ToBI Guidelines and Examples directory: public/tobi_course type: text size: 19 MB licenser: Ohio State University licensee: UoE webpage: here name: Translanguage English Database (TED), Speech directory: public/ted/speech type: speech size: 2903 MB licenser: LDC licensee: UoE webpage: here name: Translanguage English Database (TED), Transcripts directory: public/ted/transcripts type: text size: 1.3 MB licenser: LDC licensee: UoE webpage: here name: UASPEECH directory: public/UASPEECH type: speech size: 15 GB licenser: University of Illinois licensee: UoE webpage: here name: Ummah Arabic English Parallel News Text directory: public/ummah type: text size: 6.6 MB licenser: LDC licensee: UoE webpage: here name: Underspecified Rhetorical Markup Language (URML) Corpus aka Potsdam Commentary Corpus directory: public/urml type: text size: 1.7 MB licenser: University of Potsdam licensee: HCRC/ILCC webpage: here name: UKWAC (Web as Corpus) English, parsed version directory: public/wac/ukwac type: text size: 31000 MB licenser: University of Bologna licensee: freely available webpage: here name: UKWAC (Web as Corpus) English, dependency-parsed version directory: public/wac/ukwac_dep type: text size: 16000 MB licenser: University of Bologna licensee: freely available webpage: here name: Unified Linguistic Annotation Text Collection directory: public/unified_annotation type: text size: 363 MB licenser: LDC licensee: UoE webpage: here name: Wackypedia EN 1.0 directory: public/wac/wackypedia_en type: text size: 6000 MB licenser: University of Bologna licensee: freely available webpage: here name: WSJ0 complete (also known as CSR-I) directory: public/wsj/wsj0 type: speech size: 8.9 GB licenser: LDC licensee: UoE webpage: here name: WSJ1 complete (also known as CSR-II) directory: public/wsj/wsj1 type: speech size: 18 GB licenser: LDC licensee: UoE webpage: here name: WSJCAM0 Cambridge Read News directory: public/wsjcam0/original type: speech size: 3848 MB licenser: LDC licensee: UoE webpage: here name: WSJCAM0 Cambridge Read News, processed data directory: public/wsjcam0/data type: speech size: 13571 MB licenser: LDC licensee: UoE webpage: here name: Wikipedia Corpus (raw data, dated 2009-06-18) directory: large/wikipedia/raw type: text size: 22797 MB licenser: various licensee: freely available webpage: here name: Wikipedia Corpus (INEX 2006 Corpus) directory: large/wikipedia/inex type: text size: 4959 MB licenser: various licensee: ILCC/HCRC webpage: here name: Wikipedia Corpus (INEX 2006 Corpus), Question Answering Version directory: large/wikipedia/inex_qa type: text size: 5143 MB licenser: various licensee: freely available webpage: here name: Wikipedia Corpus, Tagged and Cleaned directory: large/wikipedia/tagged_cleaned type: text size: 51866 MB licenser: various licensee: freely available webpage: here name: WordNet Lexical Database, Version 1.6 directory: public/wordnet/1.6 type: lexicon size: 40 MB licenser: Princeton University licensee: freely available webpage: here name: WordNet Lexical Database, Version 1.7.1 directory: public/wordnet/1.7.1 type: lexicon size: 40 MB licenser: Princeton University licensee: freely available webpage: here name: WordNet Lexical Database, Version 2.0 directory: public/wordnet/2.0 type: lexicon size: 41 MB licenser: Princeton University licensee: freely available webpage: here name: WordNet Lexical Database, Version 2.1 directory: public/wordnet/2.1 type: lexicon size: 38 MB licenser: Princeton University licensee: freely available webpage: here name: WordNet Lexical Database, Version 3.0, with sense maps and standoff annotation directory: public/wordnet/2.1 type: lexicon size: 92 MB licenser: Princeton University licensee: freely available webpage: here name: Wordlists for various languages directory: public/wordlists type: lexicon size: 2 MB licenser: n/a licensee: freely available webpage: n/a name: Global Yoruba Lexical Database 1.0 directory: public/yoruba type: text size: 183 MB licenser: LDC licensee: UoE webpage: here name: Xinhua Chinese English Parallel News Text, Version 1.0 beta directory: public/xinhua type: text size: 40 MB licenser: LDC licensee: UoE webpage: here
Name: Prague Czech-English Dependency Treebank, 2.0 alpha directory: restricted/pcedt/v2alpha group: - type: text size: 939 MB licenser: Charles University, Prague licensee: Bonnie Webber webpage: ? Name: Reuters Corpus, various supporting data directory: restricted/reuters/data group: reuters01 type: text size: 2 MB licenser: ? licensee: LTG? webpage: ? name: AQUAINT-2 Information-Retrieval Text Research Collection directory: restricted/aquaint group: trec type: text size: 2498 MB licenser: LDC licensee: UoE webpage: here name: Continuous Speech Recognition Corpus (HUB-4) directory: restricted/csr group: corpman type: speech size: 908 MB licenser: LDC licensee: UoE webpage: here, here, here note: contains a mixture of LDC and propriatory data, thus restricted name: DMM German Morphological Database directory: restricted/dmm group: dmm type: lexicon size: 21 MB licenser: University of Erlangen-Nuernberg licensee: ILCC/HCRC webpage: here name: GALE Kickoff directory: restricted/gale/kickoff group: smt type: text size: 106 MB licenser: LDC licensee: UoE webpage: here and here name: GALE Phase 2 Releases 1, 2 and 3 directory: restricted/gale/GALE-P2* group: smt type: text size: 1500 MB licenser: LDC licensee: UoE webpage: release 1: here and here; release 2: here and here; release 3: here and here name: GALE Phase 3 DevTest - Source Text, Transcripts and Translations directory: restricted/gale/GALE-P3-DevTest-V1_0 group: smt type: text size: 5 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 1 - Distillation directory: restricted/gale/GALE-Phase3-Distillation-TrainingData-V1_0 group: smt type: text size: 1.34 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 1 - English Translation Treebank directory: restricted/gale/GALE-P3R1-EBNTT-Sep07 group: smt type: text size: 3.76 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 1 - Found Parallel Text directory: restricted/gale/GALE-P3R1 group: smt type: text size: 222.13 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 1 - Transcripts directory: restricted/gale/GALE-P3R1 group: smt type: text size: 70.57 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 3 Release 1 - Translations directory: restricted/gale/GALE-P3R1 group: smt type: text size: 7.91 MB licenser: LDC licensee: UoE webpage: here name: GALE Y1 - IBM Arabic-English Word Alignment Corpus directory: restricted/gale/Y1 group: smt type: text size: 25 MB licenser: LDC licensee: UoE webpage: here name: GALE Y1 Q3 directory: restricted/gale/GALE-Y1Q3 group: smt type: text size: 14 MB licenser: LDC licensee: UoE webpage: here name: GALE Y1 Q4 directory: restricted/gale/GALE-Y1Q4 group: smt type: text size: 81 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 4 Release 1 - Transcripts V1.0 directory: restricted/gale/GALE-P4R1 group: smt type: text size: 72 MB licenser: LDC licensee: UoE webpage: here name: GALE Phase 4 Release 1 - Translations V1.0 directory: restricted/gale/P4R1 group: smt type: text size: 72 MB licenser: LDC licensee: UoE webpage: here name: GermaNet (German WordNet) 4.0 directory: restricted/germanet/4.0 group: dmm type: lexicon size: 11 MB licenser: University of Tuebingen licensee: ILCC/HCRC webpage: here name: GermaNet (German WordNet) 8.0 directory: restricted/germanet/8.0 group: dmm type: lexicon size: 68 MB licenser: University of Tuebingen licensee: University of Edinburgh webpage: here name: Lancaster-Oslo-Bergen Corpus of British English directory: restricted/lob group: corpman type: text size: 8 MB licenser: ? licensee: LTG? webpage: here name: London-Lund Corpus of Spoken English directory: restricted/london_lund group: corpman type: text size: 10 MB licenser: ? licensee: LTG? webpage: here name: Maptask corpora for different languages and situations directory: restricted/maptask group: corpman type: text size: 622 MB licenser: ? licensee: LTG? webpage: ? name: Medline Corpus directory: restricted/medline group: umls type: text size: 7602 MB licenser: ? licensee: LTG? webpage: ? name: NTCIR Corpora directory: restricted/ntcir group: - type: text size: 48453 MB licenser: National Institute of Informatics, Tokyo licensee: Bonnie Webber webpage: here, here, and here name: NEGRA Parsed Corpus of German directory: restricted/negra group: negra type: text size: 55 MB licenser: Saarland University licensee: ILCC/HCRC webpage: here name: Reuters Corpus Volume 1 (English), Release 2000-11-03 directory: restricted/reuters/english group: reuters01 type: text size: 1012 MB licenser: NIST/Reuters licensee: Informatics/CSTR webpage: here name: Reuters Corpus Volume 2 (Multilingual), Release 2000-05-31 directory: restricted/reuters/multilingual group: reuters01 type: text size: 622 MB licenser: NIST/Reuters licensee: Informatics/CSTR webpage: here name: Search Engine Logs (AOL) directory: restricted/searchengine_logs/aol group: querylogs type: text size: 449 MB licenser: AOL licensee: freely available [but privacy concerns, hence restricted] webpage: ? name: Search Engine Logs (Excite) directory: restricted/searchengine_logs/excite group: querylogs type: text size: 52 MB licenser: Excite licensee: freely available [but privacy concerns, hence restricted] webpage: ? name: TIGER Parsed Corpus of German directory: restricted/tiger group: negra type: text size: 140 MB licenser: IMS Stuttgart licensee: ILCC/HCRC webpage: here name: TREC-9 Question Answering Track Corpus directory: restricted/trec/trec9/question_answering group: trec type: text size: 62 MB licenser: NIST licensee: LTG? webpage: here name: TREC Text Research Collection Vol. 4 directory: restricted/trec/text_collection4 group: trec type: text size: 443 MB licenser: NIST licensee: HCRC (but individual registration required; see Avril) webpage: here name: TREC Text Research Collection Vol. 5 directory: restricted/trec/text_collection5 group: trec type: text size: 394 MB licenser: NIST licensee: HCRC (but individual registration required; see Avril) webpage: here name: Tuebingen Partially Parsed Corpus of German, Newspaper (TüPP-D/Z) [based on TAZ corpus] directory: restricted/tuebingen/tueppdz group: negra type: text size: 7651 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 2 [based on TAZ corpus] directory: restricted/tuebingen/tuebadz/2.0 group: negra type: text size: 185 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: Tuebingen Treebank of German, Newspaper (TüBa-D/Z), Release 8 [based on TAZ corpus] directory: restricted/tuebingen/tuebadz/8.0 group: negra type: text size: 185 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: Tuebingen Treebank of German, Speech (TüBa-D/S), Release 1 [based on Verbmobil corpus] directory: restricted/tuebingen/tuebads/1.0 group: negra type: speech size: 107 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: Tuebingen Treebank of German, Speech (TüBa-D/S), Release 2 [based on Verbmobil corpus] directory: restricted/tuebingen/tuebads/2.0 group: negra type: speech size: 107 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: Tuebingen Treebank of English, Speech (TüBa-E/S) [based on Verbmobil corpus] directory: restricted/tuebingen/tuebaes group: negra type: speech size: 107 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: Tuebingen Treebank of Japanese, Speech (TüBa-J/S) [based on Verbmobil corpus] directory: restricted/tuebingen/tuebajs group: negra type: speech size: 107 MB licenser: Tuebingen University licensee: ILCC/HCRC webpage: here name: UMLS Metathesaurus 2005AC directory: restricted/umls group: umls type: text size: 7780 MB licenser: American Medical Association licensee: ILCC/HCRC webpage: here
|
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |