home edit page issue tracker

This page pertains to UD version 2.

Current spoken treebanks

treebank_name type available since link genre contact sentences tokens is spoken part clearly identifiable? corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Abaza ATB only spoken 2.11 https://github.com/UniversalDependencies/UD_Abaza-ATB spoken alexeykochevoy@gmail.com 98 652 yes     text_rus text_transcription;text_orth   text_name                      
Alemannic DIVITAL mixed 2.17 https://github.com/UniversalDependencies/UD_Alemannic-DIVITAL fiction nonfiction legal spoken wiki bible dbernhard@unistra.fr 977 19334       language variety ; language_glottolog text_orig; annotation author newdoc id ; title ; domain; factuality ; sample_type; origin channel; form; discourse_type;genre; audience ; translator; transcriber; source                    
Beja Autogramm only spoken 2.8 https://github.com/UniversalDependencies/UD_Beja-Autogramm spoken martine.vanhove@cnrs.fr; sylvain@kahane.fr 763 11951 yes   sound_url text_en phonetic_text speaker_id     sent_timecode                  
Bokota ChibErgIS only spoken 2.16 https://github.com/UniversalDependencies/UD_Bokota-ChibErgIS spoken marie.benzerrak@laposte.net 406 2713 yes   sound_url text_en text_ortho; morphemic_text speaker_id sent_timecode                      
Bororo BDT mixed 2.12 https://github.com/UniversalDependencies/UD_Bororo-BDT grammar-examples spoken nonfiction bible fabricio.gerardi@uni-tuebingen.de 21384 160356   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Cantonese HK only spoken 2.1 https://github.com/UniversalDependencies/UD_Cantonese-HK spoken tswong-c@my.cityu.edu.hk; jsylee@cityu.edu.hk 1004 13918 yes           _filename   parallel_id                  
Central_Romani Selice only spoken 2.16 https://github.com/UniversalDependencies/UD_Central_Romani-Selice spoken zeman@ufal.mff.cuni.cz     yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Chinese HK only spoken 2.1 https://github.com/UniversalDependencies/UD_Chinese-HK spoken tswong-c@my.cityu.edu.hk; jsylee@cityu.edu.hk 1004 9874 yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Chukchi HSE only spoken 2.7 https://github.com/UniversalDependencies/UD_Chukchi-HSE spoken ftyers@iu.edu 1004 5389 yes     text[eng];text[eng’];text[rus] text[phon]       timestamp comment;labels                
Classical_Nahuatl FloCo mixed 2.12 https://github.com/UniversalDependencies/UD_Classical_Nahuatl-FloCo spoken fiction grammar-examples nonfiction pughrob@iu.edu       corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Czech PDTC mixed 1.0 https://github.com/UniversalDependencies/UD_Czech-PDTC news reviews nonfiction academic spoken social zeman@ufal.mff.cuni.cz 213897 3432078   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Danish DDT mixed 1.1 https://github.com/UniversalDependencies/UD_Danish-DDT news fiction spoken nonfiction zeman@ufal.mff.cuni.cz 5512 100733   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Dargwa Mehweb only spoken 2.1 https://github.com/UniversalDependencies/UD_Dargwa-Mehweb spoken sasha.kozhukhar@gmail.com, olesar@yandex.ru     yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
English CHILDES only spoken 2.16 https://github.com/UniversalDependencies/UD_English-CHILDES spoken xy236@georgetown.edu 48183 289817 yes corpus_name     gold_annotation;childes_toks speaker_role;child_name;child_gender;chi l d;child_age     type;original_sent_id;s_24_sent_id                  
English ESLSpok only spoken 2.12 https://github.com/UniversalDependencies/UD_English-ESLSpok spoken kkyle2@uoregon.edu 2320 21312 yes ———————————— ———————————— ———————————— ———————————— ———————————— ———————————— ———————————— ————————————                  
English GENTLE mixed 2.12 https://github.com/UniversalDependencies/UD_English-GENTLE academic grammar-examples legal medical nonfiction poetry social spoken amir.zeldes@georgetown.edu 1334 17619     meta::sourceURL   meta::salientEntities;transition speaker newdoc id;newpar_block;meta::speaker_count;meta::summary3;meta::summary2;meta::summary1;meta::author;meta::summary4;meta::summary5;meta::dateCollected;meta::dateModified;meta::dateCreated;meta::title meta::genre;addressee s_prominence;s_type global.Entity                
English GUM mixed 2.2 https://github.com/UniversalDependencies/UD_English-GUM academic blog email fiction government legal news nonfiction social spoken web wiki amir.zeldes@georgetown.edu 13263 229851     meta::sourceURL   meta::salientEntities;transition speaker newdoc id;newpar_block;meta::speaker_count;meta::summary3;meta::summary2;meta::summary1;meta::author;meta::summary4;meta::summary5;meta::dateCollected;meta::dateModified;meta::dateCreated;meta::title meta::genre;addressee s_prominence;s_type global.Entity                
French ParisStories only spoken 2.9 https://github.com/UniversalDependencies/UD_French-ParisStories spoken gerdes@lisn.fr 2776 42257 yes   sound_url   macrosyntax speaker                        
French Rhapsodie only spoken 2.2 https://github.com/UniversalDependencies/UD_French-Rhapsodie spoken kim@gerdes.fr 3209 43699 yes corpus metadata sound_url languages and translation(s) macrosyntax;prosodic_annotation speaker_id;speaker;speaker_education;speaker_family_social_role;speaker_fullname;speaker_age;speaker_role;speaker_sex;genre;event_structure;subgenre;involvement;modalities;interactivity;planning_type;subject;social_context task type;channel sent metadata old_id                
Frisian_Dutch Fame only spoken 2.8 https://github.com/UniversalDependencies/UD_Frisian_Dutch-Fame spoken a.r.y.braggaar@student.rug.nl 400 3729 yes       text_switch speaker newdoc id                      
Gheg GPS only spoken 2.11 https://github.com/UniversalDependencies/UD_Gheg-GPS spoken christiangeorg.ebert@uzh.ch, barbara.sonnenhauser@uzh.ch, paul.widmer@uzh.ch 966 15990 yes ———————————— ———————————— ———————————— ———————————— ———————————— ———————————— ———————————— ————————————                  
Greek GDT mixed 1.1 https://github.com/UniversalDependencies/UD_Greek-GDT news wiki spoken prokopis@ilsp.gr 2521 61773             newdoc id                      
Greek Lesbian mixed 2.16 https://github.com/UniversalDependencies/UD_Greek-Lesbian grammar-examples spoken fiction s.bompolas@athenarc.gr 540 5733   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Hausa NorthernAutogramm only spoken 2.14 https://github.com/UniversalDependencies/UD_Hausa-NorthernAutogramm spoken bernard.l.caron@gmail.com 423 4116 yes   sound_url text_en phonetic_text speaker_id     sent_timecode                  
Hausa SouthernAutogramm only spoken 2.14 https://github.com/UniversalDependencies/UD_Hausa-SouthernAutogramm spoken bernard.l.caron@gmail.com 1927 14401 yes   sound_url text_en text_ortho speaker_id     sent_timecode                  
Hausa WesternAutogramm mixed 2.17 https://github.com/UniversalDependencies/UD_Hausa-WesternAutogramm fiction nonfiction spoken bernard.l.caron@gmail.com 775 13862   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Hebrew IAHLTknesset mixed 2.15 https://github.com/UniversalDependencies/UD_Hebrew-IAHLTknesset government spoken amir.zeldes@georgetown.edu 2883 50499           speaker;is_valid_speaker;is_chairman newdoc id;newpar id   turnnumber                  
Highland_Puebla_Nahuatl ITML mixed 2.13 https://github.com/UniversalDependencies/UD_Highland_Puebla_Nahuatl-ITML spoken grammar-examples nonfiction pughrob@iu.edu 1260 10018       text[spa];text[orig] text[gloss]         labels;text[a139];text[a21];text[a70];text[a86];textl[a140];text[a94];text[a27];text[a113];text[a77];text[a205];text[a52];text[a127];text[a89];text[a110];text[a125];text[a82];text[a44];text[a19];text[a118];text[a59];text[a57];text[a174];text[a151];text[a14];text[a211];text[a48];text[a88];text[a105];text[a135];text[a150];text[a229];text[a5];text[a54];text[a114];text[a25];text[a26];text[a16];text[a69];text[a134];text[a230];text[a212];text[a58];text[a76];text[a23];text[a49];text[a91];text[a7];text[a10];text[a102];text[a112];text[a29];text[a6];text[a43];text[a115];text[a50];text[a24];text[a108];text[a47];text[a227];text[a111];text[a170];text[a203];text[a136];text[a101];text[a131];text[a40];text[a81];text[a128];text[a87];text[a85];text[a158];text[a80];text[a95];text[a15];text[a22];text[a53];text[a74];text[a51];text[a116];text[a96];text[a18];text[a75];text[a119];text[a157];text[a9];text[a28];text[a17];text[a45];text[a90];text[a130];text[a210];text[a46];text[a31];text[a97];text[2];text[a84];text[a30];text[a104];text[a39];text[a99];text[a4];text[rus];text[a11];text[a232];text[a138]                
Ika ChibErgIS only spoken 2.16 https://github.com/UniversalDependencies/UD_Ika-ChibErgIS spoken jana.bajorat@hu-berlin.de 628 5307 yes   sound_url text_en morphemic_text;text_phrase-gls-es;text_phrase-gls-tl speaker_id     sent_timecode tags                
Italian KIParlaForest only spoken 2.17 https://github.com/UniversalDependencies/UD_Italian-KIParlaForest spoken ellepannitto@gmail.com 1007 9135 yes       jefferson_text speaker_id conversation_id                      
Japanese JDD only spoken 2.15 https://github.com/UniversalDependencies/UD_Japanese-JDD spoken masayu-a@ninjal.ac.jp     yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Khoekhoe KDT mixed 2.16 https://github.com/UniversalDependencies/UD_Khoekhoe-KDT fiction grammar-examples spoken kira.tulchynska@mail.huji.ac.il, witzlack@gmail.com 3589 27611   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Khunsari AHA mixed 2.7 https://github.com/UniversalDependencies/UD_Khunsari-AHA grammar-examples spoken amojiry@gmail.com 10 74       text_en;text_fa                            
Komi_Zyrian IKDP only spoken 2.2 https://github.com/UniversalDependencies/UD_Komi_Zyrian-IKDP spoken nikotapiopartanen@gmail.com 214 2304 yes corpus_version   text_en;text_ru;text_end           comment;label                
Latvian LVTB mixed 1.3 https://github.com/UniversalDependencies/UD_Latvian-LVTB news fiction legal spoken academic lauma@ailab.lv, normunds@ailab.lv 19580 330318   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Ligurian GLT mixed 2.9 https://github.com/UniversalDependencies/UD_Ligurian-GLT nonfiction fiction news wiki bible spoken grammar-examples stefano.lusito@uibk.ac.at 316 6568   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Naija NSC only spoken 2.2 https://github.com/UniversalDependencies/UD_Naija-NSC spoken kim@gerdes.fr 9241 140837 yes   sound_url text_en text_ortho speaker_id;speaker_education;speaker_age;speaker_sex;speaker_resindence;speaker_naija_competency;speaker_birthplace;speaker_primary_other_language                        
Nayini AHA mixed 2.7 https://github.com/UniversalDependencies/UD_Nayini-AHA grammar-examples spoken amojiry@gmail.com 10 78       text_en;text_fa                            
Nenets Tundra only spoken 2.16 https://github.com/UniversalDependencies/UD_Nenets-Tundra spoken mus.nikolett@gmail.com 170 1272 yes corpus metadata sound_url;media text_en;text_ru;language_variety text_p;translit;p_text speaker metadata doc_title_ modality metadata sent metadata varia                
Nheengatu CompLin mixed 2.11 https://github.com/UniversalDependencies/UD_Nheengatu-CompLin spoken bible fiction nonfiction grammar-examples leonel.de.alencar@ufc.br 2742 25645   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Northwest_Gbaya Autogramm only spoken 2.15 https://github.com/UniversalDependencies/UD_Northwest_Gbaya-Autogramm spoken pauletteroulon@gmail.com 403 2692 yes   sound_url text_fr phonetic_text speaker_id     sent_timecode                  
Norwegian NynorskLIA only spoken 2.1 https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA spoken liljao@ifi.uio.no     yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Persian Seraji mixed 1.1 https://github.com/UniversalDependencies/UD_Persian-Seraji news fiction medical legal social spoken nonfiction mojgan.seraji96@gmail.com 5997 151627   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Pesh ChibErgIS only spoken 2.15 https://github.com/UniversalDependencies/UD_Pesh-ChibErgIS spoken natalia.caceres.arandia@cnrs.fr 524 4275 yes corpus metadata sound_url text_en morphemic_text;text_phrase-gls-es;text_phrase-gls-tl;text_phrase-gls-de;text_phrase-gls-pro;text_phrase-gls-wg;text_phrase-gls-it speaker_id doc (and paragraphs) metadata modality metadata sent_timecode tags                
Polish LFG mixed 2.2 https://github.com/UniversalDependencies/UD_Polish-LFG fiction nonfiction news spoken social aep@ipipan.waw.pl, adamp@ipipan.waw.pl 17246 130967   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Scottish_Gaelic ARCOSG mixed 2.5 https://github.com/UniversalDependencies/UD_Scottish_Gaelic-ARCOSG nonfiction fiction news spoken colin.r.batchelor@googlemail.com 4748 86139           speaker newdoc id     comment;revision                
Skolt_Sami Giellagas mixed 2.5 https://github.com/UniversalDependencies/UD_Skolt_Sami-Giellagas nonfiction news spoken rueter.jack@gmail.com 250 2961   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Slovenian SST only spoken 1.3 https://github.com/UniversalDependencies/UD_Slovenian-SST spoken kaja.dobrovoljc@ff.uni-lj.si 6121 98393 yes   sound_url     speaker_id newdoc id                      
Soi AHA mixed 2.7 https://github.com/UniversalDependencies/UD_Soi-AHA grammar-examples spoken amojiry@gmail.com 8 55       text_en;text_fa                            
South_Levantine_Arabic MADAR mixed 2.7 https://github.com/UniversalDependencies/UD_South_Levantine_Arabic-MADAR spoken social shorouqjzahra@gmail.com 100 789   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Spanish COSER only spoken 2.14 https://github.com/UniversalDependencies/UD_Spanish-COSER spoken johnatan.bonillahuerfano@ugent.be 539 7987 yes                                  
Swedish_Sign_Language SSLC only spoken 1.4 https://github.com/UniversalDependencies/UD_Swedish_Sign_Language-SSLC spoken robert@ling.su.se 203 1610 yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Telugu_English TECT only spoken 2.14 https://github.com/UniversalDependencies/UD_Telugu_English-TECT spoken anishka18v@gmail.com 97 456 yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Turkish_English BUTR only spoken 2.16 https://github.com/UniversalDependencies/UD_Turkish_English-BUTR spoken furkanakkurt7242@icloud.com 51 393 yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Turkish_German SAGT only spoken 2.7 https://github.com/UniversalDependencies/UD_Turkish_German-SAGT spoken ozlem@ims.uni-stuttgart.de 2184 36934 yes ——————-   ——————-   ——————-   ——————-   ——————-   ——————-   ——————-   ——————-   ——————-
Ukrainian ParlaMint mixed 2.15 https://github.com/UniversalDependencies/UD_Ukrainian-ParlaMint government legal spoken corpus.textiv@gmail.com 7142 109166   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Vietnamese TueCL only spoken 2.14 https://github.com/UniversalDependencies/UD_Vietnamese-TueCL spoken hoa.do@student.uni-tuebingen.de,cagri.coeltekin@uni-tuebingen.de 100 1888 yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Western_Armenian ArmTDP mixed 2.8 https://github.com/UniversalDependencies/UD_Western_Armenian-ArmTDP blog fiction news nonfiction reviews social spoken web wiki marat.yavrumyan@ysu.am 6644 121432   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Western_Sierra_Puebla_Nahuatl MesoTree mixed 2.11 https://github.com/UniversalDependencies/UD_Western_Sierra_Puebla_Nahuatl-MesoTree spoken fiction grammar-examples nonfiction pughrob@iu.edu           text[spa];text[orig]           labels                
Yiddish YiTB mixed 2.17 https://github.com/UniversalDependencies/UD_Yiddish-YiTB grammar-examples learner-essays bible wiki fiction nonfiction spoken web m.kirkandrews@gmail.com 3054 27488   corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                
Zazaki ZSD only spoken 2.17 https://github.com/UniversalDependencies/UD_Zazaki-ZSD spoken luigi.talamo@uni-saarland.de     yes corpus metadata media data languages and translation(s) transcription and annotation levels available speaker metadata doc (and paragraphs) metadata modality metadata sent metadata varia                

Harmonization Pipeline:

  1. is spoken part clearly identifiable?
    1. yes > add # modality = spoken to relevant sentences
    2. no > open ISSUE with text XXXX

Workflow