Disclaimer: This page represents the output of UniDive WG1 T1.5 group. It is not meant to be understood as proper guidelines yet, it will evolve into UD guidelines in the future.
Guidelines for Spoken Language UD Treebanks
- Basic principles
- Maximal unit segmentation
- Tokenization and word segmentation
- pauses (filled vs. silent vs. long), non-verbal behaviours and punctuation and anonymization/pseudonymization, incomprehensible speech signal
- repetitions, false starts, reformulations, reparandum stuff
- Morphology
- interrupted words, lemmas
- what is INTJ
- Syntax
- specific syntax
- co-constructed syntax and handling of overlap (in Syntax)
- question answering
- Speech specific metadata
- CoNLL-U format and treebank structure
- Documentation of tags, features and relations
- POS tags:
- Syntactic relations:
conj:reform- [Discourse relation]
discourse:backchanneldiscourse:filledpausediscouse:filler
- [Parataxis]
parataxis:insertparataxis:parent
- reparandum
- MISC attributes
- Current UD Spoken treebanks
to do:
- Maximal unit segmentation -> to refine
- Tokenization -> to discuss together in toto, comment on doc and add examples
- Morphology -> to discuss together in toto, comment on doc and add examples
- Syntax -> to discuss together (except coconstruction)
- Metadata -> to refine, group pullrequests
- Discourse relation -> to discuss together