home edit page issue tracker

This page pertains to UD version 2.

Metadata harmonisation

Background

Speech-related metadata in UD treebanks remains heterogeneous and inconsistently encoded in both spoken-only and mixed-modality resources. This makes it difficult to identify spoken material reliably or retrieve data by speech type (e.g. spontaneous vs. prepared, monologue vs. dialogue) or speaker characteristics (e.g. age, gender, education).

This variation (both in the metadata recorded and in its representation in CoNLL-U) was initially documented by Dobrovoljc (2022), who recommends retaining all available speech-specific metadata in line with prevailing treebank practices and the initial recommendations of Kahane et al. (2021). The UniDive survey (see summary of results) and the Grew metadata inventory confirm that this heterogeneity persists: not all mixed-modality treebanks explicitly mark spoken material, and many spoken-only treebanks provide no additional speech-specific metadata; where such metadata is available, it varies in type, granularity, and encoding.

Related UD discussions in issues #1135 and #1146 further underline the need for clearer and more consistent metadata practices.

Core metadata categories and naming

When preparing a spoken UD treebank, two core principles should guide the treatment of metadata: (1) preserve all available metadata associated with the recordings rather than discarding it during conversion to .conllu, and (2) adopt shared naming conventions to avoid reinventing feature names that have already been used in existing treebanks.

Below, we list the most recurrent speech-related metadata categories occurring in existing treebanks and propose their standardized naming, organized by the level at which they apply. The list is not exhaustive: additional metadata may be encoded as needed, but before introducing a new feature, check whether a suitable convention is already used in existing spoken treebanks (see the Grew metadata inventory).

Document-level

Feature Description Examples
modality Data modality in mixed-modality treebanks # modality = spoken, # modality = written, # modality = signed
newdoc id Unique identifier of the speech event # newdoc id = doc01
sound_url Link to the audio recording # sound_url = link-to-audio.mp3
video_url Link to the video recording # video_url = link-to-video.mp4
genre Descriptive label of the speech event (more here) # genre = interview, # genre = conversation, # genre = lecture

Speaker-level

Feature Description Examples
speaker_id Speaker producing the turn # speaker_id = Cf-stra-07534
speaker_role Role in the interaction # speaker_role = interviewer
speaker_age Age or age range of the speaker # speaker_age = 18 to 35
speaker_gender Gender of the speaker, if available # speaker_gender = female
speaker_education Highest completed education level # speaker_education = high-school
speaker_residence Place of residence of the speaker # speaker_region = south-west

Sentence-level

Feature Description Examples
sent_id Unique identifier of the utterance # sent_id = doc01.s144
sound_alignment_begin Start timestamp in the recording (ms) # sound_alignment_begin = 12340
sound_alignment_end End timestamp in the recording (ms) # sound_alignment_end = 14560
duration Duration of the sentence (ms) # duration = 2220
text_[type] Transcription of a given type # text_orthographic = qu'est-ce que tu fais (other types: text_phonetic, text_morphemic, text_transliteration, text_conversationanalysis, text_macrosyntax)
text_[ISO] Translation into another language (ISO code) # text_en = what are you doing

Token-level

Feature Description Examples
Lang Language identifier for code-switched tokens Lang=en
OrigLang Original language of borrowed or inserted tokens OrigLang=en
WordAlignmentBegin Start timestamp of the token (ms) WordAlignmentBegin=14120
WordAlignmentEnd End timestamp of the token (ms) WordAlignmentEnd=14560

Taxonomy for describing speech events

Spoken treebanks often describe speech events by the type of interaction recorded. To make such descriptions more comparable across treebanks, we distinguish two complementary approaches: an open, descriptive genre label that treebanks define flexibly, and an optional, fixed set of interaction parameters capturing the main dimensions along which speech events vary.

Genre

Genre is encoded as genre, an open descriptive label that treebanks define flexibly to best capture their data, and is the most common way speech events are described. The most frequently reported genres in the survey are interview, conversation, lecture, speech, narrative, and monologue; others include radio show, TV show, exam, court, vlog, podcast, and commentary. See also the broader UD genre discussion here.

Interaction parameters (optional add-on)

In addition to the open genre label, speech events may be described with a fixed set of interaction parameters. This optional, more fine-grained layer captures the key dimensions of variation in spoken communication, drawing on controlled value sets so that descriptions stay comparable across treebanks. Each parameter is drawn from a fixed value set:

Parameter Values Meaning (in order)
degree_of_spontaneity unplanned, planned, elicited no prior preparation; prepared or scripted in advance; produced in response to a task or prompt
number_of_participants monologic, dialogic, multi-party one speaker; two alternating; three or more
context public, private, professional open/general audience; closed personal setting; institutional or workplace setting
setting face-to-face, telephone, broadcast, online co-present in one space; audio telephony; mass transmission (radio, TV); internet-mediated
channels phonic-auditory, gestural-visual, graphic-visual spoken/heard; signed/seen; written/read (may combine)
symmetry symmetric, asymmetric comparable participant roles; differing roles (e.g. interviewer/interviewee)

For example, a spontaneous face-to-face conversation among friends would be described as:

# genre = conversation
# degree_of_spontaneity = unplanned
# number_of_participants = dialogic
# context = private
# setting = face-to-face
# channels = phonic-auditory; gestural-visual
# symmetry = symmetric

Metadata placement and encoding

This metadata can be encoded directly in the CoNLL-U files following standard practice: token-level features in the MISC column, and sentence-, speaker-, and document-level information as # key = value comment lines. To avoid repeating shared document- and speaker-level metadata on every sentence, there is also a proposal to store such metadata in an external metadata.json file, referenced from the CoNLL-U files via stable identifiers like document_id and speaker_id, described in more detail here.