home edit page issue tracker

This page pertains to UD version 2.

Maximal Unit Segmentation

Not all currently available spoken treebanks take the same approach at maximal unit segmentation. More specifically, there is a tension between a prosodic and a syntactic view of sentence completion. The prosodic approach goes for minimal boundaries, segmenting sentences at prosodic termination. The drawback is that these can be hard to find (i.e., subjective interpretation of the annotator, difficult to come up with a clear test), and the resulting unit doesn’t always correspond to what we intuitively think of as a sentence. The syntactic approach can lead to a more “maximal” segmentation. The drawback is that it might lead to arbitrarily long sentences as the boundary between discourse functions and syntactic functions is blended.

Examples from current works:

The general principle that we want to enforce is “if you can link, link”

This doesn’t necessarily need to be applied in a strict way. You may have a discourse marker that is a connective that could technically link back, but shouldn’t because there is a very long pause, for example. Language-specific hints that may help us identify the presence of a link.

TODO: explicit reference to rectional unit (Kahane&Pietrandrea) @sylvainkahane

This means that segmentation has to be performed on the basis of various criteria: syntactic, prosodic and semantic at once.

Speaker view vs. Dependency view

proposal: provide both speaker-based and dependency-based information, developing a conversion tool to switch between the two “views” and make data available in both versions.

More specifically, in the speaker-based view each speaker utterance is a new tree, and the Speaker ID attribute applied to the tree (# speaker_id metadata). In the dependency-based view, a tree may be the outcome of multiple speaker concatenations. Each token has a Speaker_id attribute in MISC, as there may be arbitrarily many speakers contributing.

TODO: @bguil

What to do

In cases of: