CoNLL-U format
The most common way to store dependency structures is the CoNLL format. Several extensions were proposed and we describe here the one which is used by Grew, known as CoNLL-U format defined in the Universal Dependency project.
For a sentence, some metadata are given in lines beginning by #
.
The rest of the lines described the tokens of the structure.
Tokens lines contain 10 fields, separated by tabulations.
The file n01118003.conllu
is an example of CoNLL-U data taken form the corpus UD_English-PUD
(version 2.6).
# newdoc id = n01118
# sent_id = n01118003
# text = Drop the mic.
1 Drop drop VERB VB VerbForm=Inf 0 root 0:root _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 mic mic NOUN NN Number=Sing 1 obj 1:obj SpaceAfter=No
4 . . PUNCT . _ 1 punct 1:punct _
We explain here how Grew deals with the 10 fields if CoNLL-U files:
- ID. This field is a number used as an identifier for the corresponding lexical unit (LU).
- FORM. The phonological form of the LU.
In Grew, the value of this field is available through a feature named
form
(for backward compatibility, the keywordphon
can also be used instead ofform
). - LEMMA. The lemma of the LU. In Grew, this corresponds to the feature
lemma
. - UPOS. The field
upos
(for backward compatibility,cat
can also be used to refer to this field). - XPOS. The field
xpos
(for backward compatibility,pos
can also be used to refer to this field). - FEATS. List of morphological features.
- HEAD. Head of the current word, which is either a value of ID or
0
for the root node. - DEPREL. Dependency relation to the HEAD (
root
iff HEAD =0
). - DEPS. (UD only) Enhanced dependency graph in the form of a list of head-deprel pairs. In Grew, these relations are encoded with the features
enhanced=yes
- MISC. Any other annotation. In Grew, annotation of the field are accessible like morphological features if the FEATS column.
Note that the same format is very often use to describes dependency syntax corpora.
In these cases, a set of sentences is described in the same file using the same convention as above and a blank line as separator between sentences.
It is also requires that the sent_id
metadata is unique for each sentence in the file.
In practice, it may be useful to deal explicitly with the root
relation (for instance, if some rewriting rule is designed to change the root of the structure).
To allow this, when reading CoNLL-U format Grew also creates a node at position 0
and link it with the root
relation to the linguistic root node of the sentence.
The example above then produce the 5 nodes graphs below:
About CoNLL-U field names
In Grew nodes, the fields 2, 3, 4 and 5 of CoNLL-U structure are considered as features with the following feature names.
CoNLL-U field | 2 | 3 | 4 | 5 |
---|---|---|---|---|
Name | form |
lemma |
upos |
xpos |
For instance
- matching the word is →
pattern { N [form="is"] }
- matching the lemma be →
pattern { N [lemma="be"] }
Note about backward compatibility
In older versions of Grew (before the definition of the CoNLL-U format), the fields 2, 4 and 5 where accessible with the names phon
, cat
and pos
respectively.
To have a backward compatibility and uniform handling of these fields, the names phon
, cat
and pos
are replaced at parsing time by form
, upos
and xpos
.
As a consequence, it is impossible to use both phon
and form
in the same system.
We highly recommend to use only the form
feature in this setting. Of course, the same observation applies to cat
and upos
(upos
should be prefered) and to pos
and xpos
(xpos
should be chosen).
Additional features textform
and wordform
In order to deal with several places where text data present in the original sentence and the corresponding linguistic unit are different, a systematic use of the two features textform
and wordform
was proposed in #683.
The two fields are built from CoNLL-U data in the following way:
- If a multiword token
i-j
is declared:- the
textform
of the first token is theFORM
field of the multiword token - the
textform
of each other token is_
- the
- If the token is an empty node (exists only in EUD):
textform=_
andwordform=__EMPTY__
- For each token without
textform
feature, thetextform
is set to theFORM
field value - For each token without
wordform
feature, thewordform
is set to theFORM
field value
⚠️ In places where wordform
should be different from FORM
field, this should be expressed in the data with an explicit wordform
feature.
This includes:
- lowercased form of initial word or potentially other words in the sentence
- typographical or orthographical errors
- token linked by a
goeswith
relation
See a few examples in SUD_French-GSD.