CoNLL-U format

NB: The doc given here correspond to Grew version 1.16 (linked to conll version 1.18.1). You can check your versions with opam list | grep grew and opam list | grep conll.

The most common way to store dependency structures is the CoNLL format. Several extensions were proposed and we describe here the one which is used by Grew, known as CoNLL-U format defined in the UD (Universal Dependencies) project. Grew also handles the CoNLL-U plus format, see CoNLL-U plus page.

For a sentence, some metadata are given in lines beginning by #. The rest of the lines described the tokens of the structure. Token lines contain 10 fields, separated by tabulations.

The file n01118003.conllu is an example of CoNLL-U data taken form the corpus UD_English-PUD (version 2.14).

# newdoc id = n01118
# sent_id = n01118003
# text = Drop the mic.
1	Drop	drop	VERB	VB	VerbForm=Inf	0	root	0:root	_
2	the	the	DET	DT	Definite=Def|PronType=Art	3	det	3:det	_
3	mic	mic	NOUN	NN	Number=Sing	1	obj	1:obj	SpaceAfter=No
4	.	.	PUNCT	.	_	1	punct	1:punct	_

We explain here how Grew deals with the 10 fields of CoNLL-U files:

  1. ID: This field is a number used as an internal identifier for the corresponding lexical unit (LU), it can not be accessed from directly from Grew.
  2. FORM: The phonological form of the LU; in Grew, the value of this field is available through a feature named form
  3. LEMMA: The lemma of the LU; in Grew, this corresponds to the feature named lemma
  4. UPOS: The universal POS; in Grew, it is encoded as feature named upos
  5. XPOS: A language-specific part-of-speech tag; in Grew, it is encoded as feature named xpos
  6. FEATS: List of morphological features; each feature is turned into a Grew node feature.
  7. HEAD: Head of the current word, which is either a value of ID or 0 for the root node.
  8. DEPREL: Dependency relation to the HEAD (root iff HEAD = 0).
  9. DEPS: (UD only) Enhanced dependency graph in the form of a list of head-deprel pairs. In Grew, these relations are encoded with the edge feature enhanced=yes.
  10. MISC: Any other annotation. See below for the way Grew parses this field.

A few examples of usage in Grew requests:

Note that the CoNLL-U format is very often used to describe dependency syntax corpora. In these cases, a set of sentences is described in the same file using the same convention as above and a blank line as separator between sentences. It is also requires that each sentence is give a sent_id metadata which is unique in the corpus.

The anchor node at position 0

In order to be able to request or to manipulate the root relation (for instance, if some rewriting rule is designed to change the root of the structure), we need to add a special node at position 0 (called the “anchor” node) which is the source of the root relation.

Hence, the 4 tokens example above produces the 5 nodes graph below:

Dependency structure

This special node has only the form feature defined to be __0__ and no other feature. In a Grew request, to avoid the special node the be matched, one can add a upos contraint. For instance, with the request pattern { X [] } all the 5 nodes of the above graph can be matched, whereas with the request pattern { X [upos] } only the 4 nodes associated with real tokens can be matched.

Layered features

Universal Dependency proposes a notion of layered features when the same feature can be marked more than once. For instance the French word votre is a possessive determiner, introducing a singular entity but referencing to a plural possessor. In CoNLL feats, this is encoded as Number=Sing|Number[psor]=Plur.

Unfortunately, the bracket notation in the feature value name is in conflict with other usages of brackets in Grew syntax. In Grew, the bracket notation is replaced by an alternative one with a double underscore: The (S)UD feature name Number[psor] is written Number__psor. For instance:

How the MISC field is handled by Grew?

There are two main problems to deal with the MISC field in the existing (S)UD data.

  1. The content of the MISC field is not fully specified and in the UD data, it is used in many different ways and our objective is both:
  1. When a Grew node contains a feature like Case=Gen, there is no canonical way to decide if it must be output in the FEATS or in the MISC field.

To deal with the first problem, at parsing time, Grew tries to split the MISC field into a set of (feature, value) pairs. If this is not possible, the raw content is kept in a special feature named __RAW_MISC__ ( ). Doing this, it is possible to keep the MISC field unchanged during rewriting.

For the second problem, the handling of the MISC features depends on the config used (option -config on Grew CLI).

Unfortunately, in practice, the same feature may be used in both fields FEATS and MISC. For instance, in the sentence test-12 from UD_Polish-LFG (below), the feature Case appear in FEATS in tokens 2, 5, 6 and in MISC in token 4! In order to be able to correctly output the features in the right field, Grew adds a prefix __MISC__ to the feature which is is given the MISC if it is in the list given at the end of this page.

# sent_id = test-12
# text = A inni przychodzą do telewizji pijani.
# converted_from_file = NKJP1M_6203010001843_morph_2-p_morph_2.27-s-dis@1.xml
# genre = social
1	A	a	CCONJ	conj	_	3	cc	3:cc	_
2	inni	inny	ADJ	adj:pl:nom:m1:pos	Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|SubGender=Masc1	3	nsubj	3:nsubj	_
3	przychodzą	przychodzić	VERB	fin:pl:ter:imperf	Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	0:root	_
4	do	do	ADP	prep:gen	AdpType=Prep	5	case	5:case	Case=Gen
5	telewizji	telewizja	NOUN	subst:sg:gen:f	Case=Gen|Gender=Fem|Number=Sing	3	obl	3:obl:do	_
6	pijani	pijany	ADJ	adj:pl:nom:m1:pos	Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|SubGender=Masc1	2	acl	2:acl	SpaceAfter=No
7	.	.	PUNCT	interp	PunctType=Peri	3	punct	3:punct	_

Requests for Case in FEATS: and for Case in MISC: .

Additional features textform and wordform

In order to deal with several places where text data present in the original sentence and the corresponding linguistic unit are different, a systematic use of the two features textform and wordform was proposed in #683.

The two fields are built from CoNLL-U data in the following way:

  1. If a multiword token i-j is declared:
    • the textform of the first token is the FORM field of the multiword token
    • the textform of each other token is _
  2. If the token is an empty node (exists only in EUD):
    • textform=_ and wordform=__EMPTY__
  3. For each token without textform feature, the textform is set to the FORM field value
  4. For each token without wordform feature, the wordform is set to the FORM field value

⚠️ In places where wordform should be different from FORM field, this should be expressed in the data with an explicit wordform feature. This includes:

See few examples in SUD_French-GSD .

Naming of CoNLL columns FORM, UPOS and XPOS in older Grew versions

In older versions of Grew (before the definition of the CoNLL-U format), the fields 2 (FORM), 4 (UPOS) and 5 (XPOS) where accessible with the names phon, cat and pos respectively. Since 1.6, these names cannot be used anymore. If you used this features names, you have to update your old GRS with the following correspondance:

Note that this applies to the examples given in the book “Application of Graph Rewriting to Natural Language Processing”.

List of features put in the FEATS field

This list in defined in the conll library (version 1.18.1).

If the config is ud or sud, the following list of features is used to decide which features should be written into the FEATS field. The list is based on the data available in UD 2.14 (plus the Shared feature specific to SUD):