• ⬆️ Top •
Grew Tutorial • Change tokenization
We reuse here the example of conversion of Sequoia POS tagging to the SUD POS tagging.
We recall the two formats:
Format | frwiki_50.1000_00907 |
---|---|
Sequoia | |
SUD |
You can observe that in addition to a different POS tagset, the SUD format also uses a different tokenisation.
The word du of the input sentence is one token du with POS P+D
in Sequoia but this is in fact an amalgam of two lexical units: a preposition and a determiner (this is exactly what the tag P+D
means).
In SUD, such combined tag are not allowed, so the sentence is annotated with two tokens de and le for the word du.
The command add_node
So, we have to design a rule to make this new tokenisation.
The commented rule below computes this transformation (file: amalgam1.grs
):
rule amalgam {
pattern { X [form = "du", upos = "P+D"] }
commands {
add_node D :> X; % Create a new node called `D` and place it just after `X`
X.form = "de"; % Change the form of `X` to `de`
X.upos = ADP; % Set the `ADP` tag for the preposition "de"
D.form = "le"; % Add the form feature of `D` to `le`
D.upos = DET; % Set the `DET` tag for the determiner "le"
}
}
This is our first rule in this tutorial with more than one command. In general, the transformation is described by a sequence of commands which are applied successively to the current graph.
The application of this rule to our input graph builds:
Good, we have the final tokenisation we expected, but the new node for “le” is not linked to the graph.
We can imagine to connect it later with some other rule but it may be dangerous: imagine an input sentence with several occurrences of the word “du”, the application of Onf (amalgam)
will build a graph with several isolated nodes “le” and it may be confusing to choose later the “right” determiner with the “right” noun!
In practice, it is safer to avoid to build disconnected graph.
The command add_edge
With our example above, our rule should take care of the connection of the new node to the relevant noun.
This can be done with a command add_edge M -[det]-> D
where M
is the node for the word doigt.
But, to be able to use this node M
in the command
part, it must be declared in the pattern
part.
The new rule is then (file: amalgam2.grs
):
rule amalgam {
pattern {
X [form = "du", upos = "P+D"]; % match the amalgam word "du";
X -[obj.p]-> Y; % match the node linked to "du" with the `obj.p` relation
}
commands {
add_node D :> X; % Create a new node called `D` and place it just after `X`
X.form = "de"; % Change the form of `X` to `de`
X.upos = ADP; % Set the `ADP` tag for the preposition "de"
D.form = "le"; % Add the form feature of `D` to `le`
D.upos = DET; % Set the `DET` tag for the determiner "le"
add_edge Y -[det]-> D; % Add the dependency linked to the new node `D`
}
}
The application of this rule to our input graph builds:
Advanced topic
When an amalgam (also called Multi word token in UD) is used, an additional information should be added in the CoNLL-U file in order to keep the link with the raw text. A special line indicates that the two tokens de and le are present together du in the raw text. So the expected encoding of the graph above should contains:
7-8 du _ _ _ _ _ _ _ _
7 de de ADP _ _ 6 mod _ s=def
8 le _ DET _ _ 9 det _ _
These special lines (with index like 7-8
) are encoded in Grew version of graphs with the help of the textform
features (see CoNNL-U page).
The full rule which produce the expected output is:
rule amalgam {
pattern {
X [form = "du", upos = "P+D"]; % match the amalgam word "du";
X -[obj.p]-> Y; % match the node linked to "du" with the `obj.p` relation
}
commands {
add_node D :> X; % Create a new node called `D` and place it just after `X`
X.form = "de"; % Change the form of `X` to `de`
X.upos = ADP; % Set the `ADP` tag for the preposition "de"
D.form = "le"; % Add the form feature of `D` to `le`
D.upos = DET; % Set the `DET` tag for the determiner "le"
add_edge Y -[det]-> D; % Add the dependency linked to the new node `D`
X.wordform="de"; % The wordform of `X` is identical to the from
X.textform="du"; % [1/2] textform "du" and "_" on two consecutive tokens encode MWT lines
D.textform="_"; % [2/2] textform "du" and "_" on two consecutive tokens encode MWT lines
}
}
Reverse transformation
In the GRS file rm_mwt.grs, there is a rewriting system for the reverse operation. It contains some lexical information specific to French but it shows a way to deal with this kind of transformation.
• ⬆️ Top •