Grew Tutorial • Lesson 6 • More commands
Let us go on with our conversion of Sequoia POS tagging to the SUD POS tagging.
We recall the two formats:
Format | frwiki_50.1000_00907 |
---|---|
Sequoia | |
SUD |
You can observe that in addition to a different POS tagset, the SUD format also uses a different tokenisation.
The word du of the input sentence is one token du with POS P+D
in Sequoia but this is in fact an amalgam of two lexical units: a preposition and a determiner (this is exactly what the tag P+D
means).
In SUD, such combined tag are not allowed, so the sentence is annotated with two tokens de and le for the word du.
The command add_node
So, we have to design a rule to make this new tokenisation.
The commented rule below computes this transformation (file: amalgam1.grs
):
rule amalgam {
pattern { N [form = "du", upos = "P+D"] }
commands {
add_node D :> N; % Create a new node called `D` and place it just after `N`
N.form = "de"; % Change the form of `N` to `de`
N.upos = ADP; % Set the `ADP` tag for the preposition "de"
D.form = "le"; % Add the form feature of `D` to `le`
D.upos = DET; % Set the `DET` tag for the determiner "le"
}
}
This is our first rule in this tutorial with more than one command. In general, the transformation is described by a sequence of commands which are applied successively to the current graph.
The application of this rule to our input graph builds:
Good, we have the final tokenisation we expected, but the new node for “le” is not linked to the graph.
We can imagine to connect it later with some other rule but it may be dangerous: imagine an input sentence with several occurrences of the word “du”, the application of Onf (amalgam)
will build a graph with several isolated nodes “le” and it may be confusing to choose later the “right” determiner with the “right” noun!
In practice, it is safer to avoid to build disconnected graph.
The command add_edge
With our example above, our rule should take care of the connection of the new node to the relevant noun.
This can be done with a command add_edge M -[det]-> D
where M
is the node for the word doigt.
But, to be able to use this node M
in the command
part, it must be declared in the pattern
part.
The new rule is then (file: amalgam2.grs
):
rule amalgam {
pattern {
N [form = "du", upos = "P+D"]; % match the amalgam word "du";
N -[obj.p]-> M; % match the node linked to "du" with the `obj.p` relation
}
commands {
add_node D :> N; % Create a new node called `D` and place it just after `N`
N.form = "de"; % Change the form of `N` to `de`
N.upos = ADP; % Set the `ADP` tag for the preposition "de"
D.form = "le"; % Add the form feature of `D` to `le`
D.upos = DET; % Set the `DET` tag for the determiner "le"
add_edge M -[det]-> D; % Add the dependency link to the new node `D`
}
}
The application of this rule to our input graph builds:
Advanced topic
TODO: dealing the the special encoding of Mutli-Word Tokens in (S)UD with wordform
and textform
.