Rule Gallery
This page gives examples of rules that can be used either with ArboratorGrew and when writing more complex GRS.
Tokenisation
⚠️ NOTE: the rules proposed in this section are just modifying the sequence of tokens. The dependency links are not taken into account. In order to take these into account, you can:
- Add specific commands to move or change these relations
- Or, in ArboratorGrew, you can post edit the graph after the Apply rules operation
Split one token into two tokens
Suppose that your data contains a token didn’t that should be splitted into two separated tokens did and n’t. The rule above implements this operation:
rule split_1to2 {
pattern {
X [ form = "didn't" ]
}
commands {
add_node Y :> X; % Add a new node (called Y in following commands) directly after X
X.form = "did"; % Set the new form on the first sub-token
Y.form = "n't"; % Set the new form on the second sub-token
X.wordform = X.form; % This line is needed to have a clean Conll output
% You can then use additional commands for other features
% Some examples above:
X.upos = AUX;
X.lemma = "do";
Y.upos = PART;
Y.lemma = "not";
% If a feature should move to the new node:
Y.Polarity = X.Polarity;
del_feat X.Polarity;
}
}
Note again that the token not is isolated, something more is needed to connect it!
Splitting a token in sub-tokens and adding a Multi-Word token
UD proposes a mechanism for dealing with difficult tokenisation examples (see UD guidelines). This mechanism is called a Multi-Word Token (MWT). The rule above shows how to split a token into sub tokens by adding a MWT to the structure. The first example corresponds to the French contraction au, which should be split into à and le. Note that the rule does not handle the dependency relations, so the output of the rule should be edited for a full annotation.
rule mwt {
pattern {
X [ form = "au" ]
}
commands {
add_node Y :> X; % add a new node (called Y in following commands) directly after X
X.form = "à"; X.lemma="à"; % Set the new form and lemma on sub-token 1
X.upos = ADP; % Set the new POS on sub-token 1
Y.form = "le"; Y.lemma="le"; % Set the new form and lemma on sub-token 2
Y.upos = DET; % Set the POS on sub-token 2
% Next lines are dedicated to the multiword tokens.
% See https://grew.fr/doc/conllu/#additional-features-textform-and-wordform for explanation
X.wordform = "au"; Y.wordform = "_";
}
}
Splitting a punctuation mark at the end of a token
This example below splits the tokenisation signs which were kept on the previous word. For example: split Go! into Go and !.
rule tok_punct {
pattern {
% The rule applies to any form that ends with one of the punctuation marks: comma, period, exclamation mark, question mark.
X [ form = re".+[,.!?]" ]
% RegExp explanation:
% ".+" ➔ a non-empty prefix ("." is any character, "+" means repeated a non-zero number of time)
% "[,.!?]" ➔ the last character is one of the characters we want to split.
}
without {
% A token that contains only punctuation marks must be spared (e.g. "!!!" or "...").
X.form = re"[,.!?]+";
}
without {
% Other forms may need to be retained: to be completed
X[form = "M." | "Dr." | "Jr."];
}
commands {
add_node Y :> X; % Add a new node (called Y in the following commands) directly after X
Y.form = X.form[-1:]; % Python-like slicing, see http://grew.fr/doc/commands/#add-or-update-a-node-feature
Y.lemma = Y.form; % Copy the form created above into the lemma annotation
Y.upos = PUNCT; % Set Y' upos
X.form = X.form[:-1]; % Remove the last character from X.form
X.SpaceAfter = "No";
}
}
Note that commands order is important: if the command X.form = X.form[:-1]
appear before Y.form = X.form[-1:]
the result will be wrong.
Merge 2 tokens into 1
The rule below performs the reverse operation of the one above: turn two consecutive token did and n’t into one token didn’t.
rule r {
pattern { X1 [form="did"]; X2 [form="n't"]; X1 < X2 }
commands {
X1.form = "didn't";
% For a clean Conllu output
X1.wordform = X1.form;
% More lines can be added for other features
del_node X2; % Finally remove the unnedeed node
}
}
Note that, depending on the dependency relations and the features on the initial tokens X1
and X2
,
it may be simpler to pu the new form
on X2
and to remove X1
.
Larger change in tokenisation
The next example is taken from mSUD_Bokota
where some data tokenized as 3 tokens ho+ked+-a should be changed into the 2 token sequence hoke+da.
rule r {
pattern { X1 [form="ho"]; X2 [form="ked"]; X3 [form="-a"]; X1 < X2; X2 < X3 }
commands {
X1.form = "hoke";
X2.form = "da";
% If data is aligned with a sound file:
X2.AlignEnd = X3.AlignEnd;
% For a clean Conllu output
X1.wordform = X1.form;
X2.wordform = X2.form;
% More lines can be added for other features
del_node X3; % Finally remove the unnedeed node
}
}