Rule Gallery

This page gives examples of rules that can be used either with ArboratorGrew and when writing more complex GRS.

Tokenisation

Tokenisation

⚠️ NOTE: the rules proposed in this section are just modifying the sequence of tokens. The dependency links are not taken into account. In order to take these into account, you can:

Add specific commands to move or change these relations
Or, in ArboratorGrew, you can post edit the graph after the Apply rules operation

Split one token into two tokens

Suppose that your data contains a token didn’t that should be splitted into two separated tokens did and n’t. The rule above implements this operation:

rule split_1to2 {
	pattern {
		X [ form = "didn't" ]
	}
	commands {
		add_node Y :> X;         % Add a new node (called Y in following commands) directly after X
		X.form = "did";          % Set the new form on the first sub-token
		Y.form = "n't";          % Set the new form on the second sub-token

		X.wordform = X.form;     % This line is needed to have a clean Conll output

		% You can then use additional commands for other features
		% Some examples above:
		X.upos = AUX;
		X.lemma = "do";

		Y.upos = PART;
		Y.lemma = "not";

		% If a feature should move to the new node:
		Y.Polarity = X.Polarity;
		del_feat X.Polarity;
	}
}

Note again that the token not is isolated, something more is needed to connect it!

Splitting a token in sub-tokens and adding a Multi-Word token

UD proposes a mechanism for dealing with difficult tokenisation examples (see UD guidelines). This mechanism is called a Multi-Word Token (MWT). The rule above shows how to split a token into sub tokens by adding a MWT to the structure. The first example corresponds to the French contraction au, which should be split into à and le. Note that the rule does not handle the dependency relations, so the output of the rule should be edited for a full annotation.

rule mwt {
	pattern {
		X [ form = "au" ]
	}
	commands {
		add_node Y :> X;             % add a new node (called Y in following commands) directly after X
		X.form = "à"; X.lemma="à";   % Set the new form and lemma on sub-token 1
		X.upos = ADP;                % Set the new POS on sub-token 1
		Y.form = "le"; Y.lemma="le"; % Set the new form and lemma on sub-token 2
		Y.upos = DET;                % Set the POS on sub-token 2
		% Next lines are dedicated to the multiword tokens.
		% See https://grew.fr/doc/conllu/#additional-features-textform-and-wordform for explanation
		X.wordform = "au"; Y.wordform = "_";
	}
}

Splitting a punctuation mark at the end of a token

This example below splits the tokenisation signs which were kept on the previous word. For example: split Go! into Go and !.

rule tok_punct {
	pattern {
		% The rule applies to any form that ends with one of the punctuation marks: comma, period, exclamation mark, question mark.
		X [ form = re".+[,.!?]" ]
		% RegExp explanation: 
		%  ".+" ➔ a non-empty prefix ("." is any character, "+" means repeated a non-zero number of time)
		%  "[,.!?]" ➔ the last character is one of the characters we want to split.
	}
	without {
		% A token that contains only punctuation marks must be spared (e.g. "!!!" or "...").
		X.form = re"[,.!?]+";
	}
	without {
		% Other forms may need to be retained: to be completed
		X[form = "M." | "Dr." | "Jr."];
	}
	commands {
		add_node Y :> X;      % Add a new node (called Y in the following commands) directly after X
		Y.form = X.form[-1:]; % Python-like slicing, see http://grew.fr/doc/commands/#add-or-update-a-node-feature
		Y.lemma = Y.form;     % Copy the form created above into the lemma annotation
		Y.upos = PUNCT;       % Set Y' upos
		X.form = X.form[:-1]; % Remove the last character from X.form
		X.SpaceAfter = "No";
	}
}

Note that commands order is important: if the command X.form = X.form[:-1] appear before Y.form = X.form[-1:] the result will be wrong.

Merge 2 tokens into 1

The rule below performs the reverse operation of the one above: turn two consecutive token did and n’t into one token didn’t.

rule r {
  pattern { X1 [form="did"]; X2 [form="n't"]; X1 < X2 }

  commands {
  	X1.form = "didn't";

    % For a clean Conllu output
    X1.wordform = X1.form;

    % More lines can be added for other features

    del_node X2;            % Finally remove the unnedeed node
  }
}

Note that, depending on the dependency relations and the features on the initial tokens X1 and X2, it may be simpler to pu the new form on X2 and to remove X1.

Larger change in tokenisation

The next example is taken from mSUD_Bokota where some data tokenized as 3 tokens ho+ked+-a should be changed into the 2 token sequence hoke+da.

rule r {
  pattern { X1 [form="ho"]; X2 [form="ked"]; X3 [form="-a"]; X1 < X2; X2 < X3 }

  commands {
  	X1.form = "hoke";
    X2.form = "da";

    % If data is aligned with a sound file:
    X2.AlignEnd = X3.AlignEnd;

    % For a clean Conllu output
    X1.wordform = X1.form;
    X2.wordform = X2.form;

    % More lines can be added for other features

    del_node X3;            % Finally remove the unnedeed node
  }
}