There is a general clustering mechanism in Grew that can be used in various contexts to divide a set of matchings produced by a request on a corpus into a number of subsets (a partition) according to some criteria.

Where it can be used?

Clustering with a key

Clustering on a node feature

With the clustering key N.f, the matchings are clustered following the value of the feature named f for the node N present in the (matching part of the) main request. If the feature is not defined for some matchings, a cluster with the value __undefined__ is added.


Clustering on a edge feature

With the clustering key e.f, the matchings are clustered following the value of the feature named f for the edge e present in the (matching part of) main request. If for some matchings, the feature is not defined, a cluster is added with the value __undefined__.


Clustering on the full label of an edge

With the clustering key e.label, the matchings are clustered according to the full label of edge e present in the (positive part of the) main request. NB the way the label value is reported depends on the configuration used.


Clustering on an edge length

The clustering key e.length make clusters following the length of edge e; the clustering key make clusters following the relative positions of governor and dependent of edge e.


Clustering of continuous numeric features

As suggested in #28, in case of continuous numeric feature, it is sensible to cluster by value intervals.


There are not many examples of numerical features in the current version of UD. The following example is not linguistically pertinent but it shows the mechanism.

on UD_Naija-NSC with clustering key N.AlignBegin[gap=1000, min=10000, max=20000] and request:

pattern { N [AlignBegin] }

There are (up to) 12 clusters, named ]-∞, 10000[, [10000, 11000[, … [19000, 20000[ and [20000, +∞[.

Clustering on successive feature names

The feature name ExtPos is mainly used when the external POS is different from the regular upos. So it may be useful to be able to report the value of ExtPos if it exists and the value of upos otherwise. This is possible with the clustering key N.ExtPos/upos.

Example with ExtPos

On UD_French-GSD, when searching for POS of a dependent of the case relation with the request pattern { H -[case]-> D }, the clustering key D.upos reports 7 clusters (use the Count button to see all clusters) and the clustering key D.ExtPos/upos reports the more regular set of 2 clusters .

Example with corrected features

This kind of clustering key can also be useful with features Correct{feature} (see UD guidelines). For instance on UD_French-GSD, with the request

pattern { N -[amod]-> A ; N.Gender <> A.Gender}

and the two clustering keys N.CorrectGender/Gender and A.CorrectGender/Gender , we can observe in more detail the Gender agreement between two nodes related by amod: most of the cases are related to typos, many of the other cases are annotation errors in version 2.12.

Clustering by relative order of nodes

With a clustering key N1#N2#N3 where N1, N2 and N3 are nodes from the pattern part of the request, the occurrences are clustered according to the relative order of the nodes and clusters are identified by N1 << N2 << N3, N2 << N1 << N3… This can be used with any number of nodes.

Example: Verb, Subject, Object ordering

On UD, with the request;

pattern {
  V -[nsubj]-> S;
  V -[obj]-> O;

and with the clustering key V#S#O, we can observe the occurrences of the 6 possible orders SVO, SOV… on UD_Latin-Perseus.

Example: positions of copula and adposition sharing the same head

pattern {
  H [];
  COP [upos=AUX]; H -[cop]-> COP;
  ADP [upos=ADP]; H -[case|mark]-> ADP;

on UD_French-GSD with the clustering key H#COP#ADP.

Annotation of a bigram DET NOUN

With a clustering key N2 -> N1 and the pattern:

pattern { N1 [upos=DET]; N2 [upos=NOUN]; N1 < N2 }

we can observe how the bigram is annotated: on UD_German-GSD.

Annotation of a bigram NOUN NOUN

With a clustering key N1 <-> N2 and the pattern:

pattern { N1 [upos=NOUN]; N2 [upos=NOUN]; N1 < N2 }

we can observe how the bigram NOUN-NOUN is annotated: on UD_English-GUM or on UD_Chinese-GSD.

Clustering with a sub-request (whether)

A whether sub-request contains a list of clauses (as in pattern, without or with constructions). The set of occurrences is split in two subsets:

Note that no curly brackets are needed in the whether text area (see examples below).