Clustering

There is a general clustering mechanism in Grew that can be used in various contexts to divide a set of matchings produced by a request on a corpus into a number of subsets (a partition) according to some criteria.

Where it can be used?

In Grew-match, below the main textarea where the request is written, it is possible to describe either one or two clustering items
With the command line argument, in grep or count mode, clustering can be performed. Details and examples can be found here for grep and here for count
Using Python library Grewpy, in the functions Corpus.count and Corpus.search

Clustering with a key

Clustering on a node feature

With the clustering key X.f, the matchings are clustered following the value of the feature named f for the node X present in the (matching part of the) main request. If the feature is not defined for some matchings, a cluster with the value __undefined__ is added.

Examples

List lemmas of auxiliaries in UD_Polish-LFG
List VerbForm of VERB without nsubj in UD_German-GSD
Find the huge number of form associated to the lemma saada in UD_Finnish-FTB

Clustering on a edge feature

With the clustering key e.f, the matchings are clustered following the value of the feature named f for the edge e present in the (matching part of) main request. If for some matchings, the feature is not defined, a cluster is added with the value __undefined__.

Example

List sub-relations used with acl relation in UD_Swedish-Talbanken

Clustering on the full label of an edge

With the clustering key e.label, the matchings are clustered according to the full label of edge e present in the (positive part of the) main request. NB the way the label value is reported depends on the configuration used.

Example

List relations used for auxiliaries in UD_Italian-ParTUT

Clustering on an edge length

The clustering key e.length makes clusters according to the length of the edge e; the clustering key e.delta makes clusters according to the relative positions of the governor and the dependent of the edge e.

Examples

Observe the length of the amod relation in UD_Korean-PUD
Observe the relative positions of nsubj related tokens in UD_Naija-NSC

Clustering on distance between nodes

🆕 in Version 1.16. Similarly to the new syntax for request, it is possible to cluster on the distance between two nodes:

length(X,Y) absolute distance between X and Y
delta(X,Y) relative distance between X and Y

Note that this is partly redondant with the previous point (clustering on an edge length), but it can also be used on a pair of nodes that are not connected. For example, you can cluster on the relative distance between a subject and an object that depend on the same governor:

Clustering of continuous numeric features

As suggested in #28, in case of continuous numeric feature, it is sensible to cluster by value intervals.

X.feat[gap=g] will cluster the values of X.feat by packs of size g
X.feat[gap=g, min=a, max=b] will cluster the values between a and b by pack of size g, with two packs for all values < a and > b.

Example

The Naija treebank has a version with prosodic information.

on UD_Naija-NSC with clustering key S.Duration[gap=100] and request:

pattern { S [SylForm = "wO~", Duration] }

There are 8 clusters, named [0, 100[, [100, 200[, … [700, 800[.

Clustering on successive feature names

The feature name ExtPos is mainly used when the external POS is different from the regular upos. So it may be useful to be able to report the value of ExtPos if it exists and the value of upos otherwise. This is possible with the clustering key X.ExtPos/upos.

Example with `ExtPos`

On UD_French-GSD, when searching for POS of a dependent of the case relation with the request pattern { X -[case]-> Y }, the clustering key Y.upos reports 7 clusters (use the Count button to see all clusters) and the clustering key Y.ExtPos/upos reports the more regular set of 2 clusters .

Example with corrected features

This kind of clustering key can also be useful with features Correct{feature} (see UD guidelines). For instance on UD_French-GSD, with the request

pattern { X -[amod]-> Y ; X.Gender <> Y.Gender}

and the two clustering keys X.CorrectGender/Gender and Y.CorrectGender/Gender , we can observe in more detail the Gender agreement between two nodes related by amod.

Clustering by relative order of nodes

With a clustering key X1#X2#X3 where X1, X2 and X3 are nodes from the pattern part of the request, the occurrences are clustered according to the relative order of the nodes and clusters are identified by X1 << X2 << X3, X2 << X1 << X3… This can be used with any number of nodes.

Example: Verb, Subject, Object ordering

On UD, with the request;

pattern {
  V[upos=VERB];
  V -[nsubj]-> S;
  V -[obj]-> O;
}

and with the clustering key V#S#O, we can observe the occurrences of the 6 possible orders SVO, SOV… on UD_Latin-Perseus.

pattern {
  HEAD [];
  COP [upos=AUX]; HEAD -[cop]-> COP;
  ADP [upos=ADP]; HEAD -[case|mark]-> ADP;
}

on UD_French-GSD with the clustering key HEAD#COP#ADP.

With the clustering key X -> Y, the occurrences are clustered according to the relation from X to Y; a cluster named __none__ collects the cases where there is no relation from X to Y. If there is more than one such relations, another cluster __multi__ is added. Note that the __multi__ cluster never appears in dependency syntax, but it may appear in other contexts such as enhanced UD or semantic graphs.
With the clustering key X <-> Y, the occurrences are clustered according to the relation between X and Y (no matter which direction); if the direction is from Y to X, the relation name is prefixed with minus sign like -nsubj or -mark:rel. A cluster called __none__ contains the cases where there is no relation between X and Y. If there is more than one such relation, another cluster __multi__ is added.

Annotation of a bigram DET NOUN

With a clustering key Y -> X and the pattern:

pattern { X [upos=DET]; Y [upos=NOUN]; X < Y }

we can observe how the bigram is annotated: on UD_German-GSD.

Annotation of a bigram NOUN NOUN

With a clustering key X <-> Y and the pattern:

pattern { X [upos=NOUN]; Y [upos=NOUN]; X < Y }

we can observe how the bigram NOUN-NOUN is annotated: on UD_Chinese-GSD or on bUD_English-GUM (bUD is the version of the treebank whitout the enhanced dependency layer).

Clustering on the metadata of the sentences

[🆕 1.18.0] With a clustering key meta.f, the results will be clustered acccording to the value of the metadata f of the each matched graph.

Clustering with a sub-request (`whether`)

A whether sub-request contains a list of clauses (as in pattern, without or with constructions). The set of occurrences is split in two subsets:

one tagged No corresponds to the subset of occurrences where the whether sub-request cannot be fulfilled (the whether is interpreted like a without)
one tagged Yes is the complementary of the No subset and so, corresponds to the occurrences where the sub-request can be matched (the whether is interpreted like a with)

Note that no curly brackets are needed in the whether text area (see examples below).

Examples

Is advcl left-headed in UD_Hungarian-Szeged?
In UD_English-GUM, how often does the relation expl appear with or without an nsubj relation with the same head?
In UD_French-GSD, there are 618 left-headed nsubj (or subtypes):
- How often is it in an interrogative sentences? (NB: We approximate interrogative with the presence of “?”)
- How often is it in an relative clause?
- How often is there an expletive subject?

Grew

Clustering

Where it can be used?

Clustering with a key

Clustering on a node feature

Examples

Clustering on a edge feature

Example

Clustering on the full label of an edge

Example

Clustering on an edge length

Examples

Clustering on distance between nodes

Clustering of continuous numeric features

Example

Clustering on successive feature names

Example with ExtPos

Example with corrected features

Clustering by relative order of nodes

Example: Verb, Subject, Object ordering

Example: positions of copula and adposition sharing the same head

Clustering on how two nodes are related (or not)

Annotation of a bigram DET NOUN

Annotation of a bigram NOUN NOUN

Clustering on the metadata of the sentences

Clustering with a sub-request (whether)

Examples

Example with `ExtPos`

Clustering with a sub-request (`whether`)