# Clustering

There is a general clustering mechanism in Grew which can be used in several contexts to split a set of matchings produced by a request on a corpus in a some subsets (a partition) according to some criteria.

## Clustering with a key

### Clustering on a node feature

With the clustering key N.f, the matchings are clustered following the value of the feature named f for the node N present in the (matching part of) main request. If for some matchings, the feature is not defined, a cluster is added with the value __undefined__.

#### Examples

• List lemmas of auxiliaries in UD_Polish-LFG
• List VerbForm of VERB without nsubj in UD_German-GSD
• Find the huge number of form associated to the lemma saada in UD_Finnish-FTB

### Clustering on a edge feature

With the clustering key e.f, the matchings are clustered following the value of the feature named f for the edge e present in the (matching part of) main request. If for some matchings, the feature is not defined, a cluster is added with the value __undefined__.

#### Example

• List sub-relations used with acl relation in UD_Swedish-Talbanken

### Clustering on the full label of an edge

With the clustering key e.label, the matchings are clustered following the full label of edge e present in the (positive part of) main request. NB the way the label value is reported depends on the configuration used.

#### Example

• List relations used for auxiliaries in UD_Italian-ParTUT

### Clustering on an edge length

The clustering key e.length make clusters following the length of edge e; the clustering key e.delta make clusters following the relative positions of governor and dependent of edge e.

#### Examples

• Observe the length of the amod relation in UD_Korean-PUD
• Observe the relative positions of nsubj related tokens in UD_Naija-NSC

### Clustering of continuous numeric features

As suggested in #28, in case of continuous numeric feature, it is sensible to cluster by value intervals.

• X.feat[gap=g] will cluster the values of X.feat by packs of size g
• X.feat[gap=g, min=a, max=b] will cluster the values between a and b by pack of size g, with two packs for all values < a and > b.

#### Example

There is not much examples of numerical features in the current UD version. The following example is not linguistically pertinent buy shows the mechanism.

on UD_Naija-NSC with clustering key N.AlignBegin[gap=1000, min=10000, max=20000] and request:

pattern { N [AlignBegin] }


There are (up to) 12 clusters, named ]-∞, 10000[, [10000, 11000[, … [19000, 20000[ and [20000, +∞[.

### Clustering on successive feature names

The feature name ExtPos is used mainly when the external POS differs from the regular upos. So it may be useful to be able to report the value of ExtPos if it exists and the value of upos else. This is possible with the clustering key N.ExtPos/upos

#### Example with ExtPos

On UD_French-GSD, when searching for POS of a dependent of the case relation with the request pattern { H -[case]-> D }, the clustering key D.upos reports 6 clusters and the clustering key D.ExtPos/upos reports the more regular set of 2 clusters .

#### Example with corrected features

This kind of clustering key can also be useful with features Correct{feature} (see UD guidelines). For instance on UD_French-GSD, with the request

pattern { N -[amod]-> A ; N.Gender <> A.Gender}


and the two clustering keys N.CorrectGender/Gender and A.CorrectGender/Gender , we can observe more in details the Gender agreement between two nodes related by amod: most of the case are link to typos, many of the other cases are annotation errors in version 2.11.

### Clustering on relative order of nodes

With a clustering key N1#N2#N3 where N1, N2 and N3 are nodes from the pattern part of the request, the occurrences are clustered according to the relative order of nodes and clusters are identified by N1 << N2 << N3, N2 << N1 << N3… This can be used with any number of nodes.

#### Example: Verb, Subject, Object ordering

On UD, with the request;

pattern {
V[upos=VERB];
V -[nsubj]-> S;
V -[obj]-> O;
}


and with the clustering key V#S#O, we can observe the occurrences of the 6 possible orders SVO, SOV… on UD_Latin-Perseus.

#### Example: positions of copula and adposition sharing the same head

pattern {
H [];
COP [upos=AUX]; H -[cop]-> COP;
ADP [upos=ADP]; H -[case|mark]-> ADP;
}


on UD_French-GSD with the clustering key H#COP#ADP.

• With the clustering key N1 -> N2, the occurrences are clustered according to the relation from N1 to N2; a cluster named __none__ gathers the cases when there is no relation from N1 to N2. If there is more than one such relations, another cluster __multi__ is added. Note that with dependency syntax, the cluster __multi__ will never appear, but it can appear in other context like enhanced UD or semantic graphs.

• With the clustering key N1 <-> N2, the occurrences are clustered according to the relation between N1 and N2 (whatever the direction); if the direction is from N2 to N1, the relation name is prefixed with minus sign like -nsubj or -mark:rel. A cluster named __none__ gathers the cases when there is no relation between N1 and N2. If there is more than one such relations, another cluster __multi__ is added.

### Annotation of a bigram DET NOUN

With a clutering key N2 -> N1 and the pattern:

pattern { N1 [upos=DET]; N2 [upos=NOUN]; N1 < N2 }


we can observe how the bigram is annotated: on UD_German-GSD.

#### Annotation of a bigram NOUN NOUN

With a clutering key N1 <-> N2 and the pattern:

pattern { N1 [upos=NOUN]; N2 [upos=NOUN]; N1 < N2 }


we can observe how the bigram NOUN-NOUN is annotated: in UD_English-GUM or in UD_Chinese-GSD.

## Clustering with a sub-request (whether)

A whether sub-request contains a list of clauses (as in pattern, without or with constructions). The set of occurrences is split in two subsets:

• one tagged No corresponds to the subset of occurrences where the whether sub-request cannot not be fulfilled (the whether is interpreted like a without)
• one tagged Yes is the complementary of the No subset and so, corresponds to the occurrences where the sub-request can be matched (the whether is interpreted like a with)

Note that no curly brackets are needed in the whether text area (see examples below).

### Examples

• Is advcl left-headed in UD_Hungarian-Szeged?
• In UD_English-GUM, how often the relation expl appear with or without an nsubj relation with the same head?
• In UD_French-GSD, there are 627 left-headed nsubj (or subtypes):
• How often is it in an interrogative sentences?
• How often is it in an relative clause?
• How often is there an expletive subject?