Clustering
There is a general clustering mechanism in Grew which can be used in several contexts to split a set of matchings produced by a request on a corpus in a some subsets (a partition) according to some criteria.
Where it can be used?
- In Grew-match, below the main textarea where the request is written, it is possible to describe either one or two clustering items
- With the command line argument, in
grep
oucount
modes, clustering can be perfomed. For details and examples, see here forgrep
and here forcount
- With the Python library Grewpy, in functions
Corpus.count
andCorpus.search
Clustering with a key
Clustering on a node feature
With the clustering key N.f
, the matchings are clustered following the value of the feature named f
for the node N
present in the (matching part of) main request.
If for some matchings, the feature is not defined, a cluster is added with the value __undefined__
.
Examples
- List lemmas of auxiliaries in UD_Polish-LFG
- List
VerbForm
ofVERB
withoutnsubj
in UD_German-GSD - Find the huge number of
form
associated to the lemma saada in UD_Finnish-FTB
Clustering on a edge feature
With the clustering key e.f
, the matchings are clustered following the value of the feature named f
for the edge e
present in the (matching part of) main request.
If for some matchings, the feature is not defined, a cluster is added with the value __undefined__
.
Example
Clustering on the full label of an edge
With the clustering key e.label
, the matchings are clustered following the full label of edge e
present in the (positive part of) main request. NB the way the label value is reported depends on the configuration used.
Example
Clustering on an edge length
The clustering key e.length
make clusters following the length of edge e
; the clustering key e.delta
make clusters following the relative positions of governor and dependent of edge e
.
Examples
- Observe the length of the
amod
relation in UD_Korean-PUD - Observe the relative positions of
nsubj
related tokens in UD_Naija-NSC
Clustering of continuous numeric features
As suggested in #28, in case of continuous numeric feature, it is sensible to cluster by value intervals.
X.feat[gap=g]
will cluster the values of X.feat by packs of size gX.feat[gap=g, min=a, max=b]
will cluster the values between a and b by pack of size g, with two packs for all values < a and > b.
Example
There is not much examples of numerical features in the current UD version. The following example is not linguistically pertinent but it shows the mechanism.
on UD_Naija-NSC
with clustering key N.AlignBegin[gap=1000, min=10000, max=20000]
and request:
pattern { N [AlignBegin] }
There are (up to) 12 clusters, named ]-∞, 10000[
, [10000, 11000[
, … [19000, 20000[
and [20000, +∞[
.
Clustering on successive feature names
The feature name ExtPos
is used mainly when the external POS differs from the regular upos
.
So it may be useful to be able to report the value of ExtPos
if it exists and the value of upos
else.
This is possible with the clustering key N.ExtPos/upos
Example with ExtPos
On UD_French-GSD, when searching for POS of a dependent of the case
relation with the request pattern { H -[case]-> D }
, the clustering key D.upos
reports 7 clusters
(use the Count
button to see all clusters) and the clustering key D.ExtPos/upos
reports the more regular set of 2 clusters
.
Example with corrected features
This kind of clustering key can also be useful with features Correct{feature}
(see UD guidelines).
For instance on UD_French-GSD, with the request
pattern { N -[amod]-> A ; N.Gender <> A.Gender}
and the two clustering keys N.CorrectGender/Gender
and A.CorrectGender/Gender
, we can observe more in details the Gender agreement between two nodes related by amod
: most of the case are linked to typos, many of the other cases are annotation errors in version 2.12.
Clustering on relative order of nodes
With a clustering key N1#N2#N3
where N1
, N2
and N3
are nodes from the pattern
part of the request, the occurrences are clustered according to the relative order of nodes and clusters are identified by N1 << N2 << N3
, N2 << N1 << N3
… This can be used with any number of nodes.
Example: Verb, Subject, Object ordering
On UD, with the request;
pattern {
V[upos=VERB];
V -[nsubj]-> S;
V -[obj]-> O;
}
and with the clustering key V#S#O
, we can observe the occurrences of the 6 possible orders SVO, SOV…
on UD_Latin-Perseus
.
Example: positions of copula and adposition sharing the same head
pattern {
H [];
COP [upos=AUX]; H -[cop]-> COP;
ADP [upos=ADP]; H -[case|mark]-> ADP;
}
on UD_French-GSD
with the clustering key H#COP#ADP
.
Clustering on how two nodes are related (or not)
-
With the clustering key
N1 -> N2
, the occurrences are clustered according to the relation fromN1
toN2
; a cluster named__none__
gathers the cases when there is no relation fromN1
toN2
. If there is more than one such relations, another cluster__multi__
is added. Note that with dependency syntax, the cluster__multi__
will never appear, but it can appear in other context like enhanced UD or semantic graphs. -
With the clustering key
N1 <-> N2
, the occurrences are clustered according to the relation betweenN1
andN2
(whatever the direction); if the direction is fromN2
toN1
, the relation name is prefixed with minus sign like-nsubj
or-mark:rel
. A cluster named__none__
gathers the cases when there is no relation betweenN1
andN2
. If there is more than one such relations, another cluster__multi__
is added.
Annotation of a bigram DET NOUN
With a clustering key N2 -> N1
and the pattern:
pattern { N1 [upos=DET]; N2 [upos=NOUN]; N1 < N2 }
we can observe how the bigram is annotated:
on UD_German-GSD
.
Annotation of a bigram NOUN NOUN
With a clustering key N1 <-> N2
and the pattern:
pattern { N1 [upos=NOUN]; N2 [upos=NOUN]; N1 < N2 }
we can observe how the bigram NOUN-NOUN is annotated:
on UD_English-GUM
or
on UD_Chinese-GSD
.
Clustering with a sub-request (whether
)
A whether
sub-request contains a list of clauses (as in pattern
, without
or with
constructions).
The set of occurrences is split in two subsets:
- one tagged
No
corresponds to the subset of occurrences where thewhether
sub-request cannot be fulfilled (thewhether
is interpreted like awithout
) - one tagged
Yes
is the complementary of theNo
subset and so, corresponds to the occurrences where the sub-request can be matched (thewhether
is interpreted like awith
)
Note that no curly brackets are needed in the whether
text area (see examples below).