Grewpy tutorial: Run requests on a corpus
Download the notebook here.
NOTE: this notebook requires Grewpy version 0.7 (see upgrade info)
import grewpy
from grewpy import Corpus, Request
grewpy.set_config("sud") # ud or basic
chmod: changing permissions of '/tmp/tmp.svg': Operation not permitted
connected to port: 48385
Import data
The Corpus constructor takes a conllu file or a directory containing conllu files.
A Corpus allows to make queries and to count occurrences.
treebank_path = "SUD_English-PUD"
corpus = Corpus(treebank_path)
print(type(corpus))
<class 'grewpy.corpus.Corpus'>
n_sentencens = len(corpus)
sent_ids = corpus.get_sent_ids()
print(f"{n_sentencens = }")
print(f"{sent_ids[0] = }")
n_sentencens = 1000
sent_ids[0] = 'n01001011'
Explore data
See the Grew-match tutorial to practice writing Grew requests
Count the number of subjets in the corpus
req1 = Request("pattern { X-[subj]->Y }")
corpus.count(req1)
1419
It is possible to extend an already existing request with the methods pattern, without and with_ (because with is a Python keyword).
Hence, the request req1bis below is equivalent to req1.
req1bis = Request().pattern("X-[subj]->Y")
corpus.count(req1bis)
1419
Count the number of subjects such that the subject’s head is not a pronoun
req2 = Request().pattern("X-[subj]->Y").without("Y[upos=PRON]")
corpus.count(req2)
931
Count the number of subjects with at least one dependant
Note the usage of with_ (because with is a Python keyword)
req3 = Request().pattern("X-[subj]->Y").with_("Y->Z")
corpus.count(req3)
752
with and without items can be stacked
req4 = Request().pattern("X-[subj]->Y").with_("Y->Z").without("Y[upos=PRON]").without("X[upos=VERB]")
corpus.count(req4)
317
Building a request with the Grew syntax
It is possible to build request directly from the concrete syntax used in Grew-match or in Grew rules.
The req4 can be written:
req4bis = Request("""
pattern { X-[subj]->Y }
with { Y->Z }
without { Y[upos=PRON] }
without { X[upos=VERB] }
""")
corpus.count(req4bis)
317
More complex queries are allowed, with result clustering
See Clustering for more documentation. Below, we cluster the subject relation, according to the POS of the governor.
req5 = Request("pattern {X-[subj]->Y}")
corpus.count(req5, clustering_keys=["X.upos"])
{'VERB': 825, 'SCONJ': 1, 'PART': 5, 'NOUN': 3, 'AUX': 581, 'ADP': 3, 'ADJ': 1}
Clustering results by other requests
The clustering is done on the relative position of X and Y.
It answers to the question: How many subjects are in a pre-verbal position?
corpus.count(req5, clustering_keys=["{X << Y}"])
{'Yes': 76, 'No': 1343}
This example corresponds to the whether clustering in Grew-match.
Note that here curly braces are required around X << Y to indicate that whether clustering should be performed instead of key clustering.
Two clusterings can be applied
The behavior of this feature has changed in Grewpy version 0.7. See here for more details.
corpus.count(req5, clustering_keys=["{X << Y}","X.upos"])
{('Yes', 'VERB'): 44,
('Yes', 'SCONJ'): 1,
('Yes', 'AUX'): 30,
('Yes', 'ADP'): 1,
('No', 'VERB'): 781,
('No', 'PART'): 5,
('No', 'NOUN'): 3,
('No', 'AUX'): 551,
('No', 'ADP'): 2,
('No', 'ADJ'): 1}
More than two clusterings are also possible
corpus.count(req5, clustering_keys=["{X << Y}","X.upos", "{X[Number=Sing]}"])
{('Yes', 'VERB', 'Yes'): 15,
('Yes', 'VERB', 'No'): 29,
('Yes', 'SCONJ', 'No'): 1,
('Yes', 'AUX', 'Yes'): 21,
('Yes', 'AUX', 'No'): 9,
('Yes', 'ADP', 'No'): 1,
('No', 'VERB', 'Yes'): 167,
('No', 'VERB', 'No'): 614,
('No', 'PART', 'No'): 5,
('No', 'NOUN', 'Yes'): 2,
('No', 'NOUN', 'No'): 1,
('No', 'AUX', 'Yes'): 255,
('No', 'AUX', 'No'): 296,
('No', 'ADP', 'No'): 2,
('No', 'ADJ', 'No'): 1}
Search occurrences
Get the list of occurrence of a given request in the corpus
occurrences = corpus.search(req1)
assert len(occurrences) == corpus.count(req1)
occurrences[0]
{'sent_id': 'w05010027',
'matching': {'nodes': {'Y': '8', 'X': '10'}, 'edges': {}}}
Get occurrences including edges
The edge is named e, and the label of the dependency is reported in the output
req6 = Request().pattern("e: X->Y; X[upos=VERB]")
corpus.search(req6)[3]
{'sent_id': 'w05010027',
'matching': {'nodes': {'Y': '12', 'X': '10'},
'edges': {'e': {'source': '10',
'label': {'1': 'comp', '2': 'obj'},
'target': '12'}}}}
As with count, we can cluster the results of a search
result = corpus.search(req6, clustering_keys=["{X << Y}"])
result.keys()
dict_keys(['Yes', 'No'])




