Grewpy tutorial: Run requests on a corpus

Download the notebook here.

import grewpy
from grewpy import Corpus, Request

grewpy.set_config("sud") # ud or basic

connected to port: 52937

Import data

The Corpus constructor takes a conllu file or a directory containing conllu files. A Corpus allows to make queries and to count occurrences.

treebank_path = "SUD_English-PUD"
corpus = Corpus(treebank_path)
print(type(corpus))

<class 'grewpy.corpus.Corpus'>

n_sentencens = len(corpus)
sent_ids = corpus.get_sent_ids()

print(f"{n_sentencens = }")
print(f"{sent_ids[0] = }")

n_sentencens = 1000
sent_ids[0] = 'n01001011'

Explore data

See the Grew-match tutorial to practice writing Grew requests

Count the number of subjets in the corpus

req1 = Request("pattern { X-[subj]->Y }")
corpus.count(req1)

It is possible to extend an already existing request with the methods pattern, without and with_ (because with is a Python keyword). Hence, the request req1bis below is equivalent to req1.

req1bis = Request().pattern("X-[subj]->Y")
corpus.count(req1bis)

Count the number of subjects such that the subject’s head is not a pronoun

req2 = Request().pattern("X-[subj]->Y").without("Y[upos=PRON]")
corpus.count(req2)

Count the number of subjects with at least one dependant

Note the usage of with_ (because with is a Python keyword)

req3 = Request().pattern("X-[subj]->Y").with_("Y->Z")
corpus.count(req3)

`with` and `without` items can be stacked

req4 = Request().pattern("X-[subj]->Y").with_("Y->Z").without("Y[upos=PRON]").without("X[upos=VERB]")
corpus.count(req4)

Building a request with the raw Grew syntax

It is possible to build request directly from the concrete syntax used in Grew-match or in Grew rules. The req4 can be written:

req4bis = Request("""
pattern { X-[subj]->Y }
with { Y->Z }
without { Y[upos=PRON] }
without { X[upos=VERB] }
""")
corpus.count(req4bis)

More complex queries are allowed, with results clustering

See Clustering for more documentation. Below, we cluster the subject relation, according to the POS of the governor.

req5 = Request("pattern {X-[subj]->Y}")
corpus.count(req5, clustering_parameter=["X.upos"])

{'VERB': 825, 'SCONJ': 1, 'PART': 5, 'NOUN': 3, 'AUX': 581, 'ADP': 3, 'ADJ': 1}

Clustering results by other requests

The clustering is done on the relative position of X and Y. It answers to the question: How many subjects are in a pre-verbal position?

corpus.count(req5, clustering_parameter=["{X << Y}"])

{'Yes': 76, 'No': 1343}

This example corresponds to the whether clustering in Grew-match. Note that here curly braces are required around X << Y to indicate that whether clustering should be performed instead of key clustering.

Two clusterings can be applied

corpus.count(req5, clustering_parameter=["{X << Y}","X.upos"])

{'Yes': {'VERB': 44, 'SCONJ': 1, 'AUX': 30, 'ADP': 1},
 'No': {'VERB': 781, 'PART': 5, 'NOUN': 3, 'AUX': 551, 'ADP': 2, 'ADJ': 1}}

More than two clusterings are also possible

corpus.count(req5, clustering_parameter=["{X << Y}","X.upos", "{X[Number=Sing]}"])

{'Yes': {'VERB': {'Yes': 15, 'No': 29},
  'SCONJ': {'No': 1},
  'AUX': {'Yes': 21, 'No': 9},
  'ADP': {'No': 1}},
 'No': {'VERB': {'Yes': 167, 'No': 614},
  'PART': {'No': 5},
  'NOUN': {'Yes': 2, 'No': 1},
  'AUX': {'Yes': 255, 'No': 296},
  'ADP': {'No': 2},
  'ADJ': {'No': 1}}}

Search occurrences

Get the list of occurrence of a given request in the corpus

occurrences = corpus.search(req1)
assert len(occurrences) == corpus.count(req1)
occurrences[0]

{'sent_id': 'w05010027',
 'matching': {'nodes': {'Y': '8', 'X': '10'}, 'edges': {}}}

Get occurrences including edges

The edge is named e, and the label of the dependency is reported in the output

req6 = Request().pattern("e: X->Y; X[upos=VERB]")
corpus.search(req6)[3]

{'sent_id': 'w05010027',
 'matching': {'nodes': {'Y': '12', 'X': '10'},
  'edges': {'e': {'source': '10',
    'label': {'1': 'comp', '2': 'obj'},
    'target': '12'}}}}

As with `count`, we can cluster the results of a `search`

result = corpus.search(req6, clustering_parameter=["{X << Y}"])
result.keys()

dict_keys(['Yes', 'No'])

Grew