Grewpy tutorial: counting requests on a list of corpus
Download the notebook here.
import grewpy
from grewpy import Corpus, Request
grewpy.set_config("ud")
connected to port: 40137
Below, we define the list of corpora to be used.
We suppose that there is a (link to) a local folder named ud-treebanks-v2.15
with data of the corresponding UD release.
Requests are defined by a list of pairs; each pair contains the corpus_id in the previous folder and the request code.
folder = "ud-treebanks-v2.15"
corpus_list = [
"UD_Arabic-PUD",
"UD_Chinese-PUD",
"UD_Czech-PUD",
"UD_English-PUD",
"UD_Finnish-PUD",
"UD_French-PUD",
"UD_German-PUD",
"UD_Hindi-PUD",
"UD_Icelandic-PUD",
"UD_Indonesian-PUD",
"UD_Italian-PUD",
"UD_Japanese-PUD",
"UD_Korean-PUD",
"UD_Polish-PUD",
"UD_Portuguese-PUD",
"UD_Russian-PUD",
"UD_Spanish-PUD",
"UD_Swedish-PUD",
"UD_Thai-PUD",
"UD_Turkish-PUD"
]
request_codes = [
("SV", "pattern { V -[nsubj]-> S; S << V }"),
("VS", "pattern { V -[nsubj]-> S; V << S }"),
]
The code below prints (on stdout
) TSV data, with one lien for each corpus and one columns for each requet, with the correponding number of occurrences.
tab='\t'
request_list = [(request_id,Request(code)) for (request_id,code) in request_codes]
request_ids = [request_id for (request_id,code) in request_codes]
print (f'Corpus{tab}{tab.join(request_ids)}')
for corpus_id in corpus_list:
corpus = Corpus (f'{folder}/{corpus_id}')
occurences = [str(corpus.count(request)) for (_,request) in request_list]
print (f'{corpus_id}{tab}{tab.join(occurences)}')
corpus.clean() # free unused corpus from memory
Corpus SV VS
UD_Arabic-PUD 545 825
UD_Chinese-PUD 1767 5
UD_Czech-PUD 987 258
UD_English-PUD 1339 53
UD_Finnish-PUD 1018 86
UD_French-PUD 1354 63
UD_German-PUD 1209 273
UD_Hindi-PUD 1121 6
UD_Icelandic-PUD 1513 282
UD_Indonesian-PUD 1415 113
UD_Italian-PUD 1023 103
UD_Japanese-PUD 1446 0
UD_Korean-PUD 1545 1
UD_Polish-PUD 857 206
UD_Portuguese-PUD 1227 58
UD_Russian-PUD 1157 205
UD_Spanish-PUD 1074 116
UD_Swedish-PUD 1255 259
UD_Thai-PUD 1618 1
UD_Turkish-PUD 1233 6