• ⬆️ Top •
Building relation tables on your treebank
We call here “relation table” a table like the ones which are available through Grew-match: example on UD_French-PUD, version 2.14 (select a relation on the left).
The simplest way to compute this kind of table on your own corpus is to use the Python library grewpy. It is also possible to do the same with the Command Line Interface.
For this example, we suppose that we have a subfolder data
which contains the file fr_pud-ud-test.conllu
(the version 2.15 of the corpus UD_French-PUD which can be downloaded here).
.
├── data
│ └── fr_pud-ud-test.conllu
With the grewpy
Python lib
See here for the installation of grewpy
.
Table for nsubj
relation
The script below loads the corpus and computes the table for the nsubj
relation:
from grewpy import Corpus, Request
pud_corpus = Corpus('data/fr_pud-ud-test.conllu')
nsubj_table = pud_corpus.count (Request ('pattern {G -[nsubj]-> D}'), clustering_keys=['G.upos', 'D.upos'])
print (nsubj_table)
The output is a nested Python dictonary, the toplevel keys correspond to the G.upos
and the embedded keys correspond to the D.upos
.
For instance, nsubj_table['VERB']['NOUN']
returns 543
which corresponds to the number of occurrences of the nsubj
relation from a VERB
to a NOUN
:
{'X': {'NOUN': 2}, 'VERB': {'X': 2, 'SYM': 3, 'PROPN': 199, 'PRON': 470, 'NUM': 3, 'NOUN': 543, 'DET': 1, 'ADV': 1, 'ADJ': 6}, 'PROPN': {'PRON': 2, 'NOUN': 2}, 'PRON': {'PROPN': 2, 'PRON': 8, 'NOUN': 2}, 'NOUN': {'PROPN': 11, 'PRON': 26, 'NUM': 1, 'NOUN': 43, 'ADJ': 2}, 'ADV': {'PRON': 1}, 'ADJ': {'X': 1, 'VERB': 1, 'SYM': 1, 'PROPN': 10, 'PRON': 20, 'NOUN': 53, 'ADJ': 1}}
Note that the sums for rows and columns are not given but it is easy to add them in the Python code.
Table for nsubj
relation and its possible extension
The example above requires for nsubj
but not for nsubj:pass
and nsubj:caus
which are also used in UD_French-PUD.
To have the table for all relations nsubj
with and without extension, the request 'G -[nsubj]-> D'
should be changed to 'G -[1=nsubj]-> D'
(see complex edges for an explanation).
Compute tables for all relations
It is possible to get all relation tables (without looping on edge labels) by using one more clustering key.
from grewpy import Corpus, Request
pud_corpus = Corpus('data/fr_pud-ud-test.conllu')
all_tables = pud_corpus.count (Request ("pattern {e: G -> D}"), clustering_keys=['e.label', 'G.upos', 'D.upos'])
print (all_tables)
In the code above, all_tables
is a dictionary mapping the possible values of dependency label (e.label
) to a sub-dictionary as the one obtained above for nsubj
.
…,
'iobj': {'VERB': {'PRON': 39}, 'ADJ': {'ADP': 1}},
'goeswith': {'NUM': {'X': 1}, 'NOUN': {'X': 1}, 'ADV': {'X': 1}},
…
With the Command Line Interface
The needed requests must be declared in a external file. So we suppose that our folder contains two more files:
nsubj_table.req
pattern { G -[nsubj]-> D }
all_tables.req
pattern { e: G -> D }
The command below builds the JSON code of the nsubj
relation table.
grew count -request nsubj_table.req -key G.upos -key D.upos -i data/fr_pud-ud-test.conllu
For all tables:
grew count -request all_tables.req -key e.label -key G.upos -key D.upos -i data/fr_pud-ud-test.conllu
Remarks
- It we want to get list of occurrences instead of just a number, the command
grew count …
can be replaced bygrew grep …
, with the same arguments. - The JSON obtained is slightly different from the one of the Python library, it contains another external layer of dictionary because the command can be applied with more than one requests. The output of the last command is then:
{
"all_tables.req": {
"xcomp": { … },
…
"acl": { … }
}
}