• ⬆️ Top •

Building relation tables on your treebank

We call here “relation table” a table like the ones which are available through Grew-match: example on UD_French-PUD, version 2.16 (select a relation on the left).

The simplest way to compute this kind of table on your own corpus is to use the Python library grewpy. It is also possible to do the same with the Command Line Interface.

For this example, we suppose that we have a subfolder data which contains the file fr_pud-ud-test.conllu (the version 2.16 of the corpus UD_French-PUD which can be downloaded here).

.
├── data
│   └── fr_pud-ud-test.conllu

With the `grewpy` Python lib

See here for the installation of grewpy.

Table for `nsubj` relation

The script below loads the corpus and computes the table for the nsubj relation:

from grewpy import Corpus, Request

pud_corpus = Corpus('data/fr_pud-ud-test.conllu')
nsubj_table = pud_corpus.count (Request ('pattern {G -[nsubj]-> D}'), clustering_keys=['G.upos', 'D.upos'])

print (nsubj_table)

The output is a nested Python dictonary, the toplevel keys correspond to the G.upos and the embedded keys correspond to the D.upos. For instance, nsubj_table['VERB']['NOUN'] returns 543 which corresponds to the number of occurrences of the nsubj relation from a VERB to a NOUN:

{'X': {'NOUN': 2}, 'VERB': {'X': 2, 'SYM': 3, 'PROPN': 199, 'PRON': 470, 'NUM': 3, 'NOUN': 543, 'DET': 1, 'ADV': 1, 'ADJ': 6}, 'PROPN': {'PRON': 2, 'NOUN': 2}, 'PRON': {'PROPN': 2, 'PRON': 8, 'NOUN': 2}, 'NOUN': {'PROPN': 11, 'PRON': 26, 'NUM': 1, 'NOUN': 43, 'ADJ': 2}, 'ADV': {'PRON': 1}, 'ADJ': {'X': 1, 'VERB': 1, 'SYM': 1, 'PROPN': 10, 'PRON': 20, 'NOUN': 53, 'ADJ': 1}}

Note that the sums for rows and columns are not given but it is easy to add them in the Python code.

Table for `nsubj` relation and its possible extension

The example above requires for nsubj but not for nsubj:pass and nsubj:caus which are also used in UD_French-PUD. To have the table for all relations nsubj with and without extension, the request 'G -[nsubj]-> D' should be changed to 'G -[1=nsubj]-> D' (see complex edges for an explanation).

Compute tables for all relations

It is possible to get all relation tables (without looping on edge labels) by using one more clustering key.

from grewpy import Corpus, Request

pud_corpus = Corpus('data/fr_pud-ud-test.conllu')
all_tables = pud_corpus.count (Request ("pattern {e: G -> D}"), clustering_keys=['e.label', 'G.upos', 'D.upos'])

print (all_tables)

In the code above, all_tables is a dictionary mapping the possible values of dependency label (e.label) to a sub-dictionary as the one obtained above for nsubj.

…,
 'iobj': {'VERB': {'PRON': 39}, 'ADJ': {'ADP': 1}}, 
 'goeswith': {'NUM': {'X': 1}, 'NOUN': {'X': 1}, 'ADV': {'X': 1}}, 
…

With the Command Line Interface

The needed requests must be declared in a external file. So we suppose that our folder contains two more files:

nsubj_table.req

pattern { G -[nsubj]-> D }

all_tables.req

pattern { e: G -> D }

The command below builds the JSON code of the nsubj relation table.

grew count -request nsubj_table.req -key G.upos -key D.upos -i data/fr_pud-ud-test.conllu

For all tables:

grew count -request all_tables.req -key e.label -key G.upos -key D.upos -i data/fr_pud-ud-test.conllu

Remarks

It we want to get list of occurrences instead of just a number, the command grew count … can be replaced by grew grep …, with the same arguments.
The JSON obtained is slightly different from the one of the Python library, it contains another external layer of dictionary because the command can be applied with more than one requests. The output of the last command is then:

{
  "all_tables.req": {
    "xcomp": { … },
    …
    "acl": { … }
  }
}

Grew