Grew • Command Line Interface
The command used to run Grew is: grew <subcommand> [<args>]
The 5 main subcommands are:
- 🔗
transform
: application of a rewriting system to a set of graphs - 🔗
grep
: search for a request in a corpus - 🔗
count
: compute stats of a set of requests in a set of corpora - 🔗
compile
: compile a set of corpora - 🔗
clean
: clean a set of corpora
Other subcommands:
version
: Print version numbers of the Grew Ocaml library and of the Grew toolhelp
: Print general helphelp <subcommand>
: Print help for the given subcommand
There are two modes of input data: Mono corpus or Multi corpora. See here for more details about input formats.
The table below shows what are the accepted input modes for the main subcommands.
transform |
grep |
count |
compile |
clean |
|
---|---|---|---|---|---|
Mono | ✅ | ✅ | ✅ (🆕 in 1.10 ) |
❌ | ❌ |
Multi | ❌ | ✅ (🆕 in 1.10 ) |
✅ | ✅ | ✅ |
The table below shows what are the ouptut mode modes for the 3 main subcommands (compile
and clean
does not have any output).
CLI arg | transform |
grep |
count |
|
---|---|---|---|---|
CoNLL-U | ∅ | ✅ (default) | ❌ | ❌ |
JSON | -json |
✅ | ✅ (default) | ✅ (default) |
CoNLL-X | -cupt / -semcor / -columns … |
✅ | ❌ | ❌ |
DOT | -dot |
✅ | ❌ | ❌ |
multi JSON | -multi_json |
✅ | ❌ | ❌ |
TSV | -tsv |
❌ | ❌ | ✅ (in some cases) |
Transform
In this mode, Grew apply a Graph Rewriting System to a graph or a set of graphs.
The full command for this mode:
grew transform [<args>]
All arguments are optional:
-grs <grs_file>
: the main file which describes the Graph Rewriting System. If no GRS is given, the empty GRS is loaded:strat main {Seq ()}
-i <input_file>
: describes the input data (CoNLL file of gr file). If no input file is given, Grew reads fromstdin
-o <output_file>
: is the name of the output file (CoNLL file). If no output file is given, Grew writes tostdout
-strat <name>
: the strategy used in transformation (default value:main
)-safe_commands
: flag. It makes rewriting process fail in case of ineffective command-config
: See here
Grep
This mode corresponds to the command line version of the Grew-match tool. The clustering is also available 🔗 in the grep mode.
Without clustering
The command is:
grew grep -request <request_file> -i <input>
where:
<request_file>
is a file which describes a request<input>
describes the data on which the search is done- one corpus (Mono mode); in this case, the optionnal
-config
parameter (see here) can also be used - a set of corpora (Multi mode)
- one corpus (Mono mode); in this case, the optionnal
The output is given in JSON format.
Example with Mono input
With the following files:
- The corpus
UD_French-PUD
version 2.12:fr_pud-ud-test.conllu
🔗 - A request file with the code below:
dislocated.req
🔗
pattern { e: M -[dislocated]-> N }
NB: the fact the edge from M
to N
is given an identifier e
will give the information about this edge in the output (see below).
The command:
grew grep -request dislocated.req -i fr_pud-ud-test.conllu
produces the following JSON output:
[
{
"sent_id": "n01121051",
"matching": {
"nodes": { "N": "11", "M": "2" },
"edges": {
"e": { "source": "2", "label": "dislocated", "target": "11" }
}
}
},
{
"sent_id": "n01086031",
"matching": {
"nodes": { "N": "5", "M": "1" },
"edges": {
"e": { "source": "1", "label": "dislocated", "target": "5" }
}
}
},
{
"sent_id": "n01001011",
"matching": {
"nodes": { "N": "20", "M": "29" },
"edges": {
"e": { "source": "29", "label": "dislocated", "target": "20" }
}
}
}
]
This means that the request described in the file dislocated.req
was found three times in the corpus, each item gives the sentence identifier and the position of the nodes and the edges matched by the request.
Note that two other options exist:
-html
: produces a newhtml
field in each JSON item with the sentence where words impacted by the request are in a special HTML span with classhighlight
-dep_dir <directory>
: produces a new file in the folderdirectory
with the representation of the sentence with highlighted part (as in Grew-match tool) and a new field in each JSON item with the filename; the output is indep
format (usable with Dep2pict).
Example with Multi input
With the Mutli mode data described in the example file en_fr_zh.json
🔗 (which must be compiled with grew compile -i en_fr_zh.json
)
{ "corpora": [
{ "id": "UD_English-PUD",
"directory": "_build",
"files": ["en_pud-ud-test.conllu"]
},
{ "id": "UD_French-PUD",
"directory": "_build",
"files": ["fr_pud-ud-test.conllu"]
},
{ "id": "UD_Chinese-PUD",
"directory": "_build",
"files": ["zh_pud-ud-test.conllu"]
} ]
}
The command:
grew grep -request dislocated.req -i en_fr_zh.json
produces the following JSON output:
{
"UD_French-PUD": [
{
"sent_id": "n01121051",
"matching": {
"nodes": { "N": "11", "M": "2" },
"edges": {
"e": { "source": "2", "label": "dislocated", "target": "11" }
}
}
},
{
"sent_id": "n01086031",
"matching": {
"nodes": { "N": "5", "M": "1" },
"edges": {
"e": { "source": "1", "label": "dislocated", "target": "5" }
}
}
},
{
"sent_id": "n01001011",
"matching": {
"nodes": { "N": "20", "M": "29" },
"edges": {
"e": { "source": "29", "label": "dislocated", "target": "20" }
}
}
}
],
"UD_English-PUD": [
{
"sent_id": "n01029007",
"matching": {
"nodes": { "N": "3", "M": "6" },
"edges": {
"e": { "source": "6", "label": "dislocated", "target": "3" }
}
}
},
{
"sent_id": "n01002058",
"matching": {
"nodes": { "N": "4", "M": "17" },
"edges": {
"e": { "source": "17", "label": "dislocated", "target": "4" }
}
}
}
],
"UD_Chinese-PUD": [
{
"sent_id": "w04010029",
"matching": {
"nodes": { "N": "25", "M": "18" },
"edges": {
"e": { "source": "18", "label": "dislocated", "target": "25" }
}
}
},
{
"sent_id": "n05002017",
"matching": {
"nodes": { "N": "4", "M": "1" },
"edges": {
"e": { "source": "1", "label": "dislocated", "target": "4" }
}
}
},
{
"sent_id": "w01116100",
"matching": {
"nodes": { "N": "14", "M": "11" },
"edges": {
"e": { "source": "11", "label": "dislocated", "target": "14" }
}
}
},
{
"sent_id": "w01107013",
"matching": {
"nodes": { "N": "16", "M": "9" },
"edges": {
"e": { "source": "9", "label": "dislocated", "target": "16" }
}
}
},
{
"sent_id": "n01070017",
"matching": {
"nodes": { "N": "8", "M": "6" },
"edges": {
"e": { "source": "6", "label": "dislocated", "target": "8" }
}
}
}
]
}
With clustering
In both modes Mono and Multi, if the command line additionally contains one or more arguments (-key …
or -whether …
),
the set of occurrences is recursively clusterised following the given clustering items.
See the clustering documentation page for details about the different existing clustering items.
Examples
With the same files as in the without clustering example above.
With -key
, we can cluster the results according to the upos
of the node N
(the dependent).
grew grep -request dislocated.req -key N.upos -i fr_pud-ud-test.conllu
{
"PRON": [
{
"sent_id": "n01086031",
"matching": {
"nodes": { "N": "5", "M": "1" },
"edges": {
"e": { "source": "1", "label": "dislocated", "target": "5" }
}
}
}
],
"NOUN": [
{
"sent_id": "n01121051",
"matching": {
"nodes": { "N": "11", "M": "2" },
"edges": {
"e": { "source": "2", "label": "dislocated", "target": "11" }
}
}
},
{
"sent_id": "n01001011",
"matching": {
"nodes": { "N": "20", "M": "29" },
"edges": {
"e": { "source": "29", "label": "dislocated", "target": "20" }
}
}
}
]
}
With -whether
, we can cluster the results according to the fact that the relation left-headed.
We observe that in two cases, the governor M
is before N
.
grew grep -request dislocated.req -whether "M << N" -i fr_pud-ud-test.conllu
{
"Yes": [
{
"sent_id": "n01121051",
"matching": {
"nodes": { "N": "11", "M": "2" },
"edges": {
"e": { "source": "2", "label": "dislocated", "target": "11" }
}
}
},
{
"sent_id": "n01086031",
"matching": {
"nodes": { "N": "5", "M": "1" },
"edges": {
"e": { "source": "1", "label": "dislocated", "target": "5" }
}
}
}
],
"No": [
{
"sent_id": "n01001011",
"matching": {
"nodes": { "N": "20", "M": "29" },
"edges": {
"e": { "source": "29", "label": "dislocated", "target": "20" }
}
}
}
]
}
Finally, several clustering can be applied successively. For instance
grew grep -request dislocated.req -key N.upos -whether "M << N" -i fr_pud-ud-test.conllu
{
"PRON": {
"Yes": [
{
"sent_id": "n01086031",
"matching": {
"nodes": { "N": "5", "M": "1" },
"edges": {
"e": { "source": "1", "label": "dislocated", "target": "5" }
}
}
}
]
},
"NOUN": {
"Yes": [
{
"sent_id": "n01121051",
"matching": {
"nodes": { "N": "11", "M": "2" },
"edges": {
"e": { "source": "2", "label": "dislocated", "target": "11" }
}
}
}
],
"No": [
{
"sent_id": "n01001011",
"matching": {
"nodes": { "N": "20", "M": "29" },
"edges": {
"e": { "source": "29", "label": "dislocated", "target": "20" }
}
}
}
]
}
}
Remarks:
- any longer sequence of
-key …
or-whether …
can be used - the relative order of clutering items is relevant (try
grew grep -request dislocated.req -whether "M << N" -key N.upos -i fr_pud-ud-test.conllu
) - it is possible to combine Multi mode and clustering:
grew grep -request dislocated.req -key N.upos -whether "M << N" -i en_fr_zh.json
Count
This mode computes corpus statistics based on Grew-match style requests.
The input data are:
- one (Mono mode) or several (Multi mode) corpora
- one or several requests
- any number of clustering item (either key of whether)
By default, it returns a JSON describing several embedded dictionaries, counting in each corpus, each request clustered following clustering items.
If the output dimension is 2, the statistics can be printed as a TSV table. This is the case for:
- Mono mode, any number of requests, 1 clustering item (🆕 in
1.10
) - Multi mode, any number of requests, no clustering items → a TSV table is built with the number of occurrences for each request in each corpus.
- Multi mode, 1 request, 1 clustering item → a TSV table is built with the results of the clustering (with corpora on lines and values of the cluster key in rows).
The optionnal -config
parameter (see here) can also be used.
TODO: The set of corpora is described in a JSON file and must be compiled before running grew count
.
Example with Multi mode, several requests and no clustering
Each request is described in a separate file. With the two following 1-line files:
ADJ_NOUN_pre.req
🔗pattern { A[upos=ADJ]; N[upos=NOUN]; N -[amod]-> A; A << N }
ADJ_NOUN_post.req
🔗pattern { A[upos=ADJ]; N[upos=NOUN]; N -[amod]-> A; N << A }
and the Multi mode file en_fr_zh.json
🔗
{ "corpora": [
{ "id": "UD_English-PUD",
"directory": "_build",
"files": ["en_pud-ud-test.conllu"]
},
{ "id": "UD_French-PUD",
"directory": "_build",
"files": ["fr_pud-ud-test.conllu"]
},
{ "id": "UD_Chinese-PUD",
"directory": "_build",
"files": ["zh_pud-ud-test.conllu"]
} ]
}
After compiling the corpora: grew compile -i en_fr_zh.json
The command grew count -request ADJ_NOUN_pre.req -request ADJ_NOUN_post.req -i en_fr_zh.json
outputs the JSON data:
{
"UD_French-PUD": { "ADJ_NOUN_pre.req": 423, "ADJ_NOUN_post.req": 935 },
"UD_English-PUD": { "ADJ_NOUN_pre.req": 1114, "ADJ_NOUN_post.req": 12 },
"UD_Chinese-PUD": { "ADJ_NOUN_pre.req": 364, "ADJ_NOUN_post.req": 0 }
}
And, with -tsv
option: grew count -request ADJ_NOUN_pre.req -request ADJ_NOUN_post.req -i en_fr_zh.json -tsv
Corpus ADJ_NOUN_pre ADJ_NOUN_post
UD_English-PUD 1114 12
UD_French-PUD 423 935
UD_Chinese-PUD 364 0
which corresponds to the table:
Corpus | ADJ_NOUN | NOUN_ADJ |
---|---|---|
UD_English-PUD | 1114 | 12 |
UD_French-PUD | 423 | 935 |
UD_Chinese-PUD | 364 | 0 |
We can then observe that in the annotations of the 3 corpora in use:
- in French, there is a weak preference for adjective position after the noun (68.9%)
- in English, there is a strong preference for adjective position before the noun (98.9%)
- in Chinese, there is a very strong preference for adjective position before the noun (100%)
Example with Multi mode, one request and a key clustering of the output
With the same data as in the previous example, the following command:
grew count -request ADJ_NOUN_pre.req -key N.Number -i en_fr_zh.json -tsv
produces the TSV file:
Corpus __undefined__ Plur Sing
UD_English-PUD 0 392 722
UD_French-PUD 0 178 245
UD_Chinese-PUD 364 0 0
which corresponds to the table:
Corpus | Plur | Sing | undefined |
---|---|---|---|
UD_English-PUD | 392 | 722 | 0 |
UD_French-PUD | 178 | 245 | 0 |
UD_Chinese-PUD | 0 | 0 | 364 |
Example with Multi mode, one request and a whether clustering of the output
Using a whether clustering, with the request ADJ_NOUN.req
🔗
pattern { A[upos=ADJ]; N[upos=NOUN]; N -[amod]-> A; }
and the command: grew count -request ADJ_NOUN.req -whether "A << N" -i en_fr_zh.json -tsv
we obtain the TSV file:
Corpus No Yes
UD_English-PUD 12 1114
UD_French-PUD 935 423
UD_Chinese-PUD 0 364
which corresponds to the table:
Corpus | No | Yes |
---|---|---|
UD_English-PUD | 12 | 1114 |
UD_French-PUD | 935 | 423 |
UD_Chinese-PUD | 0 | 364 |
Remarks
- Only one request is used in case of clustering.
- Request syntax can be learned here or with the online Grew-match tool, first with the tutorial and then with snippets given on the right of the text area.
- If some corpus is updated, it is necessary to run again the compilation step.
- Some requests may take a long time to be searched in corpora.
- ⚠️ In previous version (<
1.10
), the TSV table also contains a column with the size of corpora (in number of sentences). This column is no longer available in version1.10
.
Compile
For the Grew-match backend (grew_match_back
) or for the command grew count
, it is required to first compile corpora.
For these two usages, sets of corpora are described in a JSON file.
For compilation, the command is:
grew compile -i <corpora.json>
Note that this produces, for each corpus, a new file with the .marshal
extension stored in the corpus directory.
The .marshal
file is computed only if the corpus has changed since the last compilation.
Clean
The commands below removes the marshal
files produced by the grew compile
command for the set of corpora described in the JSON file corpora.json
.
grew clean -i <corpora.json>
Parameters
This section describes a few command line arguments that are shared by several commands.
-config
The config value can be: ud
, sud
, sequoia
or basic
. The default value is ud
.
This parameter modifies how CoNNL-U and GRS files are interpreted. More precisely, it controls:
- How edge labels are parsed (for instance, taking
@
extension into account in SUD). See here for a detailled description about this. - How features are stored in CoNLL-U (columns FEATS or columns MISC). See here for details.
This parameter is used in the transform
, grep
and count
modes.