Input data
The Grew command line interface can be used with two different ways to describe the input linguistic data:
- the Mono mode where one corpus is considered
- the Multi mode where a set of corpora is considered
The table below shown which are the modes compatible with each subcommand.
transform |
grep |
count |
compile |
clean |
|
---|---|---|---|---|---|
Mono | ✅ | ✅ | ✅ (🆕 in 1.10 ) |
❌ | ❌ |
Multi | ❌ | ✅ (🆕 in 1.10 ) |
✅ | ✅ | ✅ |
The Multi mode is also use in Grew-match to describe the set of corpora on which one can request.
The Mono mode
The Mono mode corresponds the following arguments on the command line:
- a sequence of arguments
-i <file>.<ext>
with extension.conll
,.conllu
,.cupt
or.orfeo
→ all sentences in the different files are loaded, following the CoNNL-U format or CoNLL-U Plus format. - a sequence of arguments
-i <file>.json
with JSON files following Graph JSON encoding → all sentences in the different JSON files - a sequence of arguments
-i <file>.amr
or-i <file>.txt
→ all graphs in the different files are loaded, following the PENMAN notation - one argument
-i directory
→ load all files in the directory like in one of the3 items above - no
-i
argument → CoNNL-u data is read onstdin
The Multi mode
The Multi mode is used when the command line argument contains a sequence of arguments -i <file>.json
with JSON files following the JSON description of a set of corpora below.
JSON description of a set of corpora
Set of corpora are used both for the Grew-match online tool and for Grew.
A JSON file is used to describe the set. Each corpus is described by:
- a unique identifier
id
- a
directory
where the files of the corpus are stored (use absolute paths) - a
files
field with a list of file names. This field is optional, by default all files with extensionconll
,conllu
,cupt
ororfeo
are loaded.
For instance, the file en_fr_zh.json
🔗 describes 3 corpora from UD 2.14 (of course, directories should be modified to match your local installation).
{ "corpora": [
{ "id": "UD_English-EWT@2.14",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.14/UD_English-EWT/",
"files": ["en_ewt-ud-dev.conllu", "en_ewt-ud-test.conllu", "en_ewt-ud-train.conllu"]
},
{ "id": "UD_French-Sequoia@2.14",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.14/UD_French-Sequoia/"
},
{ "id": "UD_Chinese-GSD@2.14",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.14/UD_Chinese-GSD/"
} ]
}
NB: A few other fields are used for the description of corpora used in the Grew-match. See here for examples of the JSON files used in different instances of Grew-match.