JSON description of a set of corpora

Set of corpora are used both for the Grew-match online tool and for the Grew count mode.

A JSON file is used to describes the set. Each corpus is described by:

• a unique identifier id
• a directory where the files of the corpus are stored
• a files field with a list of file names. This field is optional, by default all files with extension conll, conllu, cupt or orfeo are loaded.

For instance, the file en_fr_zh.json 🔗 describes 3 corpora from UD 2.6 (of course, directories should be modified to match your local installation).

{ "corpora": [
{ "id": "UD_English-EWT@2.6",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.6/UD_English-EWT/",
"files": ["en_ewt-ud-dev.conllu", "en_ewt-ud-test.conllu", "en_ewt-ud-train.conllu"]
},
{ "id": "UD_French-Sequoia@2.6",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.6/UD_French-Sequoia/"
},
{ "id": "UD_Chinese-GSD@2.6",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.6/UD_Chinese-GSD/"
} ]
}


NB: a few other fields are used for corpus description in the Grew-match. See here for examples of the JSON files used for Grew-match