Dependency parsing

Grew-parse-FR is natural language parser for French. It is composed of a GRS (Graph Rewriting System) which can be used with the Grew software to produce dependency syntax structures from POS-tagged data. With a POS-tagger (Talismane is recommended), it provides a full parser with sentences as input and dependency structures as output. The parsing GRS is described in an IWPT 2015 publication.

How to parse a sentence?

We consider the sentence:

The parsing is done in three steps:

  1. POS-tagging with Talismane
  2. Convert Talismane into a format usable by Grew (a sed script)
  3. Building the dependency syntax structure by applying Graph Rewriting System

Prerequisite

More info on the parsing process

Step 0: Get the text to parse

Put the input text in the file data.txt

echo "La souris a été mangée par le chat." > data.txt

Step 1: POS-tagging

The parsing system POStoSSQ is waiting for a pos-tagged input. One easy way to produce such a pos-tagged French sentence is to use Talismane.

Call Talismane for tokenisation and POS-tagging with the command:

java -Xmx1G -Dconfig.file=talismane-fr-5.2.0.conf -jar talismane-core-6.0.0.jar \
  --analyse \
  --endModule=posTagger \
  --sessionId=fr \
  --encoding=UTF8 \
  --inFile=data.txt \
  --outFile=data.tal

This should produce the file data.tal:

1	La	la	DET	DET	n=s|g=f	
2	souris	souris	NC	NC	g=f	
3	a	avoir	V	V	n=s|t=P|p=3	
4	été	être	VPP	VPP	t=K	
5	mangée	manger	VPP	VPP	n=s|g=f|t=K	
6	par	par	P	P		
7	le	le	DET	DET	n=s|g=m	
8	chat	chat	NC	NC	n=s|g=m	
9	.	.	PONCT	PONCT		

Step 2: Convert output

Apply the sed script:

sed -f tal2grew.sed data.tal > data.pos.conll

This produces the file data.pos.conll:

1	La	la	_	DET	n=s|g=f	_	_	_	_
2	souris	souris	_	NC	g=f	_	_	_	_
3	a	avoir	_	V	n=s|m=ind|t=pst|p=3	_	_	_	_
4	été	être	_	VPP	m=part|t=past	_	_	_	_
5	mangée	manger	_	VPP	n=s|g=f|m=part|t=past	_	_	_	_
6	par	par	_	P		_	_	_	_
7	le	le	_	DET	n=s|g=m	_	_	_	_
8	chat	chat	_	NC	n=s|g=m	_	_	_	_
9	.	.	_	PONCT		_	_	_	_

Step 3: Parsing with the GRS

With the file data.pos.conll described above, the following command produces the CoNLL-U code of the parsed sentence:

grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll -o data.surf.conll

The output file is data.surf.conll:

# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1	La	la	D	DET	g=f|n=s	2	det	_	_
2	souris	souris	N	NC	g=f|s=c	5	suj	_	_
3	a	avoir	V	V	m=ind|n=s|p=3|t=pst	5	aux.tps	_	_
4	été	être	V	VPP	m=part|t=past	5	aux.pass	_	_
5	mangée	manger	V	VPP	g=f|m=part|n=s|p=3|t=past	_	_	_	_
6	par	par	P	P	_	5	p_obj.agt	_	_
7	le	le	D	DET	g=m|n=s	8	det	_	_
8	chat	chat	N	NC	g=m|n=s|s=c	6	obj.p	_	_
9	.	.	PONCT	PONCT	_	5	ponct	_	_

which encodes the syntactic structure:

Dependency structure

In case of trouble

Conversion of Talismane output

Talismane outputs features with disjunction of values in case of ambiguities. These disjunction can not be handle with the current parsing system. The sed script tal2grew.sed rewrites or removes the disjunction we have discovered so far but this may not be exhaustive.

If there is an error in the Grew output, you may have to adapt the Step 3.1 in the sed file (please inform us if this is the case, we will update the sed file for other users!).