Dependency parsing

Grew-parse-FR is natural language parser for French. It is composed of a GRS (Graph Rewriting System) which can be used with the Grew software to produce dependency syntax structures from POS-tagged data. With a POS-tagger (Talismane is recommended), it provides a full parser with sentences as input and dependency structures as output. The parsing GRS is described in an IWPT 2015 publication.

How to parse a sentence?

We consider the sentence:

The parsing is done in three steps:

  1. POS-tagging with Talismane
  2. Convert Talismane into a format usable by Grew (a sed script)
  3. Building the dependency syntax structure by applying Graph Rewriting System

Prerequisite

More info on the parsing process

Step 0: Get the text to parse

Put the input text in the file data.txt

echo "La souris a été mangée par le chat." > data.txt

Step 1: POS-tagging

The parsing system POStoSSQ is waiting for a pos-tagged input. One easy way to produce such a pos-tagged French sentence is to use Talismane.

Call Talismane for tokenisation and POS-tagging with the command:

java -Xmx1G -Dconfig.file=talismane-fr-5.2.0.conf -jar talismane-core-5.2.0.jar \
  --analyse \
  --endModule=posTagger \
  --sessionId=fr \
  --encoding=UTF8 \
  --inFile=data.txt \
  --outFile=data.tal

This should produce the file data.tal:

1	La	la	DET	DET	n=s|g=f	
2	souris	souris	NC	NC	g=f	
3	a	avoir	V	V	n=s|t=P|p=3	
4	été	être	VPP	VPP	t=K	
5	mangée	manger	VPP	VPP	n=s|g=f|t=K	
6	par	par	P	P		
7	le	le	DET	DET	n=s|g=m	
8	chat	chat	NC	NC	n=s|g=m	
9	.	.	PONCT	PONCT		

Step 2: Convert output

Apply the sed script:

sed -f tal2grew.sed data.tal > data.pos.conll

This produces the file data.pos.conll:

1	La	la	_	DET	n=s|g=f	_	_	_	_
2	souris	souris	_	NC	g=f	_	_	_	_
3	a	avoir	_	V	n=s|m=ind|t=pst|p=3	_	_	_	_
4	été	être	_	VPP	m=part|t=past	_	_	_	_
5	mangée	manger	_	VPP	n=s|g=f|m=part|t=past	_	_	_	_
6	par	par	_	P		_	_	_	_
7	le	le	_	DET	n=s|g=m	_	_	_	_
8	chat	chat	_	NC	n=s|g=m	_	_	_	_
9	.	.	_	PONCT		_	_	_	_

Step 3: Parsing with the GRS

With the file data.pos.conll described above, the following command produces the CoNLL code of the parsed sentence:

grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll -o data.surf.conll

The output file is data.surf.conll:

1	La	la	D	DET	g=f|n=s	2	det	_	_
2	souris	souris	N	NC	g=f|s=c	5	suj	_	_
3	a	avoir	V	V	m=ind|n=s|p=3|t=pst	5	aux.tps	_	_
4	été	être	V	VPP	m=part|t=past	5	aux.pass	_	_
5	mangée	manger	V	VPP	g=f|m=part|n=s|p=3|t=past	_	_	_	_
6	par	par	P	P	_	5	p_obj.agt	_	_
7	le	le	D	DET	g=m|n=s	8	det	_	_
8	chat	chat	N	NC	g=m|n=s|s=c	6	obj.p	_	_
9	.	.	PONCT	PONCT	_	5	ponct	_	_

which encodes the syntactic structure:

Dependency structure

It is also possible to run a GTK interface in which you can explore step by step rewriting of the input sentence:

grew gui -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll

In case of trouble

Conversion of Talismane output

Talismane outputs features with disjunction of values in case of ambiguities. These disjunction can not be handle with the current parsing system. The sed script tal2grew.sed rewrites or removes the disjunction we have discovered so far but this may not be exhaustive.

If there is an error in the Grew output, you may have to adapt the Step 3.1 in the sed file (please inform us if this is the case, we will update the sed file for other users!).

Use MElt instead of Talismane

If you didn’t manage to use Talismane, MElt is an alternative. See Dependency parsing with MElt if you want to use MElt).