Dependency parsing
Grew-parse-FR is natural language parser for French. It is composed of a GRS (Graph Rewriting System) which can be used with the Grew software to produce dependency syntax structures from POS-tagged data. With a POS-tagger (Talismane is recommended), it provides a full parser with sentences as input and dependency structures as output. The parsing GRS is described in an IWPT 2015 publication.
How to parse a sentence?
We consider the sentence:
- “La souris a été mangée par le chat.” [“The mouse was eaten by the cat.”].
The parsing is done in three steps:
- POS-tagging with Talismane
- Convert Talismane into a format usable by Grew (a sed script)
- Building the dependency syntax structure by applying Graph Rewriting System
Prerequisite
- Talismane:
- Download from Talismane github page, the 3 files:
talismane-distribution-6.0.0-bin.zip
,frenchLanguagePack-5.2.0.zip
andtalismane-fr-6.0.0.conf
. - Unzip
talismane-distribution-6.0.0-bin.zip
(and not the other zip file). - Grew: see Installation page
- POStoSSQ: get it with the command:
git clone https://gitlab.inria.fr/grew/POStoSSQ.git
- Download sed script
tal2grew.sed
More info on the parsing process
Step 0: Get the text to parse
Put the input text in the file data.txt
echo "La souris a été mangée par le chat." > data.txt
Step 1: POS-tagging
The parsing system POStoSSQ is waiting for a pos-tagged input. One easy way to produce such a pos-tagged French sentence is to use Talismane.
Call Talismane for tokenisation and POS-tagging with the command:
java -Xmx1G -Dconfig.file=talismane-fr-5.2.0.conf -jar talismane-core-6.0.0.jar \
--analyse \
--endModule=posTagger \
--sessionId=fr \
--encoding=UTF8 \
--inFile=data.txt \
--outFile=data.tal
This should produce the file data.tal
:
1 La la DET DET n=s|g=f
2 souris souris NC NC g=f
3 a avoir V V n=s|t=P|p=3
4 été être VPP VPP t=K
5 mangée manger VPP VPP n=s|g=f|t=K
6 par par P P
7 le le DET DET n=s|g=m
8 chat chat NC NC n=s|g=m
9 . . PONCT PONCT
Step 2: Convert output
Apply the sed script:
sed -f tal2grew.sed data.tal > data.pos.conll
This produces the file data.pos.conll
:
1 La la _ DET n=s|g=f _ _ _ _
2 souris souris _ NC g=f _ _ _ _
3 a avoir _ V n=s|m=ind|t=pst|p=3 _ _ _ _
4 été être _ VPP m=part|t=past _ _ _ _
5 mangée manger _ VPP n=s|g=f|m=part|t=past _ _ _ _
6 par par _ P _ _ _ _
7 le le _ DET n=s|g=m _ _ _ _
8 chat chat _ NC n=s|g=m _ _ _ _
9 . . _ PONCT _ _ _ _
Step 3: Parsing with the GRS
With the file data.pos.conll
described above, the following command produces the CoNLL-U code of the parsed sentence:
grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll -o data.surf.conll
The output file is data.surf.conll
:
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 La la D DET g=f|n=s 2 det _ _
2 souris souris N NC g=f|s=c 5 suj _ _
3 a avoir V V m=ind|n=s|p=3|t=pst 5 aux.tps _ _
4 été être V VPP m=part|t=past 5 aux.pass _ _
5 mangée manger V VPP g=f|m=part|n=s|p=3|t=past _ _ _ _
6 par par P P _ 5 p_obj.agt _ _
7 le le D DET g=m|n=s 8 det _ _
8 chat chat N NC g=m|n=s|s=c 6 obj.p _ _
9 . . PONCT PONCT _ 5 ponct _ _
which encodes the syntactic structure:
In case of trouble
Conversion of Talismane output
Talismane outputs features with disjunction of values in case of ambiguities.
These disjunction can not be handle with the current parsing system.
The sed script tal2grew.sed
rewrites or removes the disjunction we have discovered so far but this may not be exhaustive.
If there is an error in the Grew output, you may have to adapt the Step 3.1 in the sed file (please inform us if this is the case, we will update the sed file for other users!).