{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "title: \"Grewpy • request\"\n", "date: 2024-04-22\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`grewpy` Tutorial](../tutorial)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Grewpy tutorial: Run requests on a corpus\n", "\n", "Download the notebook [here](../request.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:05.592017Z", "iopub.status.busy": "2024-08-11T16:53:05.591674Z", "iopub.status.idle": "2024-08-11T16:53:05.841516Z", "shell.execute_reply": "2024-08-11T16:53:05.832833Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "connected to port: 55490\n" ] } ], "source": [ "import grewpy\n", "from grewpy import Corpus, Request\n", "\n", "grewpy.set_config(\"sud\") # ud or basic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import data\n", "The `Corpus` constructor takes a `conllu` file or a directory containing `conllu` files.\n", "A `Corpus` allows to make queries and to count occurrences." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:05.885453Z", "iopub.status.busy": "2024-08-11T16:53:05.884914Z", "iopub.status.idle": "2024-08-11T16:53:06.233004Z", "shell.execute_reply": "2024-08-11T16:53:06.232722Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "treebank_path = \"SUD_English-PUD\"\n", "corpus = Corpus(treebank_path)\n", "print(type(corpus))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.234790Z", "iopub.status.busy": "2024-08-11T16:53:06.234661Z", "iopub.status.idle": "2024-08-11T16:53:06.237459Z", "shell.execute_reply": "2024-08-11T16:53:06.237210Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "n_sentencens = 1000\n", "sent_ids[0] = 'n01001011'\n" ] } ], "source": [ "n_sentencens = len(corpus)\n", "sent_ids = corpus.get_sent_ids()\n", "\n", "print(f\"{n_sentencens = }\")\n", "print(f\"{sent_ids[0] = }\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore data\n", "See the [Grew-match tutorial](https://universal.grew.fr/?corpus=UD_English-ParTUT@2.14) to practice writing Grew requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of subjets in the corpus" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.239188Z", "iopub.status.busy": "2024-08-11T16:53:06.239083Z", "iopub.status.idle": "2024-08-11T16:53:06.251909Z", "shell.execute_reply": "2024-08-11T16:53:06.251642Z" } }, "outputs": [ { "data": { "text/plain": [ "1420" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req1 = Request(\"pattern { X-[subj]->Y }\")\n", "corpus.count(req1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to extend an already existing request with the methods `pattern`, `without` and `with_` (because `with` is a Python keyword).\n", "Hence, the request `req1bis` below is equivalent to `req1`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.253582Z", "iopub.status.busy": "2024-08-11T16:53:06.253479Z", "iopub.status.idle": "2024-08-11T16:53:06.263399Z", "shell.execute_reply": "2024-08-11T16:53:06.263155Z" } }, "outputs": [ { "data": { "text/plain": [ "1420" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req1bis = Request().pattern(\"X-[subj]->Y\")\n", "corpus.count(req1bis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of subjects such that the subject's head is not a pronoun" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.265159Z", "iopub.status.busy": "2024-08-11T16:53:06.265058Z", "iopub.status.idle": "2024-08-11T16:53:06.275456Z", "shell.execute_reply": "2024-08-11T16:53:06.275175Z" } }, "outputs": [ { "data": { "text/plain": [ "943" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req2 = Request().pattern(\"X-[subj]->Y\").without(\"Y[upos=PRON]\")\n", "corpus.count(req2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of subjects with at least one dependant\n", "Note the usage of `with_` (because `with` is a Python keyword)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.277052Z", "iopub.status.busy": "2024-08-11T16:53:06.276968Z", "iopub.status.idle": "2024-08-11T16:53:06.287419Z", "shell.execute_reply": "2024-08-11T16:53:06.287171Z" } }, "outputs": [ { "data": { "text/plain": [ "752" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req3 = Request().pattern(\"X-[subj]->Y\").with_(\"Y->Z\")\n", "corpus.count(req3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `with` and `without` items can be stacked \n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.289108Z", "iopub.status.busy": "2024-08-11T16:53:06.289012Z", "iopub.status.idle": "2024-08-11T16:53:06.299804Z", "shell.execute_reply": "2024-08-11T16:53:06.299520Z" } }, "outputs": [ { "data": { "text/plain": [ "320" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req4 = Request().pattern(\"X-[subj]->Y\").with_(\"Y->Z\").without(\"Y[upos=PRON]\").without(\"X[upos=VERB]\")\n", "corpus.count(req4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building a request with the raw Grew syntax\n", "It is possible to build request directly from the concrete syntax used in Grew-match or in Grew rules.\n", "The `req4` can be written:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.301466Z", "iopub.status.busy": "2024-08-11T16:53:06.301382Z", "iopub.status.idle": "2024-08-11T16:53:06.313228Z", "shell.execute_reply": "2024-08-11T16:53:06.312944Z" } }, "outputs": [ { "data": { "text/plain": [ "320" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req4bis = Request(\"\"\"\n", "pattern { X-[subj]->Y }\n", "with { Y->Z }\n", "without { Y[upos=PRON] }\n", "without { X[upos=VERB] }\n", "\"\"\")\n", "corpus.count(req4bis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More complex queries are allowed, with results clustering\n", "See [Clustering](../../doc/clustering) for more documentation.\n", "Below, we cluster the subject relation, according to the POS of the governor." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.314943Z", "iopub.status.busy": "2024-08-11T16:53:06.314841Z", "iopub.status.idle": "2024-08-11T16:53:06.328955Z", "shell.execute_reply": "2024-08-11T16:53:06.328687Z" } }, "outputs": [ { "data": { "text/plain": [ "{'VERB': 826, 'SCONJ': 1, 'PART': 5, 'NOUN': 3, 'AUX': 581, 'ADP': 3, 'ADJ': 1}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req5 = Request(\"pattern {X-[subj]->Y}\")\n", "corpus.count(req5, clustering_parameter=[\"X.upos\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clustering results by other requests\n", "The clustering is done on the relative position of `X` and `Y`.\n", "It answers to the question: _How many subjects are in a pre-verbal position?_" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.330676Z", "iopub.status.busy": "2024-08-11T16:53:06.330576Z", "iopub.status.idle": "2024-08-11T16:53:06.341013Z", "shell.execute_reply": "2024-08-11T16:53:06.340754Z" } }, "outputs": [ { "data": { "text/plain": [ "{'Yes': 77, 'No': 1343}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.count(req5, clustering_parameter=[\"{X << Y}\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example corresponds to the whether clustering in Grew-match.\n", "Note that here curly braces are required around `X << Y` to indicate that whether clustering should be performed instead of key clustering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two clusterings can be applied" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.342834Z", "iopub.status.busy": "2024-08-11T16:53:06.342729Z", "iopub.status.idle": "2024-08-11T16:53:06.357088Z", "shell.execute_reply": "2024-08-11T16:53:06.356788Z" } }, "outputs": [ { "data": { "text/plain": [ "{'Yes': {'VERB': 45, 'SCONJ': 1, 'AUX': 30, 'ADP': 1},\n", " 'No': {'VERB': 781, 'PART': 5, 'NOUN': 3, 'AUX': 551, 'ADP': 2, 'ADJ': 1}}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.count(req5, clustering_parameter=[\"{X << Y}\",\"X.upos\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More than two clusterings are also possible" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.358753Z", "iopub.status.busy": "2024-08-11T16:53:06.358644Z", "iopub.status.idle": "2024-08-11T16:53:06.373490Z", "shell.execute_reply": "2024-08-11T16:53:06.373214Z" } }, "outputs": [ { "data": { "text/plain": [ "{'Yes': {'VERB': {'Yes': 16, 'No': 29},\n", " 'SCONJ': {'No': 1},\n", " 'AUX': {'Yes': 21, 'No': 9},\n", " 'ADP': {'No': 1}},\n", " 'No': {'VERB': {'Yes': 167, 'No': 614},\n", " 'PART': {'No': 5},\n", " 'NOUN': {'Yes': 2, 'No': 1},\n", " 'AUX': {'Yes': 255, 'No': 296},\n", " 'ADP': {'No': 2},\n", " 'ADJ': {'No': 1}}}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.count(req5, clustering_parameter=[\"{X << Y}\",\"X.upos\", \"{X[Number=Sing]}\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Search occurrences\n", "Get the list of occurrence of a given request in the corpus" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.375202Z", "iopub.status.busy": "2024-08-11T16:53:06.375100Z", "iopub.status.idle": "2024-08-11T16:53:06.409937Z", "shell.execute_reply": "2024-08-11T16:53:06.409667Z" } }, "outputs": [ { "data": { "text/plain": [ "{'sent_id': 'w05010027',\n", " 'matching': {'nodes': {'Y': '8', 'X': '10'}, 'edges': {}}}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "occurrences = corpus.search(req1)\n", "assert len(occurrences) == corpus.count(req1)\n", "occurrences[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get occurrences including edges\n", "The edge is named `e`, and the label of the dependency is reported in the output" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.411647Z", "iopub.status.busy": "2024-08-11T16:53:06.411537Z", "iopub.status.idle": "2024-08-11T16:53:06.450209Z", "shell.execute_reply": "2024-08-11T16:53:06.449941Z" } }, "outputs": [ { "data": { "text/plain": [ "{'sent_id': 'w05010027',\n", " 'matching': {'nodes': {'Y': '12', 'X': '10'},\n", " 'edges': {'e': {'source': '10',\n", " 'label': {'1': 'comp', '2': 'obj'},\n", " 'target': '12'}}}}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "req6 = Request().pattern(\"e: X->Y; X[upos=VERB]\")\n", "corpus.search(req6)[3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### As with `count`, we can cluster the results of a `search`" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-08-11T16:53:06.451912Z", "iopub.status.busy": "2024-08-11T16:53:06.451800Z", "iopub.status.idle": "2024-08-11T16:53:06.481277Z", "shell.execute_reply": "2024-08-11T16:53:06.481025Z" } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['Yes', 'No'])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = corpus.search(req6, clustering_parameter=[\"{X << Y}\"])\n", "result.keys()" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 2 }