{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "title: \"Grewpy • request\"\n",
    "date: 2024-04-22\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[`grewpy` Tutorial](../tutorial)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Grewpy tutorial: Run requests on a corpus\n",
    "\n",
    "Download the notebook [here](../request.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:33.041070Z",
     "iopub.status.busy": "2025-07-03T15:08:33.040779Z",
     "iopub.status.idle": "2025-07-03T15:08:33.313668Z",
     "shell.execute_reply": "2025-07-03T15:08:33.312661Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "connected to port: 58561\n"
     ]
    }
   ],
   "source": [
    "import grewpy\n",
    "from grewpy import Corpus, Request\n",
    "\n",
    "grewpy.set_config(\"sud\") # ud or basic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import data\n",
    "The `Corpus` constructor takes a `conllu` file or a directory containing `conllu` files.\n",
    "A `Corpus` allows to make queries and to count occurrences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:33.359348Z",
     "iopub.status.busy": "2025-07-03T15:08:33.357724Z",
     "iopub.status.idle": "2025-07-03T15:08:34.135890Z",
     "shell.execute_reply": "2025-07-03T15:08:34.134969Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'grewpy.corpus.Corpus'>\n"
     ]
    }
   ],
   "source": [
    "treebank_path = \"SUD_English-PUD\"\n",
    "corpus = Corpus(treebank_path)\n",
    "print(type(corpus))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.138605Z",
     "iopub.status.busy": "2025-07-03T15:08:34.138007Z",
     "iopub.status.idle": "2025-07-03T15:08:34.143166Z",
     "shell.execute_reply": "2025-07-03T15:08:34.142372Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "n_sentencens = 1000\n",
      "sent_ids[0] = 'n01001011'\n"
     ]
    }
   ],
   "source": [
    "n_sentencens = len(corpus)\n",
    "sent_ids = corpus.get_sent_ids()\n",
    "\n",
    "print(f\"{n_sentencens = }\")\n",
    "print(f\"{sent_ids[0] = }\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore data\n",
    "See the [Grew-match tutorial](https://universal.grew.fr/?tutorial=yes) to practice writing Grew requests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the number of subjets in the corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.146587Z",
     "iopub.status.busy": "2025-07-03T15:08:34.146004Z",
     "iopub.status.idle": "2025-07-03T15:08:34.175704Z",
     "shell.execute_reply": "2025-07-03T15:08:34.174998Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1419"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req1 = Request(\"pattern { X-[subj]->Y }\")\n",
    "corpus.count(req1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is possible to extend an already existing request with the methods `pattern`, `without` and `with_` (because `with` is a Python keyword).\n",
    "Hence, the request `req1bis` below is equivalent to `req1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.178190Z",
     "iopub.status.busy": "2025-07-03T15:08:34.177783Z",
     "iopub.status.idle": "2025-07-03T15:08:34.200160Z",
     "shell.execute_reply": "2025-07-03T15:08:34.199439Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1419"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req1bis = Request().pattern(\"X-[subj]->Y\")\n",
    "corpus.count(req1bis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the number of subjects such that the subject's head is not a pronoun"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.202701Z",
     "iopub.status.busy": "2025-07-03T15:08:34.202235Z",
     "iopub.status.idle": "2025-07-03T15:08:34.224881Z",
     "shell.execute_reply": "2025-07-03T15:08:34.224198Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "931"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req2 = Request().pattern(\"X-[subj]->Y\").without(\"Y[upos=PRON]\")\n",
    "corpus.count(req2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the number of subjects with at least one dependant\n",
    "Note the usage of `with_` (because `with` is a Python keyword)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.227747Z",
     "iopub.status.busy": "2025-07-03T15:08:34.227302Z",
     "iopub.status.idle": "2025-07-03T15:08:34.252759Z",
     "shell.execute_reply": "2025-07-03T15:08:34.251476Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "752"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req3 = Request().pattern(\"X-[subj]->Y\").with_(\"Y->Z\")\n",
    "corpus.count(req3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `with` and `without` items can be stacked \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.257314Z",
     "iopub.status.busy": "2025-07-03T15:08:34.256280Z",
     "iopub.status.idle": "2025-07-03T15:08:34.285477Z",
     "shell.execute_reply": "2025-07-03T15:08:34.283615Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "317"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req4 = Request().pattern(\"X-[subj]->Y\").with_(\"Y->Z\").without(\"Y[upos=PRON]\").without(\"X[upos=VERB]\")\n",
    "corpus.count(req4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building a request with the raw Grew syntax\n",
    "It is possible to build request directly from the concrete syntax used in Grew-match or in Grew rules.\n",
    "The `req4` can be written:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.289913Z",
     "iopub.status.busy": "2025-07-03T15:08:34.288865Z",
     "iopub.status.idle": "2025-07-03T15:08:34.320410Z",
     "shell.execute_reply": "2025-07-03T15:08:34.319717Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "317"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req4bis = Request(\"\"\"\n",
    "pattern { X-[subj]->Y }\n",
    "with { Y->Z }\n",
    "without { Y[upos=PRON] }\n",
    "without { X[upos=VERB] }\n",
    "\"\"\")\n",
    "corpus.count(req4bis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More complex queries are allowed, with results clustering\n",
    "See [Clustering](../../doc/clustering) for more documentation.\n",
    "Below, we cluster the subject relation, according to the POS of the governor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.322887Z",
     "iopub.status.busy": "2025-07-03T15:08:34.322586Z",
     "iopub.status.idle": "2025-07-03T15:08:34.354575Z",
     "shell.execute_reply": "2025-07-03T15:08:34.353587Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'VERB': 825, 'SCONJ': 1, 'PART': 5, 'NOUN': 3, 'AUX': 581, 'ADP': 3, 'ADJ': 1}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req5 = Request(\"pattern {X-[subj]->Y}\")\n",
    "corpus.count(req5, clustering_parameter=[\"X.upos\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clustering results by other requests\n",
    "The clustering is done on the relative position of `X` and `Y`.\n",
    "It answers to the question: _How many subjects are in a pre-verbal position?_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.359007Z",
     "iopub.status.busy": "2025-07-03T15:08:34.358711Z",
     "iopub.status.idle": "2025-07-03T15:08:34.381657Z",
     "shell.execute_reply": "2025-07-03T15:08:34.380712Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Yes': 76, 'No': 1343}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus.count(req5, clustering_parameter=[\"{X << Y}\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example corresponds to the whether clustering in Grew-match.\n",
    "Note that here curly braces are required around `X << Y` to indicate that whether clustering should be performed instead of key clustering."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Two clusterings can be applied"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.384427Z",
     "iopub.status.busy": "2025-07-03T15:08:34.384063Z",
     "iopub.status.idle": "2025-07-03T15:08:34.415443Z",
     "shell.execute_reply": "2025-07-03T15:08:34.414552Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Yes': {'VERB': 44, 'SCONJ': 1, 'AUX': 30, 'ADP': 1},\n",
       " 'No': {'VERB': 781, 'PART': 5, 'NOUN': 3, 'AUX': 551, 'ADP': 2, 'ADJ': 1}}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus.count(req5, clustering_parameter=[\"{X << Y}\",\"X.upos\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More than two clusterings are also possible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.418317Z",
     "iopub.status.busy": "2025-07-03T15:08:34.417826Z",
     "iopub.status.idle": "2025-07-03T15:08:34.449403Z",
     "shell.execute_reply": "2025-07-03T15:08:34.448724Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Yes': {'VERB': {'Yes': 15, 'No': 29},\n",
       "  'SCONJ': {'No': 1},\n",
       "  'AUX': {'Yes': 21, 'No': 9},\n",
       "  'ADP': {'No': 1}},\n",
       " 'No': {'VERB': {'Yes': 167, 'No': 614},\n",
       "  'PART': {'No': 5},\n",
       "  'NOUN': {'Yes': 2, 'No': 1},\n",
       "  'AUX': {'Yes': 255, 'No': 296},\n",
       "  'ADP': {'No': 2},\n",
       "  'ADJ': {'No': 1}}}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus.count(req5, clustering_parameter=[\"{X << Y}\",\"X.upos\", \"{X[Number=Sing]}\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Search occurrences\n",
    "Get the list of occurrence of a given request in the corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.454553Z",
     "iopub.status.busy": "2025-07-03T15:08:34.453779Z",
     "iopub.status.idle": "2025-07-03T15:08:34.499361Z",
     "shell.execute_reply": "2025-07-03T15:08:34.498641Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sent_id': 'w05010027',\n",
       " 'matching': {'nodes': {'Y': '8', 'X': '10'}, 'edges': {}}}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "occurrences = corpus.search(req1)\n",
    "assert len(occurrences) == corpus.count(req1)\n",
    "occurrences[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get occurrences including edges\n",
    "The edge is named `e`, and the label of the dependency is reported in the output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.502685Z",
     "iopub.status.busy": "2025-07-03T15:08:34.502121Z",
     "iopub.status.idle": "2025-07-03T15:08:34.631583Z",
     "shell.execute_reply": "2025-07-03T15:08:34.630130Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sent_id': 'w05010027',\n",
       " 'matching': {'nodes': {'Y': '12', 'X': '10'},\n",
       "  'edges': {'e': {'source': '10',\n",
       "    'label': {'1': 'comp', '2': 'obj'},\n",
       "    'target': '12'}}}}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "req6 = Request().pattern(\"e: X->Y; X[upos=VERB]\")\n",
    "corpus.search(req6)[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### As with `count`, we can cluster the results of a `search`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-07-03T15:08:34.634694Z",
     "iopub.status.busy": "2025-07-03T15:08:34.634426Z",
     "iopub.status.idle": "2025-07-03T15:08:34.708511Z",
     "shell.execute_reply": "2025-07-03T15:08:34.707794Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys(['Yes', 'No'])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = corpus.search(req6, clustering_parameter=[\"{X << Y}\"])\n",
    "result.keys()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}