# **SUD_Kyrgyz-TueCL**: SUD treebank conversion for the corpus UD_Kyrgyz-TueCL version 2.17
This treebank was automatically generated from the UD treebank:
[UD_Kyrgyz-TueCL](https://github.com/UniversalDependencies/UD_Kyrgyz-TueCL/releases/tag/r2.17).

See [SUD data page](https://surfacesyntacticud.org/data/) for more details about the conversion process.

The rest of this file is a copy of the original README associated to **UD_Kyrgyz-TueCL** and therefore refers to UD.

---
---

# Summary

This is a small treebank of grammatical examples for Kyrgyz.
It is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages, designed to facilitate cross-linguistic research on these related languages.

# Introduction

The Kyrgyz-TueCL treebank is part of a parallel Universal Dependencies corpus containing 148 sentences across four Turkic languages (Turkish - [UD_Turkish-TueCL](https://github.com/UniversalDependencies/UD_Turkish-TueCL/tree/dev), Azerbaijani - [UD_Azerbaijani-TueCL](https://github.com/UniversalDependencies/UD_Azerbaijani-TueCL/tree/dev), Kyrgyz -  [UD_Kyrgyz-TueCL](https://github.com/UniversalDependencies/UD_Kyrgyz-TueCL/tree/dev), and Uzbek - [UD_Uzbek-TueCL](https://github.com/UniversalDependencies/UD_Uzbek-TueCL/tree/dev)), designed to facilitate cross-linguistic research on these related languages.

Total sentences: 173
Total tokens: 1250
Unique word forms (types): 464
Unique lemmas: 287

The Kyrgyz-TueCL treebank consists of 173 carefully selected sentences compiled from multiple sources, including the [Cairo corpus](https://github.com/UniversalDependencies/cairo) (20 sentences), the [UDTW23 corpus](https://github.com/ud-turkic/udtw23) (20 sentences), and 97 additional examples illustrating specific grammatical constructions of interest. It serves as a source treebank for a parallel corpus spanning four Turkic languages from distinct branches of the family: Turkish and Azerbaijani (Oghuz), Kyrgyz (Kipchak), and Uzbek (Karluk).

The treebank includes various syntactic phenomena relevant to Turkic languages, such as pro-drop constructions, auxiliary chains, postverbal structures, and non-canonical word orders. Each sentence has been manually annotated following UD guidelines, with particular attention to morphosyntactic features that highlight both shared typological characteristics and language-specific traits.
Glossing, transliteration, and translations of all sentences are provided in Azerbaijani, Turkish, Uzbek, and English as metadata to support comparative research.

Dependency relations, glossing, lemmatization, morphological features, POS tagging, tokenization, and transliteration were manually annotated.

This resource is significant as it represents the first fully aligned parallel UD treebanks for these Turkic languages, enabling systematic cross-linguistic comparisons previously hindered by the lack of parallel resources. The treebank supports research in comparative Turkic syntax, cross-lingual parsing, and language education.

# Acknowledgments

This work was supported by COST Action CA21167 - Universality, diversity and idiosyncrasy in language technology ([UniDive](https://unidive.lisn.upsaclay.fr/)). We thank the [Turkic UD working group](https://github.com/ud-turkic) for fruitful discussions of linguistic issues and annotation approaches.
We extend special thanks to the Kyrgyz team — [Jonathan North Washington](https://github.com/jonorthwash), Aida Kasieva, Gulnura Dzumalieva, Aigul Tursunova, Meerim Ryspakova, and Aizat Kadyrbekova — for their consistent support, as well as their valuable weekly meetings and discussions that greatly contributed to this work.

## References

Please, cite the following paper if you use Kyrgyz-TueCL UD treebank:

<pre>
@inproceedings{akhundjanova-etal-2025-parallel,
    title = "Parallel {U}niversal {D}ependencies Treebanks for {T}urkic Languages",
    author = "Akhundjanova, Arofat  and
      Akkurt, Furkan  and
      Chontaeva, Bermet  and
      Eslami, Soudabeh  and
      Coltekin, Cagri",
    editor = {Bouma, Gosse  and
      {\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
    booktitle = "Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)",
    month = aug,
    year = "2025",
    address = "Ljubljana, Slovenia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.udw-1.14/",
    pages = "129--136",
    ISBN = "979-8-89176-292-3",
    abstract = "We introduce the first fully aligned and manually annotated parallel Universal Dependencies (UD) treebanks for four Turkic languages: Azerbaijani, Kyrgyz, Turkish, and Uzbek. These resources currently consist of 148 strategically selected sentences that illustrate typologically significant morphosyntactic phenomena across these related yet distinct languages. These parallel treebanks enable systematic comparative studies of Turkic syntax and may be instrumental in cross-lingual NLP applications. All treebanks are available as part of UD v2.16."
}
</pre>


# Changelog

* 2025-09-04 v2.16
  * add reference paper
  * add parallel corpus information to machine-readable metadata
  * add parallel data support with parallel_id metadata for cross-lingual sentence matching
* 2024-05-15 v2.14
  * Initial release in Universal Dependencies.


<pre>
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.14
License: CC BY-SA 4.0
Includes text: yes
Parallel: cairo tuecl
Genre: grammar-examples
Lemmas: manual native
UPOS: manual native
XPOS: not available
Features: manual native
Relations: manual native
Contributors: Chontaeva, Bermet; Çöltekin, Çağrı
Contributing: here
Contact: bermet.chontaeva@student.uni-tuebingen.de, cagri.coeltekin@uni-tuebingen.de
===============================================================================
</pre>
