# **SUD_Swedish-Talbanken**: SUD treebank conversion for the corpus UD_Swedish-Talbanken version 2.17
This treebank was automatically generated from the UD treebank:
[UD_Swedish-Talbanken](https://github.com/UniversalDependencies/UD_Swedish-Talbanken/releases/tag/r2.17).

See [SUD data page](https://surfacesyntacticud.org/data/) for more details about the conversion process.

The rest of this file is a copy of the original README associated to **UD_Swedish-Talbanken** and therefore refers to UD.

---
---

# Summary

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University
in the 1970s.

# Introduction

The Swedish-Talbanken treebank is a conversion of the Prose section of Talbanken (Einarsson,
1976), originally annotated by a team led by Ulf Teleman at Lund University according
to the MAMBA annotation scheme (Teleman, 1974). It consists of roughly 6,000 sentences
and 95,000 tokens taken from a variety of informative text genres, including textbooks,
information brochures, and newspaper articles. The syntactic annotation is converted
directly from the original MAMBA annotation, while the morphological annotation is
based on the reannotation performed when incorporating Talbanken into the Swedish
Treebank (Nivre and Megyesi, 2007). Tokenization mostly follows the standard of the
Stockholm-Umeå Corpus, Version 2.0 (2006), and lemmatization is based on Saldo
(Borin et al., 2008).

# Acknowledgments

The new conversion has been performed by Joakim Nivre and Aaron Smith at Uppsala
University. A semi-automatic correction of features and lemmas was done by Victor 
Norrman and Joakim Nivre. We thank everyone who has been involved in previous 
conversion efforts at Växjö University and Uppsala University, including Bengt 
Dahlqvist, Sofia Gustafson-Capkova, Johan Hall, Anna Sågvall Hein, Beáta Megyesi, 
Jens Nilsson, and Filip Salomonsson. Special thanks also to Lars Borin and Markus 
Forsberg at Språkbanken for help with the lemmatization. Finally, we owe a huge 
debt to the team who produced the original treebank in the 1970s.

## References

* Lars Borin, Markus Forsberg, Lennart Lönngren. 2008. Saldo 1.0 (Svenskt
  associationslexikon version 2). Språkbanken, Göteborg universitet.

* Einarsson, Jan. 1976. Talbankens skriftspråkskonkordans. Lund University:
  Department of Scandinavian Languages.

* Joakim Nivre and Beáta Megyesi. 2007. Bootstrapping a Swedish treeebank
  using cross-corpus harmonization and annotation projection. In Proceedings
  of the 6th International Workshop on Treebanks and Linguistic Theories,
  pages 97-102.

* Teleman, Ulf. 1974. Manual för grammatisk beskrivning av talad och skriven
  svenska. Studentlitteratur.

* The Stockholm Umeå Corpus. Version 2.0. 2006. Stockholm University:
  Department of Linguistics.

# Data Splits

The test set (sv-ud-test.conllu) is the standard test set from the Swedish
Treebank, which is a balanced sample of complete documents from different
parts of the treebank.

The rest of the treebank has been split by taking the first 90% as the
training set (sv-ud-train.conllu) and the last 10% as the development set
(sv-ud-dev.conllu).

Document and paragraph boundaries are explicitly represented by comment
lines (# newdoc id = DOC_ID, # newpar id = PAR_ID), but genre classification
is not available for documents.

# Tokenization

The tokenization in the Swedish-Talbanken treebank follows the principles of the
Stockholm-Umeå Corpus, Version 2.0 (SUC, 2006), which has become the de facto
standard for Swedish tokenization and part-of-speech tagging. This is a
straightforward segmentation based on whitespace and punctuation, but the
following special cases deserve to be mentioned:

- Numerical expressions (including dates) are treated as single words and not
  segmented into their components as long as they do not contain spaces.
- Abbreviations are treated as single words regardless of whether they contain
  spaces or not.

The Swedish-Talbanken treebank contains the following tokens with spaces (all abbreviations):

Bl a
bl a
d v s
e d
f n
fr o m
Fr o m
m fl
m m
o s v
s k
t ex
t o m
t v

The Swedish-Talbanken treebank does not contain multiword tokens.

# Morphology

The morphological annotation in the Swedish-Talbanken treebank follows the general
guidelines and does not add any language-specific features. The
language-specific tags (including features) follow the guidelines of the
Stockholm-Umeå Corpus.

The mapping from language-specific tags and features to universal tags and
features was done automatically. We are not aware of any remaining errors or
inconsistences but the mapping has not been validated manually.

Lemmas were assigned using SALDO (Borin et al., 2008) in combination with
the language-specific SUC tags. Cases of remaining ambiguity were resolved
heuristically, which may have introduced errors. For words and symbols not
covered by SALDO, lemmas were added manually.

# Syntax

The syntactic annotation in the Swedish-Talbanken treebank follows the general
guidelines but adds four language-specific relations:

- acl:relcl for relative clauses
- compound:prt for verb particles
- nmod:agent for agents of passive verbs
- nmod:poss for possessive/genitive modifiers

The syntactic annotation has been automatically converted from the original
MAMBA annotation scheme in Talbanken. The following phenomena are known to
deviate from the general guidelines and will be fixed in future versions:

- The remnant analysis of ellipsis has not been fully implemented.
- Comparative modifiers are sometimes not attached to the comparative element
  itself but to its head.

# Changelog

From v1 to v1.1, an extensive (but not complete) manual validation was
carried out, resulting in a large number of conversion errors being
corrected. Specifically, all non-projective trees were validated.

From v1.1 to v1.2, complex names and multiword expressions have been
manually validated. As a result, the annotation of complex names now
conforms to the universal guidelines.

From v1.2 to v1.3, we fixed the following annotation bugs/inconsistencies:
- All conj relations are now left-headed
- No mark relations are filled by PRON
- All punct relations are filled by PUNCT (and vice versa)
- All cop relations are filled by VERB (not AUX)
- All DET and PRON have a PronType feature
- All AUX and VERB have a VerbType feature
- All NUM have a NumType features
- No predicate has more than one subject (except expl + nsubj/csubj)
- No case relations attach to predicates

From v1.3 to v1.4, only the documentation has been updated to reflect
the fact that there are two treebanks for Swedish.

From v1.4 to v2.0, we have implemented the following changes to conform
to v2 of the guidelines:
- Rename CONJ -> CCONJ
- Retag copula verbs VERB -> AUX
- Rename dobj -> obj
- Rename nsubjpass -> nsubj:pass
- Rename csubjpass -> csubj:pass
- Rename auxpass -> aux:pass
- Rename mwe -> fixed
- Rename name -> flat:name
- Split nmod into obl and nmod
- Reattach cc and punct in coordination
- Reanalyze neg as advmod + Polarity=Neg
- Add features Abbr=Yes and Foreign=Yes
- Replace "_" by " " in words with spaces
- Add # sent_id and # text for all sentences

From v2.0 to v2.1, no changes have been made.

From v2.1 to v2.2:
- Repository renamed from UD_Swedish to UD_Swedish-Talbanken.
- Harmonization with other Swedish treebanks:
  - Possessives retagged DET -> PRON
  - Negations ("inte", "icke", "ej") retagged ADV -> PART
  - Comparative markers ("som", "än") retagged CCONJ -> SCONJ
  - Comparative with nominal complement relabeled advcl -> obl [mark -> case, SCONJ -> ADP]
  - Clefts reanalyzed as copula constructions and relabeled acl:relcl -> acl:cleft
  - Temporal subordinating conjunctions ("när", "då") retagged ADV -> SCONJ and relabeled advmod -> mark
- Fixed a small number of annotation errors
- Added enhanced dependencies

From v2.2 to v2.3:
- Fixed a small number of errors in both basic and enhanced dependencies

From v2.6 to v2.7:
- Fixed a small number of lemma errors

From v2.10 to v2,11
- Fixed double subject error
- Fixed unknown enhanced relations (mainly foreign and abbreviations)

From v2.13 to v2.14:
- Harmonized lemmas and features for adjectives and determiners across all Swedish treebanks.

From v2.14 to v2.15:
- Revised the annotation of fixed expressions, reducing the number of fixed expressions in Swedish to 27.
- Construction annotations in the [UCxn](https://github.com/LeonieWeissweiler/UCxn) framework added to MISC
  - This release adds rule-based annotations of Interrogatives, Conditionals, Existentials, and NPN (noun-preposition-noun) constructions on the head of the respective phrase, plus construction elements.
  - The UCxn v1 notation and categories are documented [here](https://github.com/LeonieWeissweiler/UCxn/blob/main/docs/UCxn-v1.pdf).

From v2.15 to v2.16
- Fixed errors in the annotation of "vara" (be) as AUX vs. VERB.
- Removed "behöva" (need) from the inventory of auxiliaries. 
- Harmonized lemmas, UPOS and features for participles.
- Fixed a number of miscellaneous reported errors.
- Added features to lemma "olik".
- Harmonized lemmas, UPOS and features for DET/PRON/ADJ.
- Implemented code-switched analysis for cross-lingual content.

From v2.16 to 2.17
- Fixed a number of validation errors related to the obl/nmod distinction.
- Removed the feature Mood=Ind from participles used as adjectives.
- Resegmented sentences: sv-ud-train-1337, sv-ud-train-3727.
- Fixed Voice=Pass errors (Issue #1122).
- Fixed the annotation of "själv" as a depictive (Issue #1126).
- Changed deprel from advcl to ccomp for "NOUN går att VERB" (Issue #1128).
- Changed deprel from advcl to xcomp for "hjälpa NOUN att VERB" (Issue #1129).
- Fixed "så att" and postag of "än" (Issue #1092).
- Fixed a number of annotation errors (Issue #1132).
- Changed appos to a(dv)cl:relcl for non-adnominal relative clauses (Issue #1139).
- Harmonised analysis of "när" (Issue #1148).
- Harmonised tagging of "som" (Issue #1149).
- Fixed lemmas of truncated compounds (Issue #1150).

<pre>
=== Machine readable metadata ==============
Data available since: UD v1.0
License: CC BY-SA 4.0
Includes text: yes
Parallel: no
Genre: news nonfiction
Lemmas: automatic with corrections
UPOS: converted with corrections
XPOS: manual native
Features: converted with corrections
Relations: converted with corrections
Contributors: Nivre, Joakim; Smith, Aaron; Norrman, Victor
Contributing: elsewhere
Contact: joakim.nivre@lingfil.uu.se
============================================
<pre>
