Dataset of CONTAINMENT and SUPPORT in the Uralic languages of the Volga-Kama area

dc.contributor.affiliationLudwig-Maximilians-Universität München, University of Helsinki-Erkkilä, Riku
dc.contributor.authorErkkilä, Riku
dc.date.accessioned2025-04-29T14:03:01Z
dc.date.issued2022-09-08
dc.date.issued2022-09-08
dc.descriptionThis open access dataset contains examples of the expressions of CONTAINMENT and SUPPORT in the Uralic languages of Volga-Kama area. The exact languages and the sources of data are given in Table 1. The dataset contains data on the relational nouns (RN) and plain spatial cases expressing prototypical CONTAINMENT and SUPPORT in the languages. The RN included in the dataset are listed in Table 2, and case forms in Table 3.   language corpora Erzya (MdE) Syatko-subcorpus of the MokshEr corpus (MokshEr 2010) Moksha (MdM) Subcorpora in Moksha of the MokshEr corpus (MokshEr 2010) Meadow Mari (MaM) Marko East (Marko [no year]), Oncyko (Oncyko 2000), Meadow Mari corpus (Arkhangelskiy 2019b), Wanca (Meadow Mari) (Helsingin yliopisto et al. 2019) Hill Mari (MaH) Marko West (Marko [no year]), Wanca (Hill Mari) (Helsingin yliopisto et al. 2019) Udmurt (Udm) Pilot version of Udmurt corpus (relational nouns; presently included into [Arkhangelskiy 2018]) Udmurt corpus (content nouns) (Arkhangelskiy 2018) Komi Zyrian (KoZ) Komi Zyrian Web Corpus (Arkhangelskiy 2019a), Коми корпус (Fu-Lab team 2021) Komi Permyak (KoP) Komi Permyak text collection from the University of Turku (Permyak 2009) Table 1. Languages included into the dataset and the sources of the data for each language.     MdE MdM MaM MaH Udm KoZ KoP containment pot(mo)- potmə- kørgø, kørgə- kørgə̈- puʃk- pɨt͡ʃk- pɨt͡ʃk- support lang- lang- ymba- βə̈(l)- vɨl- vɨl-/vɨv- vɨl-/vɨv- Table 2. RN included in the dataset.     MdE MdM MaM MaH Udm KoZ KoP location -so/-se (inessive) -sa (inessive) -ʃte/-ʃto/-ʃtø (inessive) -ʃtə/-ʃtə̈ (inessive) -ɨn (inessive) -ɨn (inessive) -ɨn (inessive) source -sto/-ste (elative) -sta (elative) gət͡ɕ (source postposition) gə̈t͡s (source postposition) -ɨɕ (elative) -ɨɕ (elative) -iɕ (elative) goal -s (illative) -s/-t͜s (illative) -ʃke/-ʃko/-ʃkø/-ʃ, (illative) -ʃkə/-ʃkə̈/-ʃ, (illative) -e/-ɨ (illative) -ɘ (illative) -ɘ (illative) path -ka/-ga/-va (prolative) -ka/-ga/-va/-gæ (prolative) - - -ti/-eti/-jeti/-ɨti (prolative) -ɘd (prolative); -ti (transitive) -ɘt (prolative); -ti (transitive) Table 3. Cases that have been included into the dataset. All cases do not necessary show in every set, as for some combinations of RN and case there is no data.   The main purpose of the dataset is to enable the study of variation between a plain case and RN inflected in case when expressing CONTAINMENT or SUPPORT. To facilitate this each expression of relation has been given a prototypicality score 4 = most prototypical, 1 = non-prototypical, which tells if the relation between landmark and trajector expressed in the sentence is typical for the entities participating in it. The prototypicality scores are based on the pre-linguistic concepts of containment and support, which are robustly attested and therefore should be independent of any single language. The scoring is based on the authors understanding of the language external relations, and no native consultants are used to verify the results. Therefore, some caution is in order when using the dataset.   The dataset contains files with data of CONTAINMENT RN, SUPPORT RN, and plain case on all the included languages. The files are named according to the scheme element_languge (e. g. Containment_Erzya for the containment data on Erzya). In addition, files named element_frequencies show the number of examples divided by case and prototypicality score for each language, and element_summary shows the total number of prototypicality scores for each language. For plain case there are also summary files for the scores of CONTAINMENT and SUPPORT separately.   The dataset is annotated for following information: The case in which the content noun or RN is inflected. The predicate as inflected in the data. The content noun as given in the data. Translations of both (mainly in citation form, but in predicate sometimes with some grammatical information, cf. abbreviations below). The prototypicality score. In the data on plain cases the prototypicality score is given only for the clauses where the relation is either CONTAINMENT or SUPPORT (i. e. the prototypicality score indicates the prototypicality of the relation as CONTAINMENT or SUPPORT according to the type of relation expressed). In the data on plain cases, the relation expressed by the case is marked (CONT = CONTAINMENT, SUP = SUPPORT, N/A = some other relation). The original sentence context. Free translation. Some of the translations are done following the lexical meanings and syntactic structures of the languages, so the English is unidiomatic from time to time. The file name with which the original sentence can be located in the corpus.   The three final columns are partly lacking at the moment from the Mari and Komi languages. The translations in the data are intended only as guidelines, and anyone using the dataset should refer to the original language data in the analysis. The data in the columns is presented according to the following conventions : If the predicate is in square brackets, it means that the predicate is not present in the clause with the target LM. This can be because of two reasons: 1) The predicate is given in a previous clause, and is elliptically omitted, 2) the “predicate” is copula, which is not obligatory in the present tense in the languages studied. The following abbreviations are used to specify the meaning of the predicate when the English translation is ambiguous (note that the use is not checked, and the abbreviations might be lacking from some predicates): CAUS    causative CONT   continuative CVB      converb FRQ      frequentative INCH     inchoative INF        infinitive ITR        intransitive MOM     momentaneous NEG      negative NMLZ    nominalization PASS    passive PTCP    participle REFL    reflexive TRA      transitive The authors of this dataset are Tomi Koivunen and Riku Erkkilä and it is published under CC-BY-NC-ND licence. If used in a publication, please refer to this publication as well as mention the original source(s): This dataset has been used in following publications:   References to used corpora: Arkhangelskiy, Timofey. 2018. Udmurt corpus. http://udmurt.web-corpora.net/index.html. Arkhangelskiy, Timofey. 2019a. Komi-Zyrian corpus. http://komi-zyrian.web-corpora.net/index.html. Arkhangelskiy, Timofey. 2019b. Meadow Mari corpus. http://meadow-mari.web-corpora.net/index_en.html. Fu-Lab team. 2021. Корпус коми языка. http://komicorpora.ru/. Helsingin yliopisto, FIN-CLARIN, H. Jauhiainen, T. Jauhiainen & K. Lindén. 2019. Wanca 2016, Korp Version. Kielipankki. http://urn.fi/urn:nbn:fi:lb-2019052401. Marko. (no year). MARKO - Corpus of Mari language. University of Turku. MokshEr, V.3. 2010. Mokšan ja ersän sähköinen korpus. Turun yliopisto. Oncyko. 2000. Oncyko corpus. University of Turku. Permyak. 2009. Turku Komi-Permyak Corpus. University of Turku.
dc.identifierhttps://doi.org/10.5281/zenodo.7081747
dc.identifier.urihttps://datakatalogi.helsinki.fi/handle/123456789/5495
dc.rights.licensecc-by-4.0
dc.subjectrelational nouns
dc.subjectspatial cases
dc.subjectUralic languages
dc.titleDataset of CONTAINMENT and SUPPORT in the Uralic languages of the Volga-Kama area
dc.typedataset

Files

Repositories