OcWikiAnnot: Annotated Wikipedia Corpus of Occitan
dc.contributor.affiliation | University of Helsinki-Miletic, Aleksandra | |
dc.contributor.author | Miletic, Aleksandra | |
dc.date.accessioned | 2025-04-29T14:01:10Z | |
dc.date.issued | 2023-04-20 | |
dc.date.issued | 2023-04-20 | |
dc.description | OcWikiAnnot is a corpus of Wikipedia content in Occitan that is tokenized, PoS-tagged and lemmatized. The corpus contains 100 000 sentences for a total of 2 037 723 tokens. It is based on the Wikipedia corpus in Occitan that is part of the Leipzig Corpora Collection. | |
dc.identifier | https://doi.org/10.5281/zenodo.7777340 | |
dc.identifier.uri | https://datakatalogi.helsinki.fi/handle/123456789/4791 | |
dc.rights.license | cc-by-4.0 | |
dc.subject | Occitan | |
dc.subject | Wikipedia | |
dc.subject | PoS-tagging | |
dc.subject | lemmatization | |
dc.title | OcWikiAnnot: Annotated Wikipedia Corpus of Occitan | |
dc.type | dataset |