Tokenization for Occitan (Gascon and Lengadocian)

dc.contributor.affiliationUniversity of Helsinki-Miletic, Aleksandra
dc.contributor.authorMiletic, Aleksandra
dc.date.accessioned2025-04-29T14:02:34Z
dc.date.issued2024-06-24
dc.date.issued2024-06-24
dc.descriptionA python programme to tokenise texts in Occitan based on rules. To launch the programme, execute the following instruction: python3 tokenizer_occitan.py < input.txt > output.conllu The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself. The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland). 
dc.identifierhttps://doi.org/10.5281/zenodo.12515136
dc.identifier.urihttps://datakatalogi.helsinki.fi/handle/123456789/5211
dc.rights.licensecc-by-4.0
dc.subjecttokenization
dc.subjectoccitan
dc.titleTokenization for Occitan (Gascon and Lengadocian)
dc.typesoftware