Tokenization for Occitan (Gascon and Lengadocian)
No Thumbnail Available
Restricted Availability
Date
2024-06-24, 2024-06-24
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
software
Peer Review Status
Repositories
Access rights
ISBN
ISSN
Description
A python programme to tokenise texts in Occitan based on rules.
To launch the programme, execute the following instruction:
python3 tokenizer_occitan.py < input.txt > output.conllu
The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself.
The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland).