Tokenization for Occitan (Gascon and Lengadocian)

No Thumbnail Available

Restricted Availability

Date

2024-06-24, 2024-06-24

Persistent identifier of the Data Catalogue metadata

Creator/contributor

Editor

Journal title

Journal volume

Publisher

Publication Type

software

Peer Review Status

Repositories

Access rights

ISBN

ISSN

Description

A python programme to tokenise texts in Occitan based on rules. To launch the programme, execute the following instruction: python3 tokenizer_occitan.py < input.txt > output.conllu The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself. The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland). 

Keyword (yso)

Publication Series

Journal title

Location of the original dataset