Skip to content

outa2020/AmazighCorpora

Repository files navigation

AmazighCorpora

This folder contains the folowing ressources :

  1. AMTS: AmazighTag Set(28 tags). In 29 tagset corpus (we splitted preposition to S and S_PP)
  2. data_lab28 : An annotated corpus of about ~20K words. It is also labeled data with lexical n-gram features
  3. Amazigh_Corpus : Amazigh unlabeled data corpus.
  4. data_29tags : An annotated corpus of about ~20K words. In this coprus we used 29 tags(we separated S and S_PP to distinguish between preposition(S) and preposition when folowed by a personal pronoun).
  5. labeledData.29.tags.2col_WORD-POS : 21k words of labeled data in 2 colonnes the token and its part of speech.
  6. Unlabelled data, collected for divers books and web sites, we have :
  • UnlabeledData.sent : containnig brut texts. It contains only sentences containing more than 2 tokens per sentence
  • UnlabeledData.225K.TOK: contains 225240 tokens.
  • And UnlabeledData.1col contains 224620 tokens.

This folder containes also some useful perl scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published