Skip to content

NazimHAli/klpt

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kurdish Language Processing Toolkit

Build GitHub PyPI - Downloads Documentation Documentation PyPI version

Welcome / Hûn bi xêr hatin / بە خێر بێن! 🙂

Kurdish Language Processing Toolkit--KLPT is a natural language processing (NLP) toolkit in Python for the Kurdish language. The current version comes with four core modules, namely preprocess, stem, transliterate and tokenize and addresses basic language processing tasks such as text preprocessing, stemming, tokenization, spell-checking and morphological analysis for the Sorani and the Kurmanji dialects of Kurdish.

🧑‍💻 Install

pip install klpt

🚀 Usage

Available modules:

  1. Preprocess
  2. Tokenize
  3. Transliteration
  4. Stem

🛠️ Preprocess

Normalizes scripts and orthographies by using writing conventions based on dialects and scripts.

from klpt.preprocess import Preprocess

preprocessor_ckb = Preprocess("Sorani", "Arabic", numeral="Latin")
preprocessor_ckb.normalize("لە ســـاڵەکانی ١٩٥٠دا")
'لە ساڵەکانی 1950دا'
preprocessor_ckb.standardize("راستە لەو ووڵاتەدا")
'ڕاستە لەو وڵاتەدا'
preprocessor_ckb.unify_numerals("٢٠٢٠")
'2020'
preprocessor_ckb.preprocess("راستە لە ووڵاتەی ٢٣هەمدا")
'ڕاستە لە وڵاتەی 23هەمدا'

✂️ Tokenize

Tokenization of Kurmanji and Sorani dialects of Kurdish.

from klpt.tokenize import Tokenize

tokenizer = Tokenize("Kurmanji", "Latin")
tokenizer.word_tokenize("ji bo fortê xwe avêtin")
['▁ji▁', 'bo', '▁▁fortê‒xwe‒avêtin▁▁']
tokenizer.mwe_tokenize("bi serokê hukûmeta herêma Kurdistanê Prof. Salih re saz kir.")
'bi serokê hukûmeta herêma Kurdistanê Prof . Salih re saz kir .'

🔄 Transliteration

Transliterates from one script of Kurdish into another.

Note

Only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported.

from klpt.transliterate import Transliterate
transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic")
transliterate.transliterate("rojhilata navîn")
'رۆژهلاتا ناڤین'

🌱 Stem

Handles the following tasks:

  1. spelling (error detection & correction)
  2. morphological analysis
  3. stemming
  4. lemmatization

Note

It is recommended that this module be used on tokens from the tokenize module.
Only Sorani is supported in this module.

📝 Spelling

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.check_spelling("سوتاندبووت")
False
stemmer.correct_spelling("سوتاندبووت")
(False, ['ستاندبووت', 'سووتاندبووت', 'سووڕاندبووت', 'ڕووتاندبووت', 'فەوتاندبووت', 'بووژاندبووت'])

🔍 Analyze

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.analyze("دیتبامن")
[{'pos': ['verb'], 'description': 'past_stem_transitive_active', 'stem': 'دی', 'lemma': ['دیتن'], 'base': 'دیت', 'prefixes': '', 'suffixes': 'بامن'}]

🌿 Stem

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.stem("دەچینەوە")
['چ']
stemmer.stem("گورەکە", mark_unknown=True) # گوڵەکە in Hewlêrî dialect
['_گور_']

📚 Lemmatize

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.lemmatize("گوڵەکانم")
['گوڵ', 'گوڵە']

📜 Documentation

View documentation.

Become a sponsor

Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support,

  • You can be an official sponsor
  • You will get a GitHub sponsor badge on your profile
  • If you have any questions, I will focus on it
  • If you want, I will add your name or company logo on the front page of your preferred project
  • Your contribution will be acknowledged in one of my future papers in a field of your choice

And, thanks for those who have already sponsored this project. More significant achievements will be made thanks to you!

Contribute

Are you interested in this project? Each task is addressed individually. Please check the following repositories to find which one you are more interested in:

In addition, our main objective is to extend the current toolkit to include more tasks, particularly part-of-speech tagging, named-entity recognition and syntactic analysis. Further instructions are provided at https://sinaahmadi.github.io/klpt/about/contributing/. You can also join us on Gitter.

Don't forget, open-source is fun! 😊

Cite this project

Please consider citing this paper, if you use any part of the data or the tool (bib file):

@inproceedings{ahmadi2020klpt,
    title = "{KLPT--Kurdish Language Processing Toolkit}",
    author = "Ahmadi, Sina",
    booktitle = "Proceedings of the second Workshop for {NLP} Open Source Software ({NLP}-{OSS})",
    month = nov,
    year = "2020",
    publisher = "Association for Computational Linguistics"
}

About

The Kurdish Language Processing Toolkit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%