Kurdish Language Processing Toolkit

Welcome / Hûn bi xêr hatin / بە خێر بێن! 🙂

Kurdish Language Processing Toolkit--KLPT is a natural language processing (NLP) toolkit in Python for the Kurdish language. The current version comes with four core modules, namely preprocess, stem, transliterate and tokenize and addresses basic language processing tasks such as text preprocessing, stemming, tokenization, spell-checking and morphological analysis for the Sorani and the Kurmanji dialects of Kurdish.

🧑‍💻 Install

pip install klpt

🚀 Usage

Available modules:

Preprocess
Tokenize
Transliteration
Stem

🛠️ Preprocess

Normalizes scripts and orthographies by using writing conventions based on dialects and scripts.

from klpt.preprocess import Preprocess

preprocessor_ckb = Preprocess("Sorani", "Arabic", numeral="Latin")
preprocessor_ckb.normalize("لە ســـاڵەکانی ١٩٥٠دا")
'لە ساڵەکانی 1950دا'
preprocessor_ckb.standardize("راستە لەو ووڵاتەدا")
'ڕاستە لەو وڵاتەدا'
preprocessor_ckb.unify_numerals("٢٠٢٠")
'2020'
preprocessor_ckb.preprocess("راستە لە ووڵاتەی ٢٣هەمدا")
'ڕاستە لە وڵاتەی 23هەمدا'

✂️ Tokenize

Tokenization of Kurmanji and Sorani dialects of Kurdish.

from klpt.tokenize import Tokenize

tokenizer = Tokenize("Kurmanji", "Latin")
tokenizer.word_tokenize("ji bo fortê xwe avêtin")
['▁ji▁', 'bo', '▁▁fortê‒xwe‒avêtin▁▁']
tokenizer.mwe_tokenize("bi serokê hukûmeta herêma Kurdistanê Prof. Salih re saz kir.")
'bi serokê hukûmeta herêma Kurdistanê Prof . Salih re saz kir .'

🔄 Transliteration

Transliterates from one script of Kurdish into another.

Note

Only the Latin-based and the Arabic-based scripts of Sorani and Kurmanji are supported.

from klpt.transliterate import Transliterate
transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic")
transliterate.transliterate("rojhilata navîn")
'رۆژهلاتا ناڤین'

🌱 Stem

Handles the following tasks:

spelling (error detection & correction)
morphological analysis
stemming
lemmatization

Note

It is recommended that this module be used on tokens from the tokenize module.
Only Sorani is supported in this module.

📝 Spelling

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.check_spelling("سوتاندبووت")
False
stemmer.correct_spelling("سوتاندبووت")
(False, ['ستاندبووت', 'سووتاندبووت', 'سووڕاندبووت', 'ڕووتاندبووت', 'فەوتاندبووت', 'بووژاندبووت'])

🔍 Analyze

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.analyze("دیتبامن")
[{'pos': ['verb'], 'description': 'past_stem_transitive_active', 'stem': 'دی', 'lemma': ['دیتن'], 'base': 'دیت', 'prefixes': '', 'suffixes': 'بامن'}]

🌿 Stem

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.stem("دەچینەوە")
['چ']
stemmer.stem("گورەکە", mark_unknown=True) # گوڵەکە in Hewlêrî dialect
['_گور_']

📚 Lemmatize

from klpt.stem import Stem

stemmer = Stem("Sorani", "Arabic")
stemmer.lemmatize("گوڵەکانم")
['گوڵ', 'گوڵە']

📜 Documentation

View documentation.

Become a sponsor

Please consider donating to the project. Data annotation and resource creation requires tremendous amount of time and linguistic expertise. Even a trivial donation will make a difference. You can do so by becoming a sponsor to accompany me in this journey and help the Kurdish language have a better place within other natural languages on the Web. Depending on your support,

You can be an official sponsor
You will get a GitHub sponsor badge on your profile
If you have any questions, I will focus on it
If you want, I will add your name or company logo on the front page of your preferred project
Your contribution will be acknowledged in one of my future papers in a field of your choice

And, thanks for those who have already sponsored this project. More significant achievements will be made thanks to you!

Contribute

Are you interested in this project? Each task is addressed individually. Please check the following repositories to find which one you are more interested in:

In addition, our main objective is to extend the current toolkit to include more tasks, particularly part-of-speech tagging, named-entity recognition and syntactic analysis. Further instructions are provided at https://sinaahmadi.github.io/klpt/about/contributing/. You can also join us on Gitter.

Don't forget, open-source is fun! 😊

Cite this project

Please consider citing this paper, if you use any part of the data or the tool (bib file):

@inproceedings{ahmadi2020klpt,
    title = "{KLPT--Kurdish Language Processing Toolkit}",
    author = "Ahmadi, Sina",
    booktitle = "Proceedings of the second Workshop for {NLP} Open Source Software ({NLP}-{OSS})",
    month = nov,
    year = "2020",
    publisher = "Association for Computational Linguistics"
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
dist		dist
docs		docs
klpt		klpt
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kurdish Language Processing Toolkit

Welcome / Hûn bi xêr hatin / بە خێر بێن! 🙂

🧑‍💻 Install

🚀 Usage

🛠️ Preprocess

✂️ Tokenize

🔄 Transliteration

🌱 Stem

📜 Documentation

Become a sponsor

Contribute

Cite this project

About

Releases

Packages

Languages

License

NazimHAli/klpt

Folders and files

Latest commit

History

Repository files navigation

Kurdish Language Processing Toolkit

Welcome / Hûn bi xêr hatin / بە خێر بێن! 🙂

🧑‍💻 Install

🚀 Usage

🛠️ Preprocess

✂️ Tokenize

🔄 Transliteration

🌱 Stem

📜 Documentation

Become a sponsor

Contribute

Cite this project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages