Resha is a fast and "less aggressive" stemmer for Turkish written in Java. It uses a stem dictionary which is generated by Nuve using a statistical language model based on morpheme n-grams. So it returns the most possible stem for a word without considering the neighbor words.
####Main Features
- Less aggressive and more accurate than the other stemmers for available for Turkish such as the one in SnowBall
- Contains more than 1.1 million word-stem pairs
- Based on HashMap, very fast but uses approximately 300 MB of memory.
- The stemmer class is singleton, thread safe, and lazy initialized
####Usage
//it is implemented as an enum to guarantee
//a singleton, thread safe and lazy initialized object
Stemmer stemmer = Resha.Instance;
String stem = stemmer.stem("kitapçıdaki");
System.out.println(stem); //kitapçı
//If a word contains aphostrope,
//the part before the first aphostrope is returned as stem.
stem = stemmer.stem("İstanbul'da");
System.out.println(stem); //İstanbul
actual = stemmer.stem("aaa'aaa'aa");
System.out.println(stem); //aaa
//If a word is not in the dictionary it remains unstemmed.
stem = stemmer.stem("xxx");
System.out.println(stem); //xxxx
####Maven
Add this to pom.xml file
<repositories>
<repository>
<id>hrzafer-repo</id>
<url>https://github.com/hrzafer/mvn-repo/raw/master/releases</url>
</repository>
</repositories>
And the dependency
<dependencies>
<dependency>
<groupId>com.hrzafer</groupId>
<artifactId>resha-turkish-stemmer</artifactId>
<version>1.2.1</version>
</dependency>
</dependencies>
####Jar Distribution
Download the latest jar from the below link and add to your project: https://github.com/hrzafer/mvn-repo/tree/master/releases/com/hrzafer/resha-turkish-stemmer
####Stemming in Turkish
This part presents a brief introduction to the stemming problem in Turkish and the methodology used to solve it.
In Turkish words are composed of three consecutive parts:
root + derivational suffix(es) + inflectional suffix(es)
Prefixes don't exist and no derivational suffix come after an inflectional suffix. Example:
kitapçığında => kitap + çığ [CUK] + ın [(U)n] + da [DA]
word => root + d. sfx + i. sfx + i.sfx
A stemmer is expected to analyze the word and strip off the inflectional suffixes from the word. So the expected stem for kitapçığında
is kitapçık
. However such a morphological analysis is not trivial for Turkish. Let pay attention to the only derivative suffix shown above and try to understand the difficulties of the issue.
The inflectional suffic +CUK
is somehow similar to the let
suffix in English. The word kitapçık
means booklet
. +CUK
suffix takes different forms according to the morphemes coming before and after it. In Turkish all the following forms are possible for the +CUK
suffix:
cık, cik, cuk, cük, çık, çik, çuk, çük, cığ, ciğ, cuğ, cüğ, çığ, çiğ, çuğ, çüğ,
In the word kitapçık
the suffix is in çık
form. If the root was kalem
(pencil) then the word would be kalemcik
and the suffix would be in cik
form. When a new suffix, +(U)n
is added, +CUK
suffix changes its form from çık
to çığ
.
After stripping off the inflectional suffixes from the word kitapçığındaki
the stem becomes kitapçığ
. However the it should be kitapçık
. Thus a stemmer/analyzer for Turkish should handle many character conversions in Turkish.
Nuve is an NLP library that can perform such complex morphologic analysis (and more) which is required for many tasks like stemming. This complex analysis could be expensive for applications in which there are millions of words to be stemmed.
Resha stemmer is a Turkish stemmer based on a dictionary which consists of already stemmed words by Nuve. The dictionary includes more than 1.1 million word-stem pairs.
It is highly probable that the 1.1 million word-stem pair dictionary does not include stems for some words or you may find some stems are not correct. In this case you can add your own word-stem pairs by editing manual.dict
file. The word-stem (key-value) pairs will be added to the dictionary and if the word (key) already exists the stem (value) for it will be overwritten.