layout | title | udver |
---|---|---|
base |
Tokenization |
2 |
Written Japanese sentences are not split into words or morphemes by the use of spaces or any other technique. We have several word unit standards that can be found in corpus annotation schema or the outputs of morphological analysers. Currently, we use NINJAL UniDic Short Unit Word as the UD tokenization.
This word unit standard (morphological informatino-annotated lexicon) was derived by the morphological analyser ChaSen. The morphological analyser MeCab, developed in 2001-2004, is independently developped from the lexicon; however, the default lexicon is IPADIC. NAIST-jdic is the successor of IPADIC. NAIST-jdic resolves the license issues in IPADIC. NAIST-jdic inherits the word unit definitions and PoS tagset of IPADIC.
NINJAL proposed several word unit standards for Japanese corpus linguistics such as minimum word unit, \alpha word unit, \beta word unit, M word unit and so on (小椋ほか 2010a) (小椋ほか 2010b).
Since 2002, they maintain a morporlogical information annotated lexicon UniDic, and propose three sorts of word unit standard:
-
Short Unit Word (SUW): the most fine-grained morpheme leel for corpus statistics, derived by MeCab with UniDic.
-
Middle Unit Word (MUW): based on the right-branching compound word construction and on phonological constructions such as accent phrase and/or sequential voicing (連濁), using Comainu, which is a wrapper program for MST parsing.
-
Long Unit Word (LUW): a standard composed of 'bunsetsu(文節)' units. LUW is nearly the same in content and functional words bounded by bunsetsu boundaries. The unit is derived by SVM-based chunkers such as YamCha or linear chain CRF chunkers such as CRF++.
-
UniDic : https://unidic.ninjal.ac.jp/
This word unit standard was derived by the morphological analyser JUMAN. The unit is longer than SUW in UniDic. See also 'Morphological Analyser, JUMAN'. The unit includes several compound words as single word units. See the manual.
-
MeCab: lexicon-based morphological analyser using CRFs.
-
JUMAN: lexicon-based morphological analyser using a hand-maintained cost table.
-
Comainu: compound word constructions derived by chunking (CRF or history-based SVM) and dependency parsing.
-
CRF chunker (used in Comainu): CRF++
-
History based SVM chunker (used in Comainu): YamCha
-
MSTParser (used in Comainu)