-
Notifications
You must be signed in to change notification settings - Fork 2
DuDuPlus is a Chinese dependency parser. It also includes word segmentation and POS tagging.
License
Unknown, Unknown licenses found
Licenses found
Unknown
LICENSE.CNP
Unknown
LICENSE.DuDuPlus
rainarch/DuDuPlus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Duduplus @github Created by [chenwenliang at gmail] Here is an active copy of DuDuplus. The old site of DuDupus is at http://code.google.com/p/duduplus/ The release package can also be obtained from http://sourceforge.net/projects/duduplus/. =DuDuPlus= DuDuPlus is a dependency parser for English and Chinese. ==Features== * Open source implementation of a graph-based parser. * Support training a new supervised parser, better than MSTParser. * Support subtree-based and cluster-based features (Chen et. al EMNLP2009). * Achieve the state-of-the-art performance for English and Chinese. * Support k-best output with scores. * Can support part-of-speech tagging. ==Performance Report== * PTB standard data split, semi-supervised, UAS: 93.3 (with LBFULL) vs 91.71 (2-order MSTParser) * PTB standard data split, supervised, UAS: 92.7 (with LBBASIC) vs 91.71 (2-order MSTParser) * CTB4 standard data split, UAS: 91.7 (with LBST) vs 88.18 (2-order MSTParser) * More detail available in our paper (Chen et. al EMNLP2009) ==Download== * src: source code * model: trained model * resource: subtrees and word-clusters * All files can be found at: http://sourceforge.net/projects/duduplus/ ==Installation== ===Requirements=== * C++ compiler (gcc 4.4 or higher) ===How to make=== * %tar vxzf duduplus.src.tgz * %tar vxzf duduplus.model.tgz * %tar vxzf duduplus.resource.tgz * %cp _Models/* duduplus/_Models/ * %cp _Resources/* duduplus/res/ * %cd duduplus * %aclocal * %autoconf * %automake --add-missing * %./configure * %make Then, DuDuPlus is available. ==Usage (Dependency Parsing)== ===Training and Test file formats=== The input file should be in the [http://ifarm.nl/signll/conll/ CoNLL-2006/2007] format: # ID: Token counter, starting at 1 for each new sentence. # FORM: Word form or punctuation symbol. # LEMMA: Lemma or stem of word form, or an underscore if not available. # CPOSTAG: Coarse-grained part-of-speech tag, where the tagset depends on the language. # POSTAG: Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available. # FEATS: Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available. # HEAD: Head of the current token, which is either a value of ID or zero (0). Note that, depending on the original treebank annotation, there may be multiple tokens with HEAD=0. # DEPREL: Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that, depending on the original treebank annotation, the dependency relation when HEAD=0 may be meaningful or simply ROOT. Here is an example. [http://duduplus.googlecode.com/files/conllformat.png http://duduplus.googlecode.com/files/conllformat.png] ===Training=== Use the toy data and DuDuPlus command: %./DuDuPlus task:Parser train train-file:toydata/en.toy.train model-name:_Models/ToyModel.en feat-file:tmp/feat.dp.en order:1 lang:en runtype:LBBASIC A model named ToyModel.en is available in the folder: _Models. ===Test=== Use the toy data and the above trained model: %./DuDuPlus task:Parser model-name:_Models/ToyModel.en test test-file:toydata/en.test.gold output-file:tmp/test.out order:1 lang:en runtype:LBBASIC % perl binFiles/EvalCoNLL.pl toydata/en.test.gold tmp/test.out If everything goes well, the UAS accuracy will be 81.73. ==Usage (POS Tagging)== ===Training and Test file formats=== For training, the input file should be in the CoNLL-like format: # ID: Token counter, starting at 1 for each new sentence. # FORM: Word form or punctuation symbol. # LEMMA: Lemma or stem of word form, or an underscore if not available. # POSTAG: Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available. ===Training=== %./DuDuPlus task:Tagger train train-file:toydata/en.toy.train posmodel-name:_Models/ToyModel.POS.en feat-file:tmp/feat.pos.en lang:en ===Test=== * If the input is raw sentences, %./DuDuPlus task:Tagger test posmodel-name:_Models/ToyModel.POS.en test-file:toydata/en.test.rawline output-file:tmp/test.out format:SENT lang:en * If the input is in CoNLL-foramt, %./DuDuPlus task:Tagger test posmodel-name:_Models/ToyModel.POS.en test-file:toydata/en.test.gold output-file:tmp/test.out format:CONLL lang:en * Evaluation %perl binFiles/EvalPOS.pl tmp/test.out toydata/en.toy.train If everything goes well, the accuracy will be 91.75. ==Usage (Pipe: Tagging and Parsing)== ===Test (Now, only work for English)=== * If the input is raw sentences, %./DuDuPlus task:TagParser test model-name:_Models/ToyModel.en posmodel-name:_Models/ToyModel.POS.en test-file:toydata/en.test.gold output-file:tmp/test.out format:SENT lang:en order:1 runtype:LBBASIC * If the input is in CoNLL-foramt, %./DuDuPlus task:TagParser test model-name:_Models/ToyModel.en posmodel-name:_Models/ToyModel.POS.en test-file:toydata/en.test.gold output-file:tmp/test.out format:CONLL lang:en order:1 runtype:LBBASIC % perl binFiles/EvalPOSDP.pl toydata/en.test.gold tmp/test.out If everything goes well, the accuracy for tagging will be 91.75 and the UAS for parsing will be 76.37. ==How to Use the trained Models== Download the trained models from the project site. * TagParser (English, follows the PTB definitions) % ./DuDuPlus task:TagParser test model-name:_Models/DPModel.EN.O2LBFULL posmodel-name:_Models/POSModel.EN test-file:toydata/en.test.gold output-file:tmp/test.out format:CONLL lang:en order:2 runtype:LBFULL * Parser (Chinese, follows the CTB definitions) % ./DuDuPlus task:Parser model-name:_Models/DPModel.CN.O2LBST test test-file:toydata/cn.test.conll output-file:tmp/test.out.cn order:2 lang:cn runtype:LBST Again, if everything goes well, the score for LBFULL is 94.24 (English), the score for LBST is 91.77(Chinese). ==Further Help== %./DuDuPlus help ==Related Publications== If you would like to acknowledge this tool, please cite one of the following papers: * Wenliang Chen, Junchi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. Improving Dependency Parsing with Subtrees from Auto-Parsed Data, Conference on Empirical Methods in Natural Language Processing(EMNLP2009), pp570-579, Singapore, August 2-7, 2009. * Wenliang Chen, Jun'ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. Exploiting Subtrees in Auto-Parsed Data to Improve Dependency Parsing, Computational Intelligence Journal, Vol. 28, No. 3, pp426-451, Auguest 2012.[J] Available at (http://dx.doi.org/10.1111/j.1467-8640.2012.00451.x or http://onlinelibrary.wiley.com/doi/10.1111/coin.2012.28.issue-3/issuetoc) (Extended version of Chen et al. EMNLP2009) ==License== As the original CNP is licensed under [http://www.eclipse.org/legal/cpl-v10.html Common Public License Version 1.0 (or later)], we also distribute our materials (i.e., modification and additions to the original) under the same license. Various trained models are available for download trained from different text resources, but may require further licenses. For further help, please send an email to [chenwenliang at gmail]. OK. The End.
About
DuDuPlus is a Chinese dependency parser. It also includes word segmentation and POS tagging.
Resources
License
Unknown, Unknown licenses found
Licenses found
Unknown
LICENSE.CNP
Unknown
LICENSE.DuDuPlus
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published