Last updated: June, 2017
- N.B. For license restriction, we don't provide the original PTB in this repository.
Download Penn Treebank under data directory.
Convert PTB into CoNLL format (e.g., Penn2Malt)
Put the CoNLL format file as ./data/[train|dev|test].E00 (i.e., Error rate = 0%)
Add noise by running errgent. See the readme file in the directory.
cd ./errgent sh ./ (for generating all the files needed)
We assume that we have named the files as ./data/[train|dev|test].[E00|E05|E10|E15|E20]. The file should look like the following.
1 Ms. B-NP NNP _ _ 2 TITLE _ _ 2 Haag I-NP NNP _ _ 3 SBJ _ _ 3 plays B-VP VBZ _ _ 0 ROOT _ _ 4 Elianti B-NP NNP _ _ 3 OBJ _ _ 5 . O . _ _ 3 P _ _ 1 The B-NP DT _ _ 4 NMOD _ _ 2 luxury I-NP NN _ _ 4 NMOD _ _ 3 auto I-NP NN _ _ 4 NMOD _ _ 4 maker I-NP NN _ _ 7 SBJ _ _ 5 last B-NP JJ _ _ 6 NMOD _ _ 6 year I-NP NN _ _ 7 TMP _ _ 7 sold B-VP VBD _ _ 0 ROOT _ _ 8 1,214 B-NP CD _ _ 9 NMOD _ _ 9 cars I-NP NNS _ _ 7 OBJ _ _ 10 in B-PP IN _ _ 7 LOC _ _ 11 the B-NP DT _ _ 12 NMOD _ _ 12 U.S. I-NP NNP _ _ 10 PMOD _ _ ...
Training a model
(e.g.,) sh E05 (training a model with 5% error-injected corpus)
Parsing sentences with the trained model
(e.g.,) sh dev E05 E10 (parse 10% error-injected dev set with a model trained on 5% error corpus)
Evaluation on parsing performance
cd ./eval wget -O unzip cd ./eval/srleval/trunk/align make modify line 231 in ./eval/srleval/trunk/ (from) for item in alignment.align(ref_words, hyp_words, command=os.path.dirname(__file__) + "/align/align"): (to) for item in alignment.align(ref_words, hyp_words): run evaluation script cd ./eval (e.g.,) sh dev E05 E10 (evaluate 10% error-injected dev set with a model trained on 5% error corpus)
Evaluation on grammaticality improvement
- Please e-mail to Keisuke Sakaguchi (keisuke[at]