Skip to content

Commit

Permalink
scripts for processing WSJ, CTB, and SPMRL
Browse files Browse the repository at this point in the history
  • Loading branch information
zhaoyanpeng committed Nov 8, 2020
1 parent 965f14e commit 4280407
Show file tree
Hide file tree
Showing 5 changed files with 697 additions and 0 deletions.
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# XCFGs

Aiming at unifying all extensions of context-free grammars (XCFGs). **X** stands for weighted, (compound) probabilistic, and neural extensions, etc.

## Data

The repo handles [WSJ](https://catalog.ldc.upenn.edu/LDC99T42), [CTB](https://catalog.ldc.upenn.edu/LDC2005T01), and [SPMRL](https://dokufarm.phil.hhu.de/spmrl2014/). Have a look at `treebank.py`.

If you are looking for the data used in [C-PCFGs](https://github.com/zhaoyanpeng/cpcfg). Follow the instructions in `treebank.py` and put all outputs in the same folder, let us say `./data.punct`. The script only removes morphology features and creates data splits. To remove punctuation we will need `clean_tb.py`. For example, I used `python clean_tb.py ./data.punct ./data.clean`. All the cleaned treebanks will reside in `/data.clean`. Then simply execute the command `./batchify.sh ./data.clean/`, you will have all the data needed to reproduce the results in [C-PCFGs](https://github.com/zhaoyanpeng/cpcfg). Feel free to change parameters in `batchify.sh` if you want to use a different batch size or vocabulary size.

## Citing XCFGs

If you use XCFGs in your research or wish to refer to the results in [C-PCFGs](https://github.com/zhaoyanpeng/cpcfg), please use the following BibTeX entry.
```
@article{zhao2020xcfg,
author = {Zhao, Yanpeng},
title = {An Empirical Study of Compound PCFGs},
journal= {https://github.com/zhaoyanpeng/cpcfg},
url = {https://github.com/zhaoyanpeng/cpcfg},
year = {2020}
}
```
## Acknowledgements
`batchify.py` is borrowed from [C-PCFGs](https://github.com/harvardnlp/compound-pcfg).

## License
MIT
Loading

0 comments on commit 4280407

Please sign in to comment.