Tanja:
produce /home/tania/Dropbox/pmb2tag-frames/pmb-3.0.0-en-gold-{p31,p32}.tsv -b
mv /home/tania/Dropbox/pmb2tag-frames/pmb-3.0.0-en* /home/tania/Dropbox/pmb2tag-frames/data
rm /home/tania/Dropbox/pmb2tag-frames/data/*.toknum
rm /home/tania/Dropbox/pmb2tag-frames/data/*.const
rm /home/tania/Dropbox/pmb2tag-frames/data/*.lemma
rm /home/tania/Dropbox/pmb2tag-frames/data/*.pmbdep
rm /home/tania/Dropbox/pmb2tag-frames/data/*.roles
rm /home/tania/Dropbox/pmb2tag-frames/data/*.sem
rm /home/tania/Dropbox/pmb2tag-frames/data/*.super
rm /home/tania/Dropbox/pmb2tag-frames/data/*.wordnet
rm /home/tania/Dropbox/pmb2tag-frames/data/*.token
pmb2tsv is a collection of scripts to convert data from the Parallel Meaning Bank (PMB) into column-based files including CCG supertags, dependency structure, constituent structure, semantic tags, and semantic roles.
The primary target audience is people wanting to do semantic role labeling (SRL) experiments on the PMB.
Note: pmb2tsv
is experimental and some of its output may be erroneous.
Please download the PMB 3.0.0 and extract the
directory pmb-3.0.0
into the root directory of this repository.
Scripts to convert the files are mostly found in this repository; however, the following software needs to be present on the system:
- Python 3 – the
python3
executable should be on your$PATH
. - Produce – the
produce
executable should be on your$PATH
. - SWI-Prolog 7 or higher – the
swipl
executable should be on your$PATH
. - GNU Parallel – the
parallel
executable should be on your$PATH
.
Now use the produce
command to convert the desired portions of the PMB to TSV
files. For example, to get all gold sentences from PMB parts 00 and 01, run:
produce pmb-3.0.0-{en,de,it,nl}-gold-{p00,p01}.tsv
This example will generate 8 TSV files, one per language and part. They contain the converted sentences, separated by empty lines, one token per line with the following tab-separated columns:
- Token number within sentence
- Token form
- PMB semantic tag
- Symbol (English lemma)
- Dependency head token number or 0 if root
- CCG supertag
- CCG constituent structure
For every (verbal) frame in the sentence, there is an additional column that
marks each token as being the head of the predicate (in which case it contains
the string V
), as being the head of the role filler (in which case it
contains a VerbNet Role such as Agent
or Patient
), or as neither (in which
case it is O
).
Warning: for a small number of CCG derivations, especially some that are not fully corrected, dependency and role extraction will fail. The corresponding columns will be empty/missing. In extremely rare cases a dependency non-tree (a cyclic graph) may be extracted.
For details on the conversion from CCG derivations to dependency trees, see
Kilian Evang (2020): Configurable Dependency Tree Extraction from CCG
Derivations. Proceedings of the Universal Dependencies Workshop.
To reproduce the experiments from that paper, run:
produce pmb-3.0.0-{en,de,it,nl}-gold-{p00,p01}.eval