The MUC 6 corpus can be obtained from the LDC (LDC Catalog No. LDC2003T13)
In our experiments, only the "dryrun training" portion of the MUC 6 corpus was used (more on this below).
Due to licensing restrictions we could not include the data here. Some notes on how we processed the data are given below.
Set up a new python 2 virtualenv. Then pip install:
- nltk 3.0.1 (it does not work with later versions)
- nltk-contrib 3.2.5 (which was the latest version when we obtained it).
nltk_contrib/coref won't run out of the box, due to dependency issues and a couple errors. We had to edit the files api.py and muc.py in the coref directory. To fix api.py, comment out class HiddernMarkovModelChunkTaggerTransformI.
Copy the relevant MUC files into the nltk data directory. These are of the form (the number of xx's is arbitrary here):
xxxx.ne.xx.sgm
These files are in MUC-6/data/keys/dryrun-trng.NE-combined.key.v1.3.clean Be sure to copy the files to <nltk data directory>/corpora/muc6
Activate the virtualenv. Then in Python run:
from nltk_contrib.coref import muc muc.demo()
This will save the corpus into CONLL format in the file muc6-conll-format.txt
There are a few issues with the data. We fixed the following:
A few cases of incorrect sentence segmentation.
Commas incorrectly tokenized, eg . [course,] rather than [course] [,]
Colons (:) also incorrectly tokenized; they should be on their own.
Incorrect tokenization, such as [Mr] [.] [Smith] This should rather be [Mr.] [Smith]. This was done with the following: - a.m , p.m - Co, Corp, Bros, Inc, Ltd, S.A, Pty, G.m.b.H, N.V (N.V. is Dutch for LLC), S.p.A (Italian) - CORP - Counting (lists), eg. [1] [.] -> [1.] , and [1] [)] -> [1)] - U.S, U.S.A, U.N, U.K, L.A - U.S.S.R - Month abbreviations (Jan, Feb, Aug, Sept, Oct, Nov, Dec) - No [as in number, eg. "No. 1"] - Mr, Mrs, Ms, Prof, Dr, Jr, Sr, Rep, Sen, Rev, St, Lt, Gov - People's initials (usually first or middle name) - US State abbreviations:
- Calif
- N.J
- N.M
- N.Y
- N.C
- N.H
- R.I
- Ky
- Mass
- [W.Va][.] -> [W.] [Va.]
- Wash
- Mich
- Conn
- DC
- Ark
- Pa
- Va
- Ind
- Ariz
- Miss
- Fla
- Del
- Nev
- Ore
- Tenn
- Mont
- Ill
- Ala
- Wis
- Ga
- La
- Mo
- Vt
- Others: J.C [in: J.C . PENNEY Co.] R.L [in Tucker Anthony & R.L . Day] J.P [in J.P . Morgan] J [in: J . Walter Thompson Co.] A.G [in: Siemens A.G .] C [Stanford C. Berstein] Cos [Equitable of Iowa Cos .] A.L [A.L . Williams] E.W [E.W . Scripps Co.] A [Alfred A. Knopf] D [D. Lazzaroni & Co.] J.L [J.L Henry ] W.N [W.N Whelen] T [T. Rowe Price ] C [C. Itoh] E [E. Guigal] E.C [E.C Television] A.C [A.C. Nielsen] L.P [WFRR L.P] H.N. & Frances C. Berger Foundation] F.W [F.W Dodge Group] E [Charles E. Simon] R.P. Scherer F.H. Faulding W.R. Grace
There were less than 20 TIME entities, so these were removed.
We used stratified_split.py to create a custom train/test split of the data:
TRAIN, TEST = write_new_split('MUC6', 1000, filedir, 'muc6', max_count = 2)
- "Nomura Research Institute." is labeled as an entity.
- Usually (but not always, see "Ms. Poore"), the title (Mr., Ms., etc) is not
contained in the named entity (unlike in say ACE 2005).
- "70 U.K. and international banks" [891102-0075.ne.v1.3.sgm]. Here
U.K. was unlabeled.
- In the Senate, Edward Kennedy (D., Mass) [891101-0115.ne.v1.3.sgm]. Here
"D" is marked as ORG. (Usually it isn't in MUC 6)
- The way % and $ are parsed here: they are on different lines, eg. [50] [%] This is the correct way.