-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChangeLog
566 lines (492 loc) · 22.1 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
Version 0.9.6 2008-12-07
NLTK:
* new WordNet corpus reader (contributed by Steven Bethard)
* incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
* moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
* improved efficiency of unification algorithm
* various enhancements to the semantics package
* added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
* FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value,
to avoid the confusion caused by FreqDist.sorted()
* new downloader module to support interactive data download: nltk.download()
run using "python -m nltk.downloader all"
* fixed WordNet bug that caused min_depth() to sometimes give incorrect result
* added nltk.util.Index as a wrapper around defaultdict(list) plus
a functional-style initializer
* fixed bug in Earley chart parser that caused it to break
* added basic TnT tagger nltk.tag.tnt
* new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
* misc other bugfixes
Contrib (work in progress):
* TIGERSearch implementation by Torsten Marek
* extensions to hole and glue semantics modules by Dan Garrette
* new coreference package by Joseph Frazee
* MapReduce interface by Xinfan Meng
Data:
* Corpora are stored in compressed format if this will not compromise speed of access
* Swadesh Corpus of comparative wordlists in 23 languages
* Split grammar collection into separate packages
* New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
* Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
* Fixed bug that forced users to manually unzip the WordNet corpus
* New dependency-parsed version of Treebank corpus sample
* Added movie script "Monty Python and the Holy Grail" to webtext corpus
* Replaced words corpus data with a much larger list of English words
* New URL for list of available NLTK corpora
http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
Book:
* complete rewrite of first three chapters to make the book accessible
to a wider audience
* new chapter on data-intensive language processing
* extensive reworking of most chapters
* Dropped subsection numbering; moved exercises to end of chapters
Distributions:
* created Portfile to support Mac installation
Version 0.9.5 2008-08-27
NLTK:
* text module with support for concordancing, text generation, plotting
* book module
* Major reworking of the automated theorem proving modules (Dan Garrette)
* draw.dispersion now uses pylab
* draw.concordance GUI tool
* nltk.data supports for reading corpora and other data files from within zipfiles
* trees can be constructed from strings with Tree(s) (cf Tree.parse(s))
Contrib (work in progress):
* many updates to student projects
- nltk_contrib.agreement (Thomas Lippincott)
- nltk_contrib.coref (Joseph Frazee)
- nltk_contrib.depparser (Jason Narad)
- nltk_contrib.fuf (Petro Verkhogliad)
- nltk_contrib.hadoop (Xinfan Meng)
* clean-ups: deleted stale files; moved some packages to misc
Data
* Cleaned up Gutenberg text corpora
* added Moby Dick; removed redundant copy of Blake songs.
* more tagger models
* renamed to nltk_data to facilitate installation
* stored each corpus as a zip file for quicker installation
and access, and to solve a problem with the Propbank
corpus including a file with an illegal name for MSWindows
(con.xml).
Book:
* changed filenames to chNN format
* reworked opening chapters (work in progress)
Distributions:
* fixed problem with mac installer that arose when Python binary
couldn't be found
* removed dependency of NLTK on nltk_data so that NLTK code can be
installed before the data
Version 0.9.4 2008-08-01
NLTK:
- Expanded semantics package for first order logic, linear logic,
glue semantics, DRT, LFG (Dan Garrette)
- new WordSense class in wordnet.synset supporting access to synsets
from sense keys and accessing sense counts (Joel Nothman)
- interface to Mallet's linear chain CRF implementation (nltk.tag.crf)
- misc bugfixes incl Punkt, synsets, maxent
- improved support for chunkers incl flexible chunk corpus reader,
new rule type: ChunkRuleWithContext
- new GUI for pos-tagged concordancing nltk.draw.pos_concordance
- new GUI for developing regexp chunkers nltk.draw.rechunkparser
- added bio_sents() and bio_words() methods to ConllChunkCorpusReader in conll.py
to allow reading (word, tag, chunk_typ) tuples off of CoNLL-2000 corpus. Also
modified ConllChunkCorpusView to support these changes.
- feature structures support values with custom unification methods
- new flag on tagged corpus readers to use simplified tagsets
- new package for ngram language modeling with Katz backoff nltk.model
- added classes for single-parented and multi-parented trees that
automatically maintain parent pointers (nltk.tree.ParentedTree and
nltk.tree.MultiParentedTree)
- new WordNet browser GUI (Jussi Salmela, Paul Bone)
- improved support for lazy sequences
- added generate() method to probability distributions
- more flexible parser for converting bracketed strings to trees
- made fixes to docstrings to improve API documentation
Contrib (work in progress)
- new NLG package, FUF/SURGE (Petro Verkhogliad)
- new dependency parser package (Jason Narad)
- new Coreference package, incl support for
ACE-2, MUC-6 and MUC-7 corpora (Joseph Frazee)
- CCG Parser (Graeme Gange)
- first order resolution theorem prover (Dan Garrette)
Data:
- Nnw NPS Chat Corpus and corpus reader (nltk.corpus.nps_chat)
- ConllCorpusReader can now be used to read CoNLL 2004 and 2005 corpora.
- Implemented HMM-based Treebank POS tagger and phrase chunker for
nltk_contrib.coref in api.py. Pickled versions of these objects are checked
in in data/taggers and data/chunkers.
Book:
- misc corrections in response to feedback from readers
Version 0.9.3 2008-06-03
NLTK:
- modified WordNet similarity code to use pre-built information content files
- new classifier-based tagger, BNC corpus reader
- improved unicode support for corpus readers
- improved interfaces to Weka, Prover9/Mace4
- new support for using MEGAM and SciPy to train maxent classifiers
- rewrite of Punkt sentence segmenter (Joel Nothman)
- bugfixes for WordNet information content module (Jordan Boyd-Graber)
- code clean-ups throughout
Book:
- miscellaneous fixes in response to feedback from readers
Contrib:
- implementation of incremental algorithm for generating
referring expressions (contributed by Margaret Mitchell)
- refactoring WordNet browser (Paul Bone)
Corpora:
- included WordNet information content files
Version 0.9.2 2008-03-04
NLTK:
- new theorem-prover and model-checker module nltk.inference,
including interface to Prover9/Mace4 (Dan Garrette, Ewan Klein)
- bugfix in Reuters corpus reader that causes Python
to complain about too many open files
- VerbNet and PropBank corpus readers
Data:
- VerbNet Corpus version 2.1: hierarchical, verb lexicon linked to WordNet
- PropBank Corpus: predicate-argument structures, as stand-off annotation of Penn Treebank
Contrib:
- New work on WordNet browser, incorporating a client-server model (Jussi Salmela)
Distributions:
- Mac OS 10.5 distribution
Version 0.9.1 2008-01-24
NLTK:
- new interface for text categorization corpora
- new corpus readers: RTE, Movie Reviews, Question Classification, Brown Corpus
- bugfix in ConcatenatedCorpusView that caused iteration to fail if it didn't start from the beginning of the corpus
Data:
- Question classification data, included with permission of Li & Roth
- Reuters 21578 Corpus, ApteMod version, from CPAN
- Movie Reviews corpus (sentiment polarity), included with permission of Lillian Lee
- Corpus for Recognising Textual Entailment (RTE) Challenges 1, 2 and 3
- Brown Corpus (reverted to original file structure: ca01-cr09)
- Penn Treebank corpus sample (simplified implementation, new readers treebank_raw and treebank_chunk)
- Minor redesign of corpus readers, to use filenames instead of "items" to identify parts of a corpus
Contrib:
- theorem_prover: Prover9, tableau, MaltParser, Mace4, glue semantics, docs (Dan Garrette, Ewan Klein)
- drt: improved drawing, conversion to FOL (Dan Garrette)
- gluesemantics: GUI demonstration, abstracted LFG code, documentation (Dan Garrette)
- readability: various text readability scores (Thomas Jakobsen, Thomas Skardal)
- toolbox: code to normalize toolbox databases (Greg Aumann)
Book:
- many improvements in early chapters in response to reader feedback
- updates for revised corpus readers
- moved unicode section to chapter 3
- work on engineering.txt (not included in 0.9.1)
Distributions:
- Fixed installation for Mac OS 10.5 (Joshua Ritterman)
- Generalize doctest_driver to work with doc_contrib
Version 0.9 2007-10-12
NLTK:
- New naming of packages and modules, and more functions imported into
top-level nltk namespace, e.g. nltk.chunk.Regexp -> nltk.RegexpParser,
nltk.tokenize.Line -> nltk.LineTokenizer, nltk.stem.Porter -> nltk.PorterStemmer,
nltk.parse.ShiftReduce -> nltk.ShiftReduceParser
- processing class names changed from verbs to nouns, e.g.
StemI -> StemmerI, ParseI -> ParserI, ChunkParseI -> ChunkParserI, ClassifyI -> ClassifierI
- all tokenizers are now available as subclasses of TokenizeI,
selected tokenizers are also available as functions, e.g. wordpunct_tokenize()
- rewritten ngram tagger code, collapsed lookup tagger with unigram tagger
- improved tagger API, permitting training in the initializer
- new system for deprecating code so that users are notified of name changes.
- support for reading feature cfgs to parallel reading cfgs (parse_featcfg())
- text classifier package, maxent (GIS, IIS), naive Bayes, decision trees, weka support
- more consistent tree printing
- wordnet's morphy stemmer now accessible via stemmer package
- RSLP Portuguese stemmer (originally developed by Viviane Moreira Orengo, reimplemented by Tiago Tresoldi)
- promoted ieer_rels.py to the sem package
- improvements to WordNet package (Jussi Salmela)
- more regression tests, and support for checking coverage of tests
- miscellaneous bugfixes
- remove numpy dependency
Data:
- new corpus reader implementation, refactored syntax corpus readers
- new data package: corpora, grammars, tokenizers, stemmers, samples
- CESS-ESP Spanish Treebank and corpus reader
- CESS-CAT Catalan Treebank and corpus reader
- Alpino Dutch Treebank and corpus reader
- MacMorpho POS-tagged Brazilian Portuguese news text and corpus reader
- trained model for Portuguese sentence segmenter
- Floresta Portuguese Treebank version 7.4 and corpus reader
- TIMIT player audio support
Contrib:
- BioReader (contributed by Carlos Rodriguez)
- TnT tagger (contributed by Sam Huston)
- wordnet browser (contributed by Jussi Salmela, requires wxpython)
- lpath interpreter (contributed by Haejoong Lee)
- timex -- regular expression-based temporal expression tagger
Book:
- polishing of early chapters
- introductions to parts 1, 2, 3
- improvements in book processing software (xrefs, avm & gloss formatting, javascript clipboard)
- updates to book organization, chapter contents
- corrections throughout suggested by readers (acknowledged in preface)
- more consistent use of US spelling throughout
- all examples redone to work with single import statement: "import nltk"
- reordered chapters: 5->7->8->9->11->12->5
* language engineering in part 1 to broaden the appeal
of the earlier part of the book and to talk more about
evaluation and baselines at an earlier stage
* concentrate the partial and full parsing material in part 2,
and remove the specialized feature-grammar material into part 3
Distributions:
- streamlined mac installation (Joshua Ritterman)
- included mac distribution with ISO image
Version 0.8 2007-07-01
Code:
- changed nltk.__init__ imports to explicitly import names from top-level modules
- changed corpus.util to use the 'rb' flag for opening files, to fix problems
reading corpora under MSWindows
- updated stale examples in engineering.txt
- extended feature stucture interface to permit chained features, e.g. fs['F','G']
- further misc improvements to test code plus some bugfixes
Tutorials:
- rewritten opening section of tagging chapter
- reorganized some exercises
Version 0.8b2 2007-06-26
Code (major):
- new corpus package, obsoleting old corpora package
- supports caching, slicing, corpus search path
- more flexible API
- global updates so all NLTK modules use new corpus package
- moved nltk/contrib to separate top-level package nltk_contrib
- changed wordpunct tokenizer to use \w instead of a-zA-Z0-9
as this will be more robust for languages other than English,
with implications for many corpus readers that use it
- known bug: certain re-entrant structures in featstruct
- known bug: when the LHS of an edge contains an ApplicationExpression,
variable values in the RHS bindings aren't copied over when the
fundamental rule applies
- known bug: HMM tagger is broken
Tutorials:
- global updates to NLTK and docs
- ongoing polishing
Corpora:
- treebank sample reverted to published multi-file structure
Contrib:
- DRT and Glue Semantics code (nltk_contrib.drt, nltk_contrib.gluesemantics, by Dan Garrette)
Version 0.8b1 2007-06-18
Code (major):
- changed package name to nltk
- import all top-level modules into nltk, reducing need for import statements
- reorganization of sub-package structures to simplify imports
- new featstruct module, unifying old featurelite and featurestructure modules
- FreqDist now inherits from dict, fd.count(sample) becomes fd[sample]
- FreqDist initializer permits: fd = FreqDist(len(token) for token in text)
- made numpy optional
Code (minor):
- changed GrammarFile initializer to accept filename
- consistent tree display format
- fixed loading process for WordNet and TIMIT that prevented code installation if data not installed
- taken more care with unicode types
- incorporated pcfg code into cfg module
- moved cfg, tree, featstruct to top level
- new filebroker module to make handling of example grammar files more transparent
- more corpus readers (webtext, abc)
- added cfg.covers() to check that a grammar covers a sentence
- simple text-based wordnet browser
- known bug: parse/featurechart.py uses incorrect apply() function
Corpora:
- csv data file to document NLTK corpora
Contrib:
- added Glue semantics code (contrib.glue, by Dan Garrette)
- Punkt sentence segmenter port (contrib.punkt, by Willy)
- added LPath interpreter (contrib.lpath, by Haejoong Lee)
- extensive work on classifiers (contrib.classifier*, Sumukh Ghodke)
Tutorials:
- polishing on parts I, II
- more illustrations, data plots, summaries, exercises
- continuing to make prose more accessible to non-linguistic audience
- new default import that all chapters presume: from nltk.book import *
Distributions:
- updated to latest version of numpy
- removed WordNet installation instructions as WordNet is now included in corpus distribution
- added pylab (matplotlib)
Version 0.7.5 2007-05-16
Code:
- improved WordNet and WordNet-Similarity interface
- the Lancaster Stemmer (contributed by Steven Tomcavage)
Corpora:
- Web text samples
- BioCreAtIvE-PPI - a corpus for protein-protein interactions
- Switchboard Telephone Speech Corpus Sample (via Talkbank)
- CMU Problem Reports Corpus sample
- CONLL2002 POS+NER data
- Patient Information Leaflet corpus
- WordNet 3.0 data files
- English wordlists: basic English, frequent words
Tutorials:
- more improvements to text and images
Version 0.7.4 2007-05-01
Code:
- Indian POS tagged corpus reader: corpora.indian
- Sinica Treebank corpus reader: corpora.sinica_treebank
- new web corpus reader corpora.web
- tag package now supports pickling
- added function to utilities.py to guess character encoding
Corpora:
- Rotokas texts from Stuart Robinson
- POS-tagged corpora for several Indian languages (Bangla, Hindi, Marathi, Telugu) from A Kumaran
Tutorials:
- Substantial work on Part II of book on structured programming, parsing and grammar
- More bibliographic citations
- Improvements in typesetting, cross references
- Redimensioned images and tables for better use of page space
- Moved project list to wiki
Contrib:
- validation of toolbox entries using chunking
- improved classifiers
Distribution:
- updated for Python 2.5.1, Numpy 1.0.2
Version 0.7.3 2007-04-02
* Code:
- made chunk.Regexp.parse() more flexible about its input
- developed new syntax for PCFG grammars, e.g. A -> B C [0.3] | D [0.7]
- fixed CFG parser to support grammars with slash categories
- moved beta classify package from main NLTK to contrib
- Brill taggers loaded correctly
- misc bugfixes
* Corpora:
- Shakespeare XML corpus sample and corpus reader
* Tutorials:
- improvements to prose, exercises, plots, images
- expanded and reorganized tutorial on structured programming
- formatting improvements for Python listings
- improved plots (using pylab)
- categorization of problems by difficulty
Contrib:
- more work on kimmo lexicon and grammar
- more work on classifiers
Version 0.7.2 2007-03-01
* Code:
- simple feature detectors (detect module)
- fixed problem when token generators are passed to a parser (parse package)
- fixed bug in Grammar.productions() (identified by Lucas Champollion and Mitch Marcus)
- fixed import bug in category.GrammarFile.earley_parser
- added utilities.OrderedDict
- initial port of old NLTK classifier package (by Sam Huston)
- UDHR corpus reader
* Corpora:
- added UDHR corpus (Universal Declaration of Human Rights)
with 10k text samples in 300+ languages
* Tutorials:
- improved images
- improved book formatting, including new support for:
- javascript to copy program examples to clipboard in HTML version,
- bibliography, chapter cross-references, colorization, index, table-of-contents
* Contrib:
- new Kimmo system: contrib.mit.six863.kimmo (Rob Speer)
- fixes for: contrib.fsa (Rob Speer)
- demonstration of text classifiers trained on UDHR corpus for
language identification: contrib.langid (Sam Huston)
- new Lambek calculus system: contrib.lambek
- new tree implementation based on elementtree: contrib.tree
Version 0.7.1 2007-01-14
* Code:
- bugfixes (HMM, WordNet)
Version 0.7 2006-12-22
* Code:
- bugfixes, including fixed bug in Brown corpus reader
- cleaned up wordnet 2.1 interface code and similarity measures
- support for full Penn treebank format contributed by Yoav Goldberg
* Tutorials:
- expanded tutorials on advanced parsing and structured programming
- checked all doctest code
- improved images for chart parsing
Version 0.7b1 2006-12-06
* Code:
- expanded semantic interpretation package
- new high-level chunking interface, with cascaded chunking
- split chunking code into new chunk package
- updated wordnet package to support version 2.1 of Wordnet.
- prototyped basic wordnet similarity measures
(path distance, Wu + Palmer and Leacock + Chodorow, Resnik similarity measures.)
- bugfixes (tag.Window, tag.ngram)
- more doctests
* Contrib:
- toolbox language settings module
* Tutorials:
- rewrite of chunking chapter, switched from Treebank to CoNLL format as main focus,
simplified evaluation framework, added ngram chunking section
- substantial updates throughout (esp programming and semantics chapters)
* Corpora:
- Chat-80 Prolog data files provided as corpora, plus corpus reader
Version 0.7a2 2006-11-13
* Code:
- more doctest
- code to read Chat-80 data
- HMM bugfix
* Tutorials:
- continued updates and polishing
* Corpora:
- toolbox MDF sample data
Version 0.7a1 2006-10-29
* Code:
- new toolbox module (Greg Aumann)
- new semantics package (Ewan Klein)
- bugfixes
* Tutorials
- substantial revision, especially in preface, introduction, words,
and semantics chapters.
Version 0.6.6 2006-10-06
* Code:
- bugfixes (probability, shoebox, draw)
* Contrib:
- new work on shoebox package (Stuart Robinson)
* Tutorials:
- continual expansion and revision, especially on introduction to
programming, advanced programming and the feature-based grammar chapters.
Version 0.6.5 2006-07-09
* Code:
- improvements to shoebox module (Stuart Robinson, Greg Aumann)
- incorporated feature-based parsing into core NLTK-Lite
- corpus reader for Sinica treebank sample
- new stemmer package
* Contrib:
- hole semantics implementation (Peter Wang)
- Incorporating yaml
- new work on feature structures, unification, lambda calculus
- new work on shoebox package (Stuart Robinson, Greg Aumann)
* Corpora:
- Sinica treebank sample
* Tutorials:
- expanded discussion throughout, incl: left-recursion, trees, grammars,
feature-based grammar, agreement, unification, PCFGs,
baseline performance, exercises, improved display of trees
Version 0.6.4 2006-04-20
* Code:
- corpus readers for Senseval 2 and TIMIT
- clusterer (ported from old NLTK)
- support for cascaded chunkers
- bugfix suggested by Brent Payne
- new SortedDict class for regression testing
* Contrib:
- CombinedTagger tagger and marshalling taggers, contributed by Tiago Tresoldi
* Corpora:
- new: Senseval 2, TIMIT sample
* Tutorials:
- major revisions to programming, words, tagging, chunking, and parsing tutorials
- many new exercises
- formatting improvements, including colorized program examples
- fixed problem with testing on training data, reported by Jason Baldridge
Version 0.6.3 2006-03-09
* switch to new style classes
* repair FSA model sufficiently for Kimmo module to work
* port of MIT Kimmo morphological analyzer; still needs lots of code clean-up and inline docs
* expanded support for shoebox format, developed with Stuart Robinson
* fixed bug in indexing CFG productions, for empty right-hand-sides
* efficiency improvements, suggested by Martin Ranang
* replaced classeq with isinstance, for efficiency improvement, as suggested by Martin Ranang
* bugfixes in chunk eval
* simplified call to draw_trees
* names, stopwords corpora
Version 0.6.2 2006-01-29
* Peter Spiller's concordancer
* Will Hardy's implementation of Penton's paradigm visualization system
* corpus readers for presidential speeches
* removed NLTK dependency
* generalized CFG terminals to permit full range of characters
* used fully qualified names in demo code, for portability
* bugfixes from Yoav Goldberg, Eduardo Pereira Habkost
* fixed obscure quoting bug in tree displays and conversions
* simplified demo code, fixed import bug