ChangeLog

Version 0.9.6 2008-12-07

NLTK:
* new WordNet corpus reader (contributed by Steven Bethard)
* incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
* moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
* improved efficiency of unification algorithm
* various enhancements to the semantics package
* added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
* FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value,
  to avoid the confusion caused by FreqDist.sorted()
* new downloader module to support interactive data download: nltk.download()
  run using "python -m nltk.downloader all"
* fixed WordNet bug that caused min_depth() to sometimes give incorrect result
* added nltk.util.Index as a wrapper around defaultdict(list) plus
  a functional-style initializer
* fixed bug in Earley chart parser that caused it to break
* added basic TnT tagger nltk.tag.tnt
* new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
* misc other bugfixes

Contrib (work in progress):
* TIGERSearch implementation by Torsten Marek
* extensions to hole and glue semantics modules by Dan Garrette
* new coreference package by Joseph Frazee
* MapReduce interface by Xinfan Meng

Data:
* Corpora are stored in compressed format if this will not compromise speed of access
* Swadesh Corpus of comparative wordlists in 23 languages
* Split grammar collection into separate packages
* New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
* Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
* Fixed bug that forced users to manually unzip the WordNet corpus
* New dependency-parsed version of Treebank corpus sample
* Added movie script "Monty Python and the Holy Grail" to webtext corpus
* Replaced words corpus data with a much larger list of English words
* New URL for list of available NLTK corpora
  http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Book:
* complete rewrite of first three chapters to make the book accessible
  to a wider audience
* new chapter on data-intensive language processing
* extensive reworking of most chapters
* Dropped subsection numbering; moved exercises to end of chapters

Distributions:
* created Portfile to support Mac installation


Version 0.9.5 2008-08-27

NLTK:
* text module with support for concordancing, text generation, plotting
* book module
* Major reworking of the automated theorem proving modules (Dan Garrette)
* draw.dispersion now uses pylab
* draw.concordance GUI tool
* nltk.data supports for reading corpora and other data files from within zipfiles
* trees can be constructed from strings with Tree(s) (cf Tree.parse(s))

Contrib (work in progress):
* many updates to student projects
  - nltk_contrib.agreement (Thomas Lippincott)
  - nltk_contrib.coref (Joseph Frazee)
  - nltk_contrib.depparser (Jason Narad)
  - nltk_contrib.fuf (Petro Verkhogliad)
  - nltk_contrib.hadoop (Xinfan Meng)
* clean-ups: deleted stale files; moved some packages to misc

Data
* Cleaned up Gutenberg text corpora
* added Moby Dick; removed redundant copy of Blake songs.
* more tagger models
* renamed to nltk_data to facilitate installation
* stored each corpus as a zip file for quicker installation
  and access, and to solve a problem with the Propbank
  corpus including a file with an illegal name for MSWindows
  (con.xml).

Book:
* changed filenames to chNN format
* reworked opening chapters (work in progress)

Distributions:
* fixed problem with mac installer that arose when Python binary
  couldn't be found
* removed dependency of NLTK on nltk_data so that NLTK code can be
  installed before the data

Version 0.9.4 2008-08-01
	
NLTK:
- Expanded semantics package for first order logic, linear logic,
  glue semantics, DRT, LFG (Dan Garrette)
- new WordSense class in wordnet.synset supporting access to synsets
  from sense keys and accessing sense counts (Joel Nothman)
- interface to Mallet's linear chain CRF implementation (nltk.tag.crf)
- misc bugfixes incl Punkt, synsets, maxent
- improved support for chunkers incl flexible chunk corpus reader,
  new rule type: ChunkRuleWithContext
- new GUI for pos-tagged concordancing nltk.draw.pos_concordance
- new GUI for developing regexp chunkers nltk.draw.rechunkparser 
- added bio_sents() and bio_words() methods to ConllChunkCorpusReader in conll.py
    to allow reading (word, tag, chunk_typ) tuples off of CoNLL-2000 corpus. Also 
    modified ConllChunkCorpusView to support these changes.
- feature structures support values with custom unification methods
- new flag on tagged corpus readers to use simplified tagsets
- new package for ngram language modeling with Katz backoff nltk.model
- added classes for single-parented and multi-parented trees that 
  automatically maintain parent pointers (nltk.tree.ParentedTree and
  nltk.tree.MultiParentedTree)
- new WordNet browser GUI (Jussi Salmela, Paul Bone)
- improved support for lazy sequences
- added generate() method to probability distributions
- more flexible parser for converting bracketed strings to trees
- made fixes to docstrings to improve API documentation

Contrib (work in progress)
- new NLG package, FUF/SURGE (Petro Verkhogliad) 
- new dependency parser package (Jason Narad)
- new Coreference package, incl support for
  ACE-2, MUC-6 and MUC-7 corpora (Joseph Frazee)
- CCG Parser (Graeme Gange)
- first order resolution theorem prover (Dan Garrette)

Data:
- Nnw NPS Chat Corpus and corpus reader (nltk.corpus.nps_chat)
- ConllCorpusReader can now be used to read CoNLL 2004 and 2005 corpora.
- Implemented HMM-based Treebank POS tagger and phrase chunker for 
  nltk_contrib.coref in api.py. Pickled versions of these objects are checked
  in in data/taggers and data/chunkers.

Book:
- misc corrections in response to feedback from readers
	
Version 0.9.3 2008-06-03
	
NLTK:
- modified WordNet similarity code to use pre-built information content files
- new classifier-based tagger, BNC corpus reader
- improved unicode support for corpus readers
- improved interfaces to Weka, Prover9/Mace4
- new support for using MEGAM and SciPy to train maxent classifiers
- rewrite of Punkt sentence segmenter (Joel Nothman)
- bugfixes for WordNet information content module (Jordan Boyd-Graber)
- code clean-ups throughout

Book:
- miscellaneous fixes in response to feedback from readers

Contrib:
- implementation of incremental algorithm for generating
  referring expressions (contributed by Margaret Mitchell)
- refactoring WordNet browser (Paul Bone)
  
Corpora:
- included WordNet information content files

Version 0.9.2 2008-03-04

NLTK:
- new theorem-prover and model-checker module nltk.inference,
  including interface to Prover9/Mace4 (Dan Garrette, Ewan Klein)
- bugfix in Reuters corpus reader that causes Python
  to complain about too many open files
- VerbNet and PropBank corpus readers

Data:
- VerbNet Corpus version 2.1: hierarchical, verb lexicon linked to WordNet
- PropBank Corpus: predicate-argument structures, as stand-off annotation of Penn Treebank

Contrib:
- New work on WordNet browser, incorporating a client-server model (Jussi Salmela)

Distributions:
- Mac OS 10.5 distribution
	
Version 0.9.1 2008-01-24

NLTK:
- new interface for text categorization corpora
- new corpus readers: RTE, Movie Reviews, Question Classification, Brown Corpus
- bugfix in ConcatenatedCorpusView that caused iteration to fail if it didn't start from the beginning of the corpus

Data:
- Question classification data, included with permission of Li & Roth
- Reuters 21578 Corpus, ApteMod version, from CPAN
- Movie Reviews corpus (sentiment polarity), included with permission of Lillian Lee
- Corpus for Recognising Textual Entailment (RTE) Challenges 1, 2 and 3
- Brown Corpus (reverted to original file structure: ca01-cr09)
- Penn Treebank corpus sample (simplified implementation, new readers treebank_raw and treebank_chunk)
- Minor redesign of corpus readers, to use filenames instead of "items" to identify parts of a corpus
	
Contrib:
- theorem_prover: Prover9, tableau, MaltParser, Mace4, glue semantics, docs (Dan Garrette, Ewan Klein)
- drt: improved drawing, conversion to FOL (Dan Garrette)
- gluesemantics: GUI demonstration, abstracted LFG code, documentation (Dan Garrette)
- readability: various text readability scores (Thomas Jakobsen, Thomas Skardal)
- toolbox: code to normalize toolbox databases (Greg Aumann)
	
Book:
- many improvements in early chapters in response to reader feedback
- updates for revised corpus readers
- moved unicode section to chapter 3
- work on engineering.txt (not included in 0.9.1)

Distributions:
- Fixed installation for Mac OS 10.5 (Joshua Ritterman)
- Generalize doctest_driver to work with doc_contrib

Version 0.9 2007-10-12

NLTK:
- New naming of packages and modules, and more functions imported into
  top-level nltk namespace, e.g. nltk.chunk.Regexp -> nltk.RegexpParser,
    nltk.tokenize.Line -> nltk.LineTokenizer, nltk.stem.Porter -> nltk.PorterStemmer,
    nltk.parse.ShiftReduce -> nltk.ShiftReduceParser
- processing class names changed from verbs to nouns, e.g.
  StemI -> StemmerI, ParseI -> ParserI, ChunkParseI -> ChunkParserI, ClassifyI -> ClassifierI
- all tokenizers are now available as subclasses of TokenizeI,
  selected tokenizers are also available as functions, e.g. wordpunct_tokenize()
- rewritten ngram tagger code, collapsed lookup tagger with unigram tagger
- improved tagger API, permitting training in the initializer
- new system for deprecating code so that users are notified of name changes.
- support for reading feature cfgs to parallel reading cfgs (parse_featcfg())
- text classifier package, maxent (GIS, IIS), naive Bayes, decision trees, weka support
- more consistent tree printing
- wordnet's morphy stemmer now accessible via stemmer package
- RSLP Portuguese stemmer (originally developed by Viviane Moreira Orengo, reimplemented by Tiago Tresoldi)
- promoted ieer_rels.py to the sem package
- improvements to WordNet package (Jussi Salmela)
- more regression tests, and support for checking coverage of tests
- miscellaneous bugfixes
- remove numpy dependency

Data:
- new corpus reader implementation, refactored syntax corpus readers
- new data package: corpora, grammars, tokenizers, stemmers, samples
- CESS-ESP Spanish Treebank and corpus reader
- CESS-CAT Catalan Treebank and corpus reader
- Alpino Dutch Treebank and corpus reader
- MacMorpho POS-tagged Brazilian Portuguese news text and corpus reader
- trained model for Portuguese sentence segmenter
- Floresta Portuguese Treebank version 7.4 and corpus reader
- TIMIT player audio support

Contrib:
- BioReader (contributed by Carlos Rodriguez)
- TnT tagger (contributed by Sam Huston)
- wordnet browser (contributed by Jussi Salmela, requires wxpython)
- lpath interpreter (contributed by Haejoong Lee)
- timex -- regular expression-based temporal expression tagger

Book:
- polishing of early chapters
- introductions to parts 1, 2, 3
- improvements in book processing software (xrefs, avm & gloss formatting, javascript clipboard)
- updates to book organization, chapter contents
- corrections throughout suggested by readers (acknowledged in preface)
- more consistent use of US spelling throughout
- all examples redone to work with single import statement: "import nltk"
- reordered chapters: 5->7->8->9->11->12->5
  * language engineering in part 1 to broaden the appeal
    of the earlier part of the book and to talk more about
    evaluation and baselines at an earlier stage
  * concentrate the partial and full parsing material in part 2,
    and remove the specialized feature-grammar material into part 3

Distributions:
- streamlined mac installation (Joshua Ritterman)
- included mac distribution with ISO image

Version 0.8 2007-07-01

Code:
- changed nltk.__init__ imports to explicitly import names from top-level modules
- changed corpus.util to use the 'rb' flag for opening files, to fix problems
  reading corpora under MSWindows
- updated stale examples in engineering.txt
- extended feature stucture interface to permit chained features, e.g. fs['F','G']
- further misc improvements to test code plus some bugfixes
Tutorials:
- rewritten opening section of tagging chapter
- reorganized some exercises

Version 0.8b2 2007-06-26

Code (major):
- new corpus package, obsoleting old corpora package
  - supports caching, slicing, corpus search path
  - more flexible API
  - global updates so all NLTK modules use new corpus package
- moved nltk/contrib to separate top-level package nltk_contrib
- changed wordpunct tokenizer to use \w instead of a-zA-Z0-9
  as this will be more robust for languages other than English,
  with implications for many corpus readers that use it
- known bug: certain re-entrant structures in featstruct
- known bug: when the LHS of an edge contains an ApplicationExpression,
    variable values in the RHS bindings aren't copied over when the
    fundamental rule applies
- known bug: HMM tagger is broken
Tutorials:
- global updates to NLTK and docs
- ongoing polishing
Corpora:
- treebank sample reverted to published multi-file structure
Contrib:
- DRT and Glue Semantics code (nltk_contrib.drt, nltk_contrib.gluesemantics, by Dan Garrette)

Version 0.8b1 2007-06-18

Code (major):
- changed package name to nltk
- import all top-level modules into nltk, reducing need for import statements  
- reorganization of sub-package structures to simplify imports
- new featstruct module, unifying old featurelite and featurestructure modules
- FreqDist now inherits from dict, fd.count(sample) becomes fd[sample]
- FreqDist initializer permits: fd = FreqDist(len(token) for token in text)
- made numpy optional
Code (minor):
- changed GrammarFile initializer to accept filename
- consistent tree display format
- fixed loading process for WordNet and TIMIT that prevented code installation if data not installed
- taken more care with unicode types
- incorporated pcfg code into cfg module
- moved cfg, tree, featstruct to top level
- new filebroker module to make handling of example grammar files more transparent
- more corpus readers (webtext, abc)
- added cfg.covers() to check that a grammar covers a sentence
- simple text-based wordnet browser
- known bug: parse/featurechart.py uses incorrect apply() function
Corpora:
- csv data file to document NLTK corpora
Contrib:
- added Glue semantics code (contrib.glue, by Dan Garrette)
- Punkt sentence segmenter port (contrib.punkt, by Willy)
- added LPath interpreter (contrib.lpath, by Haejoong Lee)
- extensive work on classifiers (contrib.classifier*, Sumukh Ghodke)
Tutorials:
- polishing on parts I, II
- more illustrations, data plots, summaries, exercises
- continuing to make prose more accessible to non-linguistic audience
- new default import that all chapters presume: from nltk.book import *
Distributions:
- updated to latest version of numpy
- removed WordNet installation instructions as WordNet is now included in corpus distribution
- added pylab (matplotlib)

Version 0.7.5 2007-05-16

Code:
- improved WordNet and WordNet-Similarity interface
- the Lancaster Stemmer (contributed by Steven Tomcavage)
Corpora:
- Web text samples
- BioCreAtIvE-PPI - a corpus for protein-protein interactions
- Switchboard Telephone Speech Corpus Sample (via Talkbank)
- CMU Problem Reports Corpus sample
- CONLL2002 POS+NER data
- Patient Information Leaflet corpus
- WordNet 3.0 data files
- English wordlists: basic English, frequent words
Tutorials:
- more improvements to text and images

Version 0.7.4 2007-05-01

Code:
- Indian POS tagged corpus reader: corpora.indian
- Sinica Treebank corpus reader: corpora.sinica_treebank
- new web corpus reader corpora.web
- tag package now supports pickling
- added function to utilities.py to guess character encoding
Corpora:
- Rotokas texts from Stuart Robinson
- POS-tagged corpora for several Indian languages (Bangla, Hindi, Marathi, Telugu) from A Kumaran
Tutorials:
- Substantial work on Part II of book on structured programming, parsing and grammar
- More bibliographic citations
- Improvements in typesetting, cross references
- Redimensioned images and tables for better use of page space
- Moved project list to wiki
Contrib:
- validation of toolbox entries using chunking
- improved classifiers
Distribution:
- updated for Python 2.5.1, Numpy 1.0.2

Version 0.7.3 2007-04-02
	
* Code:
 - made chunk.Regexp.parse() more flexible about its input
 - developed new syntax for PCFG grammars, e.g. A -> B C [0.3] | D [0.7]
 - fixed CFG parser to support grammars with slash categories
 - moved beta classify package from main NLTK to contrib
 - Brill taggers loaded correctly
 - misc bugfixes
* Corpora:
 - Shakespeare XML corpus sample and corpus reader
* Tutorials:
 - improvements to prose, exercises, plots, images
 - expanded and reorganized tutorial on structured programming
 - formatting improvements for Python listings
 - improved plots (using pylab)
 - categorization of problems by difficulty
Contrib:
 - more work on kimmo lexicon and grammar
 - more work on classifiers

Version 0.7.2 2007-03-01

* Code:
 - simple feature detectors (detect module)
 - fixed problem when token generators are passed to a parser (parse package)
 - fixed bug in Grammar.productions() (identified by Lucas Champollion and Mitch Marcus)
 - fixed import bug in category.GrammarFile.earley_parser
 - added utilities.OrderedDict
 - initial port of old NLTK classifier package (by Sam Huston)
 - UDHR corpus reader
* Corpora:
 - added UDHR corpus (Universal Declaration of Human Rights)
     with 10k text samples in 300+ languages
* Tutorials:
 - improved images
 - improved book formatting, including new support for:
   - javascript to copy program examples to clipboard in HTML version,
   - bibliography, chapter cross-references, colorization, index, table-of-contents

* Contrib:
  - new Kimmo system: contrib.mit.six863.kimmo (Rob Speer)
  - fixes for: contrib.fsa (Rob Speer)
  - demonstration of text classifiers trained on UDHR corpus for
      language identification: contrib.langid (Sam Huston)
  - new Lambek calculus system: contrib.lambek
  - new tree implementation based on elementtree: contrib.tree
	
Version 0.7.1 2007-01-14

* Code:
  - bugfixes (HMM, WordNet)

Version 0.7 2006-12-22

* Code:
  - bugfixes, including fixed bug in Brown corpus reader
  - cleaned up wordnet 2.1 interface code and similarity measures
  - support for full Penn treebank format contributed by Yoav Goldberg
* Tutorials:
  - expanded tutorials on advanced parsing and structured programming
  - checked all doctest code
  - improved images for chart parsing
	
Version 0.7b1 2006-12-06
	
* Code:
  - expanded semantic interpretation package
  - new high-level chunking interface, with cascaded chunking
  - split chunking code into new chunk package
  - updated wordnet package to support version 2.1 of Wordnet.
  - prototyped basic wordnet similarity measures
    (path distance, Wu + Palmer and Leacock + Chodorow, Resnik similarity measures.)
  - bugfixes (tag.Window, tag.ngram)
  - more doctests
* Contrib:	
  - toolbox language settings module
* Tutorials:
  - rewrite of chunking chapter, switched from Treebank to CoNLL format as main focus,
    simplified evaluation framework, added ngram chunking section
  - substantial updates throughout (esp programming and semantics chapters)
* Corpora:
  - Chat-80 Prolog data files provided as corpora, plus corpus reader
	
Version 0.7a2 2006-11-13

* Code:
  - more doctest
  - code to read Chat-80 data
  - HMM bugfix
* Tutorials:
  - continued updates and polishing
* Corpora:
  - toolbox MDF sample data

Version 0.7a1 2006-10-29

* Code:
  - new toolbox module (Greg Aumann)
  - new semantics package (Ewan Klein)
  - bugfixes
* Tutorials
  - substantial revision, especially in preface, introduction, words,
    and semantics chapters.
	
Version 0.6.6 2006-10-06

* Code:
  - bugfixes (probability, shoebox, draw)
* Contrib:
  - new work on shoebox package (Stuart Robinson)
* Tutorials:
  - continual expansion and revision, especially on introduction to
    programming, advanced programming and the feature-based grammar chapters.
	
Version 0.6.5 2006-07-09

* Code:
  - improvements to shoebox module (Stuart Robinson, Greg Aumann)
  - incorporated feature-based parsing into core NLTK-Lite
  - corpus reader for Sinica treebank sample
  - new stemmer package
* Contrib:
  - hole semantics implementation (Peter Wang)
  - Incorporating yaml
  - new work on feature structures, unification, lambda calculus
  - new work on shoebox package (Stuart Robinson, Greg Aumann)
* Corpora:
  - Sinica treebank sample	
* Tutorials:
  - expanded discussion throughout, incl: left-recursion, trees, grammars,
    feature-based grammar, agreement, unification, PCFGs,
    baseline performance, exercises, improved display of trees

Version 0.6.4 2006-04-20

* Code:
  - corpus readers for Senseval 2 and TIMIT
  - clusterer (ported from old NLTK)
  - support for cascaded chunkers
  - bugfix suggested by Brent Payne
  - new SortedDict class for regression testing
* Contrib:
  - CombinedTagger tagger and marshalling taggers, contributed by Tiago Tresoldi
* Corpora:
  - new: Senseval 2, TIMIT sample
* Tutorials:
  - major revisions to programming, words, tagging, chunking, and parsing tutorials
  - many new exercises
  - formatting improvements, including colorized program examples
  - fixed problem with testing on training data, reported by Jason Baldridge
	
Version 0.6.3 2006-03-09

* switch to new style classes
* repair FSA model sufficiently for Kimmo module to work	
* port of MIT Kimmo morphological analyzer; still needs lots of code clean-up and inline docs
* expanded support for shoebox format, developed with Stuart Robinson
* fixed bug in indexing CFG productions, for empty right-hand-sides
* efficiency improvements, suggested by Martin Ranang
* replaced classeq with isinstance, for efficiency improvement, as suggested by Martin Ranang
* bugfixes in chunk eval
* simplified call to draw_trees
* names, stopwords corpora
	
Version 0.6.2 2006-01-29

* Peter Spiller's concordancer
* Will Hardy's implementation of Penton's paradigm visualization system
* corpus readers for presidential speeches
* removed NLTK dependency
* generalized CFG terminals to permit full range of characters
* used fully qualified names in demo code, for portability
* bugfixes from Yoav Goldberg, Eduardo Pereira Habkost
* fixed obscure quoting bug in tree displays and conversions
* simplified demo code, fixed import bug