Book Development

This page documents our plans for the development of the NLTK book, leading to a second edition.

Language Processing and the Natural Language Toolkit 0. Introduction * something to get attention * bring in some material from the preface (SB) * revisit the reviews of the old chapter 1
1. NLP systems: translation (nearly done), dialogue (done), summarisation (SB), web search (EL), question answering (logic-rich, inferencing approach EK), recommendations (similarity metrics over document collections, EL), sentiment analysis (EK). Pointers to demonstrations online (links hosted at nltk.org to avoid link rot). Motivation. Architecture. Limitations. Discussion to highlight the non-trivial NLP involved. Help readers understand the breadth and limitations of NLP.
  - description, the pieces you need to solve it (architecture diagram if necessary)
  - the fact that there's overlap between these in terms of the required subtasks
  - some very different approaches exist for the above (favour popularity, reasonableness, coverage of approaches across the whole set)
2. Sub-tasks: WSD (EK), pronoun resolution/coreference (EL), paraphrasing (EK), finding things in text (SB), language modeling (EL), collocations (SB), sentence segmentation (SB), lexicon (associating meaning with words, and learning those associations automatically), normalization (stemming, unicode, case, twitter-speak) (EK?), syntax (how do different words in the sentence relate to one another; e.g., agent of a verb) (EK?), named entity recognition (EL) -- (identify tasks and writers by mid June)
  - show that these are non-trivial tasks (requires an example), but also do-able (outline of how it works)
  - each block of the architecture diagrams is a candidate here
  - include some indication of what performance to expect on each task (state of the art performance?)
3. Overview of NLTK (simple things you can do, intended to engage readers with the content, linguists and non-linguists alike) (SB)
  - a little bit less on counting
  - tokenization, tagging, count more interesting things
  - parsing?
4. Overview of book
5. Summary (with forward pointers)
6. Further Reading
7. Exercises
Accessing Text Corpora and Lexical Resources
1. Accessing Text Corpora mention NomBank, PropBank [ewanklein]
2. Conditional Frequency Distributions
3. Lexical Resources add FrameNet, VerbNet, Sentiwordnet [edloper]
4. WordNet add Open Multilingual WordNet [stevenbird]
5. Summary
6. Further Reading
7. Exercises
Processing Raw Text
1. Accessing Text from the Web and from Disk (add Twitter) [ewanklein]
2. Strings: Text Processing at the Lowest Level
3. Text Processing with Unicode updated for Python 3 including bytes type – but this will already be done [edloper]
4. Regular Expressions for Detecting Word Patterns
5. Useful Applications of Regular Expressions
6. Normalizing Text (add Twitter) [ewanklein]
7. Regular Expressions for Tokenizing Text
8. Segmentation
9. Formatting: From Lists to Strings (update to use string.format – but this will already be done)
10. Scaling up (incl how to use the Stanford tokenizer) [stevenbird]
11. Summary
12. Further Reading
13. Exercises
Language Modeling [edloper]
n-gram models
bins: forming equivalence classes
independence assumptions
sparse data problems.
statistical estimators (MLE, laplace, heldout, etc)
combining estimators
word clusters and word similarity
word embeddings
scaling up -- show how to perform some task that we've already performed, but using an external tool and a larger data set. Should be a short section, but enough to show how to use the interface to the external tool, and to show the performance difference etc.
Summary
Further Reading
Exercises
Categorizing and Tagging Words
1. Using a Tagger
2. Tagged Corpora mention MASC tagged corpus? [stevenbird]
3. Mapping Words to Properties Using Python Dictionaries
4. Automatic Tagging
5. N-Gram Tagging
6. Transformation-Based Tagging
7. How to Determine the Category of a Word
8. Scaling Up (incl how to use the Stanford tagger) [stevenbird]
9. Summary
10. Further Reading
11. Exercises
Learning to Classify Text
1. Supervised Classification
2. Further Examples of Supervised Classification
3. Evaluation
4. Decision Trees
5. Naive Bayes Classifiers
6. Maximum Entropy Classifiers
7. Modeling Linguistic Patterns
8. Sentiment Detection (incl sentiwordnet); here or in chapter 7 [ewanklein]
9. Scaling Up [edloper]
10. Summary
11. Further Reading
12. Exercises
Extracting Information from Text
1. Information Extraction
2. Chunking (decide which interface to use: chunking, tagging, or parsing)
3. Developing and Evaluating Chunkers
4. Recursion in Linguistic Structure
5. Named Entity Recognition
6. Relation Extraction
7. Scaling Up (incl how to use the Stanford chunker and named entity recognizer) [edloper]
8. Summary
9. Further Reading
10. Exercises
Analyzing Sentence Structure
1. Some Grammatical Dilemmas
2. What's the Use of Syntax?
3. Context Free Grammar
4. Parsing With Context Free Grammar
5. Combinatory Categorial Grammar [ewanklein]
6. Dependencies and Dependency Grammar [edloper]
  - split into two sections:
  - (a) heads, arguments, and roles (mentions FrameNet, VerbNet, NomBank, PropBank)
  - (b) dependency grammar and dependency parsing
7. Grammar Development (could this work in chapter 9?)
8. Scaling Up (incl how to use the Stanford parser) [edloper]
9. Summary
10. Further Reading
11. Exercises
Building Feature Based Grammars
1. Grammatical Features
2. Processing Feature Structures
3. Extending a Feature based Grammar
4. Summary
5. Further Reading
6. Exercises
Analyzing the Meaning of Sentences
1. Natural Language Understanding
2. ~~Propositional Logic~~
3. ~~First-Order Logic~~
4. Logic-based Semantics
5. The Semantics of English Sentences
6. ~~Discourse Semantics~~
7. Learning to build logical representations [tbd, depends on new implementation]
8. Summary
9. Further Reading
10. Exercises
Managing Linguistic Data
1. Corpus Structure: a Case Study
2. The Life-Cycle of a Corpus
3. Acquiring Data
4. Working with XML
5. ~~Working with Toolbox Data~~ Working with FLEx Data [stevenbird]
6. Describing Language Resources using OLAC Metadata
7. Summary
8. Further Reading
9. Exercises
Further Topics
1. Machine Translation [stevenbird]
  - Sentence Alignment (incl Gale-Church algorithm)
  - Word Alignment (IBM model 1; mention existence of other models)
  - Aligned Corpora
  - Decoding [depends on new implementation]
  - Evaluation
2. Twitter Processing
3. Sentiment Analysis
4. Design Patterns for NLP systems??
5. Further Reading
6. Exercises
Appendix: Enough Python for this Book (incorporating material from old chapters 1 and 4)
1. Getting started with Python (from 1.1)
2. Texts as lists of words (lists, variables, strings, from 1.2)
3. Making decisions and taking control (conditionals, comprehensions, nesting, from 1.4)
4. Sequences (includes tuples, from 4.2)
5. Functions and modules (from 2.3 and 4.4)
6. Doing more with functions (from 4.5, plus module structure from 4.6)
7. Getting started with NLTK (from 1.1 and 1.3)
8. Exercises
Appendix: Useful Python libraries for NLP (from 4.8)
1. matplotlib
2. networkx
3. csv
4. numpy
5. scikit-learn
6. gensim

#Workplan#

We are working on groups of chapters as indicated in the following diagram:

Book development plan

#Notes#

Images are done using Helvetica and exported in 100dpi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Book Development

Clone this wiki locally