-
Notifications
You must be signed in to change notification settings - Fork 0
Book Development
This page documents our plans for the development of the NLTK book, leading to a second edition.
-
Language Processing and the Natural Language Toolkit 0. Introduction * something to get attention * bring in some material from the preface (SB) * revisit the reviews of the old chapter 1
- NLP systems: translation (nearly done), dialogue (done), summarisation (SB), web search (EL), question answering (logic-rich, inferencing approach EK), recommendations (similarity metrics over document collections, EL), sentiment analysis (EK). Pointers to demonstrations online (links hosted at nltk.org to avoid link rot). Motivation. Architecture. Limitations. Discussion to highlight the non-trivial NLP involved. Help readers understand the breadth and limitations of NLP.
- description, the pieces you need to solve it (architecture diagram if necessary)
- the fact that there's overlap between these in terms of the required subtasks
- some very different approaches exist for the above (favour popularity, reasonableness, coverage of approaches across the whole set)
- Sub-tasks: WSD (EK), pronoun resolution/coreference (EL), paraphrasing (EK), finding things in text (SB), language modeling (EL), collocations (SB), sentence segmentation (SB), lexicon (associating meaning with words, and learning those associations automatically), normalization (stemming, unicode, case, twitter-speak) (EK?), syntax (how do different words in the sentence relate to one another; e.g., agent of a verb) (EK?), named entity recognition (EL) -- (identify tasks and writers by mid June)
- show that these are non-trivial tasks (requires an example), but also do-able (outline of how it works)
- each block of the architecture diagrams is a candidate here
- include some indication of what performance to expect on each task (state of the art performance?)
- Overview of NLTK (simple things you can do, intended to engage readers with the content, linguists and non-linguists alike) (SB)
- a little bit less on counting
- tokenization, tagging, count more interesting things
- parsing?
- Overview of book
- Summary (with forward pointers)
- Further Reading
- Exercises
- NLP systems: translation (nearly done), dialogue (done), summarisation (SB), web search (EL), question answering (logic-rich, inferencing approach EK), recommendations (similarity metrics over document collections, EL), sentiment analysis (EK). Pointers to demonstrations online (links hosted at nltk.org to avoid link rot). Motivation. Architecture. Limitations. Discussion to highlight the non-trivial NLP involved. Help readers understand the breadth and limitations of NLP.
-
Accessing Text Corpora and Lexical Resources
- Accessing Text Corpora mention NomBank, PropBank [ewanklein]
- Conditional Frequency Distributions
- Lexical Resources add FrameNet, VerbNet, Sentiwordnet [edloper]
- WordNet add Open Multilingual WordNet [stevenbird]
- Summary
- Further Reading
- Exercises
-
Processing Raw Text
- Accessing Text from the Web and from Disk (add Twitter) [ewanklein]
- Strings: Text Processing at the Lowest Level
- Text Processing with Unicode updated for Python 3 including bytes type – but this will already be done [edloper]
- Regular Expressions for Detecting Word Patterns
- Useful Applications of Regular Expressions
- Normalizing Text (add Twitter) [ewanklein]
- Regular Expressions for Tokenizing Text
- Segmentation
- Formatting: From Lists to Strings (update to use string.format – but this will already be done)
- Scaling up (incl how to use the Stanford tokenizer) [stevenbird]
- Summary
- Further Reading
- Exercises
-
Language Modeling [edloper]
-
n-gram models
-
bins: forming equivalence classes
-
independence assumptions
-
sparse data problems.
-
statistical estimators (MLE, laplace, heldout, etc)
-
combining estimators
-
word clusters and word similarity
-
word embeddings
-
scaling up -- show how to perform some task that we've already performed, but using an external tool and a larger data set. Should be a short section, but enough to show how to use the interface to the external tool, and to show the performance difference etc.
-
Summary
-
Further Reading
-
Exercises
-
Categorizing and Tagging Words
- Using a Tagger
- Tagged Corpora mention MASC tagged corpus? [stevenbird]
- Mapping Words to Properties Using Python Dictionaries
- Automatic Tagging
- N-Gram Tagging
- Transformation-Based Tagging
- How to Determine the Category of a Word
- Scaling Up (incl how to use the Stanford tagger) [stevenbird]
- Summary
- Further Reading
- Exercises
-
Learning to Classify Text
- Supervised Classification
- Further Examples of Supervised Classification
- Evaluation
- Decision Trees
- Naive Bayes Classifiers
- Maximum Entropy Classifiers
- Modeling Linguistic Patterns
- Sentiment Detection (incl sentiwordnet); here or in chapter 7 [ewanklein]
- Scaling Up [edloper]
- Summary
- Further Reading
- Exercises
-
Extracting Information from Text
- Information Extraction
- Chunking (decide which interface to use: chunking, tagging, or parsing)
- Developing and Evaluating Chunkers
- Recursion in Linguistic Structure
- Named Entity Recognition
- Relation Extraction
- Scaling Up (incl how to use the Stanford chunker and named entity recognizer) [edloper]
- Summary
- Further Reading
- Exercises
-
Analyzing Sentence Structure
- Some Grammatical Dilemmas
- What's the Use of Syntax?
- Context Free Grammar
- Parsing With Context Free Grammar
- Combinatory Categorial Grammar [ewanklein]
- Dependencies and Dependency Grammar [edloper]
- split into two sections:
- (a) heads, arguments, and roles (mentions FrameNet, VerbNet, NomBank, PropBank)
- (b) dependency grammar and dependency parsing
- Grammar Development (could this work in chapter 9?)
- Scaling Up (incl how to use the Stanford parser) [edloper]
- Summary
- Further Reading
- Exercises
-
Building Feature Based Grammars
- Grammatical Features
- Processing Feature Structures
- Extending a Feature based Grammar
- Summary
- Further Reading
- Exercises
-
Analyzing the Meaning of Sentences
- Natural Language Understanding
Propositional LogicFirst-Order Logic- Logic-based Semantics
- The Semantics of English Sentences
Discourse Semantics- Learning to build logical representations [tbd, depends on new implementation]
- Summary
- Further Reading
- Exercises
-
Managing Linguistic Data
- Corpus Structure: a Case Study
- The Life-Cycle of a Corpus
- Acquiring Data
- Working with XML
-
Working with Toolbox DataWorking with FLEx Data [stevenbird] - Describing Language Resources using OLAC Metadata
- Summary
- Further Reading
- Exercises
-
Further Topics
-
Machine Translation [stevenbird]
- Sentence Alignment (incl Gale-Church algorithm)
- Word Alignment (IBM model 1; mention existence of other models)
- Aligned Corpora
- Decoding [depends on new implementation]
- Evaluation
- Twitter Processing
- Sentiment Analysis
- Design Patterns for NLP systems??
- Further Reading
- Exercises
-
Machine Translation [stevenbird]
-
Appendix: Enough Python for this Book (incorporating material from old chapters 1 and 4)
- Getting started with Python (from 1.1)
- Texts as lists of words (lists, variables, strings, from 1.2)
- Making decisions and taking control (conditionals, comprehensions, nesting, from 1.4)
- Sequences (includes tuples, from 4.2)
- Functions and modules (from 2.3 and 4.4)
- Doing more with functions (from 4.5, plus module structure from 4.6)
- Getting started with NLTK (from 1.1 and 1.3)
- Exercises
-
Appendix: Useful Python libraries for NLP (from 4.8)
- matplotlib
- networkx
- csv
- numpy
- scikit-learn
- gensim
#Workplan#
We are working on groups of chapters as indicated in the following diagram:
#Notes#
Images are done using Helvetica and exported in 100dpi