Skip to content

Commit

Permalink
update class 14 materials
Browse files Browse the repository at this point in the history
  • Loading branch information
justmarkham committed Oct 2, 2015
1 parent 52fff15 commit f4cbf36
Show file tree
Hide file tree
Showing 3 changed files with 1,922 additions and 88 deletions.
13 changes: 13 additions & 0 deletions homework/14_spam_filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,20 @@ Read Paul Graham's [A Plan for Spam](http://www.paulgraham.com/spam.html).
Here are some questions to think about:

- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Specificity, in order to minimize false positives (non-spam being incorrectly marked as spam).
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- He hand-engineered features and used those features to compute a score.
- What are the key components of his statistical filtering system? In other words, how does it work?
- Scan the entire text (including headers) and tokenize it.
- Count the number of occurrences of each token in the ham corpus and the spam corpus (separately).
- Assign each token a "spam score" based on its relative frequency in the corpora.
- For new email, only take into account the 15 most "interesting" tokens.
- What did Paul say were some of the benefits of the statistical approach?
- It works better (almost no false positives).
- It requires less work because it discovers features automatically.
- The "spam score" is interpretable.
- It can easily be tuned to the individual user.
- It evolves with the spam.
- It creates an implicit whitelist/blacklist of email addresses, server names, etc.
- How good was his prediction of the "spam of the future"?
- Great!
Loading

0 comments on commit f4cbf36

Please sign in to comment.