Pivoted Document Length Normalization

To Use:

results/crowdflower_results_large.png
- Note: currently segfaults on full 2d search
- On the most accurate (65%) parameters in training, the model still has very low precision (64%), though high recall (90%). This is clear since it predicts something like 82% of results to be positive, the main issue.
- Dummy "guess clss 1" strategy gives 61% accuracy. TFIDF improves to 64%.
- Slope doesn't seem to have a major effect, though it does add most accuracy in the range suggested by SBM (around .8)
results/crowdflower_results_small.png
- Suggests the nested 1d strategy is sensible.

Crowdflower data: Right now, we are looking at accuracy of "relevance" (crowdflower "relevance score" == 4 means "relevant retrieval", else "irrelevant retrieval"). Probably better to learn this function (range: 0-4) than collapse into a binary classification. This probably causes a lot of the problems in the model.
Also, the notion that one threshold determines "relevance" might be misapplied.

Remove static class variables in classifier class
Try RandomizedSearchCV instead of GridSearchCV
See if we can get n_jobs=2 or n_jobs=-1 to work on large dataset
SBM section 3: idf is used in query terms, not in doc weigths
To optimize:
- Use PAIRWISE rankings?
  - Note: This exponentially increases the number of "pivots" because there's no P(ret) and P(rel) curves that cross anymore.
- Metrics (later): Old: precision, recall, f-score. New: ROC, DCG and variants.
- Chapelle: Judgement metric for the contest was NDCG
TODO: Dump to .csv

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
docs		docs
results		results
snippets		snippets
README.md		README.md
clean_text.py		clean_text.py
import_data.py		import_data.py
pdln_classifier.py		pdln_classifier.py
plot_results.py		plot_results.py
run_pdln_test.py		run_pdln_test.py