Skip to content

Commit

Permalink
add rf bootstrapping explanation
Browse files Browse the repository at this point in the history
  • Loading branch information
ethen8181 committed Apr 11, 2017
1 parent 6ed6572 commit 199fea0
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 48 deletions.
8 changes: 5 additions & 3 deletions dim_reduct/PCA.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2564,7 +2564,9 @@
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class PCAModel:\n",
Expand Down Expand Up @@ -2665,15 +2667,15 @@
"pipe1 = Pipeline([\n",
" ( 'sc', StandardScaler() ),\n",
" ( 'pca', PCA(n_components = 3) ),\n",
" ( 'dt', LogisticRegression(random_state = 1) )\n",
" ( 'logistic', LogisticRegression(random_state = 1) )\n",
"])\n",
"pipe1.fit(X_train, y_train)\n",
"y_pred1 = pipe1.predict(X_test)\n",
"\n",
"# pipeline without PCA\n",
"pipe2 = Pipeline([\n",
" ( 'sc', StandardScaler() ),\n",
" ( 'dt', LogisticRegression(random_state = 1) )\n",
" ( 'logistic', LogisticRegression(random_state = 1) )\n",
"])\n",
"pipe2.fit(X_train, y_train)\n",
"y_pred2 = pipe2.predict(X_test)\n",
Expand Down
15 changes: 9 additions & 6 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
h2o
tqdm
numba
gensim
seaborn
networkx
requests
watermark
python-dotenv
jupyterthemes
keras>=2.0.1
tqdm>=4.11.2
keras>=2.0.2
numba>=0.31.0
gensim>=1.0.1
numpy>=1.12.0
pandas>=0.19.2
watermark>=1.3.4
requests>=2.13.0
matplotlib>=2.0.0
tensorflow>=1.0.1
scikit-learn>=0.18
60 changes: 21 additions & 39 deletions trees/random_forest.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -369,9 +367,7 @@
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -429,9 +425,7 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -474,9 +468,7 @@
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -514,9 +506,13 @@
"\n",
"After training, predictions for unseen samples $x'$ can be made by averaging the predictions from all the individual regression trees on $x'$:\n",
"\n",
"$$ {\\hat {f}}={\\frac {1}{B}}\\sum _{b=1}^{B}{\\hat {f}}_{b}(x')$$\n",
"$$ {\\hat {f}}={\\frac {1}{B}}\\sum _{b=1}^{B}{f}_{b}(x')$$\n",
"\n",
"Or by taking the majority vote in the case of classification trees. If you are wondering why bootstrapping is a good idea, the rationale is:\n",
"\n",
"We wish to ask a question of a population but we can't. Instead, we take a sample and ask the question to it instead. Now, how confident we should be that the sample answer is close to the population answer obviously depends on the structure of population. One way we might learn about this is to take samples from the population again and again, ask them the question, and see how variable the sample answers tended to be. But often times this isn't possible (we wouldn't relaunch the Titanic and crash it into another iceberg), thus we can use the information in the sample we actually have to learn about it.\n",
"\n",
"or by taking the majority vote in the case of classification trees."
"This is a reasonable thing to do because not only is the sample you have the best, indeed the only information you have about what the population actually looks like, but also because most samples will, if they're randomly chosen, look quite like the population they came from. In the end, sampling 'with replacement' is just a convenient way to treat the sample like it's a population and to sample from it in a way that reflects its shape."
]
},
{
Expand Down Expand Up @@ -545,9 +541,7 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -701,9 +695,7 @@
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -787,9 +779,7 @@
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -854,9 +844,7 @@
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -899,7 +887,6 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
Expand Down Expand Up @@ -929,9 +916,7 @@
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -967,9 +952,7 @@
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -1026,9 +1009,7 @@
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"metadata": {},
"outputs": [
{
"data": {
Expand Down Expand Up @@ -1088,6 +1069,7 @@
"- [Notebook: useR machine learning tutorial Random Forest](http://nbviewer.jupyter.org/github/ledell/useR-machine-learning-tutorial/blob/master/random-forest.ipynb)\n",
"- [Blog: Selecting good features – Part III: random forests](http://blog.datadive.net/selecting-good-features-part-iii-random-forests/)\n",
"- [Blog: The Unreasonable Effectiveness of Random Forests](https://medium.com/rants-on-machine-learning/the-unreasonable-effectiveness-of-random-forests-f33c3ce28883#.pv7i5ien9)\n",
"- [StackExchange: Explaining to laypeople why bootstrapping works](http://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works/)\n",
"- [Stackoverflow: RandomForestClassifier vs ExtraTreesClassifier in scikit learn](http://stackoverflow.com/questions/22409855/randomforestclassifier-vs-extratreesclassifier-in-scikit-learn?rq=1)\n",
"- [Stackoverflow: How are feature_importances in RandomForestClassifier determined?](http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined)"
]
Expand All @@ -1096,9 +1078,9 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [Root]",
"display_name": "Python 3",
"language": "python",
"name": "Python [Root]"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down Expand Up @@ -1133,5 +1115,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 1
}

0 comments on commit 199fea0

Please sign in to comment.