add rf bootstrapping explanation

singhsatvik · Apr 11, 2017 · 199fea0 · 199fea0
1 parent 6ed6572
commit 199fea0
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 48 deletions.
diff --git a/dim_reduct/PCA.ipynb b/dim_reduct/PCA.ipynb
@@ -2564,7 +2564,9 @@
   {
    "cell_type": "code",
    "execution_count": 14,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "class PCAModel:\n",
@@ -2665,15 +2667,15 @@
     "pipe1 = Pipeline([\n",
     "    ( 'sc', StandardScaler() ),\n",
     "    ( 'pca', PCA(n_components = 3) ),\n",
-    "    ( 'dt', LogisticRegression(random_state = 1) )\n",
+    "    ( 'logistic', LogisticRegression(random_state = 1) )\n",
     "])\n",
     "pipe1.fit(X_train, y_train)\n",
     "y_pred1 = pipe1.predict(X_test)\n",
     "\n",
     "# pipeline without PCA\n",
     "pipe2 = Pipeline([\n",
     "    ( 'sc', StandardScaler() ),\n",
-    "    ( 'dt', LogisticRegression(random_state = 1) )\n",
+    "    ( 'logistic', LogisticRegression(random_state = 1) )\n",
     "])\n",
     "pipe2.fit(X_train, y_train)\n",
     "y_pred2 = pipe2.predict(X_test)\n",

diff --git a/requirements.txt b/requirements.txt
@@ -1,13 +1,16 @@
 h2o
-tqdm
-numba
-gensim
 seaborn
 networkx
-requests
-watermark
 python-dotenv
 jupyterthemes
-keras>=2.0.1
+tqdm>=4.11.2
+keras>=2.0.2
+numba>=0.31.0
+gensim>=1.0.1
+numpy>=1.12.0
+pandas>=0.19.2
+watermark>=1.3.4
+requests>=2.13.0
+matplotlib>=2.0.0
 tensorflow>=1.0.1
 scikit-learn>=0.18
diff --git a/trees/random_forest.ipynb b/trees/random_forest.ipynb
@@ -3,9 +3,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "data": {
@@ -369,9 +367,7 @@
   {
    "cell_type": "code",
    "execution_count": 2,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -429,9 +425,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -474,9 +468,7 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -514,9 +506,13 @@
     "\n",
     "After training, predictions for unseen samples $x'$ can be made by averaging the predictions from all the individual regression trees on $x'$:\n",
     "\n",
-    "$$ {\\hat {f}}={\\frac {1}{B}}\\sum _{b=1}^{B}{\\hat {f}}_{b}(x')$$\n",
+    "$$ {\\hat {f}}={\\frac {1}{B}}\\sum _{b=1}^{B}{f}_{b}(x')$$\n",
+    "\n",
+    "Or by taking the majority vote in the case of classification trees. If you are wondering why bootstrapping is a good idea, the rationale is:\n",
+    "\n",
+    "We wish to ask a question of a population but we can't. Instead, we take a sample and ask the question to it instead. Now, how confident we should be that the sample answer is close to the population answer obviously depends on the structure of population. One way we might learn about this is to take samples from the population again and again, ask them the question, and see how variable the sample answers tended to be. But often times this isn't possible (we wouldn't relaunch the Titanic and crash it into another iceberg), thus we can use the information in the sample we actually have to learn about it.\n",
     "\n",
-    "or by taking the majority vote in the case of classification trees."
+    "This is a reasonable thing to do because not only is the sample you have the best, indeed the only information you have about what the population actually looks like, but also because most samples will, if they're randomly chosen, look quite like the population they came from. In the end, sampling 'with replacement' is just a convenient way to treat the sample like it's a population and to sample from it in a way that reflects its shape."
    ]
   },
   {
@@ -545,9 +541,7 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -701,9 +695,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -787,9 +779,7 @@
   {
    "cell_type": "code",
    "execution_count": 8,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -854,9 +844,7 @@
   {
    "cell_type": "code",
    "execution_count": 9,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -899,7 +887,6 @@
    "cell_type": "code",
    "execution_count": 10,
    "metadata": {
-    "collapsed": false,
     "scrolled": true
    },
    "outputs": [],
@@ -929,9 +916,7 @@
   {
    "cell_type": "code",
    "execution_count": 11,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "data": {
@@ -967,9 +952,7 @@
   {
    "cell_type": "code",
    "execution_count": 12,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -1026,9 +1009,7 @@
   {
    "cell_type": "code",
    "execution_count": 13,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "data": {
@@ -1088,6 +1069,7 @@
     "- [Notebook: useR machine learning tutorial Random Forest](http://nbviewer.jupyter.org/github/ledell/useR-machine-learning-tutorial/blob/master/random-forest.ipynb)\n",
     "- [Blog: Selecting good features – Part III: random forests](http://blog.datadive.net/selecting-good-features-part-iii-random-forests/)\n",
     "- [Blog: The Unreasonable Effectiveness of Random Forests](https://medium.com/rants-on-machine-learning/the-unreasonable-effectiveness-of-random-forests-f33c3ce28883#.pv7i5ien9)\n",
+    "- [StackExchange: Explaining to laypeople why bootstrapping works](http://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works/)\n",
     "- [Stackoverflow: RandomForestClassifier vs ExtraTreesClassifier in scikit learn](http://stackoverflow.com/questions/22409855/randomforestclassifier-vs-extratreesclassifier-in-scikit-learn?rq=1)\n",
     "- [Stackoverflow: How are feature_importances in RandomForestClassifier determined?](http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined)"
    ]
@@ -1096,9 +1078,9 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python [Root]",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "Python [Root]"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
@@ -1133,5 +1115,5 @@
   }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 1
 }