Restoring lab4

ChrisIck · Sep 27, 2018 · ef72f86 · ef72f86
1 parent d4dad30
commit ef72f86
Show file tree

Hide file tree

Showing 7 changed files with 2,669 additions and 54 deletions.
diff --git a/ipython/Labs_Student/.ipynb_checkpoints/Lab2_NumPy_Vectorization-checkpoint.ipynb b/ipython/Labs_Student/.ipynb_checkpoints/Lab2_NumPy_Vectorization-checkpoint.ipynb
diff --git a/ipython/Labs_Student/.ipynb_checkpoints/Lab2_NumPy_Vectorization_Student-checkpoint.ipynb b/ipython/Labs_Student/.ipynb_checkpoints/Lab2_NumPy_Vectorization_Student-checkpoint.ipynb
@@ -0,0 +1,383 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First we'll generate a random matrix"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Number of columns (features)\n",
+    "K = 5\n",
+    "\n",
+    "#Number of records\n",
+    "N = 1000\n",
+    "\n",
+    "#Generate an NxK matrix of uniform random variables\n",
+    "X = np.random.random([N,K])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's peak at our data to confirm it looks as we expect it"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[0.05939443, 0.97717503, 0.70381419, 0.27202192, 0.2447514 ],\n",
+       "       [0.62225124, 0.93225812, 0.10888633, 0.69672273, 0.14079667],\n",
+       "       [0.97888812, 0.88580616, 0.83562838, 0.78810289, 0.37799006],\n",
+       "       ...,\n",
+       "       [0.0077761 , 0.68383434, 0.8977181 , 0.9624185 , 0.75589448],\n",
+       "       [0.88220234, 0.35992694, 0.94726334, 0.99901007, 0.7432114 ],\n",
+       "       [0.43206062, 0.30307061, 0.13552798, 0.19924432, 0.80671002]])"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#Student - Put in a command to view the first 100 rows\n",
+    "X[101:]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(1000, 5)\n"
+     ]
+    }
+   ],
+   "source": [
+    "#Student - put in a command to see the dimensions of X\n",
+    "print(X.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This exercise is about designing a scoring function for a logistic regression. As we are not concerned with fitting a model to data, we can just make up a logistic regression. <br> <br>\n",
+    "\n",
+    "For quick intro, the Logistic Regression takes the form of $\\hat{Y} = f(x * \\beta^T)$, where $x$ is the $1xK$ vector of features and $\\beta$ is the $1xK$ vector of weights. The function $f$, called a 'link' function, is the inverse logit: <br><br>\n",
+    "\n",
+    "<center>$f(a)=\\frac{1}{1+e^{-a}}$</center> <br><br>\n",
+    "\n",
+    "In this notebook we'll write a function that, given inputs of $X$ and $\\beta$, returns a value for $\\hat{Y}$.\n",
+    "<br><br>\n",
+    "First let's generate a random set of weights to represent $\\beta$.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([-0.98793182, -0.19204666, -0.08507233, -0.51605049,  0.12759377])"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#Student - generate a K dimensional vector of uniform random variables in the interval [-1, 1]\n",
+    "beta = np.random.random([K])*2-1\n",
+    "beta"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice how we applied a neat NumPy trick here. The numpy.random.random() function returns an array, yet we applied what appears to be a scalar operation on the vector. This is an example of what NumPy calls vectorization (a major point of this tutorial), which offers us both a very fast way to do run vector computations as well as a clean and concise method of coding. \n",
+    "\n",
+    "<br><br>\n",
+    "\n",
+    "<b>Question: we designed the above $beta$ vector such that $E[\\beta_i]=0$. How can we confirm that we did this correctly?</b>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "-0.3307015072167897\n"
+     ]
+    }
+   ],
+   "source": [
+    "#start by taking the mean of the beta we already calculated\n",
+    "\n",
+    "print(np.mean(beta))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#It is likely the above is not equal to zero. Let's simulate this 100k times and see what the distribution of means is\n",
+    "#Student input code here\n",
+    "means = []\n",
+    "for i in range(int(1e3)):\n",
+    "    beta = np.random.random([K])*2-1\n",
+    "    means.append(beta)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's use matplotlibs hist function to plot the histogram of means here. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.hist(means)\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We should expect the distribution to be centered around zero. Is it? As fun technical side, let's dive a little deeper into what this distribution should look like. The histogram shows a distribution of the average of a sample of 5 uniformly distributed random variables taken over N different samples. Can we compare this to a theoretical distribution?<br>\n",
+    "\n",
+    "Yes we can! We sampled each $\\beta_i$ from a uniform distribution over the interval $[-1, 1]$. The variance of a sample of uniformly distributed variables is given by $(1/12) * (b - a)^2$, where $b$ and $a$ are the min/max of the support interval. The standard error (or the standard deviation of the mean) of a sample of size K with with $Var(X) = \\sigma^2$ is $\\sigma / \\sqrt(K)$. <br>\n",
+    "\n",
+    "Given the above knowledge, we should expect our distribution of averages to be normally distributed with mean = 0 and var = $(12 * 5)^{-1} * (1 - (-1))^2 = 0.66667$. Let's compare this normal distribution to our sample above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Compute a vector from the normal distribution specified above\n",
+    "from scipy.stats import norm\n",
+    "mu = 0\n",
+    "sig = np.sqrt(4 / 60.0) \n",
+    "xs = np.linspace(-1, 1, 1000)\n",
+    "ys = norm.pdf(xs, mu, sig) \n",
+    "\n",
+    "plt.hist(means, normed = True)\n",
+    "plt.plot(xs, ys)\n",
+    "plt.show()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's write our scoring function. Let's try to use as much of Numpy's inner optimization as possible (hint, this can be done in two lines and without writing any loops). The key is that numpy functions that would normally take in a scalar can also take in an array, and the function applies the operations element wise to the array and returns an array. i.e.:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ex_array = np.array([-1, 1])\n",
+    "np.abs(ex_array)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's use this feature to write a fast and clean scoring function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def score_logistic_regression(X, beta):\n",
+    "    '''\n",
+    "    This function takes in an NxK matrix X and 1xK vector beta.\n",
+    "    The function should apply the logistic scoring function to each record of X.\n",
+    "    The output should be an Nx1 vector of scores\n",
+    "    '''\n",
+    "    \n",
+    "    #First let's calculate X*beta - make sure to use numpy's 'dot' method\n",
+    "    #student - put in code here\n",
+    "    a = np.dot(X,beta)\n",
+    "    #Now let's input this into the link function\n",
+    "    #student - put in code here\n",
+    "    prob_score = 1/(1+np.exp(-a))\n",
+    "    return prob_score"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So how much faster is it by using Numpy? We can test this be writing the same function that uses no Numpy and executes via loops."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def score_logistic_regression_NoNumpy(X, beta):\n",
+    "    '''\n",
+    "    This function takes in an NxK matrix X and 1xK vector beta.\n",
+    "    The function should apply the logistic scoring function to each record of X.\n",
+    "    The output should be an Nx1 vector of scores\n",
+    "    '''\n",
+    "    #Let's calculate xbeta using loops\n",
+    "    xbeta = []\n",
+    "    for row in X:    \n",
+    "        xb = 0\n",
+    "        for i, el in enumerate(row):\n",
+    "            #Student - compute X*Beta in the loop\n",
+    "            xb += el * beta[i]\n",
+    "        xbeta.append(xb)\n",
+    "        \n",
+    "    #Now let's apply the link function to each xbeta\n",
+    "    prob_score = []\n",
+    "    for xb in xbeta:\n",
+    "        #student - compute p in the loop \n",
+    "        p = 1/(1+np.e**(-xb))\n",
+    "        prob_score.append(p)\n",
+    "        \n",
+    "    return prob_score"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before doing any analysis, let's test the output of each to make sure they equal"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Student - write a unit test that calls each function with the same inputs and checks to see they return the same values. \n",
+    "wnp = score_logistic_regression(X,beta)\n",
+    "wonp = score_logistic_regression_NoNumpy(X,beta)\n",
+    "\n",
+    "print(wnp-wonp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If they equal then we can proceed with timing analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%timeit score_logistic_regression_NoNumpy(X, beta)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%timeit score_logistic_regression(X, beta)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}