forked from briandalessandro/DataScienceCourse
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Chris Ick
authored and
Chris Ick
committed
Sep 27, 2018
1 parent
d4dad30
commit ef72f86
Showing
7 changed files
with
2,669 additions
and
54 deletions.
There are no files selected for viewing
440 changes: 440 additions & 0 deletions
440
ipython/Labs_Student/.ipynb_checkpoints/Lab2_NumPy_Vectorization-checkpoint.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
383 changes: 383 additions & 0 deletions
383
ipython/Labs_Student/.ipynb_checkpoints/Lab2_NumPy_Vectorization_Student-checkpoint.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,383 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import warnings\n", | ||
"warnings.filterwarnings('ignore')\n", | ||
"import numpy as np\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"%matplotlib inline" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"First we'll generate a random matrix" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Number of columns (features)\n", | ||
"K = 5\n", | ||
"\n", | ||
"#Number of records\n", | ||
"N = 1000\n", | ||
"\n", | ||
"#Generate an NxK matrix of uniform random variables\n", | ||
"X = np.random.random([N,K])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's peak at our data to confirm it looks as we expect it" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([[0.05939443, 0.97717503, 0.70381419, 0.27202192, 0.2447514 ],\n", | ||
" [0.62225124, 0.93225812, 0.10888633, 0.69672273, 0.14079667],\n", | ||
" [0.97888812, 0.88580616, 0.83562838, 0.78810289, 0.37799006],\n", | ||
" ...,\n", | ||
" [0.0077761 , 0.68383434, 0.8977181 , 0.9624185 , 0.75589448],\n", | ||
" [0.88220234, 0.35992694, 0.94726334, 0.99901007, 0.7432114 ],\n", | ||
" [0.43206062, 0.30307061, 0.13552798, 0.19924432, 0.80671002]])" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"#Student - Put in a command to view the first 100 rows\n", | ||
"X[101:]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"(1000, 5)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"#Student - put in a command to see the dimensions of X\n", | ||
"print(X.shape)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This exercise is about designing a scoring function for a logistic regression. As we are not concerned with fitting a model to data, we can just make up a logistic regression. <br> <br>\n", | ||
"\n", | ||
"For quick intro, the Logistic Regression takes the form of $\\hat{Y} = f(x * \\beta^T)$, where $x$ is the $1xK$ vector of features and $\\beta$ is the $1xK$ vector of weights. The function $f$, called a 'link' function, is the inverse logit: <br><br>\n", | ||
"\n", | ||
"<center>$f(a)=\\frac{1}{1+e^{-a}}$</center> <br><br>\n", | ||
"\n", | ||
"In this notebook we'll write a function that, given inputs of $X$ and $\\beta$, returns a value for $\\hat{Y}$.\n", | ||
"<br><br>\n", | ||
"First let's generate a random set of weights to represent $\\beta$.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"array([-0.98793182, -0.19204666, -0.08507233, -0.51605049, 0.12759377])" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"#Student - generate a K dimensional vector of uniform random variables in the interval [-1, 1]\n", | ||
"beta = np.random.random([K])*2-1\n", | ||
"beta" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Notice how we applied a neat NumPy trick here. The numpy.random.random() function returns an array, yet we applied what appears to be a scalar operation on the vector. This is an example of what NumPy calls vectorization (a major point of this tutorial), which offers us both a very fast way to do run vector computations as well as a clean and concise method of coding. \n", | ||
"\n", | ||
"<br><br>\n", | ||
"\n", | ||
"<b>Question: we designed the above $beta$ vector such that $E[\\beta_i]=0$. How can we confirm that we did this correctly?</b>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"-0.3307015072167897\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"#start by taking the mean of the beta we already calculated\n", | ||
"\n", | ||
"print(np.mean(beta))\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 33, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#It is likely the above is not equal to zero. Let's simulate this 100k times and see what the distribution of means is\n", | ||
"#Student input code here\n", | ||
"means = []\n", | ||
"for i in range(int(1e3)):\n", | ||
" beta = np.random.random([K])*2-1\n", | ||
" means.append(beta)\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now let's use matplotlibs hist function to plot the histogram of means here. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"plt.hist(means)\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We should expect the distribution to be centered around zero. Is it? As fun technical side, let's dive a little deeper into what this distribution should look like. The histogram shows a distribution of the average of a sample of 5 uniformly distributed random variables taken over N different samples. Can we compare this to a theoretical distribution?<br>\n", | ||
"\n", | ||
"Yes we can! We sampled each $\\beta_i$ from a uniform distribution over the interval $[-1, 1]$. The variance of a sample of uniformly distributed variables is given by $(1/12) * (b - a)^2$, where $b$ and $a$ are the min/max of the support interval. The standard error (or the standard deviation of the mean) of a sample of size K with with $Var(X) = \\sigma^2$ is $\\sigma / \\sqrt(K)$. <br>\n", | ||
"\n", | ||
"Given the above knowledge, we should expect our distribution of averages to be normally distributed with mean = 0 and var = $(12 * 5)^{-1} * (1 - (-1))^2 = 0.66667$. Let's compare this normal distribution to our sample above." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Compute a vector from the normal distribution specified above\n", | ||
"from scipy.stats import norm\n", | ||
"mu = 0\n", | ||
"sig = np.sqrt(4 / 60.0) \n", | ||
"xs = np.linspace(-1, 1, 1000)\n", | ||
"ys = norm.pdf(xs, mu, sig) \n", | ||
"\n", | ||
"plt.hist(means, normed = True)\n", | ||
"plt.plot(xs, ys)\n", | ||
"plt.show()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now let's write our scoring function. Let's try to use as much of Numpy's inner optimization as possible (hint, this can be done in two lines and without writing any loops). The key is that numpy functions that would normally take in a scalar can also take in an array, and the function applies the operations element wise to the array and returns an array. i.e.:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ex_array = np.array([-1, 1])\n", | ||
"np.abs(ex_array)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's use this feature to write a fast and clean scoring function" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def score_logistic_regression(X, beta):\n", | ||
" '''\n", | ||
" This function takes in an NxK matrix X and 1xK vector beta.\n", | ||
" The function should apply the logistic scoring function to each record of X.\n", | ||
" The output should be an Nx1 vector of scores\n", | ||
" '''\n", | ||
" \n", | ||
" #First let's calculate X*beta - make sure to use numpy's 'dot' method\n", | ||
" #student - put in code here\n", | ||
" a = np.dot(X,beta)\n", | ||
" #Now let's input this into the link function\n", | ||
" #student - put in code here\n", | ||
" prob_score = 1/(1+np.exp(-a))\n", | ||
" return prob_score" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"So how much faster is it by using Numpy? We can test this be writing the same function that uses no Numpy and executes via loops." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def score_logistic_regression_NoNumpy(X, beta):\n", | ||
" '''\n", | ||
" This function takes in an NxK matrix X and 1xK vector beta.\n", | ||
" The function should apply the logistic scoring function to each record of X.\n", | ||
" The output should be an Nx1 vector of scores\n", | ||
" '''\n", | ||
" #Let's calculate xbeta using loops\n", | ||
" xbeta = []\n", | ||
" for row in X: \n", | ||
" xb = 0\n", | ||
" for i, el in enumerate(row):\n", | ||
" #Student - compute X*Beta in the loop\n", | ||
" xb += el * beta[i]\n", | ||
" xbeta.append(xb)\n", | ||
" \n", | ||
" #Now let's apply the link function to each xbeta\n", | ||
" prob_score = []\n", | ||
" for xb in xbeta:\n", | ||
" #student - compute p in the loop \n", | ||
" p = 1/(1+np.e**(-xb))\n", | ||
" prob_score.append(p)\n", | ||
" \n", | ||
" return prob_score" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Before doing any analysis, let's test the output of each to make sure they equal" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Student - write a unit test that calls each function with the same inputs and checks to see they return the same values. \n", | ||
"wnp = score_logistic_regression(X,beta)\n", | ||
"wonp = score_logistic_regression_NoNumpy(X,beta)\n", | ||
"\n", | ||
"print(wnp-wonp)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"If they equal then we can proceed with timing analysis" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%timeit score_logistic_regression_NoNumpy(X, beta)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%timeit score_logistic_regression(X, beta)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"anaconda-cloud": {}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
Oops, something went wrong.