Skip to content

Commit

Permalink
folder
Browse files Browse the repository at this point in the history
  • Loading branch information
briandalessandro committed Aug 29, 2017
1 parent c5b68ed commit df3359b
Show file tree
Hide file tree
Showing 18 changed files with 8,058 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we'll generate a random matrix"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Number of columns (features)\n",
"K = 5\n",
"\n",
"#Number of records\n",
"N = 1000\n",
"\n",
"#Generate an NxK matrix of uniform random variables\n",
"X = #Student: generate a uniform random matrix here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's peak at our data to confirm it looks as we expect it"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student - Put in a command to view the first 100 rows\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student - put in a command to see the dimensions of X\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This exercise is about designing a scoring function for a logistic regression. As we are not concerned with fitting a model to data, we can just make up a logistic regression. <br> <br>\n",
"\n",
"For quick intro, the Logistic Regression takes the form of $\\hat{Y} = f(x * \\beta^T)$, where $x$ is the $1xK$ vector of features and $\\beta$ is the $1xK$ vector of weights. The function $f$, called a 'link' function, is the inverse logit: <br><br>\n",
"\n",
"<center>$f(a)=\\frac{1}{1+e^{-a}}$</center> <br><br>\n",
"\n",
"In this notebook we'll write a function that, given inputs of $X$ and $\\beta$, returns a value for $\\hat{Y}$.\n",
"<br><br>\n",
"First let's generate a random set of weights to represent $\\beta$.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student - generate a K dimensional vector of uniform random variables in the interval [-1, 1]\n",
"beta = #input command here\n",
"beta"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how we applied a neat NumPy trick here. The numpy.random.random() function returns an array, yet we applied what appears to be a scalar operation on the vector. This is an example of what NumPy calls vectorization (a major point of this tutorial), which offers us both a very fast way to do run vector computations as well as a clean and concise method of coding. \n",
"\n",
"<br><br>\n",
"\n",
"<b>Question: we designed the above $beta$ vector such that $E[\\beta_i]=0$. How can we confirm that we did this correctly?</b>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#start by taking the mean of the beta we already calculated\n",
"\n",
"#Student - fill in command here\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#It is likely the above is not equal to zero. Let's simulate this 100k times and see what the distribution of means is\n",
"#Student input code here\n",
"means = []\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use matplotlibs hist function to plot the histogram of means here. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.hist(means)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should expect the distribution to be centered around zero. Is it?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's write our scoring function. Let's try to use as much of Numpy's inner optimization as possible (hint, this can be done in two lines and without writing any loops)."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def score_logistic_regression(X, beta):\n",
" '''\n",
" This function takes in an NxK matrix X and 1xK vector beta.\n",
" The function should apply the logistic scoring function to each record of X.\n",
" The output should be an Nx1 vector of scores\n",
" '''\n",
" \n",
" #First let's calculate X*beta - make sure to use numpy's 'dot' method\n",
" #student - put in code here\n",
" \n",
" #Now let's input this into the link function\n",
" #student - put in code here\n",
" \n",
" return prob_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So how much faster is it by using Numpy? We can test this be writing the same function that uses no Numpy and executes via loops."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def score_logistic_regression_NoNumpy(X, beta):\n",
" '''\n",
" This function takes in an NxK matrix X and 1xK vector beta.\n",
" The function should apply the logistic scoring function to each record of X.\n",
" The output should be an Nx1 vector of scores\n",
" '''\n",
" #Let's calculate xbeta using loops\n",
" xbeta = []\n",
" for row in X:\n",
" \n",
" xb = 0\n",
" for i, el in enumerate(row):\n",
" xb += el * beta[i]\n",
" \n",
" xbeta.append(xb)\n",
" \n",
" #Now let's apply the link function to each xbeta\n",
" prob_score = []\n",
" for xb in xbeta:\n",
" prob_score.append(1 / (1 + np.exp(-1 * xb)))\n",
" \n",
" return prob_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before doing any analysis, let's test the output of each to make sure they equal"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student - write a unit test that calls each function with the same inputs and checks to see they return the same values. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If they equal then we can proceed with timing analysis"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%timeit score_logistic_regression_NoNumpy(X, beta)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%timeit score_logistic_regression(X, beta)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [py35]",
"language": "python",
"name": "Python [py35]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit df3359b

Please sign in to comment.