Skip to content

Commit

Permalink
iCleared student lab4
Browse files Browse the repository at this point in the history
  • Loading branch information
Chris Ick authored and Chris Ick committed Sep 27, 2018
1 parent ef72f86 commit 08f0736
Showing 1 changed file with 357 additions and 0 deletions.
357 changes: 357 additions & 0 deletions ipython/Labs_Student/Lab4_Survey_Questions_Student.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,357 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by reading in the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"\n",
"#We assume data is in a parallel directory to this one called 'data'\n",
"cwd = os.getcwd()\n",
"datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n",
"#or you can hardcode the directory\n",
"#datadir = \n",
"\n",
"print(datadir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now read in the data called survey_responses_2017.csv into a pandas data frame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Student put in read data command here:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the column headers and use something more descriptive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student put in code to look at column names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Column names like 'profile_1-profile_7' aren't very descriptive. As a quick data maintenance task, let's rename the columns starting with 'profile'. The dictionary in the next cell maps the integer index to a descriptive text.\n",
"\n",
"Tactically, let's loop through each column name. Within the loop let's check whether the column name starts with 'profile.' If it does, let's create a new name that swaps the key with the value using profile_mapping dictionary (i.e., profile_1 -> profile_Viz). We then add the new column name to a list. If it doesn't start with 'profile' just add the old column name to the list. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"profile_mapping = {1:'Viz',\n",
" 2:'CS',\n",
" 3:'Math',\n",
" 4:'Stats',\n",
" 5:'ML',\n",
" 6:'Bus',\n",
" 7:'Com'}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student put code here to change the header names\n",
"newcols = []\n",
"\n",
"for colname in data.columns:\n",
" #finish the loop \n",
" \n",
"#Now swap the old columns with the values in newcols \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use this data to illustrate common data analytic techniques. We have one numeric variable (len_answer) and different categorical variables which may carry some signal of the 'len_answer' variable. \n",
"\n",
"'Len_answer' is the character count of the response to the following question: \"Besides the examples given in lecture 1, discuss a case where data science has created value for some company. Please explain the company's goals and how any sort of data analysis could have helped the company achieve said goals.\" As this is a subjective business question, let's hypothesize that students with more professional experience might be more likely to give longer answers. \n",
"\n",
"In more technical terms, we'll test whether the variance of len_answer can be explained away by the categorical representation of a student's experience. \n",
"\n",
"The first thing we should do is look at the distribution of len_answer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student - plot a histogram here for len_answer\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like we have at least one strong outlier and a thick distribution around 0. Let's also use the Pandas describe() method to get a stronger sense of the distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data.len_answer.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's consider cleaning up the data. We'll remove the top k values as well as those with a length less than 50 (which we think is a generous minimum to communicate a reasonable answer.\n",
"\n",
"Create a new data_frame that removes these outliers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Write a function to get the kth largest value of an array\n",
"def get_kth_largest(inarray, k):\n",
"\n",
" return ...\n",
"\n",
"k = 3\n",
"kth_largest = get_kth_largest(np.array(data.len_answer.values), 3)\n",
"#Question = why did we wrap the series into an np.array() call in the above function call?\n",
"\n",
"#Student create a filtered data frame here\n",
"\n",
"#Compare the shape of both dataframes\n",
"data_clean.shape, data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have cleaned our data, let's run a pairwise t-test on each experience level to see if their difference in len_answer is statistically significant. To run a t-test, we'll need the mean, standard-deviation and count for each group. We can achieve this with a pandas groupby operation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student input code here\n",
"\n",
"#run this to look at the grouped df\n",
"data_clean_grouped"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visually, we can see a potential split between the [0, 2] year experience range and the [2+] experience range. Let's be more rigorous and run t-tests. Let's write a function that takes in the necessary statistics and returns a p-value.\n",
"\n",
"Remember, the t-stat for the difference between two means is:\n",
"\n",
"<center>$t = \\frac{\\hat{\\mu_1} - \\hat{\\mu_2}}{\\sqrt{\\frac{\\hat{\\sigma_1}^2}{n_1} + \\frac{\\hat{\\sigma_2}^2}{n_2}}}$</center>\n",
"\n",
"The p-value can be found using a t-distribution, but for simplicity, let's approximate this with the normal distribution. For the 2-tailed test, the p-value is: 2 * (1 - Norm.CDF(T))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student complete the function\n",
"from scipy.stats import norm\n",
"def pvalue_diffmeans_twotail(mu1, sig1, n1, mu2, sig2, n2):\n",
" '''\n",
" P-value calculator for the hypothesis test of mu1 != mu2.\n",
" Takes in the approprate inputs to compute the t-statistic for the difference between means\n",
" Outputs a p-value for a two-sample t-test.\n",
" '''\n",
"\n",
" \n",
" return (t, p_value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now loop through all possible pairs in data_clean_grouped and perform a t-test."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Student put in code here:\n",
"\n",
"\n",
"get distinct values in the data frame for the experience variable\n",
"grps = \n",
"\n",
"#Now loop through each pair\n",
"for i, grp1 in enumerate(grps):\n",
" for grp2 in grps[i + 1:]:\n",
" \n",
" '''\n",
" Also, the result of groupby uses a multi-index. So be sure to index on 'len_answer' as well.\n",
" Then pull out the mean, std, and cnt from that result. \n",
" ''' \n",
"\n",
" #some code should go here\n",
" \n",
" print('Two tailed T-Test between groups: {} and {}'.format(grp1, grp2))\n",
" print('Diff = {} characters'.format(round(row1['mean'] - row2['mean'], 0)))\n",
" print('The t-stat is {} and p-value is {}'.format(round(tstat, 3), round(p_value, 3)))\n",
" print('')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are some observations you might have about the above results? Are there any with large deviances that are not statistically significant at at least a 95% level? Is there any issue with using 95% as our threshold for statistical significance? In fact there is. We are running multiple hypothesis tests at once, and doing this is known to increase the probability that we have at least one false positive (i.e., $P(False Positive) = 1 - .95^{Ntests}$). We can apply a simplye but conservative method called the <a href=\"https://en.wikipedia.org/wiki/Bonferroni_correction\">Bonferoni Correction</a>, which says that if we normally would care about an alpha level of $\\alpha$ for significance testing, and we're doing $N$ tests, then our new significance level should be $\\alpha/N$. This correction is conservative because it assumes that each test is independent. Since each group is repeatedly sampled across pairs, we know that our individual tests are not indeed independent. Nonetheless, we'll see how the results hold under this new regime. \n",
"\n",
"Also, how do the numbers change if you rerun it using the original data, and not the cleaned data. What is the effect of outliers on the results?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Rerun everything without cleaning outliers\n",
"\n",
"grps = \n",
"\n",
"#Now loop through each pair\n",
"for i, grp1 in enumerate(grps):\n",
" for grp2 in grps[i + 1:]:\n",
" \n",
" '''\n",
" Also, the result of groupby uses a multi-index. So be sure to index on 'len_answer' as well.\n",
" Then pull out the mean, std, and cnt from that result. \n",
" ''' \n",
" \n",
" print('Two tailed T-Test between groups: {} and {}'.format(grp1, grp2))\n",
" print('Diff = {} characters'.format(round(row1['mean'] - row2['mean'], 0)))\n",
" print('The t-stat is {} and p-value is {}'.format(round(tstat, 3), round(p_value, 3)))\n",
" print('')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit 08f0736

Please sign in to comment.