-
Notifications
You must be signed in to change notification settings - Fork 0
/
PS7.qmd
369 lines (249 loc) · 19.2 KB
/
PS7.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
title: "PS7 - Data Distributions"
format: pdf
jupyter: python3
---
# Introduction
Please watch [The Shape of Data: Distributions](https://youtu.be/bPFNxD3Yg6U/) before starting the problem set.
# Setup
In addition to looking at real world data, we also will generate some of our own data using a random number generator. The random number generator we will used is the `Generator` class from the random `random` module in the `numpy` package. Please see the [Generator documentation page](https://numpy.org/doc/stable/reference/random/generator.html) for more information.
```{python}
from statistics import mode #load mode function
import numpy as np # load numpy
import numpy.random as rand #load random number generation from numpy
import matplotlib.pyplot as plt #load plot module from matplotlib
import matplotlib.ticker as tck # load the ticker module of matplotlib
#create random number generator
rng = rand.default_rng(seed=3289) #set seed so that our results are replicable
#generate some random numbers
r1d6 = rng.choice(range(1, 7), size=10000) # 10,000 simulated rolls of a 6-sided die.
```
```{python}
#| echo: false
price_bins = 40 #number of bins to use when calculating mode of housing prices
def NumericMode(v, bins):
#calculate mode of a numeric distribution (v) by calculating midpoint of
#the bin with the highest frequency, given (bins) bins
h = np.histogram(v, bins=bins) #bin data
max_bin = np.argmax(h[0]) # find bin with max frequency
n_mode = (h[1][max_bin] + h[1][max_bin+1])/2 #mid point of max bin
return(n_mode)
def CalcStats(v, digits=2, stat_mode=False, bins=30):
# Calculates mean, median, mode and standard dev of vector v
# stat_mode bool, should statistics.mode be used to calculate mode, otherwise
# a previously defined mode function is used.
# return dict with keys: mu, std, med, and mode
d = { #create dictionary with mean, median, and st dev
'mu': np.mean(v).round(digits),
'std': np.std(v).round(digits),
'med': np.median(v).round(digits)
}
if stat_mode: #should we calculate mode using the statistics module?
d['mode'] = mode(v)
else: #otherwise calculate mode based on bin frequency
d['mode'] = NumericMode(v, bins)
return d
```
# Distributions
A data distribution tells you how frequently a particular value appears in a data set. This can be expressed as a count ("frequency"), or it can be expressed as a likelihood ("probability"). In the setup code, we simulated 10,000 rolls of a 6-sided die. The frequency of 4 for that particular set of rolls was 1656. The probability of getting a 4 for that set of rolls was 0.1656, or approximately 16.6%. While both frequencies and probabilities can be useful, we see probabilities used more frequently in real life situations. This is because it is easier to compare between data sets. While can choose how much data we want to simulate in a computer program, scientists often don't have that option in real life.
Figure 1 depicts the frequency distribution of rolls from our simulated data. Figure 2 shows the probability distribution.
```{python}
bins = np.arange(1, 6 + 1.5) - 0.5 #set up bins for histogram
sim_name = "Data Distribution of Rolling a 6-sided Die 10,000 times"
fig1, ax1 = plt.subplots() #create figure
# add frequency histogram to figure
ax1.hist(r1d6, bins=bins, rwidth=0.9, edgecolor='black')
ax1.set_title('Figure 1. ' + sim_name + '\n(Frequency)') # add title to figure
ax1.set_xlabel('Outcome (possible roll)') # add x axis label
ax1.set_ylabel('Frequency') # add y axis label
plt.show() #show figure
```
```{python}
bins = np.arange(1, 6 + 1.5) - 0.5 #set up bins for histogram
fig2, ax2 = plt.subplots() #create figure
# add density histogram to figure
ax2.hist(r1d6, bins=bins, rwidth=0.9, edgecolor='black', density=True)
ax2.set_ylim([0,1]) # set y axis from 0-1 for full probability scale
ax2.set_title('Figure 2. ' + sim_name + '\n(Probability)') # add title to figure
ax2.set_xlabel('Outcome (possible roll)') # add x axis label
ax2.set_ylabel('Probability') # add y axis label
plt.show() # show plot
```
# Describing Distributions
Once we have a data distribution, it is important to be able to accurately describe it to people we want to share our data with. When we describe distributions, we can describe them qualitatively and quantitatively. When describing a distribution qualitiatively, we refer to 'characteristics' of the distribution. When we describe describe a distribution quantitatively, we refer to the distribution's statistics (or parameters).
## Distribution Characteristics (Qualitative)
* continuous or discrete
* Symmetric or skewed
* Unimodal or multi-modal
Distributions have a number of qualitative characteristics that we can use to describe them. First, they can be continous or discrete. Continous distributions are distributions that can take on any value, within a given range. Human heights are continuously distributed, as is the length of time it takes people to run 100m. Discrete distributions can only take on specific values. Dice rolls have a discrete distribution, as do the number of times people have eaten ice cream in the past week.
Distributions can also be symmetric or skewed. Symmetric distributions are distributions that look the same if you fold them in half. Our simulated die rolls have a roughly symmetric distribution.
Skewed distributions have more data on one side of the distribution than the other. A distribution is positively (or right) skewed if it has a long tail to the right. Income levels in the United States are right skewed, as are housing prices (Figure 3). A distribution is negatively (or left) skewed if it has a long tail to the left. Age of non-accidental death in the United States is negatively skewed.
```{python}
#| echo: false
import pandas as pd
#import sales data and select only houses
sales = pd.read_csv('data/raw_sales.csv')
sales = sales.loc[sales.propertyType=='house']
#Converting datestring to datetime class
sales['datesold'] = pd.to_datetime(sales['datesold'], format="%Y-%m-%d %H:%M:%S")
# Extracting year from datetime class
sales['Year'] = sales.datesold.dt.year
sales17 = sales.loc[sales.Year==2017] #only 2017 sales
# set all sales > $3 million equal to $3 million for ease of visualization
price17_2M = np.where(sales17.price>3e6, 3e6, sales17.price)
def ToThousands(x, pos): #function to convert number to thousands
kn = round(x/1000) # divide by 1000 and turn into an integer
kn_str = "{:,}".format(kn) + 'k' #add a k on the end to denote thousands
if x>=3e6: #add greater than to 3 million label
kn_str = '> ' + kn_str
return kn_str
fig3, ax3 = plt.subplots(figsize=(6,3.3)) #create figure
ax3.hist(price17_2M, bins=price_bins) #add histogram
ax3.xaxis.set_major_formatter(tck.FuncFormatter(ToThousands)) #convert to thousands of $
ax3.tick_params(axis='x', labelrotation = 45) #rotate x axis labels
ax3.set_title("Figure 3. Histogram of 2017 Home Sales Prices in New England \n(Right Skewed)") # add title
ax3.set_xlabel('Price ($)') # add x label
ax3.set_ylabel('Number of Houses') #add y label
plt.show() #display
```
The mode of a dataset is the most common value. The mode of a distribution is the peak frequency or probability. Most data distributions we see on a regular basis are uni-modal, meaning they only have one peak. This includes almost all the examples previously mentioned. Some distributions have more than one peak though. This was the case for age of death prior to modern medicine. Many people died in childhood, but if you survived childhood you were fairly likely to live into your 50s or 60s. This resulted in a bimodal distribution of age of death.
## Distribution Statistics (Quantitative)
In addition to these characteristics, we can also describe a distribution using a number of different statistics:
* mean ($\mu$)
* standard deviation ($\sigma$)
* median
* mode
The mean and standard deviation are particularly important, so we give them their own special symbols. The mean is represented by the greek letter mu ($\mu$), and the standard deviation is represented by the greek letter sigma ($\sigma$). We can use python to calculate these statistics fairly easily for our simulated dice rolls with functions we've already used.
```{python}
#| echo: false
from IPython.display import display, Markdown
#calculate descriptive statistics of simulated dice rolls
d6stats = CalcStats(r1d6, stat_mode=True)
# add calculated values to text
display(Markdown("""
For this simulation the mean value of our die rolls was {d6mean}, the median was {d6median}, the mode was {d6mode}, and the standard deviation was {d6st_dev}. For a different set of 10,000 simulated rolls, some of the numbers could change.
""".format(
d6mean = d6stats['mu'], # mean
d6median = d6stats['med'], # median
d6mode = d6stats['mode'], # mode
d6st_dev = d6stats['std'] # standard deviation
)))
```
The qualitative characteristics of our distribution can have a large effect on the statistics of our distribution. In particular, the skew of a distribution has a predictable effect on the mean, median and mode. For an example of this, see Figure 4. In a symmetric distribution, the mean, median, and mode are all close together, at the peak of the distribution. In a skewed distribution, the mode is still at the peak of the distribution. However, the mean is further toward the tail. The median generally falls in between the mean and the mode, for skewed distributions.
```{python}
#| echo: false
#calculate descriptive stats for 2017 home prices
pstats = CalcStats(sales17.price, bins=price_bins)
#generate normal data with the same mode and st dev as the real data
norm_price = rng.normal(
loc=pstats['mu'],
scale=pstats['std'],
size=len(sales17.price)
)
#cap price at 3 million
norm_price2M = np.where(norm_price>3e6, 3e6, norm_price)
#calculate descriptive states for normally distributed housing prices
nstats = CalcStats(norm_price, bins=price_bins)
#create figure with 2 subplots (real housing prices and simulated ones)
fig4, ax4 = plt.subplots(nrows=2, sharex=True, figsize=(7, 5))
ax4[0].hist(price17_2M, bins=price_bins) #real price histogram
ax4[0].axvline(x=pstats['mu'], color = 'black', linestyle='dashed', label='Mean')
ax4[0].axvline(x=pstats['med'], color = 'black', linestyle='dotted', label='Median')
ax4[0].axvline(x=pstats['mode'], color = 'black', linestyle='solid', label='Mode')
ax4[0].xaxis.set_major_formatter(tck.FuncFormatter(ToThousands))
ax4[0].tick_params(axis='x', labelrotation = 45) #rotate x axis labels
ax4[0].set_title("")
ax4[0].set_ylabel('Actual Sales')
ax4[0].legend(loc='upper right') #create legend in upper right corner
ax4[1].hist(norm_price2M, bins=price_bins) #simulated price histogram
ax4[1].axvline(x=nstats['mu'], color = 'black', linestyle='dashed') #sim price mean
ax4[1].axvline(x=nstats['med'], color = 'black', linestyle='dotted') #sim price median
ax4[1].axvline(x=nstats['mode'], color = 'black', linestyle='solid') #sim price mode
ax4[1].xaxis.set_major_formatter(tck.FuncFormatter(ToThousands)) #formatting ticks
ax4[1].tick_params(axis='x', labelrotation = 45) #rotate x axis labels
ax4[1].set_xlabel('Price ($)') #set x label
ax4[1].set_ylabel('Symmetric Sales\n(simulated)') #set y label
fig4.suptitle('Figure 4. Mean, Median, and Mode for Skewed and Symmetric\n Distributions of 2017 Home Sales in New England', fontsize=14)
plt.subplots_adjust(hspace=0, top=0.89, bottom=0.17)
plt.show()
```
# Specific Distributions
Some distributions are so common or important we have given them special names. We will cover two in this problem set: Uniform distributions and Normal distributions.
## Uniform distribution
A uniform distribution is any distribution where all outcomes have the same likelihood of happening. They are *symmetric*, they can be *discrete* or *continuous*, and they do not have a mode. This is because, in a perfect world, no value would be generated by a uniform distribution any more frequently than another. Of course, in the real world, processes with uniform distributions will generate data that is not exactly equal. The outcome of rolling a single die has a discrete uniform distribution. We already simulated this using the `choice()` method of the `Generator` class in the random module of numpy. Continous uniform distributions are not common in the real world. However, they can be used to simulate data using the `uniform()` method of the same `Generator` class.
## Normal Distribution
The normal distribution is probably the most common distribution. It is also frequently called a 'bell shaped curve' or a gaussian distribution. The normal distribution is *continuous*, *symmetric* and *unimodal*. It pops up all over the place, including but not limited to, human heights, birth weights, blood pressure, and measurement error.
We can simulate random data from a normal distribution using the `normal()` method of the `Generator` class. This function allows us to generate data that was randomly drawn from a theoretical normal distribution with a specific mean and standard deviation. However, the actual (or empirical) mean and standard deviation of the data we generate will not necessarily be the same as the theoretical mean and standard deviation we passed to the `normal()` method.
# Density and Binning
When plotting the probability distribution of the outcomes of a 6-sided die roll, we set `density=True` to tell python to calculate probabilities instead of frequencies. However density is not always the same as probabilitity, and the value of density doesn't just depend on the data itself.
Density depends on what data you have, the size of the bins that you use to group your data, and the locations of those bins. In a density plot, the area of all of the bins will always add to 1. This means that if the width of the bin is less than 1, the value of the bar in the histogram can be greater than 1, even though it's not possible for something to have a greater than 100% chance of happening.
If we simulate two sets of data using two different distributions we can see how the size of the bins affects the density. In Figure 5A and B, the width of the bins are exactly 1, so the bin densities do correspond to the probability. Notice no more than one of the bins has a density of greater than 0.5. However in Figures 5C and D, the bin widths are much smaller and so there are quite a few densities that have a value of greater than 0.5.
The density plotted depends on the location of the bins, in addition to their width. Figure 5A and 5B, both have the same bin width (1). However, their bins are offset from each other. Figure 5A has bins that end on the integers (-5 to -4, -4 to -3 etc), while Figure 5B has bins that end on the half steps (-5.5 to -4.5, -4.5 to -3.5 etc). Even though the blue and orange data in Figures 5A and 5B are the same, the blue distribution in Figure 5B looks much more narrow compared to the orange distribution.
```{python}
#Compare how the same data looks under 4 different binning schemes
#generate some random numbers from normal distributions
normA = rng.normal(0, 0.5, size=1000) # 10,000 normal(0, 0.5) data pulls
xlab = '$\mu=0$, $\sigma=0.5$' #A data label
normB = rng.normal(0, 1, size=1000) # 10,000 normal(0, 1) data pulls
ylab = '$\mu=0$, $\sigma=1$' #B Data label
binsInt = np.arange(-5, 5) #set up bins for histogram
binsOffset = np.arange(-5, 5.5) - 0.5 #set up bins for histogram (Offset)
#creating a 8x6 figure with 4 subplots, sharing the x and y axes
fig5, ax5 = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True,
figsize=(8, 6))
#Adding A and B datasets with Integer endpoint bins, with bin width = 1
ax5[0,0].hist(normA, bins=binsInt, density=True, alpha=0.7)
ax5[0,0].hist(normB, bins=binsInt, density=True, alpha=0.7)
ax5[0,0].annotate('A. Bin width = 1', xy=(-5.4,0.75))
#Adding A and B datasets with half integer endpoint bins, with bin width = 1
ax5[0,1].hist(normA, bins=binsOffset, density=True, alpha=0.7)
ax5[0,1].hist(normB, bins=binsOffset, density=True, alpha=0.7)
ax5[0,1].annotate('B. Bin width = 1, offset', xy=(-5.4,0.75))
#A and B datasets with more bins and narrower bins
ax5[1,0].hist(normA, bins=15, density=True, alpha=0.7)
ax5[1,0].hist(normB, bins=15, density=True, alpha=0.7)
ax5[1,0].annotate('C. Bin width ~ 0.2', xy=(-5.4,0.75))
#A and B datasets with very many, very thin bins
ax5[1,1].hist(normA, bins=40, density=True, alpha=0.7, label=xlab)
ax5[1,1].hist(normB, bins=40, density=True, alpha=0.7, label=ylab)
ax5[1,1].annotate('D. Bin width ~ 0.08', xy=(-5.4,0.75))
#get info for legend
handles, labels = ax5[1,1].get_legend_handles_labels()
fig5.legend(handles, labels, loc='lower center') #create legend
#Add Title
fig5.suptitle('Figure 5. Randomly generated data with different binning schemes')
#removing space between the plots both horizontally and vertically
plt.subplots_adjust(wspace=0, hspace=0, top=0.95, bottom=0.15)
plt.show()
```
# Questions
1. Frequencies vs. Probabilities
a. What is a specific situation where you would care as much about the frequencies of the data as you care about their probability?
b. What is a specific situation where only the data probability is important?
2. Simulating Dice Rolls
a. Simulate 10,000 rolls of 2 6-sided dice, where the outcome is the sum of the dice added together.
b. Plot the frequencies of the data you just simulated.
c. Plot the probabilities of your simulated dice roll data.
3. Describing your Dice Rolls
a. Describe the distribution of the dice rolls you simulated using the terminology from the Distribution Characteristics section.
b. Calculate the mean, median, mode, and standard deviation for your dice roll data.
4. Would we expect any of the mean, median, mode, and standard deviation to change very much between different sets of simulations for rolling a single die? How about for rolling 2 dice?
5. Uniform Distributions
a. What are the arguments of `uniform()` and what do they do?
b. Simulate data from two different uniform distributions with 2 different ranges.
c. Create density plots of both of them using the same bin width. How do the densities differ?
6. Normal Distributions
a. Pick a mean ($\mu$), and standard deviation ($\sigma$), and simulate 3 sets of data, one with 100 data points, one with 1000, and one with 100,000 data points.
b. Create density plots of both of them using the same bin width. How do the densities differ?
7. Numeric Mode
a. If you use the `st.mode()` function on a data series without any repeat values (ex. housing prices), what number does it return?
b. Write a function called `NumericMode()` that provides a more useful output for a distribution of (non-repeating) numbers
8. Plotting Real World Data
a. Plot the distribution of a variable in your dataset using the default values for `hist()`
b. Plot the distribution of the same variable using 2 different numbers of bins.
c. Plot the same variable, but offset the bin cutoffs.
d. How did each of the plots in a-c change the way you thought about your data?
9. Real World Statistics
b. Estimate the mean, median, and mode visually based on your plot from Question 8.
c. Calculate the mean, median, mode, and standard deviation of your data. Use the `NumericMode()` from Question 7. How far off were your estimates?
d. Add vertical lines to your plot located at the mean median and mode, and add an appropriate legend.