|
| 1 | +""" |
| 2 | +Normalization Wikipedia: https://en.wikipedia.org/wiki/Normalization |
| 3 | +Normalization is the process of converting numerical data to a standard range of values. |
| 4 | +This range is typically between [0, 1] or [-1, 1]. The equation for normalization is |
| 5 | +x_norm = (x - x_min)/(x_max - x_min) where x_norm is the normalized value, x is the |
| 6 | +value, x_min is the minimum value within the column or list of data, and x_max is the |
| 7 | +maximum value within the column or list of data. Normalization is used to speed up the |
| 8 | +training of data and put all of the data on a similar scale. This is useful because |
| 9 | +variance in the range of values of a dataset can heavily impact optimization |
| 10 | +(particularly Gradient Descent). |
| 11 | +
|
| 12 | +Standardization Wikipedia: https://en.wikipedia.org/wiki/Standardization |
| 13 | +Standardization is the process of converting numerical data to a normally distributed |
| 14 | +range of values. This range will have a mean of 0 and standard deviation of 1. This is |
| 15 | +also known as z-score normalization. The equation for standardization is |
| 16 | +x_std = (x - mu)/(sigma) where mu is the mean of the column or list of values and sigma |
| 17 | +is the standard deviation of the column or list of values. |
| 18 | +
|
| 19 | +Choosing between Normalization & Standardization is more of an art of a science, but it |
| 20 | +is often recommended to run experiments with both to see which performs better. |
| 21 | +Additionally, a few rules of thumb are: |
| 22 | + 1. gaussian (normal) distributions work better with standardization |
| 23 | + 2. non-gaussian (non-normal) distributions work better with normalization |
| 24 | + 3. If a column or list of values has extreme values / outliers, use standardization |
| 25 | +""" |
| 26 | +from statistics import mean, stdev |
| 27 | + |
| 28 | + |
| 29 | +def normalization(data: list, ndigits: int = 3) -> list: |
| 30 | + """ |
| 31 | + Returns a normalized list of values |
| 32 | + @params: data, a list of values to normalize |
| 33 | + @returns: a list of normalized values (rounded to ndigits decimal places) |
| 34 | + @examples: |
| 35 | + >>> normalization([2, 7, 10, 20, 30, 50]) |
| 36 | + [0.0, 0.104, 0.167, 0.375, 0.583, 1.0] |
| 37 | + >>> normalization([5, 10, 15, 20, 25]) |
| 38 | + [0.0, 0.25, 0.5, 0.75, 1.0] |
| 39 | + """ |
| 40 | + # variables for calculation |
| 41 | + x_min = min(data) |
| 42 | + x_max = max(data) |
| 43 | + # normalize data |
| 44 | + return [round((x - x_min) / (x_max - x_min), ndigits) for x in data] |
| 45 | + |
| 46 | + |
| 47 | +def standardization(data: list, ndigits: int = 3) -> list: |
| 48 | + """ |
| 49 | + Returns a standardized list of values |
| 50 | + @params: data, a list of values to standardize |
| 51 | + @returns: a list of standardized values (rounded to ndigits decimal places) |
| 52 | + @examples: |
| 53 | + >>> standardization([2, 7, 10, 20, 30, 50]) |
| 54 | + [-0.999, -0.719, -0.551, 0.009, 0.57, 1.69] |
| 55 | + >>> standardization([5, 10, 15, 20, 25]) |
| 56 | + [-1.265, -0.632, 0.0, 0.632, 1.265] |
| 57 | + """ |
| 58 | + # variables for calculation |
| 59 | + mu = mean(data) |
| 60 | + sigma = stdev(data) |
| 61 | + # standardize data |
| 62 | + return [round((x - mu) / (sigma), ndigits) for x in data] |
0 commit comments