Skip to content

Commit

Permalink
Feature/upd poetry & re-run the whole book (Yorko#736)
Browse files Browse the repository at this point in the history
* re-run all jupyter-book

* fix some of the errors
  • Loading branch information
Yorko authored Feb 3, 2023
1 parent 9671c80 commit 8d21af1
Show file tree
Hide file tree
Showing 10 changed files with 47 additions and 53 deletions.
2 changes: 1 addition & 1 deletion mlcourse_ai_jupyter_book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ repository:
branch: main # Which branch of the repository should be used when creating links (optional)

execute:
execute_notebooks : cache
execute_notebooks : force
timeout: -1

# exclude some content
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ df.head()

It would be instructive to peek into the values of our variables.

Let's convert the data into *long* format and depict the value counts of the categorical features using [`factorplot()`](https://seaborn.pydata.org/generated/seaborn.factorplot.html).
Let's convert the data into *long* format and depict the value counts of the categorical features using [`catplot()`](https://seaborn.pydata.org/generated/seaborn.catplot.html).


```{code-cell} ipython3
Expand All @@ -137,6 +137,7 @@ df_uniques = (
sns.catplot(
x="variable", y="count", hue="value", data=df_uniques, kind="bar"
)
plt.xticks(rotation='vertical');
```

We can see that the target classes are balanced. That's great!
Expand Down Expand Up @@ -165,6 +166,7 @@ sns.catplot(
data=df_uniques,
kind="bar",
)
plt.xticks(rotation='vertical');
```

You can see that the distribution of cholesterol and glucose levels great differs by the value of the target variable. Is this a coincidence?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ df.head()

It would be instructive to peek into the values of our variables.

Let's convert the data into *long* format and depict the value counts of the categorical features using [`factorplot()`](https://seaborn.pydata.org/generated/seaborn.factorplot.html).
Let's convert the data into *long* format and depict the value counts of the categorical features using [`catplot()`](https://seaborn.pydata.org/generated/seaborn.catplot.html).


```{code-cell} ipython3
Expand All @@ -135,7 +135,8 @@ df_uniques = (
sns.catplot(
x="variable", y="count", hue="value", data=df_uniques, kind="bar"
);
)
plt.xticks(rotation='vertical');
```

We can see that the target classes are balanced. That's great!
Expand Down Expand Up @@ -163,7 +164,8 @@ sns.catplot(
col="cardio",
data=df_uniques,
kind="bar"
);
)
plt.xticks(rotation='vertical');
```

You can see that the target variable greatly affects the distribution of cholesterol and glucose levels. Is this a coincidence?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@ For the interpretation of confidence intervals, you can address [this](https://w
Now that you've grasped the idea of bootstrapping, we can move on to *bagging*.

Suppose that we have a training set $\large X$. Using bootstrapping, we generate samples $\large X_1, \dots, X_M$. Now, for each bootstrap sample, we train its own classifier $\large a_i(x)$. The final classifier will average the outputs from all these individual classifiers. In the case of classification, this technique corresponds to voting:

$$\large a(x) = \frac{1}{M}\sum_{i = 1}^M a_i(x).$$

The picture below illustrates this algorithm:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ The algorithm for constructing a random forest of $\large N$ trees goes as follo
* For each split, we first randomly pick $\large m$ features from the $\large d$ original ones and then search for the next best split only among the subset.

The final classifier is defined by:

$$\large a(x) = \frac{1}{N}\sum_{k = 1}^N b_k(x)$$

We use the majority voting for classification and the mean for regression.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ Note that by definition ${PI}^{(t)}=0$, if variable $X_j$ isn't in tree $t$.

Now, we can give the feature importance calculation for ensembles:
* not normalized:
$${PI}\left(X_j\right)=\frac{\sum_{t=1}^N {PI}^{(t)}(X_j)}{N}$$
${PI}\left(X_j\right)=\frac{\sum_{t=1}^N {PI}^{(t)}(X_j)}{N}$
* normalized by the standard deviation of the differences:
$$z_j=\frac{{PI}\left(X_j\right)}{\frac{\hat{\sigma}}{\sqrt{N}}}$$
$z_j=\frac{{PI}\left(X_j\right)}{\frac{\hat{\sigma}}{\sqrt{N}}}$

## 2. Illustrating permutation importance

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,8 @@ Calculate the Adjusted Rand Index (`sklearn.metrics`) for the resulting clusteri
**Question 6:** <br>
Select all the correct statements. <br>

** Answer options:**
**Answer options:**

- According to ARI, KMeans handled clustering worse than Agglomerative Clustering
- For ARI, it does not matter which tags are assigned to the cluster, only the partitioning of instances into clusters matters
- In case of random partitioning into clusters, ARI will be close to zero
Expand Down
49 changes: 18 additions & 31 deletions mlcourse_ai_jupyter_book/book/topic07/topic7_pca_clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,44 +91,31 @@ Let's start by uploading all of the essential modules and try out the iris examp


```{code-cell} ipython3
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import datasets, decomposition
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets, decomposition
# Loading the dataset
# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Let's create a beautiful 3d-plot
fig = plt.figure(1, figsize=(6, 5))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
plt.cla()
for name, label in [("Setosa", 0), ("Versicolour", 1), ("Virginica", 2)]:
ax.text3D(
X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(),
name,
horizontalalignment="center",
bbox=dict(alpha=0.5, edgecolor="w", facecolor="w"),
)
# Change the order of labels, so that they match
y_clr = np.choose(y, [1, 2, 0]).astype(np.float32)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y_clr, cmap=plt.cm.nipy_spectral)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([]);
# Plot the dataset in 3D ignoring Petal Width
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2],
c=y, cmap='viridis', alpha=0.7)
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Petal Length');
```

Now let's see how PCA will improve the results of a simple model that is not able to correctly fit all of the training data:
Expand Down Expand Up @@ -293,7 +280,7 @@ plt.show();

The main idea behind clustering is pretty straightforward. Basically, we say to ourselves, "I have these points here, and I can see that they organize into groups. It would be nice to describe these things more concretely, and, when a new point comes in, assign it to the correct group." This general idea encourages exploration and opens up a variety of algorithms for clustering.

<figure><img align="center" src="https://habrastorage.org/getpro/habr/post_images/8b9/ae5/586/8b9ae55861f22a2809e8b3a00ef815ad.png"><figcaption>*The examples of the outcomes from different algorithms from scikit-learn*</figcaption></figure>
<figure><img align="center" src="https://habrastorage.org/getpro/habr/post_images/8b9/ae5/586/8b9ae55861f22a2809e8b3a00ef815ad.png"><figcaption><it>The examples of the outcomes from different algorithms from scikit-learn</it></figcaption></figure>

The algorithms listed below do not cover all the clustering methods out there, but they are the most commonly used ones.

Expand Down Expand Up @@ -402,7 +389,7 @@ from sklearn.cluster import KMeans
```{code-cell} ipython3
inertia = []
for k in range(1, 8):
kmeans = KMeans(n_clusters=k, random_state=1).fit(X)
kmeans = KMeans(n_clusters=k, random_state=1, n_init='auto').fit(X)
inertia.append(np.sqrt(kmeans.inertia_))
```

Expand Down Expand Up @@ -561,7 +548,7 @@ data = datasets.load_digits()
X, y = data.data, data.target
algorithms = []
algorithms.append(KMeans(n_clusters=10, random_state=1))
algorithms.append(KMeans(n_clusters=10, random_state=1, n_init='auto'))
algorithms.append(AffinityPropagation())
algorithms.append(
SpectralClustering(n_clusters=10, random_state=1, affinity="nearest_neighbors")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ This idea is implemented in the `OneHotEncoder` class from `sklearn.preprocessin


```{code-cell} ipython3
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoder = OneHotEncoder(sparse_output=False)
```


Expand Down Expand Up @@ -453,15 +453,15 @@ with open(os.path.join(PATH_TO_WRITE_DATA, "20news_test.vw"), "w") as vw_test_da
Now, we pass the created training file to Vowpal Wabbit. We solve the classification problem with a hinge loss function (linear SVM). The trained model will be saved in the `20news_model.vw` file:


```{code-cell} ipython3
```
#!vw -d $PATH_TO_WRITE_DATA/20news_train.vw \
# --loss_function hinge -f $PATH_TO_WRITE_DATA/20news_model.vw
```

VW prints a lot of interesting info while training (one can suppress it with the `--quiet` parameter). You can see [documentation](https://vowpalwabbit.org/docs/vowpal_wabbit/python/latest/tutorials/cmd_linear_regression.html#vowpal-wabbit-output) of the diagnostic output. Note how average loss drops while training. For loss computation, VW uses samples it has never seen before, so this measure is usually accurate. Now, we apply our trained model to the test set, saving predictions into a file with the `-p` flag:


```{code-cell} ipython3
```
#!vw -i $PATH_TO_WRITE_DATA/20news_model.vw -t -d $PATH_TO_WRITE_DATA/20news_test.vw \
# -p $PATH_TO_WRITE_DATA/20news_test_predictions.txt
```
Expand All @@ -477,13 +477,13 @@ with open(os.path.join(PATH_TO_WRITE_DATA, "20news_test_predictions.txt")) as pr
auc = roc_auc_score(test_labels, test_prediction)
roc_curve = roc_curve(test_labels, test_prediction)
with plt.xkcd():
plt.plot(roc_curve[0], roc_curve[1])
plt.plot([0, 1], [0, 1])
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("test AUC = %f" % (auc))
plt.axis([-0.05, 1.05, -0.05, 1.05]);
plt.plot(roc_curve[0], roc_curve[1])
plt.plot([0, 1], [0, 1])
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("test AUC = %f" % (auc))
plt.axis([-0.05, 1.05, -0.05, 1.05]);
```

The AUC value we get shows that we have achieved high classification quality.
Expand Down Expand Up @@ -524,12 +524,12 @@ We train Vowpal Wabbit in multiclass classification mode, passing the `oaa` para
Additionally, we can try automatic Vowpal Wabbit parameter tuning with [Hyperopt](https://github.com/hyperopt/hyperopt).


```{code-cell} ipython3
```
#!vw --oaa 20 $PATH_TO_WRITE_DATA/20news_train_mult.vw -f $PATH_TO_WRITE_DATA/ \
#20news_model_mult.vw --loss_function=hinge
```

```{code-cell} ipython3
```
#%%time
#!vw -i $PATH_TO_WRITE_DATA/20news_model_mult.vw -t -d $PATH_TO_WRITE_DATA/20news_test_mult.vw \
#-p $PATH_TO_WRITE_DATA/20news_test_predictions_mult.txt
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -901,7 +901,7 @@ We can fight non-stationarity using different approaches: various order differen

## Getting rid of non-stationarity and building SARIMA

Let's build an ARIMA model by walking through all the ~~circles of hell~~ stages of making a series stationary.
Let's build an ARIMA model by walking through all the *circles of hell* stages of making a series stationary.

Here is the code to render plots.

Expand Down Expand Up @@ -1555,7 +1555,7 @@ But, this victory is decieving, and it might not be the brightest idea to fit `x

# Conclusion

We discussed different time series analysis and prediction methods. Unfortunately, or maybe luckily, there is no one way to solve these kind of problems. Methods developed in the 1960s (and some even in the beginning of the 21st century) are still popular, along with LSTMs and RNNs (not covered in this article). This is partially related to the fact that the prediction task, like any other data-related task, requires creativity in so many aspects and definitely requires research. In spite of the large number of formal quality metrics and approaches to parameters estimation, it is often necessary to try something different for each time series. Last but not least, the balance between quality and cost is important. As a good example, the SARIMA model can produce spectacular results after tuning but can require many hours of ~~tambourine dancing~~ time series manipulation while a simple linear regression model can be built in 10 minutes and can achieve more or less comparable results.
We discussed different time series analysis and prediction methods. Unfortunately, or maybe luckily, there is no one way to solve these kind of problems. Methods developed in the 1960s (and some even in the beginning of the 21st century) are still popular, along with LSTMs and RNNs (not covered in this article). This is partially related to the fact that the prediction task, like any other data-related task, requires creativity in so many aspects and definitely requires research. In spite of the large number of formal quality metrics and approaches to parameters estimation, it is often necessary to try something different for each time series. Last but not least, the balance between quality and cost is important. As a good example, the SARIMA model can produce spectacular results after tuning but can require many hours of *tambourine dancing* time series manipulation while a simple linear regression model can be built in 10 minutes and can achieve more or less comparable results.

# Useful resources

Expand Down

0 comments on commit 8d21af1

Please sign in to comment.