Skip to content

Commit

Permalink
fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
Yorko committed Dec 28, 2021
1 parent ea4da81 commit ad608c9
Show file tree
Hide file tree
Showing 15 changed files with 54 additions and 34 deletions.
21 changes: 20 additions & 1 deletion .yaspellerrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -52,28 +52,40 @@
"Uvarov",
"Yuanyuan",
"Yury.*",
"accelerometer.*",
"ai",
"alco",
"autoregress.*",
"bivariate",
"boxplot",
"cardio",
"CatBoost",
"cheatsheet",
"dask",
"dataframe.*",
"DataFrame.*",
"dataset.*",
"diastolic",
"differentiable",
"distplot",
"habr.*",
"heatmap",
"hyperparam.*",
"hyperplane",
"imbalanced",
"interpretable",
"interpretability",
"gbm",
"geodata",
"GitHub",
"groupby",
"inclass",
"jointplot",
"Jupyter",
"jupyter",
"jupytext",
"Kaggle",
"leaderboard",
"LightGBM",
"MATLAB.*",
"matplotlib",
"md",
Expand All @@ -82,6 +94,7 @@
"myst",
"numpy",
"nonlinear.*",
"overfit.*",
"pairplot.*",
"pairwise",
"pandas",
Expand All @@ -91,7 +104,9 @@
"Ph. D.", "Ph.D.*",
"plotly",
"python.*",
"regressor.*",
"regularization",
"reproducibility",
"quartile.*",
"repo.*",
"vs",
Expand All @@ -100,12 +115,16 @@
"skew.*",
"sklearn.*",
"SVD",
"tokeniz.*",
"uncorrelated",
"underfit.*",
"univariate",
"unlabeled",
"unsupervised",
"variational",
"videolecture.*",
"voicemail",
"xgboost",
"yorko"
]
}
5 changes: 3 additions & 2 deletions mlcourse_ai_jupyter_book/book/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@ mlcourse.ai is never supposed to go fully monetized (it's created in the wonderf
</div>
</div><br>

The bonus pack contains 10 assignments, in some of them you are challenged to beat a baseline in a Kaggle competition under thorough guidance (["Alice"](bonus04) and "Medium") or implement an algorithm from scratch (SGD regression model and gradient boosting).
The bonus pack contains 10 assignments, in some of them you are challenged to beat a baseline in a Kaggle competition under thorough guidance (["Alice"](bonus04) and ["Medium"](bonus06)) or implement an algorithm from scratch -- efficient stochastic gradient descent [classifier](bonus08) and [gradient boosting](bonus10).

<!--
Below you can see the course program (click to enlarge).
- **Green** stands for basic content outlined in the articles;
Expand All @@ -48,4 +49,4 @@ Below you can see the course program (click to enlarge).
```{figure} /_static/program/program.svg
:name: course_program
:width: 600px
```
```-->
Original file line number Diff line number Diff line change
Expand Up @@ -220,4 +220,4 @@ data.info()
# rf_importance.sort_values
```

**Make conclusions about the perdormance of the explored 3 models in this particular prediction task.**
**Make conclusions about the performance of the explored 3 models in this particular prediction task.**
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,6 @@ print(
# rf_importance.sort_values(by="coef", ascending=False)
```

**Make conclusions about the perdormance of the explored 3 models in this particular prediction task.**
**Make conclusions about the performance of the explored 3 models in this particular prediction task.**

The depency of wine quality on other features in hand is, presumable, non-linear. So Random Forest works better in this task.
The dependency of wine quality on other features in hand is, presumable, non-linear. So Random Forest works better in this task.
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ import reverse_geocoder as revgc
revgc.search(list(zip(df.latitude, df.longitude)))
```

When working with geoсoding, we must not forget that addresses may contain typos, which makes the data cleaning step necessary. Coordinates contain fewer misprints, but its position can be incorrect due to GPS noise or bad accuracy in places like tunnels, downtown areas, etc. If the data source is a mobile device, the geolocation may not be determined by GPS but by WiFi networks in the area, which leads to holes in space and teleportation. While traveling along in Manhattan, there can suddenly be a WiFi location from Chicago.
When working with geocoding, we must not forget that addresses may contain typos, which makes the data cleaning step necessary. Coordinates contain fewer misprints, but its position can be incorrect due to GPS noise or bad accuracy in places like tunnels, downtown areas, etc. If the data source is a mobile device, the geolocation may not be determined by GPS but by WiFi networks in the area, which leads to holes in space and teleportation. While traveling along in Manhattan, there can suddenly be a WiFi location from Chicago.

> WiFi location tracking is based on the combination of SSID and MAC-addresses, which may correspond to different points e.g. federal provider standardizes the firmware of routers up to MAC-address and places them in different cities. Even a company's move to another office with its routers can cause issues.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ kernelspec:

(assignment07_solution)=

# Assignment #7 (demo). Unupervised learning. Solution
# Assignment #7 (demo). Unsupervised learning. Solution

<img src="https://habrastorage.org/webt/ia/m9/zk/iam9zkyzqebnf_okxipihkgjwnw.jpeg" />

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -417,9 +417,9 @@ We see that $J(C_k)$ decreases significantly until the number of clusters is 3 a

### Issues

Inherently, K-means is NP-hard. For $d$ dimensions, $k$ clusters, and $n$ observations, we will find a solution in $O(n^{d k+1})$ time. There are some heuristics to deal with this; an example is MiniBatch K-means, which takes portions (batches) of data instead of fitting the whole dataset and then moves centroids by taking the average of the previous steps. Compare the implementation of K-means and MiniBatch K-means in the [sckit-learn documentation](http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html).
Inherently, K-means is NP-hard. For $d$ dimensions, $k$ clusters, and $n$ observations, we will find a solution in $O(n^{d k+1})$ time. There are some heuristics to deal with this; an example is MiniBatch K-means, which takes portions (batches) of data instead of fitting the whole dataset and then moves centroids by taking the average of the previous steps. Compare the implementation of K-means and MiniBatch K-means in the [scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html).

The [implemetation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) of the algorithm using `scikit-learn` has its benefits such as the possibility to state the number of initializations with the `n_init` function parameter, which enables us to identify more robust centroids. Moreover, these runs can be done in parallel to decrease the computation time.
The [implementation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) of the algorithm using `scikit-learn` has its benefits such as the possibility to state the number of initializations with the `n_init` function parameter, which enables us to identify more robust centroids. Moreover, these runs can be done in parallel to decrease the computation time.

## Affinity Propagation

Expand Down Expand Up @@ -469,7 +469,7 @@ $d(C_i, C_j) = ||\mu_i - \mu_j||$

The 3rd one is the most effective in computation time since it does not require recomputing the distances every time the clusters are merged.

The results can be visualized as a beautiful cluster tree (dendogram) to help recognize the moment the algorithm should be stopped to get optimal results. There are plenty of Python tools to build these dendograms for agglomerative clustering.
The results can be visualized as a beautiful cluster tree (dendrogram) to help recognize the moment the algorithm should be stopped to get optimal results. There are plenty of Python tools to build these dendrograms for agglomerative clustering.

Let's consider an example with the clusters we got from K-means:

Expand Down Expand Up @@ -500,7 +500,7 @@ dn = hierarchy.dendrogram(Z, color_threshold=0.5)

## Accuracy metrics

As opposed to classfication, it is difficult to assess the quality of results from clustering. Here, a metric cannot depend on the labels but only on the goodness of split. Secondly, we do not usually have true labels of the observations when we use clustering.
As opposed to classification, it is difficult to assess the quality of results from clustering. Here, a metric cannot depend on the labels but only on the goodness of split. Secondly, we do not usually have true labels of the observations when we use clustering.

There are *internal* and *external* goodness metrics. External metrics use the information about the known true split while internal metrics do not use any external information and assess the goodness of clusters based only on the initial data. The optimal number of clusters is usually defined with respect to some internal metrics.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ This assignment is an extended version of the [demo assignment](assignment08) wh
</div>


Finally, we implement both an SGD regressor and an SGD classifier from scrath and validate both with real-world datasets.
Finally, we implement both an SGD regressor and an SGD classifier from scratch and validate both with real-world datasets.

<p float="left">
<img src="../../_static/img/assignment8_teaser_sdg_classifier.png" width="450" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ kernelspec:

**<center>[mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course** </center><br>

Author: [Yury Kashnitskiy](https://yorko.github.io). Translated and edited by [Serge Oreshkov](https://www.linkedin.com/in/sergeoreshkov/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.
Author: [Yury Kashnitsky](https://yorko.github.io). Translated and edited by [Serge Oreshkov](https://www.linkedin.com/in/sergeoreshkov/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.


This week, we'll cover two reasons for Vowpal Wabbit’s exceptional training speed, namely, online learning and hashing trick, in both theory and practice. We will try it out with news, movie reviews, and StackOverflow questions.
Expand Down Expand Up @@ -224,7 +224,7 @@ df.loc[1].job - df.loc[2].job
```


Does this operation make any sense? Not really. Let's try to train logisitic regression with this feature transformation.
Does this operation make any sense? Not really. Let's try to train logistic regression with this feature transformation.


```{code-cell} ipython3
Expand Down Expand Up @@ -336,7 +336,7 @@ A good analysis of hash collisions, their dependency on feature space and hashin

## 3. Vowpal Wabbit

[Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) (VW) is one of the most widespread machine learning libraries used in industry. It is prominent for its training speed and support of many training modes, especially for online learning with big and high-dimentional data. This is one of the major merits of the library. Also, with the hashing trick implemented, Vowpal Wabbit is a perfect choice for working with text data.
[Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit) (VW) is one of the most widespread machine learning libraries used in industry. It is prominent for its training speed and support of many training modes, especially for online learning with big and high-dimensional data. This is one of the major merits of the library. Also, with the hashing trick implemented, Vowpal Wabbit is a perfect choice for working with text data.

Shell is the main interface for VW.

Expand Down
2 changes: 1 addition & 1 deletion mlcourse_ai_jupyter_book/book/topic08/videolecture08.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@
<p align="center"><iframe width="560" height="315" style='' src="https://www.youtube.com/embed/EUSXbdzaQE8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>


2\. An overview of Vowpal Wabbit, a library that allows balzingly fast learning
2\. An overview of Vowpal Wabbit, a library that allows blazingly fast learning.

<p align="center"><iframe width="560" height="315" style='' src="https://www.youtube.com/embed/gyCjancgR9U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
2 changes: 1 addition & 1 deletion mlcourse_ai_jupyter_book/book/topic09/topic09_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ Here we discuss various approaches to working with time series: what types of da

4\. Check out the [solution](assignment09_solution) (same as a [Kaggle Notebook](https://www.kaggle.com/kashnitsky/a9-demo-time-series-analysis-solution)) to the demo assignment (optional);

5\. Complete [Bonus Assignment 9](bonus09) where you'll engineer some features and apply a m achine learning model to a time series prediction task (optional, available under Patreon ["Bonus Assignments" tier](https://www.patreon.com/ods_mlcourse)).
5\. Complete [Bonus Assignment 9](bonus09) where you'll engineer some features and apply a machine learning model to a time series prediction task (optional, available under Patreon ["Bonus Assignments" tier](https://www.patreon.com/ods_mlcourse)).
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ We begin with a simple [definition](https://en.wikipedia.org/wiki/Time_series) o
Therefore, the data is organized by relatively deterministic timestamps, and may, compared to random sample data, contain additional information that we can extract.

Let's import some libraries. First, we will need the [statsmodels](http://statsmodels.sourceforge.net/stable/) library, which has many statistical modeling functions, including time series. For R afficionados who had to move to Python, `statsmodels` will definitely look more familiar since it supports model definitions like 'Wage ~ Age + Education'.
Let's import some libraries. First, we will need the [statsmodels](http://statsmodels.sourceforge.net/stable/) library, which has many statistical modeling functions, including time series. For R aficionados who had to move to Python, `statsmodels` will definitely look more familiar since it supports model definitions like 'Wage ~ Age + Education'.


```{code-cell} ipython3
Expand Down Expand Up @@ -129,7 +129,7 @@ $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$
sklearn.metrics.r2_score
```
---
- [Mean Absolute Error](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error): this is an interpretable metric because it has the same unit of measurment as the initial series, $[0, +\infty)$
- [Mean Absolute Error](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error): this is an interpretable metric because it has the same unit of measurement as the initial series, $[0, +\infty)$

$MAE = \frac{\sum\limits_{i=1}^{n} |y_i - \hat{y}_i|}{n}$

Expand Down Expand Up @@ -850,7 +850,7 @@ If a process is stationary, that means it does not change its statistical proper

<img src="https://habrastorage.org/files/2f6/1ee/cb2/2f61eecb20714352840748b826e38680.png"/>

So why is stationarity so important? Because it is easy to make predictions on a stationary series since we can assume that the future statistical properties will not be different from those currently observed. Most of the time-series models, in one way or the other, try to predict those properties (mean or variance, for example). Furture predictions would be wrong if the original series were not stationary. Unfortunately, most of the time series that we see outside of textbooks are non-stationary, but we can (and should) change this.
So why is stationarity so important? Because it is easy to make predictions on a stationary series since we can assume that the future statistical properties will not be different from those currently observed. Most of the time-series models, in one way or the other, try to predict those properties (mean or variance, for example). Future predictions would be wrong if the original series were not stationary. Unfortunately, most of the time series that we see outside of textbooks are non-stationary, but we can (and should) change this.

So, in order to combat non-stationarity, we have to know our enemy, so to speak. Let's see how we can detect it. We will look at white noise and random walks to learn how to get from one to another for free.

Expand Down Expand Up @@ -953,7 +953,7 @@ ads_diff = ads_diff - ads_diff.shift(1)
tsplot(ads_diff[24 + 1 :], lags=60)
```

Perfect! Our series now looks like something undescribable, oscillating around zero. The Dickey-Fuller test indicates that it is stationary, and the number of significant peaks in ACF has dropped. We can finally start modeling!
Perfect! Our series now looks like something indescribable, oscillating around zero. The Dickey-Fuller test indicates that it is stationary, and the number of significant peaks in ACF has dropped. We can finally start modeling!

## ARIMA-family Crash-Course

Expand All @@ -980,7 +980,7 @@ With this, we have three parameters: $(P, D, Q)$

- $Q$ - similar logic using the ACF plot instead.

- $D$ - order of seasonal integration. This can be equal to 1 or 0, depending on whether seasonal differeces were applied or not.
- $D$ - order of seasonal integration. This can be equal to 1 or 0, depending on whether seasonal differences were applied or not.

Now that we know how to set the initial parameters, let's have a look at the final plot once again and set the parameters:

Expand Down Expand Up @@ -1141,7 +1141,7 @@ This approach is not backed by theory and breaks several assumptions (e.g. Gauss

## Feature extraction

The model needs features, and all we have is a 1-dimentional time series. What features can we extract?
The model needs features, and all we have is a 1-dimensional time series. What features can we extract?
* Time series lags
* Window statistics:
- Max/min value of series in a window
Expand Down
Loading

0 comments on commit ad608c9

Please sign in to comment.