Skip to content

Commit

Permalink
Change verbs to present mode in M3 and M5 (#749)
Browse files Browse the repository at this point in the history
  • Loading branch information
PatriOr authored Nov 6, 2023
1 parent 26ad6d2 commit 3a5b94a
Show file tree
Hide file tree
Showing 12 changed files with 96 additions and 98 deletions.
8 changes: 4 additions & 4 deletions python_scripts/parameter_tuning_ex_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,10 +68,10 @@
# %% [markdown]
# Use the previously defined model (called `model`) and using two nested `for`
# loops, make a search of the best combinations of the `learning_rate` and
# `max_leaf_nodes` parameters. In this regard, you will need to train and test
# the model by setting the parameters. The evaluation of the model should be
# performed using `cross_val_score` on the training set. We will use the
# following parameters search:
# `max_leaf_nodes` parameters. In this regard, you have to train and test the
# model by setting the parameters. The evaluation of the model should be
# performed using `cross_val_score` on the training set. Use the following
# parameters search:
# - `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
# the ability of a new tree to correct the error of the previous sequence of
# trees
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/parameter_tuning_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
)

# %% [markdown]
# In this exercise, we will progressively define the regression pipeline and
# In this exercise, we progressively define the regression pipeline and
# later tune its hyperparameters.
#
# Start by defining a pipeline that:
Expand Down
38 changes: 19 additions & 19 deletions python_scripts/parameter_tuning_grid_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
# # Hyperparameter tuning by grid-search
#
# In the previous notebook, we saw that hyperparameters can affect the
# generalization performance of a model. In this notebook, we will show how to
# generalization performance of a model. In this notebook, we show how to
# optimize hyperparameters using a grid-search approach.

# %% [markdown]
Expand Down Expand Up @@ -49,8 +49,8 @@
)

# %% [markdown]
# We will define a pipeline as seen in the first module. It will handle both
# numerical and categorical features.
# We define a pipeline as seen in the first module, to handle both numerical and
# categorical features.
#
# The first step is to select all the categorical columns.

Expand All @@ -61,7 +61,7 @@
categorical_columns = categorical_columns_selector(data)

# %% [markdown]
# Here we will use a tree-based model as a classifier (i.e.
# Here we use a tree-based model as a classifier (i.e.
# `HistGradientBoostingClassifier`). That means:
#
# * Numerical variables don't need scaling;
Expand Down Expand Up @@ -119,8 +119,8 @@
# code.
#
# Let's see how to use the `GridSearchCV` estimator for doing such search. Since
# the grid-search will be costly, we will only explore the combination
# learning-rate and the maximum number of nodes.
# the grid-search is costly, we only explore the combination learning-rate and
# the maximum number of nodes.

# %%
# %%time
Expand All @@ -134,7 +134,7 @@
model_grid_search.fit(data_train, target_train)

# %% [markdown]
# Finally, we will check the accuracy of our model using the test set.
# Finally, we check the accuracy of our model using the test set.

# %%
accuracy = model_grid_search.score(data_test, target_test)
Expand All @@ -155,17 +155,17 @@

# %% [markdown]
# The `GridSearchCV` estimator takes a `param_grid` parameter which defines all
# hyperparameters and their associated values. The grid-search will be in charge
# hyperparameters and their associated values. The grid-search is in charge
# of creating all possible combinations and test them.
#
# The number of combinations will be equal to the product of the number of
# values to explore for each parameter (e.g. in our example 4 x 3 combinations).
# Thus, adding new parameters with their associated values to be explored become
# The number of combinations are equal to the product of the number of values to
# explore for each parameter (e.g. in our example 4 x 3 combinations). Thus,
# adding new parameters with their associated values to be explored become
# rapidly computationally expensive.
#
# Once the grid-search is fitted, it can be used as any other predictor by
# calling `predict` and `predict_proba`. Internally, it will use the model with
# the best parameters found during `fit`.
# calling `predict` and `predict_proba`. Internally, it uses the model with the
# best parameters found during `fit`.
#
# Get predictions for the 5 first samples using the estimator with the best
# parameters.
Expand All @@ -186,8 +186,8 @@
# parameters "by hand" through a double for loop.
#
# In addition, we can inspect all results which are stored in the attribute
# `cv_results_` of the grid-search. We will filter some specific columns from
# these results.
# `cv_results_` of the grid-search. We filter some specific columns from these
# results.

# %%
cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
Expand Down Expand Up @@ -220,9 +220,9 @@ def shorten_param(param_name):
# With only 2 parameters, we might want to visualize the grid-search as a
# heatmap. We need to transform our `cv_results` into a dataframe where:
#
# - the rows will correspond to the learning-rate values;
# - the columns will correspond to the maximum number of leaf;
# - the content of the dataframe will be the mean test scores.
# - the rows correspond to the learning-rate values;
# - the columns correspond to the maximum number of leaf;
# - the content of the dataframe is the mean test scores.

# %%
pivoted_cv_results = cv_results.pivot_table(
Expand Down Expand Up @@ -259,7 +259,7 @@ def shorten_param(param_name):
#
# The precise meaning of those two parameters will be explained later.
#
# For now we will note that, in general, **there is no unique optimal parameter
# For now we note that, in general, **there is no unique optimal parameter
# setting**: 4 models out of the 12 parameter configurations reach the maximal
# accuracy (up to small random fluctuations caused by the sampling of the
# training set).
Expand Down
12 changes: 6 additions & 6 deletions python_scripts/parameter_tuning_nested.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@
# However, we did not present a proper framework to evaluate the tuned models.
# Instead, we focused on the mechanism used to find the best set of parameters.
#
# In this notebook, we will reuse some knowledge presented in the module
# "Selecting the best model" to show how to evaluate models where
# hyperparameters need to be tuned.
# In this notebook, we reuse some knowledge presented in the module "Selecting
# the best model" to show how to evaluate models where hyperparameters need to
# be tuned.
#
# Thus, we will first load the dataset and create the predictive model that we
# want to optimize and later on, evaluate.
# Thus, we first load the dataset and create the predictive model that we want
# to optimize and later on, evaluate.
#
# ## Loading the dataset
#
Expand Down Expand Up @@ -111,7 +111,7 @@
# ### With hyperparameter tuning
#
# As shown in the previous notebook, one can use a search strategy that uses
# cross-validation to find the best set of parameters. Here, we will use a
# cross-validation to find the best set of parameters. Here, we use a
# grid-search strategy and reproduce the steps done in the previous notebook.
#
# First, we have to embed our model into a grid-search and specify the
Expand Down
4 changes: 2 additions & 2 deletions python_scripts/parameter_tuning_parallel_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,8 +110,8 @@ def shorten_param(param_name):
# spread the active ranges and improve the readability of the plot.
# ```
#
# The parallel coordinates plot will display the values of the hyperparameters
# on different columns while the performance metric is color coded. Thus, we are
# The parallel coordinates plot displays the values of the hyperparameters on
# different columns while the performance metric is color coded. Thus, we are
# able to quickly inspect if there is a range of hyperparameters which is
# working or not.
#
Expand Down
8 changes: 4 additions & 4 deletions python_scripts/parameter_tuning_sol_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,10 @@
# %% [markdown]
# Use the previously defined model (called `model`) and using two nested `for`
# loops, make a search of the best combinations of the `learning_rate` and
# `max_leaf_nodes` parameters. In this regard, you will need to train and test
# the model by setting the parameters. The evaluation of the model should be
# performed using `cross_val_score` on the training set. We will use the
# following parameters search:
# `max_leaf_nodes` parameters. In this regard, you need to train and test the
# model by setting the parameters. The evaluation of the model should be
# performed using `cross_val_score` on the training set. Use the following
# parameters search:
# - `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
# the ability of a new tree to correct the error of the previous sequence of
# trees
Expand Down
10 changes: 5 additions & 5 deletions python_scripts/parameter_tuning_sol_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
)

# %% [markdown]
# In this exercise, we will progressively define the regression pipeline and
# later tune its hyperparameters.
# In this exercise, we progressively define the regression pipeline and later
# tune its hyperparameters.
#
# Start by defining a pipeline that:
# * uses a `StandardScaler` to normalize the numerical data;
Expand Down Expand Up @@ -108,8 +108,8 @@
cv_results = pd.DataFrame(model_random_search.cv_results_)

# %% [markdown] tags=["solution"]
# To simplify the axis of the plot, we will rename the column of the dataframe
# and only select the mean test score and the value of the hyperparameters.
# To simplify the axis of the plot, we rename the column of the dataframe and
# only select the mean test score and the value of the hyperparameters.

# %% tags=["solution"]
column_name_mapping = {
Expand Down Expand Up @@ -170,7 +170,7 @@
# vary between 0 and 10,000 (e.g. the variable `"Population"`) and B is a
# feature that varies between 1 and 10 (e.g. the variable `"AveRooms"`), then
# distances between samples (rows of the dataframe) are mostly impacted by
# differences in values of the column A, while values of the column B will be
# differences in values of the column A, while values of the column B are
# comparatively ignored. If one applies StandardScaler to such a database, both
# the values of A and B will be approximately between -3 and 3 and the neighbor
# structure will be impacted more or less equivalently by both variables.
Expand Down
16 changes: 8 additions & 8 deletions python_scripts/trees_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
#
# ## Classification dataset
#
# We will use this dataset in classification setting to predict the penguins'
# We use this dataset in classification setting to predict the penguins'
# species from anatomical information.
#
# Each penguin is from one of the three following species: Adelie, Gentoo, and
Expand All @@ -26,15 +26,15 @@
# penguins](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png)
#
# This problem is a classification problem since the target is categorical. We
# will limit our input data to a subset of the original features to simplify our
# explanations when presenting the decision tree algorithm. Indeed, we will use
# limit our input data to a subset of the original features to simplify our
# explanations when presenting the decision tree algorithm. Indeed, we use
# features based on penguins' culmen measurement. You can learn more about the
# penguins' culmen with the illustration below:
#
# ![Image of
# culmen](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png)
#
# We will start by loading this subset of the dataset.
# We start by loading this subset of the dataset.

# %%
import pandas as pd
Expand Down Expand Up @@ -73,11 +73,11 @@
#
# In a regression setting, the target is a continuous variable instead of
# categories. Here, we use two features of the dataset to make such a problem:
# the flipper length will be used as data and the body mass will be the target.
# In short, we want to predict the body mass using the flipper length.
# the flipper length is used as data and the body mass as the target. In short,
# we want to predict the body mass using the flipper length.
#
# We will load the dataset and visualize the relationship between the flipper
# length and the body mass of penguins.
# We load the dataset and visualize the relationship between the flipper length
# and the body mass of penguins.

# %%
penguins = pd.read_csv("../datasets/penguins_regression.csv")
Expand Down
10 changes: 5 additions & 5 deletions python_scripts/trees_ex_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
# By extrapolation, we refer to values predicted by a model outside of the range
# of feature values seen during the training.
#
# We will first load the regression data.
# We first load the regression data.

# %%
import pandas as pd
Expand Down Expand Up @@ -61,10 +61,10 @@
# Write your code here.

# %% [markdown]
# Now, we will check the extrapolation capabilities of each model. Create a
# dataset containing a broader range of values than your previous dataset, in
# other words, add values below and above the minimum and the maximum of the
# flipper length seen during training.
# Now, we check the extrapolation capabilities of each model. Create a dataset
# containing a broader range of values than your previous dataset, in other
# words, add values below and above the minimum and the maximum of the flipper
# length seen during training.

# %%
# Write your code here.
Expand Down
Loading

0 comments on commit 3a5b94a

Please sign in to comment.