Change verbs to present mode in M3 and M5 (#749)

INRIA · Nov 6, 2023 · 3a5b94a · 3a5b94a
1 parent 26ad6d2
commit 3a5b94a
Show file tree

Hide file tree

Showing 12 changed files with 96 additions and 98 deletions.
diff --git a/python_scripts/parameter_tuning_ex_02.py b/python_scripts/parameter_tuning_ex_02.py
@@ -68,10 +68,10 @@
 # %% [markdown]
 # Use the previously defined model (called `model`) and using two nested `for`
 # loops, make a search of the best combinations of the `learning_rate` and
-# `max_leaf_nodes` parameters. In this regard, you will need to train and test
-# the model by setting the parameters. The evaluation of the model should be
-# performed using `cross_val_score` on the training set. We will use the
-# following parameters search:
+# `max_leaf_nodes` parameters. In this regard, you have to train and test the
+# model by setting the parameters. The evaluation of the model should be
+# performed using `cross_val_score` on the training set. Use the following
+# parameters search:
 # - `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
 #   the ability of a new tree to correct the error of the previous sequence of
 #   trees

diff --git a/python_scripts/parameter_tuning_ex_03.py b/python_scripts/parameter_tuning_ex_03.py
@@ -29,7 +29,7 @@
 )
 
 # %% [markdown]
-# In this exercise, we will progressively define the regression pipeline and
+# In this exercise, we progressively define the regression pipeline and
 # later tune its hyperparameters.
 #
 # Start by defining a pipeline that:

diff --git a/python_scripts/parameter_tuning_grid_search.py b/python_scripts/parameter_tuning_grid_search.py
@@ -9,7 +9,7 @@
 # # Hyperparameter tuning by grid-search
 #
 # In the previous notebook, we saw that hyperparameters can affect the
-# generalization performance of a model. In this notebook, we will show how to
+# generalization performance of a model. In this notebook, we show how to
 # optimize hyperparameters using a grid-search approach.
 
 # %% [markdown]
@@ -49,8 +49,8 @@
 )
 
 # %% [markdown]
-# We will define a pipeline as seen in the first module. It will handle both
-# numerical and categorical features.
+# We define a pipeline as seen in the first module, to handle both numerical and
+# categorical features.
 #
 # The first step is to select all the categorical columns.
 
@@ -61,7 +61,7 @@
 categorical_columns = categorical_columns_selector(data)
 
 # %% [markdown]
-# Here we will use a tree-based model as a classifier (i.e.
+# Here we use a tree-based model as a classifier (i.e.
 # `HistGradientBoostingClassifier`). That means:
 #
 # * Numerical variables don't need scaling;
@@ -119,8 +119,8 @@
 # code.
 #
 # Let's see how to use the `GridSearchCV` estimator for doing such search. Since
-# the grid-search will be costly, we will only explore the combination
-# learning-rate and the maximum number of nodes.
+# the grid-search is costly, we only explore the combination learning-rate and
+# the maximum number of nodes.
 
 # %%
 # %%time
@@ -134,7 +134,7 @@
 model_grid_search.fit(data_train, target_train)
 
 # %% [markdown]
-# Finally, we will check the accuracy of our model using the test set.
+# Finally, we check the accuracy of our model using the test set.
 
 # %%
 accuracy = model_grid_search.score(data_test, target_test)
@@ -155,17 +155,17 @@
 
 # %% [markdown]
 # The `GridSearchCV` estimator takes a `param_grid` parameter which defines all
-# hyperparameters and their associated values. The grid-search will be in charge
+# hyperparameters and their associated values. The grid-search is in charge
 # of creating all possible combinations and test them.
 #
-# The number of combinations will be equal to the product of the number of
-# values to explore for each parameter (e.g. in our example 4 x 3 combinations).
-# Thus, adding new parameters with their associated values to be explored become
+# The number of combinations are equal to the product of the number of values to
+# explore for each parameter (e.g. in our example 4 x 3 combinations). Thus,
+# adding new parameters with their associated values to be explored become
 # rapidly computationally expensive.
 #
 # Once the grid-search is fitted, it can be used as any other predictor by
-# calling `predict` and `predict_proba`. Internally, it will use the model with
-# the best parameters found during `fit`.
+# calling `predict` and `predict_proba`. Internally, it uses the model with the
+# best parameters found during `fit`.
 #
 # Get predictions for the 5 first samples using the estimator with the best
 # parameters.
@@ -186,8 +186,8 @@
 # parameters "by hand" through a double for loop.
 #
 # In addition, we can inspect all results which are stored in the attribute
-# `cv_results_` of the grid-search. We will filter some specific columns from
-# these results.
+# `cv_results_` of the grid-search. We filter some specific columns from these
+# results.
 
 # %%
 cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
@@ -220,9 +220,9 @@ def shorten_param(param_name):
 # With only 2 parameters, we might want to visualize the grid-search as a
 # heatmap. We need to transform our `cv_results` into a dataframe where:
 #
-# - the rows will correspond to the learning-rate values;
-# - the columns will correspond to the maximum number of leaf;
-# - the content of the dataframe will be the mean test scores.
+# - the rows correspond to the learning-rate values;
+# - the columns correspond to the maximum number of leaf;
+# - the content of the dataframe is the mean test scores.
 
 # %%
 pivoted_cv_results = cv_results.pivot_table(
@@ -259,7 +259,7 @@ def shorten_param(param_name):
 #
 # The precise meaning of those two parameters will be explained later.
 #
-# For now we will note that, in general, **there is no unique optimal parameter
+# For now we note that, in general, **there is no unique optimal parameter
 # setting**: 4 models out of the 12 parameter configurations reach the maximal
 # accuracy (up to small random fluctuations caused by the sampling of the
 # training set).

diff --git a/python_scripts/parameter_tuning_nested.py b/python_scripts/parameter_tuning_nested.py
@@ -12,12 +12,12 @@
 # However, we did not present a proper framework to evaluate the tuned models.
 # Instead, we focused on the mechanism used to find the best set of parameters.
 #
-# In this notebook, we will reuse some knowledge presented in the module
-# "Selecting the best model" to show how to evaluate models where
-# hyperparameters need to be tuned.
+# In this notebook, we reuse some knowledge presented in the module "Selecting
+# the best model" to show how to evaluate models where hyperparameters need to
+# be tuned.
 #
-# Thus, we will first load the dataset and create the predictive model that we
-# want to optimize and later on, evaluate.
+# Thus, we first load the dataset and create the predictive model that we want
+# to optimize and later on, evaluate.
 #
 # ## Loading the dataset
 #
@@ -111,7 +111,7 @@
 # ### With hyperparameter tuning
 #
 # As shown in the previous notebook, one can use a search strategy that uses
-# cross-validation to find the best set of parameters. Here, we will use a
+# cross-validation to find the best set of parameters. Here, we use a
 # grid-search strategy and reproduce the steps done in the previous notebook.
 #
 # First, we have to embed our model into a grid-search and specify the

diff --git a/python_scripts/parameter_tuning_parallel_plot.py b/python_scripts/parameter_tuning_parallel_plot.py
@@ -110,8 +110,8 @@ def shorten_param(param_name):
 # spread the active ranges and improve the readability of the plot.
 # ```
 #
-# The parallel coordinates plot will display the values of the hyperparameters
-# on different columns while the performance metric is color coded. Thus, we are
+# The parallel coordinates plot displays the values of the hyperparameters on
+# different columns while the performance metric is color coded. Thus, we are
 # able to quickly inspect if there is a range of hyperparameters which is
 # working or not.
 #

diff --git a/python_scripts/parameter_tuning_sol_02.py b/python_scripts/parameter_tuning_sol_02.py
@@ -62,10 +62,10 @@
 # %% [markdown]
 # Use the previously defined model (called `model`) and using two nested `for`
 # loops, make a search of the best combinations of the `learning_rate` and
-# `max_leaf_nodes` parameters. In this regard, you will need to train and test
-# the model by setting the parameters. The evaluation of the model should be
-# performed using `cross_val_score` on the training set. We will use the
-# following parameters search:
+# `max_leaf_nodes` parameters. In this regard, you need to train and test the
+# model by setting the parameters. The evaluation of the model should be
+# performed using `cross_val_score` on the training set. Use the following
+# parameters search:
 # - `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
 #   the ability of a new tree to correct the error of the previous sequence of
 #   trees

diff --git a/python_scripts/parameter_tuning_sol_03.py b/python_scripts/parameter_tuning_sol_03.py
@@ -23,8 +23,8 @@
 )
 
 # %% [markdown]
-# In this exercise, we will progressively define the regression pipeline and
-# later tune its hyperparameters.
+# In this exercise, we progressively define the regression pipeline and later
+# tune its hyperparameters.
 #
 # Start by defining a pipeline that:
 # * uses a `StandardScaler` to normalize the numerical data;
@@ -108,8 +108,8 @@
 cv_results = pd.DataFrame(model_random_search.cv_results_)
 
 # %% [markdown] tags=["solution"]
-# To simplify the axis of the plot, we will rename the column of the dataframe
-# and only select the mean test score and the value of the hyperparameters.
+# To simplify the axis of the plot, we rename the column of the dataframe and
+# only select the mean test score and the value of the hyperparameters.
 
 # %% tags=["solution"]
 column_name_mapping = {
@@ -170,7 +170,7 @@
 # vary between 0 and 10,000 (e.g. the variable `"Population"`) and B is a
 # feature that varies between 1 and 10 (e.g. the variable `"AveRooms"`), then
 # distances between samples (rows of the dataframe) are mostly impacted by
-# differences in values of the column A, while values of the column B will be
+# differences in values of the column A, while values of the column B are
 # comparatively ignored. If one applies StandardScaler to such a database, both
 # the values of A and B will be approximately between -3 and 3 and the neighbor
 # structure will be impacted more or less equivalently by both variables.

diff --git a/python_scripts/trees_dataset.py b/python_scripts/trees_dataset.py
@@ -15,7 +15,7 @@
 #
 # ## Classification dataset
 #
-# We will use this dataset in classification setting to predict the penguins'
+# We use this dataset in classification setting to predict the penguins'
 # species from anatomical information.
 #
 # Each penguin is from one of the three following species: Adelie, Gentoo, and
@@ -26,15 +26,15 @@
 # penguins](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png)
 #
 # This problem is a classification problem since the target is categorical. We
-# will limit our input data to a subset of the original features to simplify our
-# explanations when presenting the decision tree algorithm. Indeed, we will use
+# limit our input data to a subset of the original features to simplify our
+# explanations when presenting the decision tree algorithm. Indeed, we use
 # features based on penguins' culmen measurement. You can learn more about the
 # penguins' culmen with the illustration below:
 #
 # ![Image of
 # culmen](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/culmen_depth.png)
 #
-# We will start by loading this subset of the dataset.
+# We start by loading this subset of the dataset.
 
 # %%
 import pandas as pd
@@ -73,11 +73,11 @@
 #
 # In a regression setting, the target is a continuous variable instead of
 # categories. Here, we use two features of the dataset to make such a problem:
-# the flipper length will be used as data and the body mass will be the target.
-# In short, we want to predict the body mass using the flipper length.
+# the flipper length is used as data and the body mass as the target. In short,
+# we want to predict the body mass using the flipper length.
 #
-# We will load the dataset and visualize the relationship between the flipper
-# length and the body mass of penguins.
+# We load the dataset and visualize the relationship between the flipper length
+# and the body mass of penguins.
 
 # %%
 penguins = pd.read_csv("../datasets/penguins_regression.csv")

diff --git a/python_scripts/trees_ex_02.py b/python_scripts/trees_ex_02.py
@@ -20,7 +20,7 @@
 # By extrapolation, we refer to values predicted by a model outside of the range
 # of feature values seen during the training.
 #
-# We will first load the regression data.
+# We first load the regression data.
 
 # %%
 import pandas as pd
@@ -61,10 +61,10 @@
 # Write your code here.
 
 # %% [markdown]
-# Now, we will check the extrapolation capabilities of each model. Create a
-# dataset containing a broader range of values than your previous dataset, in
-# other words, add values below and above the minimum and the maximum of the
-# flipper length seen during training.
+# Now, we check the extrapolation capabilities of each model. Create a dataset
+# containing a broader range of values than your previous dataset, in other
+# words, add values below and above the minimum and the maximum of the flipper
+# length seen during training.
 
 # %%
 # Write your code here.