class: middle, center, title-slide
Lecture 1: Fundamentals of machine learning
Prof. Gilles Louppe
[email protected]
???
R: overfitting plot -> make the same with a large nn to show it does NOT overfit!! increasing the numbers of parameters results in regularization -> https://arxiv.org/abs/1812.11118
A recap on statistical learning:
- Supervised learning
- Empirical risk minimization
- Under-fitting and over-fitting
- Bias-variance dilemma
class: middle
Consider an unknown joint probability distribution
Assume training data
- In most cases,
-
$\mathbf{x}_i$ is a$p$ -dimensional vector of features or descriptors, -
$y_i$ is a scalar (e.g., a category or a real value).
-
- The training data is generated i.i.d.
- The training data can be of any finite size
$N$ . - In general, we do not have any prior information about
$P(X,Y)$ .
???
In most cases, x is a vector, but it could be an image, a piece of text or a sample of sound.
class: middle
Supervised learning is usually concerned with the two following inference problems:
-
Classification:
Given
$(\mathbf{x}_i, y_i) \in \mathcal{X}\times\mathcal{Y} = \mathbb{R}^p \times \bigtriangleup^C$ , for$i=1, ..., N$ , we want to estimate for any new$\mathbf{x}$ ,$$\arg \max_y P(Y=y|X=\mathbf{x}).$$ -
Regression:
Given
$(\mathbf{x}_i, y_i) \in \mathcal{X}\times\mathcal{Y} = \mathbb{R}^p \times \mathbb{R}$ , for$i=1, ..., N$ , we want to estimate for any new$\mathbf{x}$ ,$$\mathbb{E}\left[ Y|X=\mathbf{x} \right].$$
???
class: middle
Or more generally, inference is concerned with the conditional estimation
class: middle, center
Classification consists in identifying
a decision boundary between objects of distinct classes.
class: middle, center
Regression aims at estimating relationships among (usually continuous) variables.
Consider a function
## Examples of loss functions
.grid[ .kol-1-3[Classification:] .kol-2-3[$\ell(y,f(\mathbf{x})) = \mathbf{1}_{y \neq f(\mathbf{x})}$] ] .grid[ .kol-1-3[Regression:] .kol-2-3[$\ell(y,f(\mathbf{x})) = (y - f(\mathbf{x}))^2$] ]
class: middle
Let
We are looking for a function
This means that for a given data generating distribution
class: middle
Unfortunately, since
However, if we have i.i.d. training data
This estimate is unbiased and can be used for finding a good enough approximation of
???
What does unbiased mean?
=> The expected empirical risk estimate (over d) is the expected risk.
class: middle
Most machine learning algorithms, including neural networks, implement empirical risk minimization.
Under regularity assumptions, empirical risk minimizers converge:
???
This is why tuning the parameters of the model to make it work on the training data is a reasonable thing to do.
Consider the joint probability distribution
class: middle
Our goal is to find a function
Consider the hypothesis space
class: middle
For this regression problem, we use the squared error loss
Therefore, our goal is to find the best value
class: middle
Given a large enough training set
class: middle
This is ordinary least squares regression, for which the solution is known analytically:
class: middle
The expected risk minimizer
Therefore, on this toy problem, we can verify that
class: middle
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
What if we consider a hypothesis space
class: middle
.center[$\mathcal{F}$ = polynomials of degree 1]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 2]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 3]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 4]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 5]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 10]
class: middle, center
Degree
???
Why shouldn't we pick the largest
class: middle
Let
We define the Bayes risk as the minimal expected risk over all possible functions,
No model
class: middle
The capacity of an hypothesis space induced by a learning algorithm intuitively represents the ability to
find a good model
In practice, capacity can be controlled through hyper-parameters of the learning algorithm. For example:
- The degree of the family of polynomials;
- The number of layers in a neural network;
- The number of training iterations;
- Regularization terms.
class: middle
- If the capacity of
$\mathcal{F}$ is too low, then$f_B \notin \mathcal{F}$ and$R(f) - R_B$ is large for any$f \in \mathcal{F}$ , including$f_*$ and$f_*^{\mathbf{d}}$ . Such models$f$ are said to underfit the data. - If the capacity of
$\mathcal{F}$ is too high, then$f_B \in \mathcal{F}$ or$R(f_*) - R_B$ is small.
However, because of the high capacity of the hypothesis space, the empirical risk minimizer$f_*^{\mathbf{d}}$ could fit the training data arbitrarily well such that$$R(f_*^{\mathbf{d}}) \geq R_B \geq \hat{R}(f_*^{\mathbf{d}}, \mathbf{d}) \geq 0.$$ In this situation,$f_*^{\mathbf{d}}$ becomes too specialized with respect to the true data generating process and a large reduction of the empirical risk (often) comes at the price of an increase of the expected risk of the empirical risk minimizer$R(f_*^{\mathbf{d}})$ . In this situation,$f_*^{\mathbf{d}}$ is said to overfit the data.
class: middle
Therefore, our goal is to adjust the capacity of the hypothesis space such that the expected risk of the empirical risk minimizer gets as low as possible.
???
Comment that for deep networks, training error may goes to 0 while the generalization error may not necessarily go up!
class: middle
When overfitting,
This indicates that the empirical risk
Nevertheless, an unbiased estimate of the expected risk can be obtained by evaluating
This test error estimate can be used to evaluate the actual performance of the model. However, it should not be used, at the same time, for model selection.
class: middle, center
Degree
???
What value of
But then how good is this selected model?
class: middle
There may be over-fitting, but it does not bias the final performance evaluation.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.center[This should be avoided at all costs!]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.center[Instead, keep a separate validation set for tuning the hyper-parameters.]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Comment on the comparison of algorithms from one paper to the other.
Consider a fixed point
Then the local expected risk of
-
$R(f_B|x)$ is the local expected risk of the Bayes model. This term cannot be reduced. -
$(f_B(x) - f_*^{\mathbf{d}}(x))^2$ represents the discrepancy between$f_B$ and$f_*^{\mathbf{d}}$ .
class: middle
If
class: middle
???
What do you observe?
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle
Formally, the expected local expected risk yields to: $$\begin{aligned} &\mathbb{E}_\mathbf{d} \left[ R(f_*^{\mathbf{d}}|x) \right] \\ &= \mathbb{E}_\mathbf{d} \left[ R(f_B|x) + (f_B(x) - f_*^{\mathbf{d}}(x))^2 \right] \\ &= R(f_B|x) + \mathbb{E}_\mathbf{d} \left[ (f_B(x) - f_*^{\mathbf{d}}(x))^2 \right] \\ &= \underbrace{R(f_B|x)}_{\text{noise}(x)} + \underbrace{(f_B(x) - \mathbb{E}_\mathbf{d}\left[ f_*^\mathbf{d}(x) \right] )^2}_{\text{bias}^2(x)} + \underbrace{\mathbb{E}_\mathbf{d}\left[ ( \mathbb{E}_\mathbf{d}\left[ f_*^\mathbf{d}(x) \right] - f_*^\mathbf{d}(x))^2 \right]}_{\text{var}(x)} \end{aligned}$$
This decomposition is known as the bias-variance decomposition.
- The noise term quantities the irreducible part of the expected risk.
- The bias term measures the discrepancy between the average model and the Bayes model.
- The variance term quantities the variability of the predictions.
class: middle
- Reducing the capacity makes
$f_*^\mathbf{d}$ fit the data less on average, which increases the bias term. - Increasing the capacity makes
$f_*^\mathbf{d}$ vary a lot with the training data, which increases the variance term.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle, center, red-slide
What about a neural network with .bold[millions] of parameters?
class: middle
class: middle
.footnote[Credits: Belkin et al, 2018.]
class: end-slide, center count: false
The end.
count: false
- Vapnik, V. (1992). Principles of risk minimization for learning theory. In Advances in neural information processing systems (pp. 831-838).
- Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502.