re-org mlp and ml fundamentals chaps

naviat · Sep 16, 2021 · 6e7cb6c · 6e7cb6c
1 parent dd178d5
commit 6e7cb6c
Show file tree

Hide file tree

Showing 15 changed files with 87 additions and 168 deletions.
diff --git a/chapter_linear-networks/softmax-regression-concise.md b/chapter_linear-networks/softmax-regression-concise.md
@@ -38,7 +38,7 @@ import tensorflow as tf
 
 ## Defining the Model
 
-As we did in :numref:`sec_linear_concise`, we construct the fully-connected layer using the built-in layer. The built-in `__call__` method then invokes `forward` whenever we need to apply the network to some inputs.
+As we did in :numref:`sec_linear_concise`, we construct the fully connected layer using the built-in layer. The built-in `__call__` method then invokes `forward` whenever we need to apply the network to some inputs.
 
 :begin_tab:`mxnet`
 Even though the input `X` is a 4-D tensor, the built-in `Dense` layer will automatically convert `X` into a 2-D tensor by keeping the first dimension size unchanged.
@@ -64,7 +64,7 @@ class SoftmaxRegression(d2l.Classification):
             self.net.initialize()
         if tab.selected('pytorch'):
             self.net = nn.Sequential(nn.Flatten(),
-                                     nn.LazyLinear(num_outputs))            
+                                     nn.LazyLinear(num_outputs))
         if tab.selected('tensorflow'):
             self.net = tf.keras.models.Sequential()
             self.net.add(tf.keras.layers.Flatten())
@@ -78,33 +78,33 @@ class SoftmaxRegression(d2l.Classification):
 :label:`subsec_softmax-implementation-revisited`
 
 In the previous section we calculated our model's output
-and applied the cross-entropy loss. While this is perfectly 
-reasonable mathematically, is is risky computationally, due to 
-numerical underflow and overflow in the exponentiation. 
+and applied the cross-entropy loss. While this is perfectly
+reasonable mathematically, is is risky computationally, due to
+numerical underflow and overflow in the exponentiation.
 
 Recall that the softmax function computes probabilities via
-$\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$. 
+$\hat y_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}$.
 If some of the $o_k$ are very large, i.e., very positive,
 then $\exp(o_k)$ might be larger than the largest number
 we can have for certain data types. This is called *overflow*. Likewise,
-if all arguments are very negative, we will get *underflow*. 
-For instance, single precision floating point numbers approximately 
-cover the range of $10^{-38}$ to $10^{38}$. As such, if the largest term in $\mathbf{o}$ 
-lies outside the interval $[-90, 90]$, the result will not be stable. 
-A solution to this problem is to subtract $\bar{o} := \max_k o_k$ from 
-all entries. 
+if all arguments are very negative, we will get *underflow*.
+For instance, single precision floating point numbers approximately
+cover the range of $10^{-38}$ to $10^{38}$. As such, if the largest term in $\mathbf{o}$
+lies outside the interval $[-90, 90]$, the result will not be stable.
+A solution to this problem is to subtract $\bar{o} := \max_k o_k$ from
+all entries.
 
 $$
-\hat y_j = \frac{\exp o_j}{\sum_k \exp o_k} = 
-\frac{\exp(o_j - \bar{o}) \exp \bar{o}}{\sum_k \exp (o_k - \bar{o}) \exp \bar{o}} = 
+\hat y_j = \frac{\exp o_j}{\sum_k \exp o_k} =
+\frac{\exp(o_j - \bar{o}) \exp \bar{o}}{\sum_k \exp (o_k - \bar{o}) \exp \bar{o}} =
 \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})}.
 $$
 
-By construction we know that $o_j - \bar{o} \leq 0$ for all $j$. As such, for a $q$-class 
-classification problem, the denominator is contained in the interval $[1, q]$. Moreover, the 
-numerator never exceeds $1$, thus preventing numerical overflow. Numerical underflow only 
-occurs when $\exp(o_j - \bar{o})$ numerically evaluates as $0$. Nonetheless, a few steps down 
-the road we might find ourselves in trouble when we want to compute $\log \hat{h}_j$ as $\log 0$. 
+By construction we know that $o_j - \bar{o} \leq 0$ for all $j$. As such, for a $q$-class
+classification problem, the denominator is contained in the interval $[1, q]$. Moreover, the
+numerator never exceeds $1$, thus preventing numerical overflow. Numerical underflow only
+occurs when $\exp(o_j - \bar{o})$ numerically evaluates as $0$. Nonetheless, a few steps down
+the road we might find ourselves in trouble when we want to compute $\log \hat{h}_j$ as $\log 0$.
 In particular, in backpropagation,
 we might find ourselves faced with a screenful
 of the dreaded `NaN` results.
@@ -118,11 +118,11 @@ we can escape the numerical stability issues altogether. We have:
 
 $$
 \log \hat{y}_j =
-\log \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})} = 
+\log \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})} =
 o_j - \bar{o} - \log \sum_k \exp (o_k - \bar{o})
 $$
 
-This avoids both overflow and underflow. 
+This avoids both overflow and underflow.
 We will want to keep the conventional softmax function handy
 in case we ever want to evaluate the output probabilities by our model.
 But instead of passing softmax probabilities into our new loss function,
@@ -167,17 +167,17 @@ albeit this time with fewer lines of code than before.
 
 ## Summary
 
-High-level APIs are very convenient at hiding potentially dangerous aspects from their user, such as numerical stability. Moreover, they allow users to design models concisely with very few lines of code. This is both a blessing and a curse. The obvious benefit is that it makes things highly accessible, even to engineers who never took a single class of statistics in their life (in fact, this is one of the target audiences of the book). But hiding the sharp edges also comes with a price: a disincentive to add new and different components on your own, since there's little muscle memory for doing it. Moreover, it makes it more difficult to *fix* things whenever the protective padding of 
-a framework fails to cover all the corner cases entirely. Again, this is due to lack of familiarity. 
+High-level APIs are very convenient at hiding potentially dangerous aspects from their user, such as numerical stability. Moreover, they allow users to design models concisely with very few lines of code. This is both a blessing and a curse. The obvious benefit is that it makes things highly accessible, even to engineers who never took a single class of statistics in their life (in fact, this is one of the target audiences of the book). But hiding the sharp edges also comes with a price: a disincentive to add new and different components on your own, since there's little muscle memory for doing it. Moreover, it makes it more difficult to *fix* things whenever the protective padding of
+a framework fails to cover all the corner cases entirely. Again, this is due to lack of familiarity.
 
-As such, we strongly urge you to review *both* the bare bones and the elegant versions of many of the implementations that follow. While we emphasize ease of understanding, the implementations are nonetheless usually quite performant (convolutions are the big exception here). It is our intention to allow you to build on these when you invent something new that no framework can give you. 
+As such, we strongly urge you to review *both* the bare bones and the elegant versions of many of the implementations that follow. While we emphasize ease of understanding, the implementations are nonetheless usually quite performant (convolutions are the big exception here). It is our intention to allow you to build on these when you invent something new that no framework can give you.
 
 
 ## Exercises
 
-1. Deep learning uses many different number formats, including FP64 double precision (used extremely rarely), 
-FP32 single precision, BFLOAT16 (good for compressed representations), FP16 (very unstable), TF32 (a new format from NVIDIA), and INT8. Compute the smallest and largest argument of the exponential function for which the result does not lead to a numerical underflow or overflow. 
-1. INT8 is a very limited format with nonzero numbers from $1$ to $255$. How could you extend its dynamic range without using more bits? Do standard multiplication and addition still work? 
+1. Deep learning uses many different number formats, including FP64 double precision (used extremely rarely),
+FP32 single precision, BFLOAT16 (good for compressed representations), FP16 (very unstable), TF32 (a new format from NVIDIA), and INT8. Compute the smallest and largest argument of the exponential function for which the result does not lead to a numerical underflow or overflow.
+1. INT8 is a very limited format with nonzero numbers from $1$ to $255$. How could you extend its dynamic range without using more bits? Do standard multiplication and addition still work?
 1. Increase the number of epochs for training. Why might the test accuracy decrease after a while? How could we fix this?
 1. What happens as you increase the learning rate? Compare the loss curves for several learning rates. Which one works better? When?
 

diff --git a/chapter_multilayer-perceptrons/dropout.md → ..._machine-learning-fundamentals/dropout.md b/chapter_multilayer-perceptrons/dropout.md → ..._machine-learning-fundamentals/dropout.md
diff --git a/...ter_multilayer-perceptrons/environment.md → ...hine-learning-fundamentals/environment.md b/...ter_multilayer-perceptrons/environment.md → ...hine-learning-fundamentals/environment.md
diff --git a/chapter_machine-learning-fundamentals/index.md b/chapter_machine-learning-fundamentals/index.md
@@ -0,0 +1,41 @@
+# Machine Learning Fundamentals
+:label:`chap_ml-fundamentals`
+
+As illustrated in :numref:`chap_introduction`,
+deep learning is just one among many popular methods for solving machine learning problems.
+As we have encountered when training
+linear regressions, softmax regressions,
+and multilayer perceptrons,
+optimization algorithms
+reduce loss function values
+by iteratively updating model parameters.
+However,
+when we train high-capacity models,
+such as deep neural networks, we run the risk of overfitting.
+Thus, we will need to provide your first rigorous introduction
+to the notions of overfitting, underfitting, and model selection.
+To help you combat these problems,
+we will introduce regularization techniques such as weight decay and dropout.
+In view of many failed machine learning *deployments*,
+it is necessary to
+expose some common concerns
+and stimulate the critical thinking required to detect these situations early, mitigate damage, and use machine learning responsibly.
+Throughout, we aim to give you a firm grasp not just of the concepts
+but also of the practice of using machine learning models.
+At the end of this chapter,
+we apply what we have introduced so far to a real case: house price prediction.
+We punt matters relating to the computational performance,
+scalability, and efficiency of our models to subsequent chapters.
+
+```toc
+:maxdepth: 2
+
+optimization-primer
+model-selection
+underfit-overfit
+weight-decay
+dropout
+environment
+kaggle-house-price
+```
+
diff --git a/...tilayer-perceptrons/kaggle-house-price.md → ...arning-fundamentals/kaggle-house-price.md b/...tilayer-perceptrons/kaggle-house-price.md → ...arning-fundamentals/kaggle-house-price.md
diff --git a/...multilayer-perceptrons/model-selection.md → ...-learning-fundamentals/model-selection.md b/...multilayer-perceptrons/model-selection.md → ...-learning-fundamentals/model-selection.md
diff --git a/chapter_machine-learning-fundamentals/optimization-primer.md b/chapter_machine-learning-fundamentals/optimization-primer.md
@@ -0,0 +1,3 @@
+# Optimization Primer
+
+TODO
diff --git a/...ultilayer-perceptrons/underfit-overfit.md → ...learning-fundamentals/underfit-overfit.md b/...ultilayer-perceptrons/underfit-overfit.md → ...learning-fundamentals/underfit-overfit.md
diff --git a/...er_multilayer-perceptrons/weight-decay.md → ...ine-learning-fundamentals/weight-decay.md b/...er_multilayer-perceptrons/weight-decay.md → ...ine-learning-fundamentals/weight-decay.md
diff --git a/chapter_multilayer-perceptrons/backprop.md b/chapter_multilayer-perceptrons/backprop.md
@@ -31,7 +31,7 @@ techniques and their implementations,
 we rely on some basic mathematics and computational graphs.
 To start, we focus our exposition on
 a one-hidden-layer MLP
-with weight decay ($\ell_2$ regularization).
+with weight decay ($\ell_2$ regularization, to be described in subsequent chapters).
 
 ## Forward Propagation
 
@@ -79,7 +79,8 @@ for a single data example,
 
 $$L = l(\mathbf{o}, y).$$
 
-According to the definition of $\ell_2$ regularization,
+According to the definition of $\ell_2$ regularization
+that we will introduce later,
 given the hyperparameter $\lambda$,
 the regularization term is
 

diff --git a/chapter_multilayer-perceptrons/index.md b/chapter_multilayer-perceptrons/index.md
@@ -7,32 +7,22 @@ and they consist of multiple layers of neurons
 each fully connected to those in the layer below
 (from which they receive input)
 and those above (which they, in turn, influence).
-When we train high-capacity models we run the risk of overfitting.
-Thus, we will need to provide your first rigorous introduction
-to the notions of overfitting, underfitting, and model selection.
-To help you combat these problems,
-we will introduce regularization techniques such as weight decay and dropout.
-We will also discuss issues relating to numerical stability and parameter initialization
+Although automatic differentiation
+significantly simplifies the implementation of deep learning algorithms,
+we will dive deep into how these gradients
+are calculated in deep networks.
+Then we will
+be ready to 
+discuss issues relating to numerical stability and parameter initialization
 that are key to successfully training deep networks.
-Throughout, we aim to give you a firm grasp not just of the concepts
-but also of the practice of using deep networks.
-At the end of this chapter,
-we apply what we have introduced so far to a real case: house price prediction.
-We punt matters relating to the computational performance,
-scalability, and efficiency of our models to subsequent chapters.
+
 
 ```toc
 :maxdepth: 2
 
 mlp
-mlp-scratch
-model-selection
-underfit-overfit
-weight-decay
-dropout
+mlp-implementation
 backprop
 numerical-stability-and-init
-environment
-kaggle-house-price
 ```
 
diff --git a/chapter_multilayer-perceptrons/mlp-concise.md b/chapter_multilayer-perceptrons/mlp-concise.md
diff --git a/...ter_multilayer-perceptrons/mlp-scratch.md → ...tilayer-perceptrons/mlp-implementation.md b/...ter_multilayer-perceptrons/mlp-scratch.md → ...tilayer-perceptrons/mlp-implementation.md
diff --git a/chapter_multilayer-perceptrons/numerical-stability-and-init.md b/chapter_multilayer-perceptrons/numerical-stability-and-init.md
@@ -230,14 +230,15 @@ the network's expressive power.
 The hidden layer would behave
 as if it had only a single unit.
 Note that while minibatch stochastic gradient descent would not break this symmetry,
-dropout regularization would!
+dropout regularization (to be introduced later) would!
 
 
 ## Parameter Initialization
 
 One way of addressing---or at least mitigating---the
 issues raised above is through careful initialization.
-Additional care during optimization
+As we will see later,
+additional care during optimization
 and suitable regularization can further enhance stability.
 
 

diff --git a/index.md b/index.md
@@ -24,6 +24,7 @@ chapter_introduction/index
 chapter_preliminaries/index
 chapter_linear-networks/index
 chapter_multilayer-perceptrons/index
+chapter_machine-learning-fundamentals/index
 chapter_deep-learning-computation/index
 chapter_convolutional-neural-networks/index
 chapter_convolutional-modern/index