-th to ^\mathrm{th}, T to \top, till mat-mat prod

ksbsk · Dec 9, 2019 · 2a64af8 · 2a64af8
1 parent 00e0dcb
commit 2a64af8
Show file tree

Hide file tree

Showing 12 changed files with 63 additions and 88 deletions.
diff --git a/chapter_attention-mechanism/attention.md b/chapter_attention-mechanism/attention.md
@@ -69,7 +69,7 @@ $$\alpha(\mathbf q, \mathbf k) = \langle \mathbf q, \mathbf k \rangle /\sqrt{d}.
 
 Assume $\mathbf Q\in\mathbb R^{m\times d}$ contains $m$ queries and $\mathbf K\in\mathbb R^{n\times d}$ has all $n$ keys. We can compute all $mn$ scores by
 
-$$\alpha(\mathbf Q, \mathbf K) = \mathbf Q \mathbf K^T /\sqrt{d}.$$
+$$\alpha(\mathbf Q, \mathbf K) = \mathbf Q \mathbf K^\top /\sqrt{d}.$$
 
 Now let's implement this layer that supports a batch of queries and key-value pairs. In addition, it supports randomly dropping some attention weights as a regularization.
 
@@ -108,7 +108,7 @@ In multilayer perceptron attention, we first project both query and keys into $\
 
 To be more specific, assume learnable parameters $\mathbf W_k\in\mathbb R^{h\times d_k}$, $\mathbf W_q\in\mathbb R^{h\times d_q}$, and $\mathbf v\in\mathbb R^{p}$.  Then the score function is defined by
 
-$$\alpha(\mathbf k, \mathbf q) = \mathbf v^T \text{tanh}(\mathbf W_k \mathbf k + \mathbf W_q\mathbf q). $$
+$$\alpha(\mathbf k, \mathbf q) = \mathbf v^\top \text{tanh}(\mathbf W_k \mathbf k + \mathbf W_q\mathbf q). $$
 
 This concatenates the key and value in the feature dimension and feeds them into a single hidden layer perceptron with hidden layer size $h$ and output layer size $1$. The hidden layer activation function is tanh and no bias is applied.
 

diff --git a/chapter_attention-mechanism/transformer.md b/chapter_attention-mechanism/transformer.md
@@ -284,7 +284,7 @@ Let first look at how a decoder behaviors during predicting. Similar to the seq2
 
 ![Predict at time step $t$ for a self-attention layer.](../img/self-attention-predict.svg)
 
-During training, because the output for the $t$-query could depend all $T$ key-value pairs, which results in an inconsistent behavior than prediction. We can eliminate it by specifying the valid length to be $t$ for the $t$-th query.
+During training, because the output for the $t$-query could depend all $T$ key-value pairs, which results in an inconsistent behavior than prediction. We can eliminate it by specifying the valid length to be $t$ for the $t^\textrm{th}$ query.
 
 Another difference compared to the encoder transformer block is that the decoder block has an additional multi-head attention layer that accepts the encoder outputs as keys and values.
 

diff --git a/chapter_computational-performance/multiple-gpus.md b/chapter_computational-performance/multiple-gpus.md
@@ -22,7 +22,7 @@ for training models using optimization algorithms described in
 :numref:`sec_minibatch_sgd`. Now, we will demonstrate how data parallelism works using mini-batch
 stochastic gradient descent as an example.
 
-Assume there are $k$ GPUs on a machine. Given the model to be trained, each GPU will maintain a complete set of model parameters independently. In any iteration of model training, given a random mini-batch, we divide the examples in the batch into $k$ portions and distribute one to each GPU. Then, each GPU will calculate the local gradient of the model parameters based on the mini-batch subset it was assigned and the model parameters it maintains. Next, we add together the local gradients on the $k$ GPUs to get the current mini-batch stochastic gradient. After that, each GPU uses this mini-batch stochastic gradient to update the complete set of model parameters that it maintains. Figure 8.1 depicts the mini-batch stochastic gradient calculation using data parallelism and two GPUs.
+Assume there are $k$ GPUs on a machine. Given the model to be trained, each GPU will maintain a complete set of model parameters independently. In any iteration of model training, given a random mini-batch, we divide the examples in the batch into $k$ portions and distribute one to each GPU. Then, each GPU will calculate the local gradient of the model parameters based on the mini-batch subset it was assigned and the model parameters it maintains. Next, we add together the local gradients on the $k$ GPUs to get the current mini-batch stochastic gradient. After that, each GPU uses this mini-batch stochastic gradient to update the complete set of model parameters that it maintains. Figure 10.1 depicts the mini-batch stochastic gradient calculation using data parallelism and two GPUs.
 
 ![Calculation of mini-batch stochastic gradient using data parallelism and two GPUs. ](../img/data-parallel.svg)
 
@@ -115,7 +115,7 @@ print('after allreduce:\n', data[0], '\n', data[1])
 
 ## Split a Data Batch into Multiple GPUs
 
-The `utils` module in Gluon provides a function to evenly split an array into multiple parts along the first dimension, and then copy the $i$-th part into the the $i$-th device. It's straightforward to implement, but we will use the pre-implemented version so later chapters can reuse the `split_batch` function we will define later.
+The `utils` module in Gluon provides a function to evenly split an array into multiple parts along the first dimension, and then copy the $i^\mathrm{th}$ part into the $i^\mathrm{th}$ device. It's straightforward to implement, but we will use the pre-implemented version so later chapters can reuse the `split_batch` function we will define later.
 
 Now, we try to divide the 4 data instances equally between 2 GPUs using the `split_and_load` function.
 

diff --git a/chapter_computer-vision/tranposed-conv.md b/chapter_computer-vision/tranposed-conv.md
@@ -93,7 +93,7 @@ Y = d2l.corr2d(X, K)
 Y
 ```
 
-Next, we rewrite convolution kernel $K$ as a matrix $W$. Its shape will be $(4,9)$, where the $i$-th row present applying the kernel to the input to generate the $i$-th output element.
+Next, we rewrite convolution kernel $K$ as a matrix $W$. Its shape will be $(4,9)$, where the $i^\mathrm{th}$ row present applying the kernel to the input to generate the $i^\mathrm{th}$ output element.
 
 ```{.python .input}
 def kernel2matrix(K):
@@ -112,7 +112,7 @@ Then the convolution operator can be implemented by matrix multiplication with p
 Y == np.dot(W, X.reshape(-1)).reshape(2, 2)
 ```
 
-We can implement transposed convolution as a matrix multiplication as well by reusing `kernel2matrix`. To reuse the generated $W$, we construct a $2\times 2$ input, so the corresponding weight matrix will have a shape $(9,4)$, which is $W^T$. Let's verify the results.
+We can implement transposed convolution as a matrix multiplication as well by reusing `kernel2matrix`. To reuse the generated $W$, we construct a $2\times 2$ input, so the corresponding weight matrix will have a shape $(9,4)$, which is $W^\top$. Let's verify the results.
 
 ```{.python .input}
 X = np.array([[0,1], [2,3]])

diff --git a/chapter_deep-learning-computation/use-gpu.md b/chapter_deep-learning-computation/use-gpu.md
@@ -28,7 +28,7 @@ Note that this might be extravagant for most desktop computers but it is easily
 
 ## Computing Devices
 
-MXNet can specify devices, such as CPUs and GPUs, for storage and calculation. By default, MXNet creates data in the main memory and then uses the CPU to calculate it. In MXNet, the CPU and GPU can be indicated by `cpu()` and `gpu()`. It should be noted that `cpu()` (or any integer in the parentheses) means all physical CPUs and memory. This means that MXNet's calculations will try to use all CPU cores. However, `gpu()` only represents one graphic card and the corresponding graphic memory. If there are multiple GPUs, we use `gpu(i)` to represent the $i$-th GPU ($i$ starts from 0). Also, `gpu(0)` and `gpu()` are equivalent.
+MXNet can specify devices, such as CPUs and GPUs, for storage and calculation. By default, MXNet creates data in the main memory and then uses the CPU to calculate it. In MXNet, the CPU and GPU can be indicated by `cpu()` and `gpu()`. It should be noted that `cpu()` (or any integer in the parentheses) means all physical CPUs and memory. This means that MXNet's calculations will try to use all CPU cores. However, `gpu()` only represents one graphic card and the corresponding graphic memory. If there are multiple GPUs, we use `gpu(i)` to represent the $i^\mathrm{th}$ GPU ($i$ starts from 0). Also, `gpu(0)` and `gpu()` are equivalent.
 
 ```{.python .input}
 from mxnet import np, npx

diff --git a/chapter_linear-networks/softmax-regression-gluon.md b/chapter_linear-networks/softmax-regression-gluon.md
@@ -39,7 +39,7 @@ numerical stability issues, a matter we've already discussed a few times
 (e.g. in :numref:`sec_naive_bayes`) and
 in the problem set of the previous chapter). Recall that the softmax function
 calculates $\hat y_j = \frac{e^{z_j}}{\sum_{i=1}^{n} e^{z_i}}$, where $\hat y_j$
-is the j-th element of ``yhat`` and $z_j$ is the j-th element of the input
+is the $j^\mathrm{th}$ element of ``yhat`` and $z_j$ is the $j^\mathrm{th}$ element of the input
 ``y_linear`` variable, as computed by the softmax.
 
 If some of the $z_i$ are very large (i.e. very positive),

diff --git a/chapter_multilayer-perceptrons/kaggle-house-price.md b/chapter_multilayer-perceptrons/kaggle-house-price.md
@@ -62,7 +62,7 @@ Year of construction, for example, is represented with integers
 roof type is a discrete categorical feature,
 other features are represented with floating point numbers.
 And here is where reality comes in:
-for some exampels, some data is altogether missing
+for some examples, some data is altogether missing
 with the missing value marked simply as 'na'.
 The price of each house is included for the training set only
 (it's a competition after all).
@@ -198,7 +198,7 @@ that we have a data processing bug.
 And if things work, the linear model will serve as a baseline
 giving us some intuition about how close the simple model
 gets to the best reported models, giving us a sense
-of how much gain we should expect from fanicer models.
+of how much gain we should expect from fancier models.
 
 ```{.python .input  n=13}
 loss = gluon.loss.L2Loss()
@@ -279,12 +279,12 @@ in the section where we discussed how to deal
 with model section (:numref:`sec_model_selection`). We will put this to good use to select the model design
 and to adjust the hyperparameters.
 We first need a function that returns
-the i-th fold of the data in a k-fold cros-validation procedure.
-It proceeds by slicing out the i-th segment as validation data
+the $i^\mathrm{th}$ fold of the data in a k-fold cross-validation procedure.
+It proceeds by slicing out the $i^\mathrm{th}$ segment as validation data
 and returning the rest as training data.
 Note that this is not the most efficient way of handling data
 and we would definitely do something much smarter
-if our dataset wasconsiderably larger.
+if our dataset was considerably larger.
 But this added complexity might obfuscate our code unnecessarily
 so we can safely omit here owing to the simplicity of our problem.
 
@@ -361,7 +361,7 @@ in the k-fold cross-validation have also been reduced accordingly.
 Now that we know what a good choice of hyperparameters should be,
 we might as well use all the data to train on it
 (rather than just $1-1/k$ of the data
-that is used in the crossvalidation slices).
+that is used in the cross-validation slices).
 The model that we obtain in this way
 can then be applied to the test set.
 Saving the estimates in a CSV file
@@ -424,7 +424,7 @@ The steps are quite simple:
 1. Can you improve your model by minimizing the log-price directly? What happens if you try to predict the log price rather than the price?
 1. Is it always a good idea to replace missing values by their mean? Hint - can you construct a situation where the values are not missing at random?
 1. Find a better representation to deal with missing values. Hint - What happens if you add an indicator variable?
-1. Improve the score on Kaggle by tuning the hyperparameters through k-fold crossvalidation.
+1. Improve the score on Kaggle by tuning the hyperparameters through k-fold cross-validation.
 1. Improve the score by improving the model (layers, regularization, dropout).
 1. What happens if we do not standardize the continuous numerical features like we have done in this section?
 

diff --git a/chapter_natural-language-processing/word2vec-data-set.md b/chapter_natural-language-processing/word2vec-data-set.md
@@ -193,7 +193,7 @@ all_negatives = get_negatives(all_contexts, corpus, 5)
 
 We extract all central target words `all_centers`, and the context words `all_contexts` and noise words `all_negatives` of each central target word from the data set. We will read them in random mini-batches.
 
-In a mini-batch of data, the $i$-th example includes a central word and its corresponding $n_i$ context words and $m_i$ noise words. Since the context window size of each example may be different, the sum of context words and noise words, $n_i+m_i$, will be different. When constructing a mini-batch, we concatenate the context words and noise words of each example, and add 0s for padding until the length of the concatenations are the same, that is, the length of all concatenations is $\max_i n_i+m_i$(`max_len`). In order to avoid the effect of padding on the loss function calculation, we construct the mask variable `masks`, each element of which corresponds to an element in the concatenation of context and noise words, `contexts_negatives`. When an element in the variable `contexts_negatives` is a padding, the element in the mask variable `masks` at the same position will be 0. Otherwise, it takes the value 1. In order to distinguish between positive and negative examples, we also need to distinguish the context words from the noise words in the `contexts_negatives` variable. Based on the construction of the mask variable, we only need to create a label variable `labels` with the same shape as the `contexts_negatives` variable and set the elements corresponding to context words (positive examples) to 1, and the rest to 0.
+In a mini-batch of data, the $i^\mathrm{th}$ example includes a central word and its corresponding $n_i$ context words and $m_i$ noise words. Since the context window size of each example may be different, the sum of context words and noise words, $n_i+m_i$, will be different. When constructing a mini-batch, we concatenate the context words and noise words of each example, and add 0s for padding until the length of the concatenations are the same, that is, the length of all concatenations is $\max_i n_i+m_i$(`max_len`). In order to avoid the effect of padding on the loss function calculation, we construct the mask variable `masks`, each element of which corresponds to an element in the concatenation of context and noise words, `contexts_negatives`. When an element in the variable `contexts_negatives` is a padding, the element in the mask variable `masks` at the same position will be 0. Otherwise, it takes the value 1. In order to distinguish between positive and negative examples, we also need to distinguish the context words from the noise words in the `contexts_negatives` variable. Based on the construction of the mask variable, we only need to create a label variable `labels` with the same shape as the `contexts_negatives` variable and set the elements corresponding to context words (positive examples) to 1, and the rest to 0.
 
 Next, we will implement the mini-batch reading function `batchify`. Its mini-batch input `data` is a list whose length is the batch size, each element of which contains central target words `center`, context words `context`, and noise words `negative`. The mini-batch data returned by this function conforms to the format we need, for example, it includes the mask variable.
 
@@ -269,4 +269,3 @@ for batch in data_iter:
 ## Scan the QR Code to [Discuss](https://discuss.mxnet.io/t/4356)
 
 ![](../img/qr_word2vec-data-set.svg)
-