From c647911ecc81efc597e83f948a5560431372cc44 Mon Sep 17 00:00:00 2001
From: Aston Zhang <asv325@gmail.com>
Date: Thu, 19 Jan 2023 07:59:39 +0000
Subject: [PATCH] Fix missing bib entries

---
 .../bahdanau-attention.md                                       | 2 +-
 chapter_convolutional-modern/alexnet.md                         | 2 +-
 chapter_convolutional-modern/cnn-design.md                      | 2 +-
 chapter_hyperparameter-optimization/hyperopt-intro.md           | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/chapter_attention-mechanisms-and-transformers/bahdanau-attention.md b/chapter_attention-mechanisms-and-transformers/bahdanau-attention.md
index a901cc1b45..60f8a871b4 100644
--- a/chapter_attention-mechanisms-and-transformers/bahdanau-attention.md
+++ b/chapter_attention-mechanisms-and-transformers/bahdanau-attention.md
@@ -21,7 +21,7 @@ Recall :numref:`fig_seq2seq_details` which we reprint below (:numref:`fig_s2s_at
 
 While this is quite reasonable for short sequences, it is clear that it is infeasible for long sequences, such as a book chapter or even just a very long sentence. After all, after a while there will simply not be enough "space" in the intermediate representation to store all that is important in the source sequence. Consequently the decoder will fail to translate long and complex sentences. One of the first to encounter was :citet:`Graves.2013` when they tried to design an RNN to generate handwritten text. Since the source text has arbitrary length they designed a differentiable attention model
 to align text characters with the much longer pen trace,
-where the alignment moves only in one direction. This, in turn, draws on decoding algorithms in speech recognition, e.g., hidden Markov models :cite:`RabJua93`.
+where the alignment moves only in one direction. This, in turn, draws on decoding algorithms in speech recognition, e.g., hidden Markov models :cite:`rabiner1993fundamentals`.
 
 Inspired by the idea of learning to align,
 :citet:`Bahdanau.Cho.Bengio.2014` proposed a differentiable attention model
diff --git a/chapter_convolutional-modern/alexnet.md b/chapter_convolutional-modern/alexnet.md
index 354566e578..4a48f9ee26 100644
--- a/chapter_convolutional-modern/alexnet.md
+++ b/chapter_convolutional-modern/alexnet.md
@@ -148,7 +148,7 @@ Challenge :cite:`russakovsky2015imagenet`,
 pushed computer vision and machine learning research forward,
 challenging researchers to identify which models performed best
 at a greater scale than academics had previously considered. The largest vision datasets, such as LAION-5B 
-:cite:`schuhmannlaion` contain billions of images with additional metadata.
+:cite:`schuhmann2022laion` contain billions of images with additional metadata.
 
 ### Missing Ingredient: Hardware
 
diff --git a/chapter_convolutional-modern/cnn-design.md b/chapter_convolutional-modern/cnn-design.md
index 0301af39fe..303387594a 100644
--- a/chapter_convolutional-modern/cnn-design.md
+++ b/chapter_convolutional-modern/cnn-design.md
@@ -337,7 +337,7 @@ with d2l.try_gpu():
 With desirable inductive biases (assumptions or preferences) like locality and translation invariance (:numref:`sec_why-conv`)
 for vision, CNNs have been the dominant architectures in this area. This has remained the case since LeNet up until recently when Transformers (:numref:`sec_transformer`) :cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021,touvron2021training` started surpassing CNNs in terms of accuracy. While much of the recent progress in terms of vision Transformers *can* be backported into CNNs :cite:`liu2022convnet`, it is only possible at a higher computational cost. Just as importantly, recent hardware optimizations (NVIDIA Ampere and Hopper) have only widened the gap in favor of Transformers. 
 
-It is worth noting that Transformers have a significantly lower degree of inductive bias towards locality and translation invariance than CNNs. It is not the least due to the availability of large image collections, such as LAION-400m and LAION-5B :cite:`schuhmannlaion` with up to 5 billion images that learned structures prevailed. Quite surprisingly, some of the more relevant work in this context even includes MLPs :cite:`tolstikhin2021mlp`. 
+It is worth noting that Transformers have a significantly lower degree of inductive bias towards locality and translation invariance than CNNs. It is not the least due to the availability of large image collections, such as LAION-400m and LAION-5B :cite:`schuhmann2022laion` with up to 5 billion images that learned structures prevailed. Quite surprisingly, some of the more relevant work in this context even includes MLPs :cite:`tolstikhin2021mlp`. 
 
 In sum, vision Transformers (:numref:`sec_vision-transformer`) by now lead in terms of 
 state-of-the-art performance in large-scale image classification, 
diff --git a/chapter_hyperparameter-optimization/hyperopt-intro.md b/chapter_hyperparameter-optimization/hyperopt-intro.md
index e432d7015a..40988de000 100644
--- a/chapter_hyperparameter-optimization/hyperopt-intro.md
+++ b/chapter_hyperparameter-optimization/hyperopt-intro.md
@@ -272,7 +272,7 @@ depends on a small subset of the hyperparameters.
     5. Apart from the sheer amount of compute and storage required, what other issues would gradient-based hyperparameter optimization run into? Hint: Re-read about vanishing and exploding gradients in :numref:`sec_numerical_stability`.
     6. *Advanced*: Read :cite:`maclaurin-icml15` for an elegant (yet still somewhat unpractical) approach to gradient-based HPO.
 3. Grid search is another HPO baseline, where we define an equi-spaced grid for each hyperparameter, then iterate over the (combinatorial) Cartesian product in order to suggest configurations.
-    1. We stated above that random search can be much more efficient than grid search for HPO on a sizable number of hyperparameters, if the criterion most strongly depends on a small subset of the hyperparameters. Why is this? Hint: Read :cite:`bergstra-nips11`.
+    1. We stated above that random search can be much more efficient than grid search for HPO on a sizable number of hyperparameters, if the criterion most strongly depends on a small subset of the hyperparameters. Why is this? Hint: Read :cite:`bergstra2011algorithms`.
 
 
 :begin_tab:`pytorch`