Undergraduate course on ML/AI at the University of Central Florida.
-
Artificial intelligence, machine learning, and deep learning
-
Supervised learning, unsupervised learning, and reinforcement learning
-
Labeled/unlabeled examples, training, inference, classification, regression
-
Loss, empirical risk minimization, squared error loss, mean square error loss
Let's examine what could go wrong when applying gradient descent with a poorly chosen learning rate. We could fail to find any solution due to divergence or we could get stuck in a bad local minimum. The following notebook allows us to apply gradient descent for finding minima of univariate functions. (Univariate means that the functions depend on only one variable.)
The loss function for a deep neural network depends on millions of parameters. Such functions are called multivariate because they depend on multiple variables. It is no longer possible to easily visualize multivariate functions.
The following notebooks present two methods for visualizing bivariate function, that is, those that depend on exactly two variables. Such functions define surfaces in 3D. Think of the surface of a mountain range.
To get started, let's consider the simple case of linear regression: n=1, that is, there is only one feature and the model has only one weight (and a bias term).
In the first implementation, we consider the weight and bias separately and implement stochastic gradient descent. It is easy to see the correspondance between the code and the mathematical expression for the gradient.
In the second implementation, we combine the weight and bias into one vector. We also consider three versions of gradient descent: batch, stochastic, and mini-batch gradient descent. We use a vectorized implementation, that is, all data in a batch is processed in parallel. It is more difficult to see the correspondance between the code and the mathematical expression for the gradient.
(TO DO: improve everything below!)
Let's see how we can solve the simplest case of linear regression in Keras.
There is a closed-form solution for choosing the best weights and bias for linear regression. The optimal solution achieves the smallest squared error loss.
To understand the mathematics underlying the normal equation, read the following materials. I will not cover the derivation of the normal equation.
-
Chapter 4 Numerical Computation, Section 4.3 Gradient-Based Optimization
-
Chapter 5 Machine Learning Basics, Subsection 5.1.4 Example: Linear Regression
-
Additional materials: proof of convexity of MSE and computation of gradient of MSE
We will use Keras to build (almost) all deep learning models. Roughtly speaking, TensorFlow is a back-end for deep learning, whereas Keras is a front-front. Keras can use TensorFlow or other backends.
Keras is now part of the latest version of TensorFlow 2.0 so it is available automatically when you import tensorflow. Previously (TensorFlow 1.x) you had to import Keras seperately. I may need to do some minor tweaks to the notebooks so that everything is perfectly adapted to TensorFlow 2.0.
Let's now see how we can solve more interesting problems with Keras. We consider the problems of classifying images from the MNIST digits, MNIST fashion items, and CIFAR10 datasets.
The classifications problems are all multi-class, single-label classifications problems. Multi-class means that there are several classes. For instance, T-shirt, pullover or bag in the MNIST fashion items dataset. Single-label means that classes are mutually exclusive. For instance, an image is either the digit 0, or the digit 1, etc. in the MNIST digits dataset.
In the multi-class, single-label classification problem, the activation in the last layer is softmax and the loss function is categorical cross entropy.
The examples below use the so-called relu activation function for the hidden layer.
-
Notebook for loading and exploring the MNIST digits data set
-
Notebook for classifying MNIST digits with dense layers and analyzing model performance
-
Notebook for classifying MNIST fashion items with dense layers and analyzing model performance
The goal of machine learning is to obtain models that perform well on new unseen data, that is. For instance, it can happen that a model performs perfectly on the training data, but fails on new data. This is called overfitting. The following notes explain briefly how to deal with this important issue.
Logistic regression is used for binary classification problems. Binary means that there are only two classes. For instance, an image has to be classified as either a cat or a dog. There is only one output neuron whose activation indicates the class (say, 1=dog, 0=cat). It is best to use the binary cross entropy loss instead of the squared error loss.
Sigmoid activation functions are used in multi-class, multi-label classification problems. The number of output neurons is equal to the number of classes, and each neuron uses the sigmoid activation function. The binary cross entropy loss is used for each output neuron.
These notes explain how to compute the gradients for neural networks consisting of multiple dense layers. I will not go over the mathematical derivation of the backpropagation algorithm. Fortunately, the gradients are computed automatically in Keras.
-
Code for creating sequential neural networks with dense layers and training them with backprop and mini-batch SGD; currently, code is limited to (1) mean squared error loss and (2) sigmoid activations.
TO DO: add note of preventing overfitting with data augmentation (also, add L2/L1 regularization and dropout earlier!)
Classification of MNIST digits and fashion items
Classification of cats and dogs
based on Chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet
-
training convnet from scratch, using data augmentation and dropout
-
using VGG16 conv base for fast feature extraction (data augmentation not possible), using dropout
-
using VGG16 conv base for feature extraction, using data augmentation, not using dropout
based on Google ML Practicum: Image Classification
-
Colab notebook for training a convolutional neural network from scratch
-
Colab notebook for training a CNN from scratch with data augmentation and dropout
Visualizing what convnets learn
based on chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet
-
Visualizing convnet filters, the convnet filter visualizations at the bottom of the notebook look pretty cool!
-
Visualizing heatmaps of class activations, modified version, changes softmax to linear activation in last layer
-
keras-vis This is a package for producing cool looking visualizations. I had problems using it on colab.
Some cool looking stuff
Based on Section 8.2 DeepDream and Section 8.3 Neural style transfer of the book Deep learning with Python by F. Chollet. I am not going to explain in detail how deep dream and neural style transfer work. I just wanted to include these notebooks to show you two cool examples of what can be done with deep neural networks.
The goal is to introduce more advanced architectures and concepts. This is based onthe Keras documentation: CIFAR-10 ResNet.
The relevant research papers are:
Notebooks
I have made several changes to the code from the Keras documentation. In the above notebook, I had to change the number of epochs and the learning rate schedule because the model is only trained on 40k and validated on 10k, whereas the model in the Keras documentation is trained on 50k and not validated at all. I wanted to have a situation that is similar to the situation in HW 2 so we can better compare the performance of the ResNet and the (normal) CNN.
- Word embeddings
- Using 1D convnets (TO D)
- Word embeddings (TO DO: change notebook !!!)
- Newsgroup classification with convolutional model using pretrained Glove embeddings (TO DO)
- IMDB sentiment classification with LSTM model (TO DO)
- ...
- Arguments
return_sequences
and andreturn_sequences
for LSTM cells in Keras - Character-based sequence-to-sequence model for translating French to English
- TO DO: sequence-to-sequence model with attention