Undergraduate course on ML/AI at the University of Central Florida.
-
Artificial intelligence, machine learning, and deep learning
-
Supervised learning, unsupervised learning, and reinforcement learning
-
Labeled/unlabeled examples, training, inference, classification, regression
-
Loss, empirical risk minimization, squared error loss, mean square error loss
Let's examine what could go wrong when applying gradient descent with a poorly chosen learning rate. We could fail to find any solution due to divergence or we could get stuck in a bad local minimum. The following notebook allows us to apply gradient descent for finding minima of univariate functions. (Univariate means that the functions depend on only one variable.)
The loss function for a deep neural network depends on millions of parameters. Such functions are called multivariate because they depend on multiple variables. It is no longer possible to easily visualize multivariate functions.
The following notebooks present two methods for visualizing bivariate function, that is, those that depend on exactly two variables. Such functions define surfaces in 3D. Think of the surface of a mountain range.
In the first implementation, we consider the weight and bias separately and implement stochastic gradient descent. It is easy to see the correspondance between the code and the mathematical expression for the gradient (see section 1 of the above notes).
In the second implementation, we combine the weight and bias into one vector. We also consider three versions of gradient descent: batch, mini-batch, and stochastic gradient descent. We use a vectorized implementation, that is, all data in a batch is processed in parallel. It is more difficult to see the correspondance between the code and the mathematical expression for the gradient (see subsection 2.2 of the above notes).
This vectorized implementation of gradient descent for linear regression with a single feature can be generalized to linear regression with multiple features (you have to do this for n=2 for one of the homework problems).
There is a closed-form solution for choosing the best weights and bias for linear regression. The optimal solution achieves the smallest squared error loss. I will not cover this in class. If you are interested, you can find more details in the notes Linear regression using the normal equation.
Keras is a high-level deep learning API that allows you to easily build, train, evaluate, and execute all sorts of neural networks. Its documentation (or specification) is available at https://keras.io. The reference implementation https://github.com/keras-team/keras also called Keras, was developed by Francois Chollet as part of a research project and released as an open source project in March 2015. To perform the heavy computations required by neural networks, this reference implementation relies on a computation backend. At present, you can choose from three popular open source deep learning libraries: TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano. Therefore, to avoid any confusion, we will refer to this reference implementation as multibackend Keras.
Since late 2016, other implementations have been released. You can now run Keras on Apache MXNet, Apple's Core ML, JavaScript or TypeScript (to run Keras code in a web browser), and PlaidML (which can run on all sorts of GPU devices, not just Nvidia).
TensorFlow 2 itself now comes bundled with its own Keras implementation, tf.keras
. It only supports TensorFlow as the backend, but it has the advantage of offering some very useful extra features: for example, it supports TensorFlow's Data API, which makes it easy to load and preprocess data efficiently.
In this course, we will use TensorFlow 2.x and tf.keras
. Always make sure that you use correct versions of TensorFlow and Keras.
-
keras.io is the documentation for the multibackend Keras implementation. You have to tweak the code examples from keras.io to use them with TensorFlow 2.x and
tf.keras
.
Let's see how we can solve the simplest case of linear regression in Keras.
We are going to work with some simple datasets to start learning about neural networks. The collection tf.keras.datasets
contains only a few simple datasets and provides an elementary way of loading them. (Later, we will learn about TensorFlow datasets, which contains nearly 100 datasets and provides a high-performace input data pipelines to load the datasets.)
Let's briefly describe Keras concepts such as dense / convolutional / recurrent layers, sequential models, functional API, activation functions, loss functions, optimizers, and metrics.
Before formally defining sequential neural networks with dense layers, let's look at some simple Keras models showing how to use such networks for classification. We consider the problems of classifying images from the MNIST digits dataset and the fashion items dataset. These problems are so-called multi-class / single-label classifications problems.
Multi-class means that there are several classes. For instance, T-shirt, pullover or bag in the fashion items dataset.
Single-label means that classes are mutually exclusive. For instance, an image is either the digit 0, or the digit 1, etc. in the MNIST digits dataset.
The example neural networks in the notebooks below consist of three layers: input, hidden, and output layers. They use the softmax activation function in the last (output) layer and the categorical cross entropy loss function because the problems are multi-class, single-label classification problems. They also use the relu activation activation function for the hidden layer.
These notebooks also show how to split datasets into training datasets and test datasets and also discuss overfitting.
-
Notebook for classifying MNIST digits with dense layers and analyzing model performance
-
Notebook for classifying fashion items with dense layers and analyzing model performance
The notebook below uses pandas.DataFrame
to display learning curves and to visually analyze predictions.
The goal of machine learning is to obtain models that perform well on new unseen data. It can happen that a model performs perfectly on the training data, but fails on new data. This is called overfitting. The following notes explain briefly how to deal with this important issue.
Logistic regression is used for binary classification problems. Binary means that there are only two classes. For instance, a movie review has to be classified as either positive (class 1) or negative (class 0). There is only one output neuron whose activation indicates the probability of class 1. This output neuron uses the sigmoid activation function, which enforces that its activation inside the interval [0, 1], that is, is a valid probability.
The squared error loss could be used, but it is much better to use the binary cross entropy loss instead of the squared error loss because it speeds up training. The notes below derive the gradient for the two combinatations: sigmoid activation with squared error loss and sigmoid activation with binary cross entropy loss.
The notebook below presents a simple elementary method for preprocessing text data so it can be input into a neural network. We will discuss more advanced methods for preprocessing text later.
This notebook also shows how we can use a validation set to monitor the performance of the model and subsequently choose a good number of epochs to prevent overfitting.
We already talked briefly about multi-class / single-label classification, softmax activation, and categorical cross entropy loss when presenting Keras examples for classifying MNIST digits and fashion items.
The notes below explain the mathematics behind softmax activation and categorical cross entropy loss and derive the gradient for this combination of activation and loss.
Image that you receive an image of a face and that you have to decide (a) if the person is smiling or not and (b) if the person is wearing glasses. Similing and wearing glasses are independent of each other. This is an example of multi-class / multi-label classification.
Sigmoid activation functions are used in the output layer in multi-class / multi-label classification problems. The number of output neurons is equal to the number of classes, and each neuron uses the sigmoid activation function. The binary cross entropy loss is used for each output neuron.
We will look at some examples of multi-class / multi-label classification after introducting convolutional neural networks.
Underfitting, overfitting, and two simple methods for fighting overfitting: dropout and L1 / L2 regularization
-
Notebook for classifying movie reviews, demonstrating underfitting and overfitting
-
Notebook for classifying movie reviews, using regularization and dropout to combat overfitting (TO DO: add more details how regularization and dropout work)
Notes on backpropagation algorithm for computing gradients in sequential neural networks with dense layers
These notes explain how to compute the gradients for neural networks consisting of multiple dense layers. I will not go over the mathematical derivation of the backpropagation algorithm. Fortunately, the gradients are computed automatically in Keras.
My notes are mostly based on chapter 2 "How the backpropagation algorithm works" of the book "Neural Networks and Deep Learning".
My code is based on the code described in chapter 5 "Getting started with neural networks" of the book "Deep Learning and the Game of Go".
- Code for creating sequential neural networks with dense layers and training them with backprop and mini-batch SGD; currently, code is limited to (1) mean squared error loss and (2) sigmoid activations; the neural network learns rather slowly because the combination of sigmoid activation in the output layer and the mean squared error loss is suboptimial; ideally, we would use softmax activation together with categorical cross entropy loss; using CuPy on a GPU-instance instead of NumPy should speed-up training
You can also find implementations of neural networks from scratch in the book "Neural Networks and Deep Learning" and also in the book "Grokking Deep Learning".
TO DO: clean up everything below
TO DO: add note of preventing overfitting with data augmentation (also, add L2/L1 regularization and dropout earlier!)
Classification of MNIST digits and fashion items
Transfer learning: classification of cats and dogs
based on Chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet
-
training convnet from scratch, using data augmentation and dropout
-
using VGG16 conv base for fast feature extraction (data augmentation not possible), using dropout
-
using VGG16 conv base for feature extraction, using data augmentation, not using dropout
!!! Remove the notebooks below? Redundant ? !!!
based on Google ML Practicum: Image Classification
-
Colab notebook for training a convolutional neural network from scratch
-
Colab notebook for training a CNN from scratch with data augmentation and dropout
Visualizing what convnets learn
based on chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet
-
Visualizing convnet filters, the convnet filter visualizations at the bottom of the notebook look pretty cool!
-
Visualizing heatmaps of class activations, modified version, changes softmax to linear activation in last layer
-
keras-vis This is a package for producing cool looking visualizations. I had problems using it on colab. !!! Fix it !!!
Some cool looking stuff
Based on Section 8.2 DeepDream and Section 8.3 Neural style transfer of the book Deep learning with Python by F. Chollet. I am not going to explain in detail how deep dream and neural style transfer work. I just wanted to include these notebooks to show you two cool examples of what can be done with deep neural networks.
The goal is to introduce more advanced architectures and concepts. This is based onthe Keras documentation: CIFAR-10 ResNet.
The relevant research papers are:
Notebooks
I have made several changes to the code from the Keras documentation. In the above notebook, I had to change the number of epochs and the learning rate schedule because the model is only trained on 40k and validated on 10k, whereas the model in the Keras documentation is trained on 50k and not validated at all. I wanted to have a situation that is similar to the situation in HW 2 so we can better compare the performance of the ResNet and the (normal) CNN.
TensorFlow datasets is a collection of nearly 100 ready-to-use datasets that can quickly help build high-performance input data pipelines for training TensorFlow models. Instead of downloading and manipulating datasets manually and then figuring out how to read their labels, TensorFlow datasets standardizes the data format so that it's easy to swap one dataset with another, often with just a single line of code change. As you will see later on, doing things like breaking the dataset down into training, validation, and testing is also a matter of a single line of code. The high-performance input data pipelines make it possible to work on the data in parallel. For instance, while the GPU is working with a batch of data, the CPU is prefeching the next batch.
- one-shot learning
- image similarity, face-recognition
- Word embeddings
- Using 1D convnets (TO D)
- Word embeddings (TO DO: change notebook !!!)
- Newsgroup classification with convolutional model using pretrained Glove embeddings (TO DO)
- IMDB sentiment classification with LSTM model (TO DO)
- ...
- Arguments
return_sequences
and andreturn_sequences
for LSTM cells in Keras - Character-based sequence-to-sequence model for translating French to English
- TO DO: sequence-to-sequence model with attention