Skip to content

Latest commit

 

History

History
291 lines (151 loc) · 17.5 KB

README.md

File metadata and controls

291 lines (151 loc) · 17.5 KB

CAP 4630 Artificial Intelligence

Undergraduate course on ML/AI at the University of Central Florida.

Overview

Fundamental machine learning concepts

Python, numpy, and matplotlib

Effect of learning rate on gradient descent for finding minima of univariate functions

Let's examine what could go wrong when applying gradient descent with a poorly chosen learning rate. We could fail to find any solution due to divergence or we could get stuck in a bad local minimum. The following notebook allows us to apply gradient descent for finding minima of univariate functions. (Univariate means that the functions depend on only one variable.)

Visualization of bivariate functions

The loss function for a deep neural network depends on millions of parameters. Such functions are called multivariate because they depend on multiple variables. It is no longer possible to easily visualize multivariate functions.

The following notebooks present two methods for visualizing bivariate function, that is, those that depend on exactly two variables. Such functions define surfaces in 3D. Think of the surface of a mountain range.

Linear regression using gradient descent - numpy implementation

To get started, let's consider the simple case of linear regression: n=1, that is, there is only one feature and the model has only one weight (and a bias term).

In the first implementation, we consider the weight and bias separately and implement stochastic gradient descent. It is easy to see the correspondance between the code and the mathematical expression for the gradient.

In the second implementation, we combine the weight and bias into one vector. We also consider three versions of gradient descent: batch, stochastic, and mini-batch gradient descent. We use a vectorized implementation, that is, all data in a batch is processed in parallel. It is more difficult to see the correspondance between the code and the mathematical expression for the gradient.

(TO DO: improve everything below!)

Linear regression - Keras implementation

Let's see how we can solve the simplest case of linear regression in Keras.

Linear regression using the normal equation - numpy implementation

There is a closed-form solution for choosing the best weights and bias for linear regression. The optimal solution achieves the smallest squared error loss.

To understand the mathematics underlying the normal equation, read the following materials. I will not cover the derivation of the normal equation.

TensorFlow and Keras

We will use Keras to build (almost) all deep learning models. Roughtly speaking, TensorFlow is a back-end for deep learning, whereas Keras is a front-front. Keras can use TensorFlow or other backends.

Keras is now part of the latest version of TensorFlow 2.0 so it is available automatically when you import tensorflow. Previously (TensorFlow 1.x) you had to import Keras seperately. I may need to do some minor tweaks to the notebooks so that everything is perfectly adapted to TensorFlow 2.0.

Keras examples

Let's now see how we can solve more interesting problems with Keras. We consider the problems of classifying images from the MNIST digits, MNIST fashion items, and CIFAR10 datasets.

The classifications problems are all multi-class, single-label classifications problems. Multi-class means that there are several classes. For instance, T-shirt, pullover or bag in the MNIST fashion items dataset. Single-label means that classes are mutually exclusive. For instance, an image is either the digit 0, or the digit 1, etc. in the MNIST digits dataset.

In the multi-class, single-label classification problem, the activation in the last layer is softmax and the loss function is categorical cross entropy.

The examples below use the so-called relu activation function for the hidden layer.

Generalization, overfitting, splitting data in train & test sets

The goal of machine learning is to obtain models that perform well on new unseen data, that is. For instance, it can happen that a model performs perfectly on the training data, but fails on new data. This is called overfitting. The following notes explain briefly how to deal with this important issue.

Validation

Logistic regression, gradient for squared error loss, and gradient for binary cross entropy loss

Logistic regression is used for binary classification problems. Binary means that there are only two classes. For instance, an image has to be classified as either a cat or a dog. There is only one output neuron whose activation indicates the class (say, 1=dog, 0=cat). It is best to use the binary cross entropy loss instead of the squared error loss.

Sigmoid activation functions are used in multi-class, multi-label classification problems. The number of output neurons is equal to the number of classes, and each neuron uses the sigmoid activation function. The binary cross entropy loss is used for each output neuron.

Softmax, gradient for categorical cross entropy loss

Sequential neural networks with dense layers

These notes explain how to compute the gradients for neural networks consisting of multiple dense layers. I will not go over the mathematical derivation of the backpropagation algorithm. Fortunately, the gradients are computed automatically in Keras.

Deep learning for computer vision (convolutional neural networks)

TO DO: add note of preventing overfitting with data augmentation (also, add L2/L1 regularization and dropout earlier!)

Classification of MNIST digits and fashion items

Classification of cats and dogs

based on Chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet


based on Google ML Practicum: Image Classification


Visualizing what convnets learn

based on chapter 5 Deep learning for computer vision of the book Deep learning with Python by F. Chollet


Some cool looking stuff

Based on Section 8.2 DeepDream and Section 8.3 Neural style transfer of the book Deep learning with Python by F. Chollet. I am not going to explain in detail how deep dream and neural style transfer work. I just wanted to include these notebooks to show you two cool examples of what can be done with deep neural networks.


Deep learning for computer vision (residual networks)

The goal is to introduce more advanced architectures and concepts. This is based onthe Keras documentation: CIFAR-10 ResNet.

The relevant research papers are:

Notebooks

I have made several changes to the code from the Keras documentation. In the above notebook, I had to change the number of epochs and the learning rate schedule because the model is only trained on 40k and validated on 10k, whereas the model in the Keras documentation is trained on 50k and not validated at all. I wanted to have a situation that is similar to the situation in HW 2 so we can better compare the performance of the ResNet and the (normal) CNN.


Visualizing high-dimensional data using t-SNE


Text

Character-based

Word-based

  • Word embeddings
  • Using 1D convnets (TO D)
  • Word embeddings (TO DO: change notebook !!!)
  • Newsgroup classification with convolutional model using pretrained Glove embeddings (TO DO)
  • IMDB sentiment classification with LSTM model (TO DO)
  • ...

One-shot learning


Variational Autoencoder


Sequence-to-sequence models


Tools, additional materials