- Visual Learning and Recognition (16-824) Spring 2018
- Created By: Rohit Girdhar
- TAs: Lerrel Pinto, Senthil Purushwalkam, Nadine Chang and Rohit Girdhar
- Please post questions, if any, on the piazza for HW1.
- Total points: 100
In this assignment, we will learn to train multi-label image classification models using the TensorFlow (TF) framework. We will classify images from the PASCAL 2007 dataset into the objects present in the image. Your task in this assignment is to fill in the parts of code, as described in this document, perform all experiments, and submit a report with your results and analyses. You are free to use any TensorFlow built-in high-level APIs, such as tf.contrib.slim
or tf.contrib.keras
, as long as you can follow the code structure we define in the steps of this assignment. Feel free to google how to do certain things if you get stuck, but put proper attribution. It is not acceptable to google "alexnet for PASCAL classification in tensorflow" and copy-paste that code, as that would probably not follow the code structure we define in the assignment.
In all the following tasks, coding and analysis, please write a short summary of what you tried, what worked (or didn't), and what you learned, in the report. Write the code into the files as specified. Submit a zip file (ANDREWID.zip
) with all the code files, and a single REPORT.pdf
, which should have commands that TAs can run to re-produce your results/visualizations etc. Also mention any collaborators or other sources used for different parts of the assignment.
If you are using AWS instance setup using the provided instructions, you should already have most of the requirements installed on your machine. In any case, you would need the following python libraries installed:
- TensorFlow (1.3+)
- Numpy
- Pillow (PIL)
- sklearn (v0.19)
MNIST is a dataset containing handwritten digits from 0-9, formatted as 28x28px monochrome images. It has been popularly used for debugging and experimenting with convolutional neural networks (CNNs). In this task, we will start with understanding the basic workings of TensorFlow utilities provided for building CNNs. We will follow MNIST Layers official tutorial. If you have trouble understanding this tutorial, or want more background, you can look at MNIST and Deep MNIST tutorial. It is also recommended you go through the Estimator tutorial, which is the new TF high-level API.
For simplicity, I already provide the code from the MNIST tutorial in 00_mnist.py
. Try running that code using python 00_mnist.py
. It will start printing the loss per-iterations. After 20000 iterations, it will run the trained model on test data and print the classification accuracy. Go through the code and make sure you understand the different parts of it.
Q 0.3: Make the training loss and validation accuracy curve. Show for at least 100 points between 0 to 20000 iterations.
Hint: You will need to run the mnist_classifier.train
in a loop; train for a few iterations, run the evaluate to get the current accuracy, and so on.
Yay! We can now read numbers from images :-) This technology, developed by LeCun and collaborators in 1990s became the basis of automated check and zip code processing.
Numbers are easy. Lets try to recognize some natural images. We start by modifying this code to read images from the PASCAL 2007 dataset. Following steps will guide you through the process.
We first need to download the image dataset and annotations. Use the following commands to setup the data, and lets say it is stored at location $DATA_DIR
.
$ # First, cd to a location where you want to store ~0.5GB of data.
$ wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
$ tar xf VOCtrainval_06-Nov-2007.tar
$ # Also download the test data
$ wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar && tar xf VOCtest_06-Nov-2007.tar
$ cd VOCdevkit/VOC2007/
$ export DATA_DIR=$(pwd)
The first step is to write a data loader to load this PASCAL data. Since there are only about 10K images in this dataset, we can simply load all the images into CPU memory, along with the labels. The important thing to note is that PASCAL can have multiple objects present in the same image. Hence, this is a multi-label classification problem, and will have to be tackled slightly differently.
We provide some starter code for this task in 01_pascal.py
by slightly modifying our MNIST codebase. You need to fill in some of the functions, as outlined next.
Find the function definition for load_pascal
. As the function docstring says, the function takes as input the $DATA_DIR
and the split (train
/val
/trainval
/test
), and outputs all the images, labels and weights from the dataset. For N
images in the split, the images should be np.ndarray
of shape NxHxWx3
, and labels/weights should be Nx20
. The labels should be 1s for each object that is present in the image, and weights should be 1 for each label in the image, except those labeled as ambiguous. All other values should be 0. For simplicity, resize all images to a canonical size (eg, 256x256px).
Hint: The dataset contain a ImageSets/Main/
folder, with files named <class_name>_<split_name>.txt
. Use those files to find images that are in the different splits of the data. Look at the README to understand the structure and labeling.
Since the data is in numpy
format, we use the tf.estimator.inputs.numpy_input_fn
data loader, that takes care of constructing batches, shuffling etc. This is already provided in 01_pascal.py
.
Next we need to write the model function for PASCAL. I provide an empty function definition, but feel free to copy over from your MNIST code. We will be using the same model from MNIST (bad idea, I know, but lets give it a shot). We need to take care of a couple of things:
- Now that our images are much larger (256px vs 28px before), you might need to change the size in the reshape layer before the fully connected (
dense
) layer. - You will no longer need the
tf.one_hot
function as you are already producing a one-hot-ish representation fromload_pascal
. - Perhaps most importantly, we need to change the final non-linearity. A standard solution for multi-label classification problems is to consider each class as a separate binary classification problem, and to use
tf.sigmoid
as the activation, andtf.losses.sigmoid_cross_entropy
as the loss function. Use these to replace thesoftmax
loss and activation we used for MNIST.
With all this code in place, we should now be able to train the model. For now we will use the same training parameters as we used for MNIST. The other thing to figure out is the evaluation. A standard metric for multi-label evaluation is mean average precision (mAP). I already provide the code for evaluation; just make sure your model_fn
can return an EstimatorSpec
for the predict mode (it should return the probability for each class).
Q 1.3: Same as before, show the training loss and test accuracy (mAP) curves. Train for only 1000 iterations.
As you might have seen, the performance of our 2-layer model that worked perfectly on MNIST, was pretty low for PASCAL. This is expected as PASCAL is much more complex than MNIST, and we need a much beefier model to handle it. Copy over your code from 01_pascal.py
to 02_pascal_alexnet.py
, and lets implement a deep CNN.
In this task we will be constructing a variant of the alexnet architecture, known as CaffeNet. If you are familiar with Caffe, a prototxt of the network is available here.
Here is the exact model we want to build. I use the following operator notation for the architecture:
- Convolution: A convolution with kernel size
k
, strides
, output channelsn
, paddingp
, is represented asconv(k, s, n, p)
. - Max Pooling: A max pool operation with kernel size
k
, strides
asmax_pool(k, s)
. - Fully connected: For
n
units,fully_connected(n)
.
ARCHITECTURE:
-> image
-> conv(11, 4, 96, 'VALID')
-> relu()
-> max_pool(3, 2)
-> conv(5, 1, 256, 'SAME')
-> relu()
-> max_pool(3, 2)
-> conv(3, 1, 384, 'SAME')
-> relu()
-> conv(3, 1, 384, 'SAME')
-> relu()
-> conv(3, 1, 256, 'SAME')
-> max_pool(3, 2)
-> flatten()
-> fully_connected(4096)
-> relu()
-> dropout(0.5)
-> fully_connected(4096)
-> relu()
-> dropout(0.5)
-> fully_connected(20)
We would also need to modify the solver settings from what we used on MNIST.
- Change the optimizer to SGD + Momentum, with momentum of 0.9.
- Initialize the conv and FC weights using a
gaussian(0, 0.01)
initializer, and biases using azeros
initializer. You may refer to the Caffe prototxt above for exact details. - Use a exponentially decaying learning rate schedule, that starts at 0.001, and decays by 0.5 every 10K iterations. You should train it for at least 40K iterations at batch size of 10, which should take half hour on the AWS nodes.
Q 2.2: Implement the above solver parameters. This should require small changes to our previous code.
Since we are training a model from scratch on this small dataset, it is important to perform some basic data augmentation to avoid overfitting. Add random crops and left-right flips when training, and do a center crop when testing. Hint: Note that you can use ops such as tf.image.random_flip_left_right
, tf.random_crop
etc directly into your cnn_model_fn
function, and do not need to perform this augmentation manually in the data loader. This is one of the strengths of TF, it allows you to specify all data pre-processing as a part of the computation graph and can optimally schedule all operations.
Hint: You may refer to a previous work, Krahenbuhl et al. (ICLR'16), for more tips on setting hyper-parameters for this task. Feel free to explore slight modifications of the architecture.
Hopefully we all got much better accuracy with the deeper model! Since 2012, many other deeper architectures have been proposed, and VGG-16 is one of the popular ones. In this task, we attempt to further improve the performance with the "very deep" VGG-16 architecture.
Q 3.1: Modify the network architecture from Task 2 to implement the VGG-16 architecture (refer to the original paper). Use the same hyperparameter settings from Task 2, and try to train the model. Add the train/test loss/accuracy curves into the report.
TensorFlow ships with an awesome visualization tool called TensorBoard. It can be used to visualize training losses, network weights and other parameters. Now that we're training much deeper network, lets hook up tensorboard to get better understanding of these networks.
If you have been using the Estimator
API, then your code is already storing logs for tensorboard in the models_dir
! You can visualize it by running tensorboard --logdir $MODEL_DIR --port 6006
, and view the UI using a browser, at <aws-public-ip>:6006
. Try that out.
Q 3.2: The task in this section is to log the following entities: a) Training loss, b) Learning rate, c) Histograms of gradients, d) Training images and e) Network graph into tensorboard. Add screenshots from your tensorboard into the report.
As we have already seen, deep networks can sometimes be hard to optimize, while other times lead to heavy overfitting on small training sets. Many approaches have been proposed to counter this, eg, Krahenbuhl et al. (ICLR'16) and other works we have seen in un-/self-supervised learning. However, the most effective approach remains pre-training the network on large, well-labeled datasets such as ImageNet. While training on the full ImageNet data is beyond the scope of this assignment, people have already trained many popular/standard models and released them online. In this task, we will initialize the VGG model from the previous task with pre-trained ImageNet weights, and finetune the network for PASCAL classification. You can download the pre-trained VGG-16 model for TensorFlow from here, and rename as ./vgg_16.ckpt
.
You might want to look at tf.train.SessionRunHook
to load models into Estimator
s. Also, the pre-trained model used the following namescope hierarchy:
vgg_16/
conv1/
conv1_1/
weights
biases
conv1_2/
... and so on
It would help if you define your VGG model using a similar naming hierarchy (look up tf.variable_scope
), as that would help you load the model without converting blob names from the ones in the model to ones in your code.
Q 4.1: Load the pre-trained weights upto fc7 layer, and initialize fc8 weights and biases from scratch. Then train the network as before and report the training/validation curves and final performance on PASCAL test set.
Use similar hyper-parameter setup as in the scratch case, however, use 1/10th the learning rate, number of iterations and learning rate step size.
By now we should have a good idea of training networks from scratch or from pre-trained model, and the relative performance in either scenarios. Needless to say, the performance of these models is way stronger than previous non-deep architectures we used until 2012. However, final performance is not the only metric we care about. It is important to get some intuition of what these models are really learning. Lets try some standard techniques.
Extract and compare the conv1 filters from CaffeNet in Task 2, at different stages of the training. Show at least 3 data points.
Pick 10 images from PASCAL test set from different classes, and compute the nearest neighbors of those images over the test set. You should use and compare the following feature representations for the nearest neighbors:
- pool5 features from the AlexNet (trained from scratch)
- fc7 features from the AlexNet (trained from scratch)
- pool5 features from the VGG (finetuned from ImageNet)
- fc7 features from VGG (finetuned from ImageNet)
We can also visualize how the feature representations specialize for different classes. Take 1000 random images from the test set of PASCAL, and extract fc7
features from those images. Compute a 2D tSNE projection of the features, and plot them with each feature color coded by the GT class of the corresponding image. If multiple objects are active in that image, compute the color as the "mean" color of the different classes active in that image. Legend the graph with the colors for each object class.
Show the per-class performance of your scratch (alexnet) and pre-trained (VGG) models. Try to explain, by observing examples from the dataset, why some classes are harder or easier than other (consider the easiest and hardest class). Do some classes see large gains due to pre-training? Can you explain why that might happen?
Many techniques have been proposed in the literature to improve classification performance for deep networks. In this section, we try to use a recently proposed technique called mixup. The main idea is to augment the training set with linear combinations of images and labels. Read through the paper and modify your model to implement mixup. Report your performance, along with training/test curves, and comparison with baseline in the report.
Parts of the starter code are taken from official TensorFlow tutorials. Many thanks to the original authors!