Model Zoo

Trained models are posted here as links to Github Gists. Check out the model zoo documentation for details.

To acquire a model:

download the model gist by ./scripts/download_model_from_gist.sh <gist_id> <dirname> to load the model metadata, architecture, solver configuration, and so on. (<dirname> is optional and defaults to caffe/models).
download the model weights by ./scripts/download_model_binary.py <model_dir> where <model_dir> is the gist directory from the first step.

Berkeley-trained models

Finetuning on Flickr Style: same as provided in models/, but listed here as a Gist for an example.
BVLC GoogleNet

Network in Network model

The Network in Network model is described in the following ICLR-2014 paper:

Network In Network
M. Lin, Q. Chen, S. Yan
International Conference on Learning Representations, 2014 (arXiv:1409.1556)

please cite the paper if you use the models.

Models:

NIN-Imagenet: a small(29MB) model for imagenet, yet performs slightly better than AlexNet, and fast to train.
NIN-CIFAR10: NIN model on CIFAR10, originally published in the paper Network In Network. The error rate of this model is 10.4% on CIFAR10.

Models from the BMVC-2014 paper "Return of the Devil in the Details: Delving Deep into Convolutional Nets"

The models are trained on the ILSVRC-2012 dataset. The details can be found on the project page or in the following BMVC-2014 paper:

Return of the Devil in the Details: Delving Deep into Convolutional Nets
K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman
British Machine Vision Conference, 2014 (arXiv ref. cs1405.3531)

Please cite the paper if you use the models.

Models:

VGG_CNN_S: 13.1% top-5 error on ILSVRC-2012-val
VGG_CNN_M: 13.7% top-5 error on ILSVRC-2012-val
VGG_CNN_M_2048: 13.5% top-5 error on ILSVRC-2012-val
VGG_CNN_M_1024: 13.7% top-5 error on ILSVRC-2012-val
VGG_CNN_M_128: 15.6% top-5 error on ILSVRC-2012-val
VGG_CNN_F: 16.7% top-5 error on ILSVRC-2012-val

Models used by the VGG team in ILSVRC-2014

The models are the improved versions of the models used by the VGG team in the ILSVRC-2014 competition. The details can be found on the project page or in the following arXiv paper:

Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan, A. Zisserman
arXiv:1409.1556

Please cite the paper if you use the models.

Models:

16-layer: 7.5% top-5 error on ILSVRC-2012-val, 7.4% top-5 error on ILSVRC-2012-test
19-layer: 7.5% top-5 error on ILSVRC-2012-val, 7.3% top-5 error on ILSVRC-2012-test

In the paper, the models are denoted as configurations D and E, trained with scale jittering. The combination of the two models achieves 7.1% top-5 error on ILSVRC-2012-val, and 7.0% top-5 error on ILSVRC-2012-test.

Places-CNN model from MIT.

Places CNN is described in the following NIPS 2014 paper:

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva
Learning Deep Features for Scene Recognition using Places Database.
Advances in Neural Information Processing Systems 27 (NIPS) spotlight, 2014.

The project page is at here

Models:

Places205-CNN: CNN trained on 205 scene categories of Places Database (used in NIPS'14) with ~2.5 million images. The architecture is the same as Caffe reference network.
Hybrid-CNN: CNN trained on 1183 categories (205 scene categories from Places Database and 978 object categories from the train data of ILSVRC2012 (ImageNet) with ~3.6 million images. The architecture is the same as Caffe reference network.

GoogLeNet GPU implementation from Princeton.

We implemented GoogLeNet using a single GPU. Our main contribution is an effective way to initialize the network and a trick to overcome the GPU memory constrain by accumulating gradients over two training iterations.

Please check http://vision.princeton.edu/pvt/GoogLeNet/ for more information. Pre-trained models on ImageNet and Places, and the training code are available for download.
Make sure cls2_fc2 and cls3_fc have num_output = 1000 in the prototxt. Otherwise, the trained model would crash on test.

Fully Convolutional Semantic Segmentation Models (FCN-Xs)

These models are described in the paper:

Fully Convolutional Models for Semantic Segmentation
Jonathan Long, Evan Shelhamer, Trevor Darrell
arXiv:1411.4038

These are pre-release models. They do not run in any current version of BVLC/caffe, as they require unmerged PRs. They should run in the preview branch provided at https://github.com/longjon/caffe/tree/future.

Models trained on PASCAL (using extra data from Hariharan et al. and finetuned from the ILSVRC-trained VGG-16 model above):

FCN-32s PASCAL: single stream, 32 pixel prediction stride version
FCN-16s PASCAL: two stream, 16 pixel prediction stride version
FCN-8s PASCAL: three stream, 8 pixel prediction stride version

Models trained on SIFT Flow (also finetuned from VGG-16):

FCN-16s SIFT Flow: two stream, 16 pixel prediction stride version

Models trained on NYUDv2 (also finetuned from VGG-16, and using HHA features from Gupta et al. https://github.com/s-gupta/rcnn-depth):

FCN-32s NYUDv2: single stream, 32 pixel prediction stride version
FCN-16s NYUDv2: two stream, 16 pixel prediction stride version

Models trained on PASCAL-Context including training model definition, solver configuration, and barebones solving script (finetuned from the ILSVRC-trained VGG-16 model):

FCN-32s PASCAL-Context: single stream, 32 pixel prediction stride version
FCN-16s PASCAL-Context: two stream, 16 pixel prediction stride version
FCN-8s PASCAL-Context: three stream, 8 pixel prediction stride version

CaffeNet fine-tuned for Oxford flowers dataset

https://gist.github.com/jgoode21/0179e52305ca768a601f

The is the reference CaffeNet (modified AlexNet) fine-tuned for the Oxford 102 category flower dataset. The number of outputs in the inner product layer has been set to 102 to reflect the number of flower categories. Hyperparameter choices reflect those in Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data. The global learning rate is reduced while the learning rate for the final fully connected is increased relative to the other layers.

After 50,000 iterations, the top-1 error is 7% on the test set of 1,020 images.

I0215 15:28:06.417726  6585 solver.cpp:246] Iteration 50000, loss = 0.000120038
I0215 15:28:06.417789  6585 solver.cpp:264] Iteration 50000, Testing net (#0)
I0215 15:28:30.834987  6585 solver.cpp:315]     Test net output #0: accuracy = 0.9326
I0215 15:28:30.835072  6585 solver.cpp:251] Optimization Done.
I0215 15:28:30.835083  6585 caffe.cpp:121] Optimization Done.

CNN Models for Salient Object Subitizing.

CNN models described in the following CVPR'15 papger "Salient Object Subitizing":

Salient Object Subitizing
J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, B. Price and R. Mech. 
CVPR, 2015.

Models:

AlexNet: CNN model finetuned on the Salient Object Subitizing dataset (~5500 images). The architecture is the same as the Caffe reference network.
VGG16: CNN model finetuned on the Salient Object Subitizing dataset (~5500 images). The architecture is the same as the VGG16 network. This model gives better performance than the AlexNet model, but is slower for training and testing.

Model from the CVPR2015 DeepVision workshop paper "Deep Learning of Binary Hash Codes for Fast Image Retrieval"

This model generates compact binary codes for fast image retrieval. The details can be found in the following "CVPRW'15 paper":

Deep Learning of Binary Hash Codes for Fast Image Retrieval
K. Lin, H.-F. Yang, J.-H. Hsiao, C.-S. Chen
CVPR 2015, DeepVision workshop

please cite the paper if you use the model:

CIFAR10-48bit: Proposed CNN model with 48 nodes latent layer on CIFAR10. The error rate of this model is 10.6% on CIFAR10.

Places_CNDS_models on Scene Recognition

Places-CNDS-8 is a "8conv3fc layer" deep Convolutional neural Networks model trained on MIT Places Dataset with Deep Supervision.

The details of training this model are described in the following report. Please cite this work if the model is useful for you.

Training Deeper Convolutional Networks with Deep Supervision
L.Wang, C.Lee, Z.Tu, S. Lazebnik, arXiv:1505.02496, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Zoo

Berkeley-trained models

Network in Network model

Models from the BMVC-2014 paper "Return of the Devil in the Details: Delving Deep into Convolutional Nets"

Models used by the VGG team in ILSVRC-2014

Places-CNN model from MIT.

GoogLeNet GPU implementation from Princeton.

Fully Convolutional Semantic Segmentation Models (FCN-Xs)

CaffeNet fine-tuned for Oxford flowers dataset

CNN Models for Salient Object Subitizing.

Model from the CVPR2015 DeepVision workshop paper "Deep Learning of Binary Hash Codes for Fast Image Retrieval"

Places_CNDS_models on Scene Recognition

Clone this wiki locally