This repository consists of:
- vision.datasets : Data loaders for popular vision datasets
- vision.models : Definitions for popular model architectures, such as AlexNet, VGG, and ResNet and pre-trained models.
- vision.transforms : Common image transformations such as random crop, rotations etc.
- vision.utils : Useful stuff such as saving tensor (3 x H x W) as image to disk, given a mini-batch creating a grid of images, etc.
Anaconda:
conda install torchvision -c soumith
pip:
pip install torchvision
From source:
python setup.py install
The following dataset loaders are available:
- MNIST
- COCO (Captioning and Detection)
- LSUN Classification
- ImageFolder
- Imagenet-12
- CIFAR10 and CIFAR100
Datasets have the API: - __getitem__
- __len__
They all subclass
from torch.utils.data.Dataset
Hence, they can all be multi-threaded
(python multiprocessing) using standard torch.utils.data.DataLoader.
For example:
torch.utils.data.DataLoader(coco_cap, batch_size=args.batchSize, shuffle=True, num_workers=args.nThreads)
In the constructor, each dataset has a slightly different API as needed, but they all take the keyword args:
transform
- a function that takes in an image and returns a transformed version- common stuff like
ToTensor
,RandomCrop
, etc. These can be composed together withtransforms.Compose
(see transforms section below) target_transform
- a function that takes in the target and transforms it. For example, take in the caption string and return a tensor of word indices.
dset.MNIST(root, train=True, transform=None, target_transform=None, download=False)
root
: root directory of dataset where processed/training.pt
and training/test.pt
exist
train
: True
- use training set, False
- use test set.
transform
: transform to apply to input images
target_transform
: transform to apply to targets (class labels)
download
: whether to download the MNIST data
This requires the COCO API to be installed
dset.CocoCaptions(root="dir where images are", annFile="json annotation file", [transform, target_transform])
Example:
import torchvision.datasets as dset
import torchvision.transforms as transforms
cap = dset.CocoCaptions(root = 'dir where images are',
annFile = 'json annotation file',
transform=transforms.ToTensor())
print('Number of samples: ', len(cap))
img, target = cap[3] # load 4th sample
print("Image Size: ", img.size())
print(target)
Output:
Number of samples: 82783 Image Size: (3L, 427L, 640L) [u'A plane emitting smoke stream flying over a mountain.', u'A plane darts across a bright blue sky behind a mountain covered in snow', u'A plane leaves a contrail above the snowy mountain top.', u'A mountain that has a plane flying overheard in the distance.', u'A mountain view with a plume of smoke in the background']
dset.CocoDetection(root="dir where images are", annFile="json annotation file", [transform, target_transform])
dset.LSUN(db_path, classes='train', [transform, target_transform])
db_path
= root directory for the database filesclasses
='train'
- all categories, training set'val'
- all categories, validation set'test'
- all categories, test set- [
'bedroom_train'
,'church_train'
, ...] : a list of categories to load
dset.CIFAR10(root, train=True, transform=None, target_transform=None, download=False)
dset.CIFAR100(root, train=True, transform=None, target_transform=None, download=False)
root
: root directory of dataset where there is foldercifar-10-batches-py
train
:True
= Training set,False
= Test setdownload
:True
= downloads the dataset from the internet and puts it in root directory. If dataset already downloaded, does not do anything.
A generic data loader where the images are arranged in this way:
root/dog/xxx.png root/dog/xxy.png root/dog/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/asd932_.png
dset.ImageFolder(root="root folder path", [transform, target_transform])
It has the members:
self.classes
- The class names as a listself.class_to_idx
- Corresponding class indicesself.imgs
- The list of (image path, class-index) tuples
This is simply implemented with an ImageFolder dataset.
The data is preprocessed as described here
The models subpackage contains definitions for the following model architectures:
- AlexNet: AlexNet variant from the "One weird trick" paper.
- VGG: VGG-11, VGG-13, VGG-16, VGG-19 (with and without batch normalization)
- ResNet: ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152
You can construct a model with random weights by calling its constructor:
import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
We provide pre-trained models for the ResNet variants and AlexNet, using
the PyTorch model zoo.
These can be constructed by passing pretrained=True
:
python import torchvision.models as models resnet18 = models.resnet18(pretrained=True) alexnet = models.alexnet(pretrained=True)
Transforms are common image transforms. They can be chained together
using transforms.Compose
One can compose several transforms together. For example.
transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean = [ 0.485, 0.456, 0.406 ],
std = [ 0.229, 0.224, 0.225 ]),
])
Rescales the input PIL.Image to the given 'size'. 'size' will be the size of the smaller edge.
For example, if height > width, then image will be rescaled to (size * height / width, size) - size: size of the smaller edge - interpolation: Default: PIL.Image.BILINEAR
Crops the given PIL.Image at the center to have a region of the given size. size can be a tuple (target_height, target_width) or an integer, in which case the target will be of a square shape (size, size)
Crops the given PIL.Image at a random location to have a region of the
given size. size can be a tuple (target_height, target_width) or an
integer, in which case the target will be of a square shape (size, size)
If padding
is non-zero, then the image is first zero-padded on each
side with padding
pixels.
Randomly horizontally flips the given PIL.Image with a probability of 0.5
Random crop the given PIL.Image to a random size of (0.08 to 1.0) of the original size and and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio
This is popularly used to train the Inception networks - size: size of the smaller edge - interpolation: Default: PIL.Image.BILINEAR
Pads the given image on each side with padding
number of pixels, and
the padding pixels are filled with pixel value fill
. If a 5x5
image is padded with padding=1
then it becomes 7x7
Given mean: (R, G, B) and std: (R, G, B), will normalize each channel of the torch.*Tensor, i.e. channel = (channel - mean) / std
ToTensor()
- Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]ToPILImage()
- Converts a torch.*Tensor of range [0, 1] and shape C x H x W or numpy ndarray of dtype=uint8, range[0, 255] and shape H x W x C to a PIL.Image of range [0, 255]
Given a Python lambda, applies it to the input img
and returns it.
For example:
transforms.Lambda(lambda x: x.add(10))
Given a 4D mini-batch Tensor of shape (B x C x H x W), makes a grid of images
Saves a given Tensor into an image file.
If given a mini-batch tensor, will save the tensor as a grid of images.