:mod:`torch.cuda` keeps track of currently selected GPU, and all CUDA tensors you allocate will be created on it. The selected device can be changed with a :any:`torch.cuda.device` context manager.
However, once a tensor is allocated, you can do operations on it irrespectively of your selected device, and the results will be always placed in on the same device as the tensor.
Cross-GPU operations are not allowed by default, with the only exception of :meth:`~torch.Tensor.copy_`. Unless you enable peer-to-peer memory accesses any attempts to launch ops on tensors spread across different devices will raise an error.
Below you can find a small example showcasing this:
x = torch.cuda.FloatTensor(1)
# x.get_device() == 0
y = torch.FloatTensor(1).cuda()
# y.get_device() == 0
with torch.cuda.device(1):
# allocates a tensor on GPU 1
a = torch.cuda.FloatTensor(1)
# transfers a tensor from CPU to GPU 1
b = torch.FloatTensor(1).cuda()
# a.get_device() == b.get_device() == 1
c = a + b
# c.get_device() == 1
z = x + y
# z.get_device() == 0
# even within a context, you can give a GPU id to the .cuda call
d = torch.randn(2).cuda(2)
# d.get_device() == 2
Host to GPU copies are much faster when they originate from pinned (page-locked) memory. CPU tensors and storages expose a :meth:`~torch.Tensor.pin_memory` method, that returns a copy of the object, with data put in a pinned region.
Also, once you pin a tensor or storage, you can use asynchronous GPU copies.
Just pass an additional async=True
argument to a :meth:`~torch.Tensor.cuda`
call. This can be used to overlap data transfers with computation.
You can make the :class:`~torch.utils.data.DataLoader` return batches placed in
pinned memory by passing pin_memory=True
to its constructor.
Most use cases involving batched input and multiple GPUs should default to using :class:`~torch.nn.DataParallel` to utilize more than one GPU. Even with the GIL, a single python process can saturate multiple GPUs.
As of version 0.1.9, large numbers of GPUs (8+) might not be fully utilized. However, this is a known issue that is under active development. As always, test your use case.
There are significant caveats to using CUDA models with :mod:`~torch.multiprocessing`; unless care is taken to meet the data handling requirements exactly, it is likely that your program will have incorrect or undefined behavior.