Jan 2019
tl;dr: Use a non-local module to capture long range dependencies (spatial, time or spatialtime).
The design of non-local module is modular, which does not change the input shape. Therefore it can be plugged into existing network design easily. This is similar to the deisgn of SE-net. The authors show that self-attention proposed in Google's transformer paper is a special case (embedded Gaussian) of non-local networks.
- Both CNN and RNN blocks are local operations, and model long range dependencies progressively through repeated local operations.
- The formulation of non-local operation is as follows, where the output feature z has the same shape as the input feature x to ensure a modular design.
$\hat{x}_j$ can be the original or subsampled$x_j$ or some other features (such as in long term feature bank)
- The non-local operation is different from a fully connected (fc) layer. Non-local is more flexible in that the output size matches the input size and can be inserted anywhere inside a network and keep spatialtime information.
- The embedded Gaussian version of non-local is the self-attention module.
- Used alone, non-local + 2D is better than 3D counterparts.
- The non-local module can also improve static image detection tasks, such as Mask RCNN on COCO.
- The instantiation of non-local net can take on many forms, but the most common/generic form is as follows.
$F(x)$ can be a gaussian or identify function, leading to embedded gaussian or dot product. The gaussian is the dot product plus a softmax.
- There is a batch-norm layer before residual sum with original x, initiated with zero weight to ensure smooth fine-tuning process.
- Input feature
$x$ is first linearly embedded to reduce the number of channels by half. Pooled features$x_j$ are used to augment input features$x_i$ . - In Mask RCNN, non-local block can be added to backbone (e.g., right before last resblock res4) or in the head (e.g., after every 2 layers for keypoint detection) or both.
- Use non-local module in Mask RCNN to boost performance.
- The paper introduced a variant of convnet called C2D, which uses 2D weight directly on 3D data. Only the parameter-less pooling operation is used to capture dependencies along the 3rd dimension.