This paper introduces the task of dense captioning and proposes a network architecture that processes an image and produce region descriptions in a single pass and can be trained end-to-end. Main contributions:
- Dense captioning
- Generalization of object detection (caption consists of single word) and image captioning (region consists of whole image).
- Fully convolution localization network
- Fully differentiable, can be trained jointly with the rest of the network
- Consists of a region proposal network, box regression (similar to Faster R-CNN) and bilinear interpolation (similar to Spatial Transformer Networks) for sampling.
- Network details
- Convolutional layer features are extracted for image
- For each element in the feature map, k anchor boxes of different aspect ratios are selected in the input image space.
- For each of these, the localization layer predicts offsets and confidence.
- The region proposals are projected on the convolutional feature map and a sampling grid is computed from output feature map to input (bilinear sampling).
- The computed feature map is passed through an MLP to compute representations corresponding to each region.
- These are passed (in a batch) as the first word to an LSTM (Show and Tell) which is trained to predict each word of the caption.
- Fully differentiable 'spatial attention' mechanism (bilinear interpolation) in place of RoI pooling as in the case of Faster R-CNN.
- RoI pooling is not differentiable with respect to the input proposal coordinates.
- Fast, and impressive qualitative results.
The model is very well engineered together from different works (Faster R-CNN + Spatial Transformer Networks + Show & Tell).