This is a great paper that addresses the vanishing/exploding gradient problem existing in deep neural network architecture. The solution proposed in the paper is a deep residual learning framework which allows for training extremely deep CNN models for various visual recognition tasks. The architecture consists of stacked convolutional layers, with every other layer connected with two layers below. In this way, every two layers are trained to approximate a residual function of an underlying mapping.
The claim made in the paper is that learning some underlying mapping H(x) is asymptotically approximate to learning the residual function [H(x)-x] and then adding x, but that the latter is easier to learn with several layers of a neural network. This intuition isn't very clear. Section 3.1 discusses this intuition, and I was wondering if someone could help me understand this.
Some further questions and observations:
- This framework doesn't seem to have several fully connected layers at the end, as VGG/AlexNet papers did. The lack of these FC layers certainly improves performance speed, but I wonder if the authors also found that the extra FC layers didn't help much with accuracy for ResNets. If so, why would this be the case?
- Have people explored other shortcut-connection architectures? Is there a good intuition for why creating a shortcut connection between every other layer works?