This paper focused on solving the degradation problem (saturation of accuracy in deeper networks). The paper's explanation is that residuals make backprop more efficient for deeper networks. That makes sense, but there's more to the story.

The self-referential formulation of ResNets leads to interesting properties on closer examination. On 'unrolling' H(X), we can view ResNets as ensembles of shallower networks ( [1605.06431] Residual Networks Behave Like Ensembles of Relatively Shallow Networks ) which explains the intuition behind why they learn better as they go deeper. This interpretation is key to understanding ResNets and consequently HighwayNets and DenseNets.

Another interesting interpretation is as gate-less LSTMs ( Microsoft Wins ImageNet 2015 through Feedforward LSTM without Gates ).