Train deep architectures for a variable number of layers with and without pre-training.
Weights initialized using random sample from [-1/sqrt(k), 1/sqrt(k)] where k is fan-in value.
Increasing depth (without pre-training) causes error rate to go up faster than the case of pre-training.
Pre-training also makes the network more robust to random initializations.
At same training cost level, the pre-trained models systematically yields a lower cost than the randomly initialized ones.
Pre-training seems to be most advantageous for smaller training sets.
Pre-training appears to have a regularizing effect - it decreases the variance (for parameter configurations) by restricting the set of possible final configurations for parameter values and introduces a bias.
Pre-training helps for larger layers (with a larger number of units per layer) and for deeper networks. But in the case of small networks, it can lower the performance.
As small networks tend to have a small capacity, this supports the hypothesis that pre-training exhibits a kind of regularizing effect.
Pre-training seems to provide a better marginal conditioning of the weights. Though this is not the only benefit pre-training provides as it captures more intricate dependencies.
Pre-training the lower layers is more important (and impactful) than pre-training the layers closer to the output.
Error landscape seems to be flatter for deep architectures and for the case of pre-training.
Learning trajectories for pre-trained and not pre-trained models start and stay in different regions of function space. Moreover, trajectories of any of the given type initially move together, but at some point, they diverge away.