- Previous work in this domain
- Model Parallelism
- Data Parallelism
- Distributed Optimization Algorithms
- Beyond DistBelief - TensorFlow
Downpour SGD (with Adagrad adaptive learning rate) outperforms Downpour SGD (with fixed learning rate) and Sandblaster L-BFGS. Moreover, Adagrad can be easily implemented locally within each parameter shard. It is surprising that Adagrad works so well with Downpour SGD as it was not originally designed to be used with asynchronous SGD. The paper conjectures that "Adagrad automatically stabilizes volatile parameters in the face of the flurry of asynchronous updates, and naturally adjusts learning rates to the demands of different layers in the deep network."