The comments on Open Review provide a really good summary of the paper (copy-pasting the highlights and summary below)
Comments by Azalia Mirhoseini (author)
The main idea of our paper can be summarized as this: Massively increasing the capacity of deep networks by employing efficient, general-purpose conditional computation. This idea seems hugely promising and hugely obvious. At first glance, it is utterly shocking that no one had successfully implemented it prior to us. In practice, however, there are major challenges in achieving high performance and high quality. We enumerate these challenges in the introduction section of our new draft. Our paper discusses how other authors have attacked these challenges, as well as our particular solutions.
While some of our particular solutions (e.g., noisy-top-k gating, the particular batching schemes, the load-balancing loss, even the mixture-of-experts formalism) may not withstand the test of time, our main contribution, which is larger than these particulars, is to prove by example that efficient, general-purpose conditional computation in deep networks is possible and very beneficial. As such, this is likely a seminal paper in the field.
Comments by George Philipp (an interested reader)
The authors introduce a general-purpose mechanism to scale up neural networks significantly beyond their current size using sparsity of activation, i.e. by forcing the activation of most neurons in the net to be zero for any given training example.
Firstly, I believe the sheer size of the models successfully trained in this paper warrant an 8 rating all by themselves.
Secondly, we know historically that sparsity of parameters is among the most important modeling principles in machine learning, being used with great success in e.g. Lasso with the l1 penalty, in SVM with the hinge loss and in ConvNets by setting connections outside the receptive field to zero. This paper, in addition to sparsity of parameters (neurons in different experts are not connected) employs sparsity of activation, where the computation path is customized for each training example. It is, as far as I can tell, the first paper to implement this in a practical, scalable and general way for neural networks. If sparsity of activation turns out to be even a small fraction as important as sparsity of parameters, this paper will have a major impact.
Thirdly, I love the computational efficiency of the model presented. The authors achieve extreme sparsity yet fully utilize their GPUs. In particular, the authors design the network in such a way that there are very few connections between active and non-active units. If we have, say, a sparsely activated fully-connected network, most computation would be wasted on network connections that start on active units and end on non-active units.
Fourthly, the authors discuss and provide a practical and elegant strategy for large-scale cluster implementation, showcasing their technical sophistication. It is perhaps unfortunate that current baseline datasets may not even be able to fully utilize the power of MoE or other to-be-designed networks following similar principles, but models like the one presented here are bound to only become more prominent in the future.