So this is the summary of what I took away from the article. It would be great if someone could tell me if I went wrong somewhere!
- Gradient Descent w/o momentum is slow to converge, and sometimes doesn't converge at all.
- Adding momentum skips the various unnecessary ravines and troughs we would encounter in our imaginary valley if we were using only gradient descent.
- The added z^(k+1) parameter [where we include momentum beta] basically tells us "How fast do I go?", and the actual weight update tells us "In which direction do I go?". In gradient descent w/o beta, we only had access to the latter question.
Did not understand the quite dense mathematics behind most of the derivations, but I hope I got the intuition right?