Part of list:
Gradient Descent, an intuitive introduction
- Introduction and Overview
- Equations and formulas
- Intuition for Gradient Descent
- Variants of Gradient Descent
- Choosing the learning rate
Gradient Descent is one of the most popular and widely used optimization algorithm. Given a machine learning model with parameters (weights and biases) and a cost function to evaluate how good a particular model is, our learning problem reduces to that of finding a good set of weights for our model which minimizes the cost function.
Gradient descent is an iterative method. We start some value for our model parameters (weights and biases), and improve them slowly. To improve a given set of weights, we try to get a sense of the value of the cost function for weights similar to the current weights (by calculating the gradient) and move in the direction in which the cost function reduces. By repeat this step thousands of times we'll continually minimize our cost function.
Gradient descent is used to minimize a cost function J(w) parameterized by a model parameters w. The gradient (or derivative) tells us the incline or slope of the cost function. Hence, to minimize the cost function, we move in the direction opposite to the gradient.
Let G be the gradient of the cost function with respect to the parameters at a particular value w of the weight vector. That is,
Then, the gradient descent step is given by
Here, η is the learning rate which determines the size of the steps we take to reach a minimum . We need to be very careful about this parameter since high values of η may overshoot the minimum and very low value will reach minimum very slowly.
Imagine you're blind folded in a rough terrain, and your objective is to reach the lowest altitude. One of the simplest strategies you can use, is to feel the ground in every direction, and take a step in the direction where the ground is descending the fastest. If you keep repeating this process, you might end up at the lake, or even better, somewhere in the huge valley.
The rough terrain is analogous to the cost function. Minimizing the cost function is analogous to trying to reach lower altitudes. You are blind folded, since we don't have the luxury of evaluating (seeing) the value of the function for every possible set of parameters. Feeling the slope of the terrain around you is analogous to calculating the gradient, and taking a step is analogous to one iteration of update to the parameters.
There are multiple variants of gradient descent depending on how much of the data is being used to calculate the gradient. The main reason behind these variations is computational efficiency. A dataset may have millions of data points, and calculating the gradient over the entire dataset can be computational expensive.
Of these, stochastic GD and mini-batch GD are most popular. Usually mini-batch GD is used because computing infrastructure (compilers, CPUs, GPUs) is often optimized for performing vector additions and vector multiplications.
There are some challenges with aforementioned gradient descent algorithms such as choosing a proper learning rate. It's typical to choose a small value such as 0.1, 0.01 or 0.001 and adapt it based on whether the cost function is reducing very slowly (increase learning rate) or is exploding / being erratic (decrease learning rate).
Although manually choosing a learning rate is still the most common practice, several methods such as adam optimizer, AdaGrad and RMSProp have been proposed to automatically choose a suitable learning rate. Suggested further reading for going deeper into this topic: An overview of gradient descent optimization algorithms.