Gradient Descent

September 29, 2017

Introduction and Overview

Gradient Descent is one of the most popular and widely used optimization algorithms.

Let’s say we are given a machine learning model (parameterized by weights and biases) and a cost function to evaluate how good a particular model is. Then, our learning problem reduces to that of finding the values of the model parameters which minimize the cost function.

Gradient descent is an iterative method. We start with some arbitrary initial values for our model parameters (weights and biases), and improve them slowly. In each iteration, we try to get a sense of the value of the cost function for weights similar to the current weights (by calculating the gradient) and move in the direction in which the cost function reduces. By repeating this step thousands of times we’ll continually minimize our cost function.

Pseudocode for Gradient Descent

Let’s say we want to minimize a cost function J(w) parameterized by model parameters w. The pseudocode for gradient descent is as follows:

Step 1: initialize the weights $w$ randomly

Step 2: calculate the gradients

The gradient G is defined as the derivative of the cost function with respect to the model parameters. That is, for the $i$-th parameter $w_i$, we have $G_i = \frac{\partial J(w)}{\partial w_i}$.

The gradient is the direction in which the cost function increases the most. Hence, the direction in which the cost function decreases the most is given by -G.

Note: If you don’t remember much about derivatives and calculus, that’s okay. Calculating the formula for the gradient is not the focus here.

Step 3: update the weights using the formula $w \leftarrow w - \eta \cdot G$

Here, $\eta$ (pronounced eta) is the learning rate which determines the size of the steps we take to reach a minimum. We will discuss how to pick a good value for the learning rate later in the tutorial.

Notice the minus sign in the update formula. The gradient gives us the direction in which the cost function increases. But we want to move in the direction in which the cost function decreases, i.e. the direction opposite to the gradient.

Step 4: repeat till $J(w)$ stops reducing or some termination criteria is met.

Typically, we just check if the cost function reduced by at least 0.001 or some similar threshold. We may also have a limit on the maximum number of iterations allowed (say 10,000).

Gradient Descent for Linear Regression

Let us see what the formula for the gradient G looks like for the case of linear regression. We have:

Cost function: $J(w)= \frac{1}{n}\sum_{i=1}^{n}((b+w_1x^{(i)})-y^{(i)}_{true})^2$

This is the mean squared cost function that you saw in the previous tutorial.

Gradient with respect to $w_1$: $G_1 = \frac{\partial J(w)}{\partial w_1}=\frac{2}{n}\sum_{i=1}^{n}x^{(i)}((b+w_1x^{(i)})-y^{(i)}_{true})$

Gradient with respect to b: $G_b = \frac{\partial J(w)}{\partial b}=\frac{2}{n}\sum_{i=1}^{n}((b+w_1x^{(i)})-y^{(i)}_{true})$

We will not show the step-by-step derivation of the gradients here since calculating the formula for the gradient is not the focus here. However, if you’d like to try, you will need to use calculus and in particular, the chain rule of differentiation.

Hence the updates are:

$w_1 \leftarrow w_1 - \eta \cdot G_1$
$b \leftarrow b - \eta \cdot G_b$

Note: In the above formulas, $b + w_1x$ is the prediction $y_{w,b}(x)$. So the above formulas would be simpler if written in terms of $y_{w,b}(x)$, but we’re writing them in full so that it is easier to see the complete picture.

For linear regression, we can visualise gradient descent as follows:

Gradient descent with 2D cost function

Here the x and y-axis represent the changing weight w₁ and bias b. The function values in the z-axis is the cost function. As we change the weight and bias values we keep descending to lower values of cost function to arrive at the minima.

Global vs Local Minima

As we saw in the previous section, the cost function for linear regression is shaped like a bowl. Hence, there is only one minima, and gradient descent always proceeds in the correct direction.

In general, gradient descent is also used when the cost function’s shape is not that straightforward. In particular, the cost function may have more than one minima. In such a scenario, we may get stuck in local minimas, depending on the initial values of weights and biases we start our descent from.

Here’s an example where we start at two very similar values for the weights and biases, but end up at completely different minimas.

Gradient descent on a cost function with multiple local minimas

For many machine learning algorithms, there’s only one minima, so gradient descent finds the optimal solution. But in general, gradient descent often works quite well even for machine learning algorithms where there is more than one minima.

Intuition for Gradient Descent

Imagine you’re blind folded in a rough terrain, and your objective is to reach the lowest altitude. One of the simplest strategies you can use, is to feel the ground in every direction, and take a step in the direction where the ground is descending the fastest. If you keep repeating this process, you might end up at the lake, or even better, somewhere in the huge valley.

Source: Andrej Karpathy’s Stanford Course Lecture 3

The rough terrain is analogous to the cost function. Minimizing the cost function is analogous to trying to reach lower altitudes. You are blindfolded, since we don’t have the luxury of evaluating (seeing) the value of the function for every possible set of parameters. Feeling the slope of the terrain around you is analogous to calculating the gradient, and taking a step is analogous to one iteration of update to the parameters.

Choosing the learning rate

Typically, finding a good value for the learning rate requires some trial and error.

First, let’s understand how learning rate effects gradient descent.

When the learning rate is too small, each step of descent is also extremely small. This means that it will take a very very long time to reach the local minima.
On the other hand, a larger learning rate may converge faster, but if it is too large, it may overshoot the minima and fail to converge.

Gradient descent with different learning rates

Given this information, we can use the following strategy for finding a good value for the learning rate.

We start with a small value such as 0.1, 0.01 or 0.001. Then, we adapt it based on whether the cost function is reducing very slowly or exploding.

If the cost function is reducing very slowly, we increase the learning rate.
If the cost function is exploding or if it is being erratic, we decrease the learning rate.

Although manually choosing a learning rate is still the most common practice, several methods such as Adam optimizer, AdaGrad and RMSProp have been proposed to automatically choose a suitable learning rate.

If you would like to go deeper into this topic, here is a link for further reading: An overview of gradient descent optimization algorithms.

Variants of Gradient Descent

There are multiple variants of gradient descent depending on how much of the data is being used to calculate the gradient. The main reason behind these variations is computational efficiency. A dataset may have millions of data points, and calculating the gradient over the entire dataset can be computationally expensive.

Batch gradient descent computes the gradient of the cost function with respect to parameter w for the entire training data. Since we need to calculate the gradients for the whole dataset to perform one parameter update, batch gradient descent can be very slow.
Stochastic gradient descent (SGD) computes the gradient for each update using a single training data point x⁽ⁱ⁾ (chosen at random). The idea is the gradient calculated this way is a stochastic approximation to the gradient calculated using the entire training data. Each update is now much faster than in batch gradient descent, and over many updates, we will head in the same general direction.
In mini-batch gradient descent, we calculate the gradient for each small mini-batch of training data. That is, we first divide the training data into small batches (say M data points / batch). We perform one update per mini-batch. M is usually in the range 30-500, depending on the problem. Usually mini-batch GD is used because computing infrastructure - compilers, CPUs, GPUs - are often optimized for performing vector additions and vector multiplications.

Of these, SGD and mini-batch GD are most popular. In a typical scenario, we do several passes over the training data before the termination criteria is met.

Example of Gradient Descent

Let’s take a concrete example which combines everything we learnt above.

In this example, we will go through a single update step of Stochastic Gradient Descent. For this, we’ll assume we have a training data point where $x = (x_1, x_2) = (2, -3)$ and $y = 1$.

In addition, we need to choose what our model and cost function are. Let’s say we are doing linear regression. Hence, our model has weights $w_1, w_2$ and bias $b$. The prediction is given by $p = w_1x_1 + w_2x_2 + b$ and the cost function is $J(w) = (p-y)^2$.

Step 1: Initialize model weights (at random) and bias. Let’s say $(w_1, w_2) = (0.1, 0.2)$ and $b = 0$.
Step 2: For each data point in training data:
Calculate the gradients: To calculate the gradients, we compute the prediction, then the cost, and then the gradients.
Calculate the prediction: $p = w_1x_1 + w_2x_2 + b = 2w_1 - 3w_2 + b = 0.2 - 0.6 = -0.4$
Calculate the cost: $J(w) = (p-y)^2 = (-1.4)^2 = 1.96$
Calculate the gradient: (formulas for gradient provided earlier in the tutorial)
For $w_1$ we have: $\frac{\partial J(w)}{\partial w_1} = \frac{\partial J(w)}{\partial p}\frac{\partial p}{\partial w_1} = 2(p-y)x_1 = 2 * -1.4 * 2 = -5.6$
Similarly, for $w_2$ we have: $\frac{\partial J(w)}{\partial w_2} = 2(p-y)x_2 = 2 * -1.4 * -3 = 8.4$
And finally, for $b$ we have: $\frac{\partial J(w)}{\partial b} = \frac{\partial J(w)}{\partial p}\frac{\partial p}{\partial b} = 2(p-y) = 2 * -1.4 = -2.8$
Update the weights
Let’s assume the learning rate is 0.001. Then,
$w_1 = w_1 - \eta G_{w_1} = 0.1 - 0.001 * -5.6 = 0.1056$.
$w_2 = w_2 - \eta G_{w_2} = 0.2 - 0.001 * 8.4 = 0.1916$
$b = b - \eta G_{b} = 0 - 0.001 * -2.8 = 0.0028$
This gives us the new weights $(w_1, w_2) = (0.1056, 0.1916)$ and bias $b = 0.0028$.
Step 3: Check if termination criteria is met (see if mean-squared-error reduced). If yes, repeat from step 2. Else stop.

Above was a detailed walk-through of one update step using stochastic gradient descent. Typically, we repeat this step thousands of times, each time choosing a data point at random for the stochastic gradient descent.

As a sanity check, you can verify that the prediction that would be made using the new weights $(w_1, w_2) = (0.1056, 0.1916)$ and bias $b = 0.0028$ is an improvement over our original prediction.

Summary

Gradient Descent is a popular and widely used optimization algorithm.
It is an iterative method. We start with some arbitrary values for our model parameters and improve them repeatedly to move towards lower values of the cost function.
The direction in which we move is determined by calculating the gradient.
We must choose the learning rate carefully. Too small a learning rate makes the training move impractically slow, and too large a learning rate leads to divergence.