# Introduction and Overview

**Gradient Descent** is one of the most popular and widely used **optimization algorithms**. Given a machine learning model with parameters (weights and biases) and a cost function to evaluate how good a particular model is, our learning problem reduces to that of finding a good set of weights for our model which minimizes the cost function.

Gradient descent is an **iterative method**. We start with some set of values for our model parameters (weights and biases), and improve them slowly. To improve a given set of weights, we try to get a sense of the value of the cost function for weights similar to the current weights (by calculating the gradient) and move in the direction in which the cost function reduces. By repeating this step thousands of times we'll continually minimize our cost function.

# Pseudocode for Gradient Descent

Gradient descent is used to minimize a *cost function* *J*(* w*) parameterized by a

*model parameters*

*. The gradient (or derivative) tells us the incline or slope of the cost function. Hence, to minimize the cost function, we move in the direction opposite to the gradient.*

**w****initialize**the weights w randomly**calculate the gradients**G = \nabla_w J(w) of cost function w.r.t parameters. That is, G_i = \frac{\partial J(w)}{\partial w_i} (See footnotes 1, 2).**update the weights**by an amount proportional to G, i.e. w = w - \eta \cdot G- repeat till J(w) stops reducing or other pre-defined
**termination criteria**is met

In step 3, \eta is the **learning rate** which determines the size of the steps we take to reach a minimum. We need to be very careful about this hyper-parameter since high values of \eta may overshoot the minimum and very low value will reach minimum very slowly.

A popular sensible choice for the termination criteria is that the cost J(w) stops reducing on the *validation* dataset.

# Intuition for Gradient Descent

Imagine you're *blind folded *in a rough terrain, and your objective is to reach the *lowest altitude*. One of the simplest strategies you can use, is to *feel the ground in every direction, and take a step in the direction where the ground is descending* the fastest. If you keep repeating this process, you might end up at the lake, or even better, somewhere in the huge valley.

The rough terrain is analogous to the cost function. Minimizing the cost function is analogous to trying to reach lower altitudes. You are blind folded, since we don't have the luxury of evaluating (seeing) the value of the function for every possible set of parameters. Feeling the slope of the terrain around you is analogous to calculating the gradient, and taking a step is analogous to one iteration of update to the parameters.

# Variants of Gradient Descent

There are multiple variants of gradient descent depending on how much of the data is being used to calculate the gradient. The main reason behind these variations is computational efficiency. A dataset may have millions of data points, and calculating the gradient over the entire dataset can be computationally expensive.

**Batch gradient descent**computes the gradient of the cost function w.r.t to parameter w for**entire training data**. Since we need to calculate the gradients for the whole dataset to perform one parameter update, batch gradient descent can be very slow.**Stochastic gradient descent (SGD)**computes the gradient for each update using a**single training data point***x*(chosen at random). The idea is the gradient calculated this way is a_{i}*stochastic approximation*to the gradient calculated using the entire training data. Each update is now much faster than in batch gradient descent, and over many updates, we will head in the same general direction.- In
**mini-batch gradient descent**, we calculate the gradient for each small mini-batch of training data. That is, we first divide the training data into small batches (say M samples / batch). We perform one update per mini-batch. M is usually in the range 30-500, depending on the problem. Usually mini-batch GD is used because computing infrastructure - compilers, CPUs, GPUs - are often optimized for performing vector additions and vector multiplications.

Of these, SGD and mini-batch GD are most popular. In a typical scenario, we do several passes over the training data before the termination criteria is met. Each pass is called an *epoch*. Also note that since the update step is much more computationally efficient in SGD and mini-batch GD, we typically perform 100s-1000s of updates in between checks for termination criteria being met.

# Choosing the learning rate

Typically, the value of the learning rate is chosen manually. We usually start with a small value such as 0.1, 0.01 or 0.001 and adapt it based on whether the cost function is reducing very slowly (increase learning rate) or is exploding / being erratic (decrease learning rate).

Although manually choosing a learning rate is still the most common practice, several methods such as Adam optimizer, AdaGrad and RMSProp have been proposed to automatically choose a suitable learning rate. Suggested further reading for going deeper into this topic: An overview of gradient descent optimization algorithms.

# Example of (Stochastic) Gradient Descent

Let's take a concrete example which combines everything we learnt above.

In this example, we will go through a single update step of Stochastic Gradient Descent. For this, we'll assume we have a training data point where x = (x_1, x_2) = (2, -3) and y = 1.

In addition, we need to choose what our model and cost function are. Let's say we are doing linear regression. Hence, our model has weights w_1, w_2 and bias b. The prediction is given by p = w_1x_1 + w_2x_2 + b and the cost function is J(w) = (p-y)^2.

- Step 1:
**Initialize**model weights (at random) and bias. Let's say (w_1, w_2) = (0.1, 0.2) and b = 0. - Step 2: For each data point in training data:
**Calculate the gradients**: To calculate the gradients, we compute the prediction, then the cost, and then the gradients.**Calculate the prediction**: p = w_1x_1 + w_2x_2 + b = 2w_1 - 3w_2 + b = 0.2 - 0.6 = -0.4**Calculate the cost**: J(w) = (p-y)^2 = (-1.4)^2 = 1.96**Calculate the gradient**: (uses differentiation chain rule)- For w_1 we have: \frac{\partial J(w)}{\partial w_1} = \frac{\partial J(w)}{\partial p}\frac{\partial p}{\partial w_1} = 2(p-y)x_1 = 2 * -1.4 * 2 = -5.6
- Similarly, for w_2 we have: \frac{\partial J(w)}{\partial w_2} = 2(p-y)x_2 = 2 * -1.4 * -3 = 8.4
- And finally, for b we have: \frac{\partial J(w)}{\partial b} = \frac{\partial J(w)}{\partial p}\frac{\partial p}{\partial b} = 2(p-y) = 2 * -1.4 = -2.8
**Update the weights**- Let's assume the learning rate is 0.001. Then,
- w_1 = w_1 - \eta G_{w_1} = 0.1 - 0.001 * -5.6 = 0.1056.
- w_2 = w_2 - \eta G_{w_2} = 0.2 - 0.001 * 8.4 = 0.1916
- b = b - \eta G_{b} = 0 - 0.001 * -2.8 = 0.0028
- This gives us the new weights (w_1, w_2) = (0.1056, 0.1916) and bias b = 0.0028.
- Step 3: Check if
**termination criteria**is met (see if mean-squared-error on entire validation data reduced). If yes, repeat from step 2. Else stop.

Above was a detailed walk-through of one update step using stochastic gradient descent. In each epoch, we perform the update step once for each data point. And typically, we do multiple epochs.

As a sanity check, you can check that the prediction that would be made using our new weights (w_1, w_2) = (0.1056, 0.1916) and bias b = 0.0028 is an improvement over our original prediction.

# Footnotes

- The value of the gradient depends on the cost function and the inputs, and you might need to revisit the topic of differentiation if you are calculating them by hand.
- It is common for machine learning software libraries to calculate the gradients automatically once the cost function is specified. However, if one is implementing a novel machine learning algorithm and applying the gradient descent optimization technique on it, it is important to have at-least a rough idea of what the gradients look like in-order to make sure that the method works as desired and does not get stuck / behave erratically.