This post is an intuitive introduction to computational graphs and backpropagation - both important and core concepts in deep learning for training neural networks.

# Doesn't Tensorflow perform backprop automatically?

Deep learning frameworks such as Tensorflow and Torch perform backpropagation and gradient descent automatically. So, why do we need to understand it?

The answer is, because although the *calculations* are abstracted out by libraries, gradient descent on neural networks is susceptible to various issues such as *vanishing gradients*, *dead neurons*, etc. When we are trying to train a neural network and we face these problems, we will not be able to resolve it unless we understand whats going on behind the scenes.

# Why computational graphs?

Computational graph is how backpropagation is implemented in deep learning frameworks like Tensorflow, Torch, Theano, etc. More importantly, understanding backpropagation on computational graphs unites multiple different backpropagation algorithms and its variations (such as backprop through time, backprop with shared weights). Once you convert everything to a computational graph, they are all the *same* algorithm - just backpropagation on computational graphs.

# Computational Graph, What is it?

A computational graph is a directed graph where the nodes correspond to** mathematical operations. **It is a way of *expressing* and *evaluating* a mathematical expression.

For example, here is a simple mathematical equation.

q = x + y

We can draw a computational graph of the above equation as follows.

The above computational graph has an addition node (node with "+" sign) with two input variables *x* and *y* and one output *q*.

Let's take another example, slightly more complex. We have the following equation.

f = (x + y)*z

Which is represented by the following computational graph.

# Forward Pass

Forward pass is the procedure for evaluating the value of the mathematical expression represented by computational graphs. Doing forward pass means we are passing the value from variables in forward direction (toward node's output).

Let's take an example by giving some value to all of the inputs. Suppose, the following values are given to all of the inputs.

x = 2, y = 3, z = -4

By giving those values to the inputs we could do forward pass and get the following values for the outputs on each node.

First, we use the value of x = 2 and y = 3, to get q = 5. Then we use q = 5 and z = -4 to get f = -20. We go from left to right, *forwards*.

# Objectives of Backward Pass (backpropagation)

In the backward pass, our goal is to calculate the gradients for each input with respect to the final output. These gradients are required for training the neural network using gradient descent.

For our running example, we desire the following gradients.

Desired \space gradients: \frac{\partial f}{\partial x}, \space \frac{\partial f}{\partial y}, \space \frac{\partial f}{\partial z}

# Backward pass (backpropagation)

We start the backward pass by derivating the final output with respect to the final output (itself!). Thus, it will result in the identity derivation and the value is equal to one.

\frac{\partial f}{\partial f} = 1

i.e. our computational graph now looks as follows,

Next, we will do the backward pass through the "*" operation, i.e. we'll calculate the gradients at *q* and *z.* Since *f = q*z*, we know that

\frac{\partial f}{\partial q} = z, \space \frac{\partial f}{\partial z} = q

We already know the values of *z* and *q* from the forward pass. Hence, we get

\frac{\partial f}{\partial z} = q = 3

and

\frac{\partial f}{\partial q} = z = -4

So the graph now looks as follows:

**Now comes the interesting part.** We want to calculate the gradients at *x* and *y, *i.e.

\frac{\partial f}{\partial x}, \space \frac{\partial f}{\partial y}

but we want to do this *efficiently* (although *x* and *f* are only two hops away in this graph, imagine them being really far from each other). To calculate these values efficiently, we'll use the **chain rule of differentiation. **From chain rule, we have

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} \space \space \space \space \space \space \space \space \space \frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial y}

But we already know the d*f*/d*q *= -4 (which is the difficult term to calculate here if the graph is large). d*q*/d*x* and d*q*/d*y* are easy since *q* directly depends on *x* and *y*. We have,

q = x + y \space \Rightarrow \space \frac{\partial q}{\partial x} = 1, \space \frac{\partial q}{\partial y} = 1

Hence, we get

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} = (-4).1 = -4

And for the input y.

\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial y} = (-4).1 = -4

And our final graph.

The main reason for doing this **backwards** is that when we had to calculate the gradient at *x*, we only used already computed values, and d*q*/d*x* (derivative of node output with respect to the same node's input). We used *local* information to compute a *global* value.

# Translating to neural networks

Above, you saw the computational graph for a simple function. The computational graph for a typical neural network will be much larger. Below is the graph for a single neuron with two inputs and tanh activation function,

Note that in the computational graph, **the weights are nodes too**.

The computational graph of a naive neural network plus cost function looks as follows:

# Putting it all together

We have a neural network, and its corresponding computational graph G. It takes as input *x* and gives output *c*, which represents the error or the cost.

Steps for training a neural network are as follows:

- for data point x in dataset
- do forward pass with x as input, and calculate the cost c as output
- do backward pass starting at c, and calculate gradients for all nodes in graph G. This includes nodes which represent the neural network weights.
- update the weights by doing W = W - learning rate * gradients
- repeat until stop criteria is met

# Conclusion

In this article, we explained what a computational graph is, and how to perform forward pass and backward pass on it. We explained how the computational graph relates to neural networks, and how gradients can be calculated efficiently and used to train neural networks with gradient descent.