Computational Graphs and Backpropagation

December 22, 2017

This post is an intuitive introduction to computational graphs and backpropagation - both important and core concepts in deep learning for training neural networks.

Doesn’t Tensorflow perform backprop automatically?

Deep learning frameworks such as Tensorflow and Torch perform backpropagation and gradient descent automatically. So, why do we need to understand it?

The answer is, because although the calculations are abstracted out by libraries, gradient descent on neural networks is susceptible to various issues such as vanishing gradients, dead neurons, etc. When we are trying to train a neural network and we face these problems, we will not be able to resolve it unless we understand whats going on behind the scenes.

Why computational graphs?

Computational graph is how backpropagation is implemented in deep learning frameworks like Tensorflow, Torch, Theano, etc. More importantly, understanding backpropagation on computational graphs unites multiple different backpropagation algorithms and its variations (such as backprop through time, backprop with shared weights). Once you convert everything to a computational graph, they are all the same algorithm - just backpropagation on computational graphs.

Computational Graph, What is it?

A computational graph is a directed graph where the nodes correspond to mathematical operations. It is a way of expressing and evaluating a mathematical expression.

For example, here is a simple mathematical equation.

$$ q = x + y $$

We can draw a computational graph of the above equation as follows.

The above computational graph has an addition node (node with ”+” sign) with two input variables x and y and one output q.

Let’s take another example, slightly more complex. We have the following equation.

$$ f = (x + y)*z $$

Which is represented by the following computational graph.

Forward Pass

Forward pass is the procedure for evaluating the value of the mathematical expression represented by computational graphs. Doing forward pass means we are passing the value from variables in forward direction (toward node’s output).

Let’s take an example by giving some value to all of the inputs. Suppose, the following values are given to all of the inputs.

$$ x = 2, y = 3, z = -4 $$

By giving those values to the inputs we could do forward pass and get the following values for the outputs on each node.

First, we use the value of x = 2 and y = 3, to get q = 5. Then we use q = 5 and z = -4 to get f = -20. We go from left to right, forwards.

Objectives of Backward Pass (backpropagation)

In the backward pass, our goal is to calculate the gradients for each input with respect to the final output. These gradients are required for training the neural network using gradient descent.

For our running example, we desire the following gradients.

$$ Desired \space gradients: \frac{\partial f}{\partial x}, \space \frac{\partial f}{\partial y}, \space \frac{\partial f}{\partial z} $$

Backward pass (backpropagation)

We start the backward pass by derivating the final output with respect to the final output (itself!). Thus, it will result in the identity derivation and the value is equal to one.

$$ \frac{\partial f}{\partial f} = 1 $$

i.e. our computational graph now looks as follows,

Next, we will do the backward pass through the ”” operation, i.e. we’ll calculate the gradients at q and z. Since _f = qz_, we know that

$$ \frac{\partial f}{\partial q} = z, \space \frac{\partial f}{\partial z} = q $$

We already know the values of z and q from the forward pass. Hence, we get

$$ \frac{\partial f}{\partial z} = q = 5 $$

and

$$ \frac{\partial f}{\partial q} = z = -4 $$

So the graph now looks as follows:

Now comes the interesting part. We want to calculate the gradients at x and y, i.e.

$$ \frac{\partial f}{\partial x}, \space \frac{\partial f}{\partial y} $$

but we want to do this efficiently (although x and f are only two hops away in this graph, imagine them being really far from each other). To calculate these values efficiently, we’ll use the chain rule of differentiation. From chain rule, we have

$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} \space \space \space \space \space \space \space \space \space \frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial y} $$

But we already know the df/dq = -4 (which is the difficult term to calculate here if the graph is large). dq/dx and dq/dy are easy since q directly depends on x and y. We have,

$$ q = x + y \space \Rightarrow \space \frac{\partial q}{\partial x} = 1, \space \frac{\partial q}{\partial y} = 1 $$

Hence, we get

$$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} = (-4).1 = -4 $$

And for the input y.

$$ \frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial y} = (-4).1 = -4 $$

And our final graph.

The main reason for doing this backwards is that when we had to calculate the gradient at x, we only used already computed values, and dq/dx (derivative of node output with respect to the same node’s input). We used local information to compute a global value.

Translating to neural networks

So far, you’ve seen the computational graph for a simple function. The computational graph for a typical neural network will be much larger. Below is the graph for a single neuron with two inputs and tanh activation function,

Note that in the computational graph, the weights are nodes too.

The computational graph of a naive neural network plus cost function looks as follows:

Putting it all together

We have a neural network, and its corresponding computational graph G. It takes as input x and gives output c, which represents the error or the cost.

Steps for training a neural network are as follows:

for data point x in dataset
do forward pass with x as input, and calculate the cost c as output
do backward pass starting at c, and calculate gradients for all nodes in graph G. This includes nodes which represent the neural network weights.
update the weights by doing W = W - learning rate * gradients
repeat until stop criteria is met

Conclusion

In this article, we explained what a computational graph is, and how to perform forward pass and backward pass on it. We explained how the computational graph relates to neural networks, and how gradients can be calculated efficiently and used to train neural networks with gradient descent.