# A First Look at Neural Networks

December 09, 2017

# Neurons

The basic foundational unit of a neural network is the *neuron*, which is conceptually quite simple.

Schematic for a neuron in a neural net

Each neuron has a set of inputs, each of which is given a specific weight. The neuron computes some function on these weighted inputs. A linear neuron takes a linear combination of the weighted inputs. A sigmoidal neuron feeds the weighted sum of the inputs into the logistic function, which results in a value between 0 and 1.

The function of a sigmoidal neuron

When the weighted sum is very negative, the return value is very close to 0. When the weighted sum is very large and positive, the return value is very close to 1. The logistic function is important because it introduces a *non-linearity*, and this is important to enable the neural network to learn more complex models. In the absence of these non-linear functions (called *activation functions*), the entire neural network would be a linear function, and the layers would not be useful.

# Neural networks

Regardless of the choice of the activation function, the value outputted by the neuron is transmitted to other neurons as its input. A neural network comes about when we start hooking up neurons to each other, to the input data, and to the “outlets,” which correspond to the network’s answer to the learning problem. To make this structure easier to visualize, a simple example of a neural net is included below. We let w^{(k)}_{{i,j}} be the weight of the link connecting the *i*th neuron in the *k*th layer with the *j*th neuron in the (*k+1)*st layer:

An example of a neural net with 3 layers and 3 neurons per layer

Similar to how neurons are generally organized in layers in the human brain, neurons in neural nets are often organized in layers as well - neurons on the bottom layer receive signals from the inputs, neurons in the top layers have their outlets connected to the “answer,” and usually there are no connections between neurons in the same layer (although this is an optional restriction, more complex connectivities are allowed in general). We also note that in this example, there are no connections that lead from a neuron in a higher layer to a neuron in a lower layer (i.e., no directed cycles). These neural networks are called *feed-forward* neural networks as opposed to their counterparts, which are called *recurrent* neural networks. For the sake of simplicity, we focus only on feed-forward networks throughout this discussion.

# Some important notes

Further, here are some important notes to keep in mind:

- Although every layer has the same number of neurons in this example, this is not necessary.
- The inputs and outputs are vectorized representations. For example, you might imagine a neural network where the inputs are the individual pixel RGB values in an image represented as a vector. The last layer might have 2 neurons which correspond to the answer to our problem: [0,1] if the image contains a dog, [1,0] if the image contains a cat, [0,0] if it contains neither, and [1,1] if it contains both.
- The layers of neurons that lie sandwiched between the first layer of neurons (input layer) and the last layer of neurons (output layer), are called
*hidden layers*. This is where most of the magic is happening when the neural net tries to solve problems. Taking a closer look at the activities of hidden layers can tell us a lot about the features the network has learned to extract from the data.

Also, note that it is not required that a neuron has its outlet connected to the inputs of every neuron in the next layer. Different architectures of neural networks are obtained by selecting which neurons to connect to which other neurons in the next layer.

# Pseudocode for calculating output (forward-pass)

The following is the pseudocode for doing the forward-pass, i.e. calculating the output of the neural network given the input.

```
# node[] := array of nodes, topologically sorted
# (i.e. if there is an edge from a to b, then a is to the left of b
# moreover, if the NN has R inputs and S outputs,
# then first R nodes are input nodes. and last S nodes are output nodes.
# incoming[x] := list of nodes connected to node x
# weight[x] := edge weights of incoming edges to x
for each neuron x from left to right:
if x <= R: do nothing # its an input node, we already know the value
inputs[x] = [output[i] for i in incoming[x]]
weighted_sum = dot_product(weights[x], inputs[x])
output[x] = activation_function(weighted_sum)
```

# Training a neural network: Back-propagation algorithm and backward-pass

There is no closed form solution for finding the optimal values of the weights (or parameters) of a neural network. To train a neural network, we use the iterative gradient descent method. That is, we start with random initialization of the weights. And then repeatedly make predictions on some subset of the data (forward-pass), calculate the corresponding cost function C, and update each weight w by an amount proportional to dC/dw, i.e. the derivative of the cost functions w.r.t. the weight. The proportionality constant is known as the *learning rate*.

The gradients can be calculated efficiently using the back-propagation algorithm. The key observation of backprop is that because of the chain rule of differentiation, the gradient at each neuron in the neural network can be calculated using the gradient at the neurons it has outgoing edges to. Hence, we calculate the gradients *backwards*, i.e. first calculate the gradients of the output layer, then the top-most hidden layer, then the preceding hidden layer, and so on, ending at the input layer.

Typically, the back-propagation algorithm is implemented using the concept of a computational graph, where each neuron is expanded to many *nodes* in the computational graph and performs a simple mathematical operation like addition, multiplication. The computational graph doesn’t have any values (weights) on the edges, all values go at the nodes, so the weights become their own nodes. The backprop algorithm is then run on the computational graph. Once the calculation is complete, only the gradients of the *weight nodes* are required for update. The rest of the gradients can be thrown away.

# Appendix: Pseudocode for forward-pass using matrix operations

If the neural network consists of only fully-connected layers, then the calculations we saw can be written using matrices. This is common practice because it leads to higher computational efficiency, allowing the use of vectorized mathematical operations (a feature most computing hardware supports). In this case, the forward-pass pseudocode would look as follows:

```
# Layer 0 is the input. NN has k layers.
# Z[i] := output of the i-th layer.
# Z[0] := input (given).
# Z[1], Z[2], ... Z[k-1] := hidden layers.
# Z[k] := output layer.
# W = list of weight matrices. B = list of bias vectors.
# W[i] = weights connecting layer i to layer i+1.
# W[i] has shape size[i] x size[i+1], where size[j] is the number of nodes in layer [j]
Z[0] = input
for i in 1 ... k:
Z[i] = Z[i-1] * W[i] + B[i]
return Z[k]
# Yes. The above is complete. Forward calculation of a NN given
# input in matrix notation is 2 lines of code.
```