Convolutional Neural Networks (CNNs, or ConvNets) are a neural network architectures very successful in computer vision applications, and also widely used in applications that process media such as audios and videos. The main difference between a standard neural network and CNN is a special type of neural network layer, called the convolutional layer.

The deep learning research paper that caused the fields revival in 2012 used a CNN for image classification. A standard CNN architecture for image classification takes an image as the input, passes it through a series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, and gets an output. The output is a probability of classes that best describes the image. The network layers might look something like this:

We'll describe the layers of a CNN in the order in which they are stacked in a standard implementation.

# Input Layer

Let's assume that the input to a CNN a 32 x 32 x 3 array of pixel values. This 32 x 32 x 3 array is a 32 pixel x 32 pixel color image, there the 3 comes from RGB values of an image pixel. Each value is normalized to be between 0 and 1. For more info, see this: How do computers see an image?

# Convolutional Layer – Math Part

The first layer in a CNN is a **Convolutional (conv) Layer**. Imagine a flashlight that is shining over the top left of the image. Let’s say that the light this flashlight shines covers a 5 x 5 area. The dimensions of the volume over which the flashlight is shining is 5 x 5 x 3. In machine learning terms, this flashlight is called a **filter** (sometimes **kernel**) and the region that it is shining over is called the **receptive field**. This filter is also an array of numbers which are weights (or parameters) of the neural network. The dimensions of the filter are the same as that of the volume it shines upon, which in this case is 5 x 5 x 3.

The top left value at the next layer, is obtained by doing a dot product of the filter values and the volume of neurons it is shining upon, i.e. doing element-wise multiplications for the 5 x 5 x 3 neurons and 5 x 5 x 3 filter weights, and the summing them up. This gives us a single number.

This process is then repeated with the flashlight sliding across all the areas of the input image (next step would be moving the filter to the right by 1 unit, then right again by 1, and so on). Every unique location on the input volume produces a number. After sliding the filter over all the locations, we get a 28 x 28 x 1 array of numbers (28 x 28 array because the filter size is 5 x 5 and we are shifting by 1 each time).

The entire operation described above, is known as a convolution, i.e. the entire volume of input was convolved with the filter, and it produced the 28 x 28 x 1 output.

Source: "Neural Networks and Deep Learning" by Michael Nielsen.

Now, we'll actually have multiple filters (let's say 6 in this example). Then our output volume would be 28 x 28 x 6. By using more filters, we are able to preserve more information about the original image, but we increase the computational requirements.

Last but not the least, the result of the convolutions is then passed through an activation function. The most common activation to use is ReLU, but we can also use sigmoids or other activation functions.

# Convolutional Layer – Intuition

Let’s understand intuitively what the convolution is doing. Each of these filters can be thought of as **feature identifiers**, features such as straight edges, simple colors, and curves. These are the simplest characteristics that all images have in common with each other.

To explain this, we'll work with grayscale images, of dimensions say 32 x 32 x 1, and let's assume that our filter is 7 x 7 x 1. Our first filter is going to be a curve detector. As a curve detector, the filter will have a pixel structure in which there will be higher numerical values along the area that is a shape of a curve.

Below is the image that we have as input. Let’s put our filter at the top left corner.

It is computing multiplications between the filter and pixel values at that region.

Notice that because there is a shape that generally resembles the curve that this filter is representing, all of the multiplications summed together will result in a large value.

If we move our filter to a location which does not resemble the curve, the resulting value will be much lower.

This is because there wasn’t anything in the image section that responded to the curve detector filter.

When we apply the convolution over the entire image, the result will be a map that shows which areas in the image resemble the curve that the feature represents. In this example, the top left value of our 28 x 28 x 1 activation map will be 6600. This high value means that it is likely that there is some sort of curve in the input volume that caused the filter to activate. The top right value in our activation map will be 0 because there wasn’t anything in the input volume that caused the filter to activate (or more simply said, there wasn’t a curve in that region of the original image).

The above is just for one filter, which detects lines that curve outward and to the right. There are multiple filters, which each detect different types of features in the image - such as vertical lines, horizontal lines, lines that curve to the left, and so on. The more filters, the greater the depth of the activation map, and the more information we have about the input volume.

Examples of actual visualizations of the filters from the first conv layer of a CNN. Some of them detect horizontal, vertical and diagonal edges, and others detect various color gradients. Source: Stanford's CS 231N course taught by Andrej Karpathy and Justin Johnson.

# Filter size, Stride and Padding

There are 3 main parameters that we can change to modify the behavior of a conv layer - **filter size**, **stride** and **padding**.

Filter size is the dimensions of the filter (and hence, the input volume it shines upon). For example, 3 x 3, or 5 x 5, or 7 x 7. The depth is the same as the depth of the previous layer.

Stride controls how far the patches are which we choose to convolve with the filter. Larger strides cause the output volume to shrink. In the example above, we used a stride of 1.

Below is a visualization of a 7 x 7 input volume, and 3 x 3 filter (3rd dimension removed for simplicity), and a stride of 1.

And here's what happens when the stride is 2

The receptive field is shifting by 2 units now and hence the output volume shrinks.

If we tried to set our stride to 3, then we’d have to throw away information about the corners, since our third patch would go outside the input volume. To fix this, we apply a zero padding to the input. If we apply a zero padding of width 2 on a 32 x 32 input, the result would be 36 x 36 input volume.

Given an input of dimension W x W x D, N filters of dimension K x K x D, a padding of P and stride of S, our output would be O x O x N, where

# Pooling Layers

The convolutional layers often alternate with pooling layers, also referred to as a downsampling layer. There are several types of pooling layers - max-pooling is the most popular, average pooling and L2 pooling are some other options. The pooling operation takes a filter (usually of size 2x2) and a stride of the same length, and applies it to the input volume. For example, in the example below, we have an input of volume of dimensions 4 x 4 x D, which becomes an output volume of dimensions 2 x 2 x D. Note that the depth of the feature maps remains unchanged in this operation (the pooling happens separately on each depth dimension). In maxpooling, the output is the max of every input volume region.

The intuitive reasoning behind this layer is that once we know that a specific feature is in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features. This layer drastically reduces the spatial dimension (the length and the width change but not the depth) of the input volume. The layers reduces the computational requirements for future layers.

# Review

Let's review the overall structure of a CNN.

We talked about what the filters in the first conv layer are designed to detect. They detect low level features such as edges and curves.

When we go through another conv layer, the output of the first conv layer becomes the input of the 2nd conv layer. Now, this is a little bit harder to visualize. When we were talking about the first layer, the input was just the original image. However, when we’re talking about the 2nd conv layer, the input is the activation map(s) that result from the first layer. This layer is describing the locations in the original image where certain low level features appear. When we apply a set of filters on top of that (pass it through the 2nd conv layer), the output will be activations that represent features themselves composed of lower level features, i.e. slightly higher level features. Types of these features could be semicircles (combination of a curve and straight edge) or squares (combination of several straight edges) or corners.

As we go through the network and apply more conv layers, we get activation maps that represent more and more complex features. If we look at the top-most convolutional layer, we may have some filters that activate when there are high level features such as, a hand or a paw or a ear in the image. Note that as we go deeper into the network, the filters begin to have a larger and larger receptive field, which means that they are able to consider information from a larger area of the original input volume (i.e. they are more responsive to larger regions of the original image).

Lastly, we have some fully connected layers at the top of a CNN, which take as input the high level feature map outputted by the convolutional layers, and map that to classes we want to predict. It learns patterns among which high level features correlate to a particular class (for example, that a beak or wings is highly indicative of the image being that of a bird).

# Choosing Hyperparameters

How do we choose the number of layers to use, the filter sizes, the stride lengths and other hyperparameters of the model? Here are some tips:

- As we increase number of layers, the neural network will start overfitting and computational requirements will increase. So, number of layers should be as large as possible, given constraints of overfitting and computational resources.
- Usually, filter sizes reduce with increase in number of layers. Usually, we want the top-most conv layers to have a large receptive field (where large = significant fraction of the input image). We want the stride lengths to be such that there is some overlapping of the patches being considered.

That being said, popular ConvNets for the image classification problem that range from a few layers to 150 layers.

# Training

CNNs are trained in a similar manner as standard neural networks - i.e. using gradient descent or its variations, with regularization techniques such as early stopping, dropout and data augmentation.

# Visualizing ConvNet filters

Matt Zeiler and Rob Fergus have an excellent research paper discussing the topic. Jason Yosinski also has a video on YouTube that provides a great visual representation.

# Biological Connection

CNNs take a biological inspiration from the structure of visual cortex, the region of the human brain responsible for processing visual input (i.e. once light that falls on the retina is converted to neurological signals). The visual cortex itself has 6 layers, named V1 through V6.

It is composed of small regions of cells that are sensitive to specific regions of the visual field. This idea was expanded upon by a fascinating experiment by Hubel and Wiesel in 1962 (Video) where they showed that some individual neuronal cells in the brain responded (or fired) only in the presence of edges of a certain orientation. For example, some neurons fired when exposed to vertical edges and some when shown horizontal or diagonal edges.

There are numerous studies showing similarities between the lower layers of ConvNets and V1, V2 layers in the visual cortex. Understanding whats going on in the higher layers (V3 onwards) has proved more challenging since the features are tougher to interpret.

**Improve this Wiki**

Suggested improvements to this Wiki

- Add diagram of AlexNet in review, and a few lines walking through how read the architecture diagram + the results it achieves on ImageNet.
- The description of Convolutional Layer doesn't use the term
*weight sharing*anywhere. That terminology should be introduced, and also*why*it is a better choice than full-connected layer for the base layers. - AutoML is becoming a popular method for optimizing hyperparameters. Include descriptions of when it can be used and when not.