In my last post I described the DRAW model of recurrent auto-encoders. As far as I've seen, the only implementations of DRAW floating around Github deal with the MNIST dataset. While they are helpful for reference, I wanted to have a model that could successfully generate photographs, not just black-and-white digits.
In this post, I'll go over the steps and thinking behind taking a working implementation of DRAW, and adjusting it to work on colored pictures.
The first question to ask is: what changed? In the original DRAW model, the input data consists of many 28x28 arrays of either 1 or 0. There's already a problem: images don't consist of only pure white or pure black pixels. They have a range, usually from 0 - 255.
The loss function
To account for this, we'll replace the log likelihood loss function with an L2 loss, which takes a sum of the squared diff...
A few weeks ago I made a post on Variational Autoencoders, and how they can be applied to image generation. In this post, we'll be taking a look at DRAW: a model based off of the VAE that generates images using a sequence of modifications rather than all at once.
In most image generation methods, the actual generation network is a bunch of deconvolution layers. These map from some initial latent matrix of parameters to a bigger matrix, which then maps to an even bigger matrix, and so on.
However, there were a couple of downsides to using a plain GAN.
First, the images are generated off some arbitrary noise. If you wanted to generate a picture with specific features, there's no way of determining which initial noise values would produce that picture, other than searching over the entire distribution.
Second, a generative adversarial model only discriminates between "real" and "fake" images. There's no constraints that an image of a cat has to look like a cat. This leads to results where there's no actual object in a generated image, but the style just looks like picture.
In this post, I'll go over the variational autoencoder, a type of network that solves these two problems.
There's been a lot of advances in image classification, mostly thanks to the convolutional neural network.
It turns out, these same networks can be turned around and applied to image generation as well. If we've got a bunch of images, how can we generate more like them?
A recent method, Generative Adversarial Networks, attempts to train an image generator by simultaneously training a discriminator to challenge it to improve.
To gain some intuition, think of a back-and-forth situation between a bank and a money counterfeiter. At the beginning, the fakes are easy to spot. However, as the counterfeiter keeps trying different kinds of techniques, some may get past the check. The counterfeiter then can improve his fakes towards the areas that got past the bank's security checks.
But the bank doesn't give up. It also keeps learning how to tell the fakes apart from real money. After a long period of b...
In order to do this, we first need to define and train a convolutional network. Due to lack of training power, I couldn't train on ImageNet and had to use CIFAR-10, a dataset of 32x32 images in 10 classes. The network structure was pretty standard: two convolutional layers, each with 2x2 max pooling and a reLu gate, followed by a fully-connected layer and a softmax classifier.
We're only trying to visualize the features in the convolutional layers, so we can effectively ignore the fully-connected and softmax layers.
Features in a convolutional network are simply numbe...
The paper A Neural Algorithm of Artistic Style detailed on how to extract two sets of features from a given image: the content, and and the style.
In convolutional neural networks, each layer stores information in an abstraction based on the previous layer. For example, the first layer may search for dark pixels in a line to represent an edge. The next layer may then look for two perpendicular edges to represent a corner. The last layer can then return a classification based on which of the features are present and how they are arranged.
If we pass an image through a convolutional network and record the activations of each layer, or what information was passed through, we can retrieve a general structure of the contents of an image.
This information changes based on which layer is used. Information from the first layer would contain what specific pixels were present, while higher layers might show where edges are present, but not what pattern of pixels are considered edges.