A novel regularization is proposed that enforces the output of a model to be isotropically smooth around each input data point irrespective of the availability of the training label. This is achieved by investigating the local anisotropy around each input data point and smoothing the model in the most anisotropic direction.
The approach to investigate the local anisotropy around each data point is to find the direction of a small perturbation that the model is most sensitive to. The sensitivity of the model is measured in terms of the KL divergence between the output class probability distribution given the input and the input plus the perturbation. The desired perturbation is called the 'virtual adversarial perturbation'. The model is supposed to minimize the supervised loss (cross entropy, MSE etc) as well as the sensitivity to the virtual adversarial perturbation.
One way to think about why this works is that VAT pu...
[Paper Summary + Doubts] Deep Residual Learning for Image Recognition
This is a great paper that addresses the vanishing/exploding gradient problem existing in deep neural network architecture. The solution proposed in the paper is a deep residual learning framework which allows for training extremely deep CNN models for various visual recognition tasks. The architecture consists of stacked convolutional layers, with every other layer connected with two layers below. In this way, every two layers are trained to approximate a residual function of an underlying mapping.
The claim made in the paper is that learning some underlying mapping H(x) is asymptotically approximate to learning the residual function [H(x)-x] and then adding x, but that the latter is easier to learn with several layers of a neural network. This intuition isn't very clear. Section 3.1 discusses this intuition, and I was wondering if someone could help me understand this.
Some further questions and observations:
This framework doesn't seem to have several fully connected layers at the end, as VGG/AlexNet papers d...
This paper focused on solving the degradation problem (saturation of accuracy in deeper networks). The paper's explanation is that residuals make backprop more efficient for deeper networks. That makes sense, but there's more to the story.
The self-referential formulation of ResNets leads to...
[Research paper] Learning to learn by gradient descent by gradient descent
Training an LSTM (NTM—Neural Turing Machine—is tested too) to output an update for a steepest descent iteration at each time step. Aptly titled “Learning to learn by gradient descent by gradient descent”.
Anybody interested in trying Learning to learn to learn by gradient descent by gradient descent by gradient descent?
One cool thing is that they found that the NTM, with access to a complete memory bank, produced behavior similar to that of a second-order optimization algorithm like L-BFGS (I think because being able to store a Jacobian matrix should allow one to estimate a Hessian matrix?).
A clockwork RNN is a recurrent neural network (RNN) architecture designed to remember things easily over long periods of time. It is a follow up on Long Short-Term Memory architecture (also by Jurgen Schmidhuber).
In my opinion, the paper is quite interesting and deserves more attention than it has received.
In general deep learning models cannot be naturally adapted to find solutions to structured learning problems.
In this paper, they introduce a variant of deep recurrent neural networks, which can learn to parse a sentence by learning transitions in a shift-reduce parser. One of their main contributions is to batch this algorithm. In spite of variability in structures between examples, they have managed to invent a batched algorithm.
They use this unusual architecture for solving the natural language inference problem.