Turing, in his MIND paper in 1950, proposed an operational, behavioral alternative to the philosophical question "Can machines think?” by suggesting a simple "Turing test" where machines play the "imitation game” and humans are tasked with discerning machine from human given responses. He believed even partial success towards this goal given only 5 minutes of interaction would be hard and far-off.
The Turing test hasn’t yet been met (except in restricted settings like Siri, Watson), but his prediction "one will be able to speak of machines thinking without expecting to be contradicted” has proved true — "smart” computers have become commonplace.
One of the reasons Turing test hasn’t been met yet is because of the failures today’s intelligent systems make. Their capabilities are limited as type of questions they can handle, domains and their ability to handle unexp...
This paper introduces a deep convolutional neural network (CNN) architecture that achieved record-breaking performance in the 2012 ImageNet LSVRC. Notably, it brings together a bunch of neat ideas in an end-to-end, trainable model. Main contributions:
Achieves state-of-the-art performance in ILSVRC-2012.
Makes available an efficient, parallelized GPU implementation of their model.
Describes in detail the features of their model that help in improving performance and reducing training time, along with extensive ablative studies.
Uses data augmentation and dropout to prevent overfitting.
This paper introduces a novel visualization technique to understand representations learnt by intermediate layers of a deep convolutional neural network - DeconvNet. Using DeconvNet visualizations as a diagnostic tool in different settings, the authors propose changes to the model proposed by Alex Krizhevsky, which performs slightly better and generalizes well to other datasets. Key contributions:
Feature activations are mapped back to input pixel space by setting other activations in the layer to zero and successively unpooling, rectifying and filtering (using the same parameters).
Unpooling is approximated by using switch variables to remember the location of highest input activation (and hence these visualizations are image-specific).
This paper proposes a modified convolutional network architecture by increasing the depth, using smaller filters, data augmentation and a bunch of engineering tricks, an ensemble of which achieves second place in the classification task and first place in the localization task at ILSVRC2014. Main contributions:
Experiments with architectures with different depths from 11 to 19 weight layers.
Changes in architecture
Smaller convolution filters
1x1 convolutions: linear transformation of input channels followed by a non-linearity, increases discriminative capability of decision function.
This paper presents a simple approach to predicting sequences from sequential input. They use a multi-layer LSTM-based encoder-decoder architecture and show promising results on the task of neural machine translation. Their approach beats a phrase-based statistical machine translation system by a BLEU score of > 1.0 and is close to state-of-the-art if used to re-rank 1000-best predictions from the SMT system. Main contributions:
The first LSTM encodes an input sequence to a single vector, which is then decoded by a second LSTM. End of sequence is indicated by a special character.
4-layer deep LSTMs.
160k source vocabulary, 80k target vocabulary. Trained on 12M sentences. Words in output sequence are generated by a softmax over fixed vocabulary.
This paper presents R-CNN, an approach to do object detection using CNNs pre-trained for image classification. Object proposals are extracted from the image using Selective Search, dilated by few pixels, warped to CNN input size and fed into the CNN to extract features (they experiment with pool5, fc6, fc7). These extracted feature vectors are scored using SVMs, one per class. Bounding box regression, where they predict parameters to move the proposal closer to ground-truth, further boosts localization.The authors use AlexNet, pre-trained on ImageNet and finetuned for detection. Object proposals with IOU overlap greater than 0.5 are treated as positive examples, and others as negative, and a 21-way classification (20 object categories + background) is set up to finetune the CNN. After finetuning, SVMs are trained per class, taking only the ground-truth boxes as positives, and IOU <= 0.3 as negatives.R-CNN achieves major performance improvements on PASCAL VOC 2007/2010 and ILSVRC2013 detection datasets. Finally, this method is extended to do semantic segmentation and achieves competitive results.
Neural Turing Machine (NTM) consists of a neural network controller interacting with a working memory bank in a learnable manner. This is analogous to computers — controllers = CPU (hidden activations as registers) and memory matrix = RAM. Key ideas:
Controller (modified RNN) interacts with external world via input and output vectors, and with memory via read and write "heads"
"Read" vector is a convex combination of row-vectors of M_t (memory matrix at time t) — r_t = \sum w_t(i) M_t(i) where w_t is a vector of weightings over N memory locations
"Writing" is decomposed into 1) erasing and 2) adding
The write head produces the erase vector e_t and the add vector a_t along with the vector of weightings over memory locations w_t
This paper studies a very natural generalization of convolutional layers by replacing a single filter that slides over the input feature map with a "micro network" (multi-layer perceptron). The authors argue that good abstractions are highly non-linear functions of input data and instead of generating an overcomplete number of feature maps and shrinking them down in higher layers (as is the case in traditional CNNs), it would be beneficial to generate better representations on each local patch, before feeding into the next layer. Main contributions:
Replaces the convolutional filter with a multi-layer perceptron.
Instead of fully connected layers, uses global average pooling.
This paper introduces the Places dataset, which is a scene-centric dataset at the scale of ImageNet (which is for object recognition) so as to enable training of deep CNNs like AlexNet, and achieves state-of-the-art for scene benchmarks. Main contributions:
Collects a dataset at ImageNet scale for scene recognition.
Achieves state-of-the-art on scene benchmarks: SUN397, MIT Indoor67, Scene15, SUN Attribute.
Introduces measures for comparing datasets: density and diversity.
Makes a thorough comparison b/w ImageNet and Places, from dataset to classification results to learned representation visualizations.