September 2012: Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton published a paper which achieved 15.3% error rate compared to 26.2% error rate for the second-best performing model in top-5 classification accuracy on the ImageNet dataset. Before this paper, the improvement on the task had saturated and the error rates had reduced < 2% in 2 years. [1]
The papers main contributions included a very efficient implementation of convolutional neural networks using a GPU. Using non-saturating ReLU units instead of sigmoid or tanh units for faster convergence. And although dropout had been used for regularization before, this paper definitely made it a no brainer to use it all the time.
Six months after the ImageNet paper was published, Google+ image search was already using it for its image searches. So if you searched for hibiscus in Google+ in early 2013, you would get results of image posts of hibiscus, where the post title or comments or tags didn't mention hibiscus at all.
Over the next few years, the error on ImageNet reduced to 11.1% in 2013, 7.4% in 2014 and 3.5% in 2015.
In the 2014 winning submission by Visual Geometry Group from University of Oxford, the central realization was that deeper networks performed better. Their networks had 16-19 layers compared to 8 layers in the 2012 ImageNet paper. [2] In the same year, GoogLeNet got an error rate of 6.7% and its model had 22 layers. [3]
In 2015, Microsoft Research achieved an error rate of 3.5% and its model has 152 layers. Its main contributions were highway networks and shortcut connections. [4]
If I missed out on a significant part of the image classification puzzle, feel free to reply to this post and let me know.
Links to research papers for further reading: