In this article, we'll discuss the importance of collecting more data and using more complex machine learning models for prediction in different situations and circumstances. What helps more (data or algorithm) depends on the specific problem at hand. In the discussion, we'll also highlight different caveats that might effect the desirability for pursuing each of these directions.
Collecting more data is one of the most promising ways of improving performance. In fact, you might have come across quotes like "We don’t have better algorithms. We just have more data." or the article "The Unreasonable Effectiveness of Data" by Peter Norvig (Director of Research, Google).
One of the most effective ways in improving algorithm performance is in adding new features to a dataset.
Let's consider the example of Google's famous PageRank algorithm, invented in 1998. Although PageRank is an elegant and brilliant algorithm, it is important to note that before PageRank, search engine algorithms only used the text of web pages to rank them. Google's search engine was the first one to recognize that hyperlinks are an important signal for popularity, and anchor text (the text of the hyperlink) an important signal for relevance. Given these two new signals, a number of different algorithms making use of this data produce results similar to the PageRank algorithm, specially relative to not using these signals at all.
For new features to provide significant improvement in results, it is important that the information captured by the signal be substantially different from information already present in the current features. In the above example, analyzing incoming links to a website carries very important information regarding what other's think of the website, as opposed to only looking at the content the website itself has.
The second way to improve algorithm performance with data is to simply have more of it. Increasing the amount of data we have can reduce overfitting (i.e. improve the performance of models with large number of parameters / large variance). Having more data reduces our dependence on weak correlations.
Of special importance here are non-parametric models - models where the number of parameters isn't fixed, but rather grows with data. Examples of such models include naive bayes, K nearest neighbor, n-gram, matrix factorization, and even the simple histogram. For problems such as topic classification, it is important to learn the topical association for each word or each phrase. In such cases, even with large datasets of 100 million documents it is common to have words which only appear a few times in the dataset. In the movie recommendation problem, we might have very few ratings available for a specific user. For such problems, doubling the amount of data can lead to significant improvement in results, since it helps us better model each word and each user.
Having more data also improves our performance when our model might be under-fitting (i.e. has a high bias). In this case, more data helps because it allows us to get rid of some of the underlying assumptions made by our model, allowing the “data to speak for itself”. For example, with a small dataset, we might make an assumption that our model is linear. If we collect more data, we could instead learn a piecewise linear function or a histogram.
Relying on more data instead of better algorithms usually pays off if our data is too sparse (causing overfitting), or if we have restricted model complexity because of limited data (high bias).
Sometimes however, collecting more data is simply too expensive. For example, a geoscience project that need manual labeled data or focuses on rare phenomena.
The first step towards improving a model is understanding whether it is overfitting or under-fitting. If it is under-fitting, using a more complex learner can help, such as a deep learning model. If it is overfitting, we might want to use a simpler model, or add more modeling constraints into the model, thereby adding our knowledge directly into the learner as modeling assumptions. For example, we could use a convolutional neural network instead of a fully-connected neural network, adding a constraint of translational invariance to the model architecture.
Diverse complex models often predict the data differently. An example of different frontiers for classification problems can be appreciated below. Visually interpreting data and models is important to gain an understanding of the steps to take to improve it.
The main advantage of complex models is that they are able to capture more complex phenomena than simpler linear models might be able to capture.
However, complex models are often require non-linear amounts of computation (say quadratic or cubic), forcing us to choose between a complex model or using all the data we have. It is also often difficult to implement complex models in distributed set-ups, utilizing Big Data technologies such as Hadoop, because they can't be directly implemented in a map-reduce structure.
Other disadvantages of complex learners include the intricate work of fine-tuning, and the inherent difficulty in understanding their internal structure.
Deep Learning deserves a separate mention in this discussion since it often represents the best of both worlds - a complex learner whose performance often keeps improving with additional data.
The performance of many traditional models (such as support vector machines, linear regression, or K-nearest neighbors) increases till data size reaches a certain threshold, beyond which no significant improvement is obtained. With the ability of deep neural networks to represent arbitrarily complex functions, their performance often keeps improving with additional data. Below is a simple plot showing the relation of traditional machine learning methods to deep learning as amount of data increases.
Deep learning models have repeatedly shown accuracy improvements using what is known as pre-training.
As an example, for problems in computer vision, people often start with convolutional neural networks (the state-of-the-art method for image classification) pre-trained on a large public dataset such as ImageNet. Thereafter, they throw away the top layer of the model and re-train it on their specific task with limited data, transferring a lot of knowledge from the larger public dataset. This avoids overfitting since the entire neural network is not being trained on a small dataset.
Another example from natural language processing - people often train word vectors on the language modeling task for which billions of documents are available from the web as training data. Then the pre-trained word vectors can be re-used on the specific task at hand, or used as additional features with a bag-of-words model.