Machine learning has a number of popular theorems. However, one should keep in mind that theorems are stated in a mathematically strict sense. Extrapolating from these theorems to make decisions in practice is prone to errors, and one needs to be very careful. In this article, we will discuss some examples on how theoretical guarantees might not be what they seem, or how we can avoid reaching incorrect conclusions.
This article does not intend to say theory is bad. Its main point is, theory and practice often have a wide gap. When practitioners with less experience in theory come across widely applicable theorems, they often extrapolate it in ways which are undesirable.
Example A: The universality theorem asserts that neural networks can compute any computable function. Moreover, even a NN with only one hidden layer can accomplish this feat.
Such a statement gives the impression that NN architectures are extremely powerful and can be applied to very different types of problems. Below are some caveats regarding the above theorem important to keep in mind in practice,
- A NN is capable of representing a computable function, does not say anything about how to find such a NN in the hypothesis space. That is, it doesn't say anything about whether it is possible to start with a NN with random weight initialization, and find the desired NN. Learning = Representation + Evaluation + Optimization. This theorem only talks about the representation, not the optimization.
- A NN with one hidden layer might be equivalent to a deep NN in terms of representational power. However, in practice, a single layered NN would take exponentially more neurons, and exponentially more data to train on realistic tasks such as object classification.
- Last but not the least, the theorem should NOT be read as NNs being better than other machine learning algorithms in any general sense. There are tasks for which other algorithms will continue to perform better than NNs due to a myriad of reasons related to optimization, overfitting, computational efficiency, amount of data available, and so on.
Example B: There are a number of theoretical bounds which make claims about number of training samples given the number of parameters in a model required for good generalization. Details for the lemma, which are based on the preference for simplicity of Occam's razor, can be checked here.
The problem is, a lot of these theoretical bounds are extremely loose, i.e. one should not assume that if theory says we need a large number of examples, that we actually need that many. In practice requirements for total amount of data required is usually less restrictive that theoretical bounds suggest.
Example C: Similar theorems exist for predicting convergence time of various algorithm. For example, for the perceptron algorithm, we have theoretical bounds saying convergence is guaranteed in number of iterations = exponential in dimensionality of input space.
However, in the average case the algorithm is relatively efficient and only needs polynomial time.
Example D: According to the no free lunch theorem, any two classifiers are equivalent when their performance is averaged across all possible problems. This means that if classifier A outperforms classifier B for some problems, then classifier B must outperform classifier A over other problems.
Here is what wikipedia says about the no free lunch theorem.
... . That is, the NFL states what the NFL states in the mathematical statements and it is nothing more than that. For example, it applies to the situations where the algorithm is fixed first and a nature can choose a worst problem instance to each fixed algorithm. Therefore, if we have a "good" problem in practice or if we can choose a "good" learning algorithm for a given particular problem instance, then the NFL does not mention any limitation about this particular problem instance.
Moreover, let us see what does "average over all possible problems" actually mean? Let's consider all classification problems that map 10-dimensional input to a one-dimensional output. To keep things simple, let us assume all inputs and outputs take binary values. These are quite realistic constraints. For example, one can imagine a problem where we need to predict the gender of a person, male / female, based on 10 binary features.
Mathematically, there are 2^(2^10) possible problems (there are 210 possible inputs, and each problem is defined as the the output produced for each input. Since each output has 2 possible values, we have 2^(2^10) possible problems).
Notice that for a typical problem in the set of all possible problems, each input dimensions interacts with every other input dimension in a chaotic way, i.e. input dimension 1 by itself doesn't tell you much about the output, input dimension 2 by itself doesn't tell you much about the output, input dimension 3 by itself doesn't tell you much about the output, and so on. On average, there is no correlation among features, and there is no independence.
In realistic problems, dimensions usually have a semantic meaning. This means that the dimension is often directly correlated with the output. Even if that is not true for an individual dimension, in realistic problems, a small set of the features is enough to provide evidence towards output being 1 or 0 in isolation, i.e. regardless of the value of the other features. Almost all problems in the real-world have this property. Hence, it does not make much sense to consider all possible problems in the mathematical sense.
Although theoretical guarantees can be used for guiding research directions and making decisions in practice, in machine learning there is often a wide gap between practice and theory. It is important to understand this gap, and be cautious to not extrapolate a theorems results in-order to use it for decision making in practice.