Let's say you're working on a cool image processing project, and your goal is to build an algorithm that analyzes faces for emotions. It takes in a 256 pixel by 256 pixel grayscale image as its input and spits out an emotion as an answer. For example, if you passed in the following image, you'd expect the algorithm to label it as "happy."
Now this is all well and good, but before we're satisfied with this approach, let's take a step back and think about what this really means. A 256 by 256 grayscale image corresponds to an input vector of over 65,000 dimensions! In other words, we're trying to solve a problem in a 65,000-dimensional space. That's not a particularly easy thing to do, even for a computer! Not only are large inputs annoying to store, move around, and compute with, but they also give rise to some pretty serious tractability concerns.
Let's get a rough idea of how the difficulty of a machine learning problem increases as the dimensionality increases. According to a study by C.J. Stone in 1982, the time it takes to fit a model (specifically a nonparametric regression) to your data is at best proportional to m−p/(2p+d), where m is the number of data points, d is the dimensionality of the data, and p is a parameter that depends on the model we are using (specifically, we are assuming that the regression function is p times differentiable). In a nutshell, this relation implies that we need exponentially more training examples as we increase the dimensionality of our inputs.
We can observe this graphically by considering a simple example, borrowed from Gutierrez-Osuna. Our learning algorithm divides the feature space uniformly into bins and plots all of our training examples. We then assign each bin a label based on the predominant class that's found in that bin. Finally, for each new example that we want to classify, we just need to figure out the bin the example falls into and the label associated with that bin!
In our toy problem, we begin by choosing a single feature (one-dimensional inputs) and divide the space into 3 simple bins:
Understandably, we notice there's too much overlap between classes, so we decide to add another feature to improve separability. But we notice that if we keep the same number of samples, we get a 2D scatter plot that's really sparse, and it's really hard to ascertain any meaningful relationships without increasing the number of samples. If we move onto 3-dimensional inputs, it becomes even worse. Now we're trying to fill even more 33 = 27 bins with the same number of examples. The scatter plot is nearly empty!
Of important note is that this is NOT an issue that can be easily solved by more effective computing or improved hardware! For many machine learning tasks, collecting training examples is the most time consuming part, so this mathematical result forces us to be careful about the dimensions we choose to analyze. If we are able to express our inputs in few enough dimensions, we might be able to turn an unfeasible problem into a feasible one!
Sparsity is one out of many issues labelled as the curse of dimensionality. Another issue that arises is that it is hard to know what true distance means when you have a large number of dimensions. Having lots of dimensions implies that everything is far away from each other. This is because in a low dimensional sphere (or any other regular shape), most of the volume is close to the center. Whereas in a high dimensional sphere, most of the volume is close to the surface.
This means that when an algorithm such as the K-nearest neighbor algorithm is applied on a high dimensional dataset, you will usually find a lot of neighbors at roughly the same distance from a given data point. Hence, the concept of nearest neighbor isn't as effective. High dimensionality also makes clustering hard, since clustering algorithms inherently rely on distance between data points in-order to cluster them together.
Distance measurements become even worse when different dimensions are correlated with each other, a common property of high dimensional datasets. To give a simple example, let's say that we have a dataset with 3 dimensions, where dimensions 2 and 3 are the same. Now, anytime we measure the distance between two data points, we are effective saying that dimension 2 (or 3) is twice as important as the first dimension. For a more practical example, imagine what happens to the raw distance in the 65,000 dimensions when you simply 'shift' the image of the happy man by one pixel. We want the original image and the new image to be quite close to each other, but this simple change makes them separated by a large distance if we fail to capture the relationships between the various dimensions.
Going back to the original problem of classifying facial expressions, it's quite clear that we don't need all 65,000 dimensions to classify an image. Specifically, there are certain key features that our brain automatically uses to detect emotion quickly. For example, the curve of the person's lips, the extent his brow is furrowed, and the shape of his eyes all help us determine that the man in the picture is feeling happy. It turns out that these features can be conveniently summarized by looking at the relative positioning of various facial keypoints, and the distances between them.
Obviously, this allows us to significantly reduce the dimensionality of our input (from 65,000 to about 60), but there are some limitations! Hand selection of features can take years and years of research to optimize. For many problems, the features that matter are not easy to express. For example, it becomes very difficult to select features for generalized object recognition, where the algorithm needs to tell apart birds, from faces, from cars, etc. So how do we extract the most information-rich dimensions from our inputs? This is answered by a set of techniques within the field of machine learning called dimensionality reduction.