K-nearest neighbors (KNN) is one of the simplest Machine Learning algorithms. It is a supervised learning algorithm which can be used for both classification and regression.
Understanding the classification algorithm (illustration)
Let us understand this algorithm with a classification problem. For simplicity of visualization, we'll assume that our input data has 2 dimensions. We will also assume that it is a binary classification task, i.e. the target can take two possible labels - green and red.
Here's what a plot of our training data looks like.
There is no explicit training phase in KNN! In other words, for classifying new data points, we'll directly use our dataset (in some sense, the dataset is the model).
To classify a new data point, we find the K points in the training data closest to it, and make a prediction based on whichever class is most common among these K points (i.e. we simulate a vote). Here closest is defined by a suitable distance metric, such as euclidean distance. We will discuss other distance metrics later in the tutorial.
Let's take a couple of examples.
Suppose we want to classify the blue point shown in figure below. We consider the K nearest data points and we assign it the class which is the most common among those K data points.
Let's assume that K = 3. Then, we get 2 data points with green class and 1 data point with red class. Hence, we'll predict green class for the new point.
Here's another example. This time the new data point (marked in blue again) is in a different position, and we will also assume K = 5 (instead of 3).
Since K = 5, we need to look at the 5 nearest neighbors. We get 4 neighbors with red class and 1 neighbor with green class. Hence, new point will be classified as a red point.
Note: The value of K and the distance metric are both considered hyper-parameters of the K nearest neighbors model. We can choose both of them using the validation dataset, as we have done with the hyper-parameters of the other models we have seen so far.
KNN as regression algorithm
Above, we saw how to use KNN as a classification algorithm. In that case, we predicted whichever class was most common among the K closest data points.
In case of regression, when target variable is a real value, we take the average of the K nearest neighbors.
For example, let's say K = 3 and the y value associated with the 3 nearest neighbors are 5, 8 and 6. Then, for the new data point, we will predict (5 + 8 + 6) / 3 = 6.33.