K-nearest neighbors

June 18, 2017

K-nearest neighbors (KNN) is one of the simplest Machine Learning algorithms. It is a supervised learning algorithm which can be used for both classification and regression.

Understanding the classification algorithm (illustration)

Let us understand this algorithm with a classification problem. For simplicity of visualization, we’ll assume that our input data has 2 dimensions. We will also assume that it is a binary classification task, i.e. the target can take two possible labels - green and red.

Here’s what a plot of our training data looks like.

There is no explicit training phase in KNN! In other words, for classifying new data points, we’ll directly use our dataset (in some sense, the dataset is the model).

To classify a new data point, we find the K points in the training data closest to it, and make a prediction based on whichever class is most common among these K points (i.e. we simulate a vote). Here closest is defined by a suitable distance metric, such as euclidean distance. We will discuss other distance metrics later in the tutorial.

Let’s take a couple of examples.

Suppose we want to classify the blue point shown in figure below. We consider the K nearest data points and we assign it the class which is the most common among those K data points.

Let’s assume that K = 3. Then, we get 2 data points with green class and 1 data point with red class. Hence, we’ll predict green class for the new point.

Here’s another example. This time the new data point (marked in blue again) is in a different position, and we will also assume K = 5 (instead of 3).

Since K = 5, we need to look at the 5 nearest neighbors. We get 4 neighbors with red class and 1 neighbor with green class. Hence, new point will be classified as a red point.

Note: The value of K and the distance metric are both considered hyper-parameters of the K nearest neighbors model. We can choose both of them using the validation dataset, as we have done with the hyper-parameters of the other models we have seen so far.

KNN as regression algorithm

Above, we saw how to use KNN as a classification algorithm. In that case, we predicted whichever class was most common among the K closest data points.

In case of regression, when target variable is a real value, we take the average of the K nearest neighbors.

For example, let’s say K = 3 and the y value associated with the 3 nearest neighbors are 5, 8 and 6. Then, for the new data point, we will predict (5 + 8 + 6) / 3 = 6.33.

Tuning the hyper-parameter K

When K is small, the algorithm will only consider a few close neighbors of the data point we’re trying to make a prediction for. As K grows large, it will consider more and more neighbors. Initially, this will usually improve the accuracy by reducing overfitting.

However, as K gets even more large, we will start considering even far off points as neighbors of the current data point. As an extreme example, consider the case when K = N (the total number of data points). Then, all points in the training data will be considered neighbors, and we will always make the same prediction (whichever label is most common in the training data).

Hence, to pick a good value of K, we can perform hyper-parameter tuning using the validation dataset.

If performing hyper-parameter tuning is not possible, (say for example, if our dataset is large and we have very limited computational resources), a rule of thumb is to set $k = \sqrt{n}$ (square root of total number of data points).

Distance metrics

There are various options available for distance metric such as euclidean or manhattan distance. The most commonly used metric is euclidean distance.

Minkowski is the generalization of Euclidean and Manhattan distance. When q = 2, Minkowski distance becomes Euclidean distance, and when q = 1, Minkowski distance becomes Manhattan distance.

Below, you can see the geometric representation of Euclidean and Manhattan distance for points on the 2-dimensional plane.

Importance of feature scaling / data normalization

It is important to perform feature scaling / normalization on the data points before using them in K nearest neighbors algorithm. For example, we could make sure that for each dimension, the mean is 0 mean and standard deviation is 1).

Otherwise, the distance metric will give more importance to some dimension as opposed to other dimensions.

Let’s take a simplified example to understand this. Here’s the formula for the euclidean distance between the data points $x^{(1)}$ and $x^{(2)}$.

$$ \sqrt{(x^{(1)}_1-x^{(2)}_1)^2+(x^{(1)}_2-x^{(2)}_2)^2} $$

In this simple example, let’s say that $x_1$ varies between -1 to +1 and $x_2$ varies between -1000 to +1000. It is very clear that $x_2$ will be responsible for most of the total distance, whereas $x_1$ will have almost no impact.

Scikit-learn implementation

Let’s implement K-nearest neighbors using scikit-learn .

We will load iris dataset in sklearn.datasets for our KNN implementation. Let’s load independent variables to X and target variable to y.

## load the dataset
from sklearn.datasets import load_iris
dataset = load_iris()
# X - independent variables
X = dataset.data
# y - target variable
y = dataset.target

Next, we scale X using StandardScaler before implementing the distance metric based KNN algorithm.

## pre-processing
# standardize the data to make sure each feature contributes equally 
# to the distance
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_processed = ss.fit_transform(X)
# Dataset after scaling
print('Dataset after scaling')
print('X_processed (first 5 rows)')
print(X_processed[:5, :])
# Print first 5 values of y
print('y (first 5 values)')
print(y[:5])

We split the data into train and test data, to enable evaluating the model trained.

## split the dataset into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.3, random_state=42)

Now, let’s fit the KNN model. We will set n_neighbors=5 .

## fit n nearest neighbor model 
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
# p=2 for euclidean distance
# fit the model with train set
model.fit(X_train, y_train)
print('\nDisplay all arguments scikit-learn is using for the KNN model')
print(model)

Finally, let’s evaluate the model.

## evaluate
score = model.score(X_test, y_test)
print('\nScore:', score)

Parametric and non-parametric models

Parametric models (like logistic regression) assume a finite set of parameters. From the given train data, it learns the values of these parameters (during training). These parameters help predict the target value for any test data. Therefore, in parametric models, training takes time and its really quick to predict the target value for test data.

Benefits of Parametric Models are:

Simpler: They are easier to understand and interpret the results.
Less Data: They do not require much training data.

Limitations of Parametric Models are:

Constrained: By choosing a functional form these methods are highly constrained to that specified form.
Limited Complexity: The methods are more suited to simpler problems.

In non-parametric models (like KNN), the training phase is quick or negligible. There are no parameters to learn (Or we can state that there are infinite parameters to learn). When a new test data point arrives, the algorithm then compares with the train data to predict the target value. So in non-parametric models, the training is quick compared to prediction.

Benefits of Non-parametric Models are:

Flexibility: They are capable of fitting a large number of functional forms.
Power: No assumptions about the underlying function.

Limitations of Non-parametric Models:

More data: They require a lot of training data.
Overfitting: More risk of overfitting the training data and it is harder to explain why specific predictions are made.

Summary

K-nearest neighbors (KNN) is a supervised learning algorithm which can be used for both classification and regression.
KNN predict the class which is most common among the K closest data points for classification problems.
In case of regression, when target variable is a real value, KNN takes the average of the K nearest neighbors.
K is a hyper parameter which is tuned to select the best value.
Always perform feature scaling / normalization on the data points before using them in KNN.