"Note that you'll want to do some pre-processing on the input data (for example, make sure each dimension has 0 mean and unit variance) so that the distance metrics above are meaningful."

Sorry but could you explain this a bit more?

CommonLounge

Categories

Featured Contributions

NaN.

tutorial

K-nearest neighborsby Wiki

Scroll

Cards

**K-nearest neighbors (KNN)** is one of the simplest Machine Learning algorithms. It is a supervised learning algorithm which can be used for both classification and regression.

Let us understand this algorithm with a classification problem. For simplicity of visualization, we'll assume that our input data is 2 dimensional. We have two possible classes - *green* and *red*.

Lets plot out training data in feature space.

There is no explicit training phase in KNN! In other words, for classifying new data points, we'll directly use our dataset (in some sense, the dataset *is* the model).

To classify a new data point, we find the **k** points in the training data closest to it, and make a prediction based on whichever class is most common among these **k** points (i.e. we simulate a vote). Here *closest* is defined by a suitable distance metric such as euclidean distance. Other distance metrics are discussed below.

For example, if we want to classify blue point as shown in following figure, we consider **k** nearest data points and we assign the class which has the majority.

If **k = 3**, we get two data points with green class and one data point with red class. Hence, we'll predict green class for the new point.

Here's another example, let us change the position of new point (blue point) as shown below.

If we take **k = 5** then we get four neighbors with red class and one neighbor with green class. Hence, new point will be classified as red point.

In case of regression (when target variable is a real value), we take the average of the K nearest neighbors.

A small value of k means that noise will have a higher influence on the result and large value will make the algorithm computationally expensive. Usually, we perform cross-validation to find out best k value (or to choose the value of k that best suits our accuracy / speed trade-off). If you don't want to try multiple values of k, a rule of thumb is to set k equal to the square root of total number of data points. For more on choosing best value of k, refer __this stackoverflow__ thread.

There are various options available for distance metric such as euclidean or manhattan distance. The most commonly used metric is **euclidean distance**.

Minkowski is the generalization of Euclidean and Manhattan distance.

Note that you'll want to do some pre-processing on the input data (for example, make sure each dimension has 0 mean and unit variance) so that the distance metrics above are meaningful.

## load the datasetfrom sklearn.datasets import load_irisdataset = load_iris()X = dataset.datay = dataset.target## precessing# standardize the data to make sure each feature contributes equally# to the distancefrom sklearn.preprocessing import StandardScalerss = StandardScaler()X_processed = ss.fit_transform(X)## split the dataset into train and test setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.3, random_state=42)## fit n nearest neighbor modelfrom sklearn.neighbors import KNeighborsClassifiermodel = KNeighborsClassifier(n_neighbors = 5, metric="minkowski", p=2)# p=2 for euclidian distancemodel.fit(X_train, y_train)# output:# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',# metric_params=None, n_jobs=1, n_neighbors=5, p=2,# weights='uniform')## evaluatemodel.score(X_test, y_test)# output: 1.0

You can also play with this project directly in-browser via Google Colaboratory using the link above. Google Colab is a free tool that lets you run small Machine Learning experiments through your browser. You should read this 1 min tutorial if you're unfamiliar with Google Colaboratory.

In a parametric model, we continuously update a fixed number of parameters to learn a function which can classify new data point without requiring the training data (for example, logistic regression). In a non-parametric model, the number of parameters grows with the size of training data. This is what happens in KNN.

Read more…(504 words)

Category: Machine Learning

comment in this discussion

"Note that you'll want to do some pre-processing on the input data (for example, make sure each dimension has 0 mean and unit variance) so that the distance metrics above are meaningful."

Sorry but could you explain this a bit more?

Read more… (42 words)

Load More