KNN: Failure cases, Limitations, and Strategy to Pick the Right K

Deepak Jain

Published in

Level Up Coding

5 min readJul 17, 2020

In this article, we will talk about the failure cases of KNN, its limitation, and how to pick the right value of “K”.

In case you missed the first part, I strongly suggest you go through it first, here.

Failure cases of KNN:

CASE-1

Consider the below example:

In this case, the data is randomly spread and hence no useful information can be obtained from it. Now in such a scenario when we are given a query point (yellow point), the KNN algorithm will try to find the k nearest neighbors but since the data points are jumbled, the accuracy is questionable.

CASE-2

Consider the below example:

In this case, the data is grouped in clusters but the query point seems far away from the actual grouping. In such a case, we can use K nearest neighbors to identify the class, however, it doesn’t make much sense because the query point (yellow point) is really far from the data points and hence we can’t be very sure about its classification.

Limitations of KNN:

KNN is a very powerful algorithm. It is also called “lazy learner”. However, it has the following set of limitations:

1. Doesn’t work well with a large dataset:
Since KNN is a distance-based algorithm, the cost of calculating distance between a new point and each existing point is very high which in turn degrades the performance of the algorithm.

2. Doesn’t work well with a high number of dimensions:
Again, the same reason as above. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance.

3. Sensitive to outliers and missing values:
KNN is sensitive to outliers and missing values and hence we first need to impute the missing values and get rid of the outliers before applying the KNN algorithm.

How to pick the right value of “K”?

This is the most critical step under this algorithm. Let me reconstruct our example from the previous article here.

So, when the value of K=3, we classify the query point under class B, and if the value of K=7 then we classify the query point under class A and this is where the confusion arises as to which value of K to pick?

Lets first understand a term called Decision surfaces.

So, consider the below example:

Now, you can observe that there is a blue curve that separates the positive points from the negative ones. So, when we say, that we are building a model to classify the points into positive and negative points, we are basically looking for a function that would achieve that for us. This curve that you see is nothing but that function. This curve is our model.

Now, except for a couple of points, we see that this curve/decision surface separates all the positive and negative points. Further, given any query point, depending upon which side of the decision surface the query point lies (taking into account the distance with the neighbors), it will be classified into either positive or negative point.

Let's focus our attention on the green point. This green point is my query point and the objective is to classify this green point into one of the classes.

Let's make this a little more interesting and fun. Consider the below 3 cases to get a better intuition.

CASE-1

Suppose I pick my value of K=1. Meaning, I have to pick 1-Nearest Neighbor of my query point, and whatever is the class label of that 1-NN, I need to assign it to the query point. In that case, my 1-NN is a positive point and hence I need to assign a positive class label to my query point. But wait! My query point lies on the negative side of the decision surface. Yes. That’s true, but that’s from a geometric point of view. When we apply KNN (where K=1) the closest neighbor is my positive point and hence the algorithm classifies it as a positive point. This is also known as overfitting.

CASE-2

Here, I pick my value of K=5. Now, I need to consider 5 nearest neighbors. When I consider 5-NN and take the majority vote, I can classify the query point as negative.

CASE-3

Here, my value of K=n where n = total no. of points in my dataset. For the sake of our understanding, let's assume I have 1000 data points and out of these 1000 data points 600 points are negative and 400 points are positive. So my value of K=1000. It means we`ll consider all the 1000 data points as our neighbors. Now, when we consider all the data points and consider the majority vote, our query point will be classified as negative since the number of negative points is greater than positive points.

I want to highlight one interesting fact here. Consider the query point marked in blue. This query point is deep in the positive cluster but for K=1000, this query point will be considered as negative because when considering the majority vote, since I have 600 negative points versus 400 positive points, it`ll be considered as a negative point. So the point that I am trying to make is, irrespective of where my query point lies since I am considering all the data points as nearest neighbors and I have negative class labels as the majority, all my query points will be classified as negative points. This is also known as underfitting.

Now, you might wonder, all this sounds interesting but the confusion still prevails. How to choose the right value of “K”?

The answer to that question is, there is no step by step guide to explain the best way to find the right value of “K”.

One way to do this is training KNN for different values of K and capturing the performance metrics for that value at K. Say for eg. You consider accuracy as the performance measure then we plot a graph of accuracy vs. different values of K and we select a point where we get the best accuracy.

Ogunkeye, Fiyinfoba. (2020). How can we find the optimum K in K-Nearest Neighbor? Retrieved from: Source

Conclusion:

So far, we understood the failure cases of KNN, its limitation, and how to pick the right value of K. In the next section we will understand cross-validation and k-fold cross-validation.

Do hit the follow button and feel free to respond in the comments section if you have any queries/suggestions.

Until then, Happy Learning!

KNN: Failure cases, Limitations, and Strategy to Pick the Right K

Failure cases of KNN:

CASE-1

CASE-2

Limitations of KNN:

How to pick the right value of “K”?

CASE-1

CASE-2

CASE-3

Conclusion:

Written by Deepak Jain