602: Nearest Neighbour Classification and Regression

More than just similarities

  • Classification: Predict the same class as the nearest observations
  • Regression: Predict the same value as the nearest observations

???

Remember for classification tasks, we want to predict a class for a new observation.

What we could do is predict a class that is the same as the nearest neighbour. Simple!

For regression tasks, we need to predict a value. Again, we could use the value of the nearest neighbour! Simple again!


Nearest neighbour classification

Nearest neighbour classification

Nearest neighbour regression

Nearest neighbour regression

???

We’ve used the nearest neighbour in these examples to demonstrate the idea.

But in reality we don’t want to use a single neighbour. It could be noise.

Instead, we can use a weighting of some number of neighbours. For example in classification we would take predict the class which had the majority of nearest neighbours. For regression we might take the average value.

Equally, we could estimate the class probability by looking at the proportion of the nearest neighbours belonging to the predicted class.

(Beware of class probabilities on small numbers of observations!)

Generally, when we’re using more than one neighbour the name of the algorithm is shortened to k-NN.

Where k refers to the number of neighbours used in the algorithm.

In general, higher values of k perform more averaging and give “smoother” results.

Let’s take another look at the “iris” dataset with various values of k.


iris_knn

Bias and variance part duex

What we’re seeing here is overfitting and over-generalisation in practice.

By choosing low values of k we are highly sensitive to outliers. We overfit.

But choosing high values of k, the model might not be complex enough to represent the data.


How do we pick a value of k?

We saw before how to use validation to pick make sure we are not over or underfitting.

We can do the same here, by varying the value of k and validating the result.

But be careful of the choice of K. It should be…


  • a coprime of number of classes and K

E.g. two classes, if we used a value for K of 6, we could have ties.

  • greater or equal to the number of classes plus one

To give each class opportunity to have it’s say.

  • High enough to avoid spurious results
  • Low enough to avoid always picking the most common class

Pros/Cons

The pros of using k-NN are obvious and were stated at the beginning.

  • The algorithm is very simple to understand.
  • It is very flexible; you can use it for similarity matching, classification or regression.
  • It is easy to tune; there’s often only a simple parameter.

However, there are issues. Of which we discuss next.


Justification

  • How can you justify the result?

E.g.

“your mortgage application was declined because of your similarity to three people that defaulted, whom were also Danish”

  • We’ve created a complex look-up table. It doesn’t improve our knowledge of the data.

???

First, and this doesn’t necessarily apply only to k-NN, how do you explain and justify the result?

Netflix justifies their k-NN recommendation by saying “you might like ‘I’m Alan Partridge’, because you liked both ‘Brass Eye’ and ‘Borat’”. And you’re probably ok with this justification.

But if you were declined a mortgage because “your application has been declined because your circumstances (where you live) were similar to those who have defaulted”. This might not be as easy to chew.

Furthermore, it is harder to learn something from the model, because we’re not actually modelling anything. We’ve basically created a sophisticated look up table. Stakeholders may not like this result.


More articles

101: Why Data Science?

This section introduces Data Science. It explains what it is and why we need it. We discuss some of the reasons for doing Data Science and provides famous examples from around the world.

Read more

102: How to do a Data Science Project

In this video we will talk about the problems encountered in data science. We'll also discover how it fits into a process, which you can used as a plan. Finally, we'll look at the impacts of a Data Science project which will help you avoid any common pitfalls.

Read more
}