Nearest Neighbour Algorithms

Nearest Neighbour Algorithms

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

Nearest neighbour algorithms are a class of algorithms that use some measure of similarity. They rely on the premise that observations which are close to each other (when comparing all of the features) are similar to each other.

Making this assumption, we can do some interesting things like:

  • Recommendations
  • Find similar stuff

But more crucially, they provide an insight into the character of the data.

This is really useful when you are given some data and you don’t know what the goal is.

Loading the data

The data for this section is all about Whiskey. The goal is to be able to recommend similar whiskeys based upon one that you like. So for whiskey lovers, this is your ideal workshop!

It is analogous to serious datasets too; you could imagine that instead of features about the whiskey, they could be features about competitors products or about customers.

# Usual imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
whiskey = pd.read_csv('https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/whiskies.csv')
display(whiskey.head())
print(whiskey.columns)

DistilleryBodySweetnessSmokyMedicinalTobaccoHoneySpicyWineyNuttyMaltyFruityFloralPostcodeLatitudeLongitude
0Aberfeldy222002122222PH15 2EB286580749680
1Aberlour331004322332AB38 9PJ326340842570
2AnCnoc132002002232AB5 5LI352960839320
3Ardbeg414400201210PA42 7EB141560646220
4Ardmore222001112311AB54 4NH355350829140
Index(['Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
       'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral',
       'Postcode', ' Latitude', ' Longitude'],
      dtype='object')

We can see that there are several features here that we don’t want to leak into our dataset.

We’re interested in how close the taste of whiskies are, not how geographically close (although we think that this will be the result!)

cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
       'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
X = whiskey[cols]
y = whiskey['Distillery']
display(X.head())
display(y.head())

BodySweetnessSmokyMedicinalTobaccoHoneySpicyWineyNuttyMaltyFruityFloral
0222002122222
1331004322332
2132002002232
3414400201210
4222001112311
0    Aberfeldy
1     Aberlour
2       AnCnoc
3       Ardbeg
4      Ardmore
Name: Distillery, dtype: object

Distance measures

First let’s define a distance measure.

We defined quite a few in the presentation, but it’s always best to start with the simplest thing possible. So let’s write a euclieanDistance method…

$$ d_{Euclidean}(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + …} $$

import math
def euclideanDistance(instance1, instance2):
    distance = 0
    for x in range(len(instance1)):
        distance += pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

One thing that we haven’t talked much about is test driven development (TDD).

(If you’re not all software engineers) TDD is the idea that all code is backed by a test to prove that it does what it is intended to do.

TDD states that tests should be written first.

It’s quite handy to sprinkle these throughout a notebook to make sure your custom methods do what they are supposed to do…

from unittest import *
class TestDistance(TestCase):
    def testSimple(self):
        self.assertAlmostEqual(euclideanDistance([0], [1]), 1.0, places=1)
    def test2D(self):
        self.assertAlmostEqual(euclideanDistance([0, 0], [1, 1]), 1.4, places=1)

suite = TestLoader().loadTestsFromModule(TestDistance())
TextTestRunner(verbosity=2).run(suite) ;
test2D (__main__.TestDistance) ... ok
testSimple (__main__.TestDistance) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.006s

OK

Nearest neighbour

Now we have a measure of distance, let’s write an algorithm to get the nearest neighbours..

import operator
def getNeighbors(trainingSet, instance, k):
    """Return the first k locations of the nearest neighbours to an instance"""
    distances = []
    for x in range(len(trainingSet)):
        dist = euclideanDistance(instance, trainingSet[x])
        distances.append(dist)
    locs = np.argsort(distances)
    return locs[:k]

To test whether this works, let’s define a test instance that we want to compare.

We expect that the test instance is the nearest match, then some more matches.

testInstance = X.loc[y == 'Laphroig']
display(testInstance)

BodySweetnessSmokyMedicinalTobaccoHoneySpicyWineyNuttyMaltyFruityFloral
58424410011100
neighbors = getNeighbors(X.as_matrix(), testInstance.as_matrix()[0], 5)
print(y[neighbors])
58     Laphroig
57    Lagavulin
3        Ardbeg
23    Clynelish
21     Caol Ila
Name: Distillery, dtype: object

Great. If you know your domain (I know a little, at least about smokey whiskies :-) ) then we can see that these are pretty reasonable recommendations.

Tasks:

  • Generate some recommendations for whiskies based upon the ones that you like!

More articles

K-NN For Classification

We can use the nearest neighbour algorithm for classification too. This python notebook explains how to perform the k-Nearest Neighbours algorithm.

Read more

Visualising Underfitting and Overfitting in High Dimensional Data

Most of the time you have too many dimensions to simply plot the decision boundary of a classifier. This workshop investigates other ways to visualise under and overfitting in high-dimensional datasets.

Read more
}