Nearest Neighbour Algorithms

Nearest neighbour algorithms are useful for all kinds of problems. This python notebook explains what it is and how to use it.

MACHINE LEARNING WORKSHOP

Nearest Neighbour Algorithms

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

Nearest neighbour algorithms are a class of algorithms that use some measure of similarity. They rely on the premise that observations which are close to each other (when comparing all of the features) are similar to each other.

Making this assumption, we can do some interesting things like:

  • Recommendations
  • Find similar stuff

But more crucially, they provide an insight into the character of the data.

This is really useful when you are given some data and you don’t know what the goal is.

Loading the data

The data for this section is all about Whiskey. The goal is to be able to recommend similar whiskeys based upon one that you like. So for whiskey lovers, this is your ideal workshop!

It is analogous to serious datasets too; you could imagine that instead of features about the whiskey, they could be features about competitors products or about customers.

# Usual imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
whiskey = pd.read_csv('https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/whiskies.csv')
display(whiskey.head())
print(whiskey.columns)

Distillery Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral Postcode Latitude Longitude
0 Aberfeldy 2 2 2 0 0 2 1 2 2 2 2 2 PH15 2EB 286580 749680
1 Aberlour 3 3 1 0 0 4 3 2 2 3 3 2 AB38 9PJ 326340 842570
2 AnCnoc 1 3 2 0 0 2 0 0 2 2 3 2 AB5 5LI 352960 839320
3 Ardbeg 4 1 4 4 0 0 2 0 1 2 1 0 PA42 7EB 141560 646220
4 Ardmore 2 2 2 0 0 1 1 1 2 3 1 1 AB54 4NH 355350 829140
Index(['Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
       'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral',
       'Postcode', ' Latitude', ' Longitude'],
      dtype='object')

We can see that there are several features here that we don’t want to leak into our dataset.

We’re interested in how close the taste of whiskies are, not how geographically close (although we think that this will be the result!)

cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
       'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
X = whiskey[cols]
y = whiskey['Distillery']
display(X.head())
display(y.head())

Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral
0 2 2 2 0 0 2 1 2 2 2 2 2
1 3 3 1 0 0 4 3 2 2 3 3 2
2 1 3 2 0 0 2 0 0 2 2 3 2
3 4 1 4 4 0 0 2 0 1 2 1 0
4 2 2 2 0 0 1 1 1 2 3 1 1
0    Aberfeldy
1     Aberlour
2       AnCnoc
3       Ardbeg
4      Ardmore
Name: Distillery, dtype: object

Distance measures

First let’s define a distance measure.

We defined quite a few in the presentation, but it’s always best to start with the simplest thing possible. So let’s write a euclieanDistance method…

$$ d_{Euclidean}(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + …} $$

import math
def euclideanDistance(instance1, instance2):
    distance = 0
    for x in range(len(instance1)):
        distance += pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

One thing that we haven’t talked much about is test driven development (TDD).

(If you’re not all software engineers) TDD is the idea that all code is backed by a test to prove that it does what it is intended to do.

TDD states that tests should be written first.

It’s quite handy to sprinkle these throughout a notebook to make sure your custom methods do what they are supposed to do…

from unittest import *
class TestDistance(TestCase):
    def testSimple(self):
        self.assertAlmostEqual(euclideanDistance([0], [1]), 1.0, places=1)
    def test2D(self):
        self.assertAlmostEqual(euclideanDistance([0, 0], [1, 1]), 1.4, places=1)

suite = TestLoader().loadTestsFromModule(TestDistance())
TextTestRunner(verbosity=2).run(suite) ;
test2D (__main__.TestDistance) ... ok
testSimple (__main__.TestDistance) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.006s

OK

Nearest neighbour

Now we have a measure of distance, let’s write an algorithm to get the nearest neighbours..

import operator
def getNeighbors(trainingSet, instance, k):
    """Return the first k locations of the nearest neighbours to an instance"""
    distances = []
    for x in range(len(trainingSet)):
        dist = euclideanDistance(instance, trainingSet[x])
        distances.append(dist)
    locs = np.argsort(distances)
    return locs[:k]

To test whether this works, let’s define a test instance that we want to compare.

We expect that the test instance is the nearest match, then some more matches.

testInstance = X.loc[y == 'Laphroig']
display(testInstance)

Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral
58 4 2 4 4 1 0 0 1 1 1 0 0
neighbors = getNeighbors(X.as_matrix(), testInstance.as_matrix()[0], 5)
print(y[neighbors])
58     Laphroig
57    Lagavulin
3        Ardbeg
23    Clynelish
21     Caol Ila
Name: Distillery, dtype: object

Great. If you know your domain (I know a little, at least about smokey whiskies :-) ) then we can see that these are pretty reasonable recommendations.

Tasks:

  • Generate some recommendations for whiskies based upon the ones that you like!