Nearest Neighbour Algorithms

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

Nearest neighbour algorithms are a class of algorithms that use some measure of similarity. They rely on the premise that observations which are close to each other (when comparing all of the features) are similar to each other.

Making this assumption, we can do some interesting things like:

Recommendations
Find similar stuff

But more crucially, they provide an insight into the character of the data.

This is really useful when you are given some data and you don’t know what the goal is.

Loading the data

The data for this section is all about Whiskey. The goal is to be able to recommend similar whiskeys based upon one that you like. So for whiskey lovers, this is your ideal workshop!

It is analogous to serious datasets too; you could imagine that instead of features about the whiskey, they could be features about competitors products or about customers.

# Usual imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display

whiskey = pd.read_csv('https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/whiskies.csv')
display(whiskey.head())
print(whiskey.columns)

	Distillery	Body	Sweetness	Smoky	Medicinal	Honey	Spicy	Winey	Nutty	Malty	Fruity	Floral	Postcode	Latitude	Longitude
0	Aberfeldy	2	2	2	0	2	1	2	2	2	2	2	PH15 2EB	286580	749680
1	Aberlour	3	3	1	0	4	3	2	2	3	3	2	AB38 9PJ	326340	842570
2	AnCnoc	1	3	2	0	2	0	0	2	2	3	2	AB5 5LI	352960	839320
3	Ardbeg	4	1	4	4	0	2	0	1	2	1	0	PA42 7EB	141560	646220
4	Ardmore	2	2	2	0	1	1	1	2	3	1	1	AB54 4NH	355350	829140

Index(['Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
       'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral',
       'Postcode', ' Latitude', ' Longitude'],
      dtype='object')

We can see that there are several features here that we don’t want to leak into our dataset.

We’re interested in how close the taste of whiskies are, not how geographically close (although we think that this will be the result!)

cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
       'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
X = whiskey[cols]
y = whiskey['Distillery']
display(X.head())
display(y.head())

	Body	Sweetness	Smoky	Medicinal	Honey	Spicy	Winey	Nutty	Malty	Fruity	Floral
0	2	2	2	0	2	1	2	2	2	2	2
1	3	3	1	0	4	3	2	2	3	3	2
2	1	3	2	0	2	0	0	2	2	3	2
3	4	1	4	4	0	2	0	1	2	1	0
4	2	2	2	0	1	1	1	2	3	1	1

0    Aberfeldy
1     Aberlour
2       AnCnoc
3       Ardbeg
4      Ardmore
Name: Distillery, dtype: object

Distance measures

First let’s define a distance measure.

We defined quite a few in the presentation, but it’s always best to start with the simplest thing possible. So let’s write a euclieanDistance method…

$$ d_{Euclidean}(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + …} $$

import math
def euclideanDistance(instance1, instance2):
    distance = 0
    for x in range(len(instance1)):
        distance += pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

One thing that we haven’t talked much about is test driven development (TDD).

(If you’re not all software engineers) TDD is the idea that all code is backed by a test to prove that it does what it is intended to do.

TDD states that tests should be written first.

It’s quite handy to sprinkle these throughout a notebook to make sure your custom methods do what they are supposed to do…

from unittest import *
class TestDistance(TestCase):
    def testSimple(self):
        self.assertAlmostEqual(euclideanDistance([0], [1]), 1.0, places=1)
    def test2D(self):
        self.assertAlmostEqual(euclideanDistance([0, 0], [1, 1]), 1.4, places=1)

suite = TestLoader().loadTestsFromModule(TestDistance())
TextTestRunner(verbosity=2).run(suite) ;

test2D (__main__.TestDistance) ... ok
testSimple (__main__.TestDistance) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.006s

OK