Nearest Neighbour Algorithms
Nearest Neighbour Algorithms
Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.
Nearest neighbour algorithms are a class of algorithms that use some measure of similarity. They rely on the premise that observations which are close to each other (when comparing all of the features) are similar to each other.
Making this assumption, we can do some interesting things like:
- Recommendations
- Find similar stuff
But more crucially, they provide an insight into the character of the data.
This is really useful when you are given some data and you don’t know what the goal is.
Loading the data
The data for this section is all about Whiskey. The goal is to be able to recommend similar whiskeys based upon one that you like. So for whiskey lovers, this is your ideal workshop!
It is analogous to serious datasets too; you could imagine that instead of features about the whiskey, they could be features about competitors products or about customers.
# Usual imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
whiskey = pd.read_csv('https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/whiskies.csv')
display(whiskey.head())
print(whiskey.columns)
Distillery | Body | Sweetness | Smoky | Medicinal | Tobacco | Honey | Spicy | Winey | Nutty | Malty | Fruity | Floral | Postcode | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aberfeldy | 2 | 2 | 2 | 0 | 0 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | PH15 2EB | 286580 | 749680 |
1 | Aberlour | 3 | 3 | 1 | 0 | 0 | 4 | 3 | 2 | 2 | 3 | 3 | 2 | AB38 9PJ | 326340 | 842570 |
2 | AnCnoc | 1 | 3 | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 2 | 3 | 2 | AB5 5LI | 352960 | 839320 |
3 | Ardbeg | 4 | 1 | 4 | 4 | 0 | 0 | 2 | 0 | 1 | 2 | 1 | 0 | PA42 7EB | 141560 | 646220 |
4 | Ardmore | 2 | 2 | 2 | 0 | 0 | 1 | 1 | 1 | 2 | 3 | 1 | 1 | AB54 4NH | 355350 | 829140 |
Index(['Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral',
'Postcode', ' Latitude', ' Longitude'],
dtype='object')
We can see that there are several features here that we don’t want to leak into our dataset.
We’re interested in how close the taste of whiskies are, not how geographically close (although we think that this will be the result!)
cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
X = whiskey[cols]
y = whiskey['Distillery']
display(X.head())
display(y.head())
Body | Sweetness | Smoky | Medicinal | Tobacco | Honey | Spicy | Winey | Nutty | Malty | Fruity | Floral | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2 | 2 | 0 | 0 | 2 | 1 | 2 | 2 | 2 | 2 | 2 |
1 | 3 | 3 | 1 | 0 | 0 | 4 | 3 | 2 | 2 | 3 | 3 | 2 |
2 | 1 | 3 | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 2 | 3 | 2 |
3 | 4 | 1 | 4 | 4 | 0 | 0 | 2 | 0 | 1 | 2 | 1 | 0 |
4 | 2 | 2 | 2 | 0 | 0 | 1 | 1 | 1 | 2 | 3 | 1 | 1 |
0 Aberfeldy
1 Aberlour
2 AnCnoc
3 Ardbeg
4 Ardmore
Name: Distillery, dtype: object
Distance measures
First let’s define a distance measure.
We defined quite a few in the presentation, but it’s always best to start with the simplest thing possible. So let’s write a euclieanDistance method…
$$ d_{Euclidean}(\mathbf{x}, \mathbf{y}) = ||\mathbf{x} - \mathbf{y}||=\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + …} $$
import math
def euclideanDistance(instance1, instance2):
distance = 0
for x in range(len(instance1)):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)
One thing that we haven’t talked much about is test driven development (TDD).
(If you’re not all software engineers) TDD is the idea that all code is backed by a test to prove that it does what it is intended to do.
TDD states that tests should be written first.
It’s quite handy to sprinkle these throughout a notebook to make sure your custom methods do what they are supposed to do…
from unittest import *
class TestDistance(TestCase):
def testSimple(self):
self.assertAlmostEqual(euclideanDistance([0], [1]), 1.0, places=1)
def test2D(self):
self.assertAlmostEqual(euclideanDistance([0, 0], [1, 1]), 1.4, places=1)
suite = TestLoader().loadTestsFromModule(TestDistance())
TextTestRunner(verbosity=2).run(suite) ;
test2D (__main__.TestDistance) ... ok
testSimple (__main__.TestDistance) ... ok
----------------------------------------------------------------------
Ran 2 tests in 0.006s
OK
Nearest neighbour
Now we have a measure of distance, let’s write an algorithm to get the nearest neighbours..
import operator
def getNeighbors(trainingSet, instance, k):
"""Return the first k locations of the nearest neighbours to an instance"""
distances = []
for x in range(len(trainingSet)):
dist = euclideanDistance(instance, trainingSet[x])
distances.append(dist)
locs = np.argsort(distances)
return locs[:k]
To test whether this works, let’s define a test instance that we want to compare.
We expect that the test instance is the nearest match, then some more matches.
testInstance = X.loc[y == 'Laphroig']
display(testInstance)
Body | Sweetness | Smoky | Medicinal | Tobacco | Honey | Spicy | Winey | Nutty | Malty | Fruity | Floral | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
58 | 4 | 2 | 4 | 4 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
neighbors = getNeighbors(X.as_matrix(), testInstance.as_matrix()[0], 5)
print(y[neighbors])
58 Laphroig
57 Lagavulin
3 Ardbeg
23 Clynelish
21 Caol Ila
Name: Distillery, dtype: object
Great. If you know your domain (I know a little, at least about smokey whiskies :-) ) then we can see that these are pretty reasonable recommendations.
Tasks:
- Generate some recommendations for whiskies based upon the ones that you like!