So let’s walk through a very visual, intuitive example to help describe what all data science algorithms are trying to do.

*This will seem quite complicated if you’ve never done anything like this before. That’s ok!*

I want to do this to show you that all algorithms that you’ve every heard of have some very basic assumption of what they are trying to do.

At the end of this, we will have completely derived one very important type of classifier.

Imagine that we had a classification problem. We want perform model induction so that we can
generate a predictive model to use to decide whether a new beer belongs to group `A`

or group `B`

(a
binary classification problem).

We have a dataset where the targets have been specified, so this is a supervised task.

One way to solve this is to arbitrarily split the data according to some rule.

E.g. is `feature 1 > 0.5`

Then we can test to see how well we have split the data.

We do this by measuring how mixed the resulting classes are.

To make this more concrete, consider the following data:

Alc. Volume | Colour | Class |
---|---|---|

4.2 | 92 | A |

6.4 | 102 | A |

3.5 | 3 | B |

4.7 | 10 | B |

Note:

- the features
- the differing scales
- the classes

So let’s think how to generate a model. We want to segment the data so that the classes are *clean*.

By clean, I mean pure, homogeneous, clear cut.

If we sliced the data and we ended up with two classes, all A’s in one class and all B’s in the other, then the result is as pure as it can possibly be.

Obviously we can do this visually. We can *see* a *predictive model* that would solve the problem.

But real life isn’t so easy.

We need a way of measuring the purity of a group. This is known as a *purity measure*. Thankfully
we’ve already encountered one purity measure. Entropy.

$$H=-\sum(p_i \log_2 (p_i))$$

Where `\(p_i\)`

is the probability that the observation belongs to class `\(i\)`

.

For example, if we have two classes:

$$H=-p_1 \log_2 (p_1)-p_2 \log_2 (p_2)$$

Thought experiments:

- flip a coin
- throw a die
- pick a card from a pack

How did you calculate that probability?

A bit more difficult:

- What is the probability that you will next see a woman or a man?
- Given the previous data (two A’s and two B’s) what is the probability that you would pick an A? What about a B?

Consider the previous example where we had two classes, with two instances in each class:

$$H=-p_1 \log_2 (p_1)-p_2 \log_2 (p_2)$$

$$H=-0.5 \log_2 (0.5)-0.5 \log_2 (0.5)$$

$$H=1$$

This is a very impure dataset. Given the observation, this data has high entropy and contains a maximal amount of information.

Entropy measures the amount of information in a given sample.

Looking back to the goal, we want to partition into groups so that when a new observation arrives we can guess which group that belongs to.

To choose the best partition we need to measure which split would give the best result.

We can do that by trying different splits and then measure the entropy again with the goal of making the groups more pure.

We can measure how much purer the new split becomes by computing the *information gain*

*Note: Draw a picture of this on the whiteboard.*

The information gain is defined as the parent entropy minus the summed entropy of the subgroups.

\begin{align} IG(parent, children) = & entropy(parent) - \nonumber \\ & \left(p(c_1)entropy(c_1) + p(c_2)entropy(c_2) + …\right) \end{align}

Reading the equation, this is the weighted entropy of each new group subtracted from the parent entropy.

I.e. we are calculating the reduction in entropy by creating two new classes.

In the previous example the parent’s entropy was $1$. Lets try splitting the classes in two ways.

Alc. Volume | Colour | Class |
---|---|---|

4.2 | 92 | A |

6.4 | 102 | A |

3.5 | 3 | B |

4.7 | 10 | B |

First Let’s split by an `Alc. Volume`

of 5.0. This leaves the data:

**< 5.0**

Alc. Volume | Colour | Class |
---|---|---|

4.2 | 92 | A |

3.5 | 3 | B |

4.7 | 10 | B |

So, A = 1/3, B = 2/3

**>= 5.0**

Alc. Volume | Colour | Class |
---|---|---|

6.4 | 102 | A |

and, A = 1/1, B = 0/1

So, given the entropy of the parent (`1.0`

) we can now calculate the entropy of the two children:

\begin{align} H & = -p_1 \log_2 (p_1)-p_2 \log_2 (p_2) \\ entropy(alc < 5.0) & = -0.33 \log_2 (0.33)-0.66 \log_2 (0.66) \\ entropy(alc < 5.0) & = 0.92 \\ entropy(alc > 5.0) & = -1 \log_2 (1) \\ entropy(alc > 5.0) & = 0 \\ \end{align}

And calculate the information gain as:

\begin{align} IG(parent, children) = & entropy(parent) - \\ & \left(p(c_1)entropy(c_1) + p(c_2)entropy(c_2) + …\right) \\ = & 1 - \left(\frac{3}{4} \times 0.92 + \frac{1}{4} \times 0\right) \\ = & 0.31 \\ \end{align}

We have an information gain of `0.31`

with this split. Let’s consider another split.

Let’s split by an `Colour`

of 50. This leaves the data:

**< 50**

Alc. Volume | Colour | Class |
---|---|---|

3.5 | 3 | B |

4.7 | 10 | B |

So, A = 0/2, B = 2/2

**>= 50**

Alc. Volume | Colour | Class |
---|---|---|

4.2 | 92 | A |

6.4 | 102 | A |

and, A = 2/2, B = 0/2

\begin{align} H & = -p_1 \log_2 (p_1)-p_2 \log_2 (p_2) \\ entropy(alc < 5.0) & = -1 \log_2 (1)-0 \\ entropy(alc < 5.0) & = 0 \\ entropy(alc > 5.0) & = -1 \log_2 (1) \\ entropy(alc > 5.0) & = 0 \end{align}

And calculate the information gain as:

\begin{align} IG(parent, children) = & entropy(parent) - \\ & \left(p(c_1)entropy(c_1) + p(c_2)entropy(c_2) + …\right) \\ = & 1 - \left(\frac{2}{4} \times 0 + \frac{2}{4} \times 0\right) \\ = & 1 \end{align}

An information gain of `1`

. Comparing the two splits, the first only increased the amount of
information by `0.31`

. This split increases the IG by a full `1.0`

. Clearly, we would choose the
second split to split out data.