There’s a fairly famous dataset called the “mushroom dataset”.
It describes whether mushrooms are edible or not, depending on an array of features.
The nice thing about this dataset is that the features are all catagorical.
So we can go through and segment the data for each value in a feature.
This is some example data:
For each feature, we can then plot the information gain visually, which helps. On the x axis we have the proportion of the sample given a feature value, vs the calculated entropy for the label (poisonous, not poisonous).
This is the information gain for the “cap-shape” feature:
And this is the information gain for the “odour” feature:
We can visually see that the second feature has a much smaller
entropy*probability area, hence
the information gain is much greater.
We would certainly consider using this feature over the other.
Segmentation and quantitative measures of information are fundamental to data science.
If we select the most informative feature then we have segmented the data. It is then quite simple to use that segmentation rule (e.g. < or > 50) to predict the class of a new observation.
But what if a single split didn’t completely separate the classes?
One simple solution is to stack segmentation rules. I.e. perform the split again until the classes are pure.
This is known as a tree.
A tree has several components:
Trees are often used as predictive models.
We can continue performing the splits and information gain calculations until all the terminal nodes are pure.
Then if we see a new observation, we can re-run the same rules for that new observation and predict it’s class.
When trees are used to make a decision they are called decision trees. Our first classification algorithm!
Pat yourself on your back. Pat your neighbours back. You’ve just derived a very important algorithm!
They are popular because:
Given some data, we can build an optimal tree structure for our problem. This is called tree induction.
The goal of the tree is to provide supervised segmentation; given some labelled data, find the rules, based upon their features, into subgroups that have similar values.
We performed this task manually when discussing information gain. We can iteratively select the best split that maximises the information gain.
Once rules have been derived from the data the rule forms a decision boundary.
If there is only one feature, then the boundary is simply a point. If there are two features, then the boundary is a line. If there are three it becomes a surface. Etc.
Generalising the boundary is known as a hyperplane, which simply means a separating surface.
One of the beauties of tree-based classification algorithms is that they are purely logical.
It is easy to convert simple trees into a set of rules.
For example, our beer example could be summarised as:
if COLOUR < 50 then Class=A else Class=B
These rules can be easily encoded into software or procedures.
This is an interesting dataset describing whether users of a mobile phone service left for another provider. This is known as “customer churn”.
The goal of this dataset is to predict wheter a customer is going to leave or not from some other features.
The features include:
Index(['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE', 'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION', 'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL', 'CONSIDERING_CHANGE_OF_PLAN'], dtype='object')
I won’t go into details just yet, because we’re not quite ready. But I pushed this data through an algorithm to create a decision tree, the calculated the information gain given each of the generated rules.
The resulting rules look like…