Information and Entropy
Information and Entropy
Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.
Remember the goal of data science. The goal is to make a decision based upon some data. The quality of that decision depends on our information. If we have good, clear information then we can make well informed decisions. If we have bad, messy data then our decisions will be poor.
Classification
In the context of classification, which is the the attempt to predict which class an observation belongs to, we can be more certain about a result if our algorithm is able to separate the classes cleanly.
One measure of how clean or pure a collection of classes are is Entropy.
In this workshop we will mathematically define entropy, which is a measure of the amount of information that can be stored in a limited number of bits.
import numpy as np # Numpy is a general purpose mathematical library for Python.
# Most higher level data science libraries use Numpy under the bonnet.
X = np.array([0, 0, 1, 1, -1, -1, 100]) # Create an array. All numpy funcitons expect the data in a Numpy array.
print(np.mean(X))
print(np.var(X))
14.2857142857
1225.06122449
Entropy
Remember entrpopy is defined as:
$$H=-\sum(p_i \log_2 (p_i))$$
Where \(p_i\) is the probability that the observation belongs to class \(i\). (i.e. $p(y==c)/n$, where y is the target, c is the class of interest and n is the total number of samples)
For example, if we have two classes:
$$H=-p_1 \log_2 (p_1)-p_2 \log_2 (p_2)$$
Task
- Read through this code and understand what is going on.
- Try calculating the entropy of another array of values. What happens when you add more values? Change values?
X = np.array([[4.2, 92], [6.4, 102], [3.5, 3], [4.7, 10]]) # Numpy arrays are general purpose mathematical arrays
y = np.array([0, 0, 1, 1]) # They implement all kinds of useful operators, like the == operator.
def entropy(y):
probs = [] # Probabilities of each class label
for c in set(y): # Set gets a unique set of values. We're iterating over each value
num_same_class = sum(y == c) # Remember that true == 1, so we can sum.
p = num_same_class / len(y) # Probability of this class label
probs.append(p)
return np.sum(-p * np.log2(p) for p in probs)
print(entropy(y)) # Should be 1.0
1.0
Information gain
Imagine we had some data like that of X
and y
above, where X
are the fetures and y
are the class labels.
We could propose a threshold or a rule that would split the data in X
to separate the classes. How would we quantify which was the best split?
What we can do is compare the entropy of the parent before the split against the weighted combination of the entropy after the split. I.e. if three observations end up in the left bucket and one in the right, then the left bucket will account for three quarters of the child’s entropy.
If we subtract the parent entropy from the weighted child’s entroy, we’re left with a measure of improvement. This is called the information gain.
The information gain is defined as the parent entropy minus the weighted entropy of the subgroups.
$$ \begin{align} IG(parent, children) = & entropy(parent) - \nonumber \\ & \left(p(c_1)entropy(c_1) + p(c_2)entropy(c_2) + …\right) \end{align} $$
Tasks:
- Given the following
information_gain
function (understand it) pick some splits and calculate the information gain. Which is better?
def information_gain(parent, left_split, right_split):
return entropy(parent) - (len(left_split) / len(parent)) * entropy(left_split) - (len(right_split) / len(parent)) * entropy(right_split)
# Make a split around the first column, < 5.0:
split1 = information_gain(y, y[X[:, 0] < 5.0], y[X[:, 0] > 5.0])
print("%0.2f" % split1) # Should be 0.31
0.31
# Make a split around the second column, < 50.0:
split2 = information_gain(y, y[X[:, 1] < 50], y[X[:, 1] > 50])
print(split2) # Should be 1.0
print("Split %d is better" % ((split1 < split2) + 1)) # Split 2 should be better, higher information gain
1.0
Split 2 is better