Quantitative Model Evaluation

Quantitative Model Evaluation

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

We need to be able to compare models for a range of tasks. The most common use case is to decide whether changes to your model improve performance. Typically we want to visualise this, and we will in another workshop, but first we need to establish some quantitative measures of performance.

from sklearn import metrics
import numpy as np


First, let’s take an in depth look at accuracy. Given the confusion matrix below, calculate the accuracy.

I’m representing positive as ‘a’ - affirmative, and ’n’ as negative. This is because of the natural ordering would make the confusion matrix confusing. (i.e. if we use y and n, y would be at the bottom!)

y_true = ['a'] * 15 + ['n'] * 35
y_pred = ['a'] * 12 + ['n'] * 3 + ['n'] * 30 + ['a'] * 5
cm = metrics.confusion_matrix(y_true, y_pred)
print(metrics.accuracy_score(y_true, y_pred))
[[12  3]
 [ 5 30]]

Note that we have quite a lot of skew here too.

Expected Value

Imagine we had a marketing example, like in the training. The idea is that we want to spend some money on marketing, but we only want to target people that make sense. We were given the following information:

  • Profit from each sale: £50
  • Cost for marketing: £9

We can generate a cost/benefit matrix as follows:

profit = 50
cost   = -9
cost_benefit = np.array([[profit+cost, cost],[0   , 0]])
[[41 -9]
 [ 0  0]]

Given the results in the previous confusion matrix, what is the expected value?

def expected_value(confusion_matrix, cost_benefit_matrix):
    return sum(sum(confusion_matrix * cost_benefit_matrix)) / sum(sum(confusion_matrix))

print(expected_value(cm, cost_benefit))

Imbalanced Arrays

Lets take a look at the two confusion matricies seen in the training.

model_a = np.array([[25, 30], [0, 45]])
model_b = np.array([[30, 0], [20, 50]])
print("model a:\n", model_a)
print("model b:\n", model_b)
model a:
 [[25 30]
 [ 0 45]]
model b:
 [[30  0]
 [20 50]]

Let’s calculate the expected value of these models given the previous cost/benefit matrix:

print("model a:\n", expected_value(model_a, cost_benefit))
print("model b:\n", expected_value(model_b, cost_benefit))
model a:
model b:

But look at the sizes of each model test set, there’s a big skew.

print("model a sample size:\n", sum(model_a))
print("model b sample size:\n", sum(model_b))
model a sample size:
 [25 75]
model b sample size:
 [50 50]

Factoring Out Sample Size

We can factor out the sample size with the probability identity:

$p(\mathbf{Y},\mathbf{n}) = p(\mathbf{n})\cdotp(\mathbf{Y} \vert \mathbf{n})$

Which means we can factor out the sample sizes with:

\begin{align}\\ \text{expected profit} = & p(\mathbf{p}) \cdot \left[ p(\mathbf{Y} \vert \mathbf{p}) \cdot b(\mathbf{Y},\mathbf{p}) + p(\mathbf{N} \vert \mathbf{p}) \cdot b(\mathbf{N},\mathbf{p}) \right] + \\ & p(\mathbf{n}) \cdot \left[ p(\mathbf{Y} \vert \mathbf{n}) \cdot b(\mathbf{Y},\mathbf{n}) + p(\mathbf{N} \vert \mathbf{n}) \cdot b(\mathbf{N},\mathbf{n}) \right]\\ \end{align}

Let’s create a new method that implements the above equation. We will first use the same class skew as in the provided data above and check that it results in teh same value as before (sanity check!).

def factored_expected_value(m, cb, p_p=0.5, p_n=0.5):
    t_p = sum(m[:,0])
    t_n = sum(m[:,1])
    return p_p * (m[0,0]/t_p) * cb[0,0] + (m[1,0]/t_p) * cb[1,0] + \
           p_n * (m[0,1]/t_n) * cb[0,1] + (m[1,1]/t_n) * cb[1,1]

print("Should be equal:", expected_value(model_a, cost_benefit), factored_expected_value(model_a, cost_benefit, 0.25, 0.75))
print("Should be equal:", expected_value(model_b, cost_benefit), factored_expected_value(model_b, cost_benefit))
Should be equal: 7.55 7.55
Should be equal: 12.3 12.3

Now let’s see what the expected values are when we factor out the class skew (by altering the probabilty of each class in the above equation):

print("Results after factoring out training sample skew")
print("Model A expected value:", factored_expected_value(model_a, cost_benefit, 0.5, 0.5))
print("Model B expected value:", factored_expected_value(model_b, cost_benefit, 0.5, 0.5))
Results after factoring out training sample skew
Model A expected value: 18.7
Model B expected value: 12.3

Note how different the expected values are after accounting for class skew!

This is demonstrating that when you are comparing models, make sure they have the same class probabilities. They don’t necessarily have to have an equal class skew, but you need to compare models with the same skew.

Also note that we could have fixed the skew by altering the underlying data; i.e. balance the classes in the test data.

Other Evaluation Metrics

Let’s take a quick look at some other evaluation metrics that we defined in the training. We can calculate these manually or use the methods in the sklearn.metrics module.

This are technical metrics. It is wise to choose technical metrics that best suit your problem. For example, are you more worried about false positives? Then make sure you use a metric that takes false positives into account (like the false positive rate).

But always bear in mind that these are hard to interpret by non-data scientists. I.e. you shouldn’t present them to the business. Always try and find business-friendly summary statistics like the expectated value defined above.

$$ \text{accuracy} = \frac{\text{correct predictions}}{\text{all instances}} = \frac{TP+TN}{P+N} = \frac{TP+TN}{TP+FP+TN+FN} $$

$$ \text{precision} = \frac{\text{true positives}}{\text{all predicted yes}} = \frac{TP}{TP + FP}$$

$$ \text{recall} = \frac{\text{true positives}}{\text{all positives}} = \frac{TP}{TP + FN}$$

$$ \text{false positive rate} = \frac{\text{false positives}}{\text{all negatives}} = \frac{FP}{N} = \frac{FP}{FP+TN} $$

print("model a:\n", model_a, "\n", np.array([['TP', 'FN'],['FP', 'TN']]), "\n")

TP = model_a[0,0]; FP = model_a[1,0];
TN = model_a[1,1]; FN = model_a[0,1]

print("Accuracy:", (TP+TN)/(TP+FP+TN+FN) )
print("Precision:", (TP)/(TP+FP) )
print("Recall:", (TP)/(TP+FN) )
print("FPR:", (FP)/(FP+TN) )
model a:
 [[25 30]
 [ 0 45]]
 [['TP' 'FN']
 ['FP' 'TN']]

Accuracy: 0.7
Precision: 1.0
Recall: 0.454545454545
FPR: 0.0
y_true = ['a'] * 25 + ['n'] * 75
y_pred = ['a'] * 20 + ['n'] * 30 + ['n'] * 15 + ['a'] * 35


  • Create a confusion matrix for the above data
  • Calculate the Accuracy, prevision, recall and FPR


If you finish early, go grab a coffee, or try these tasks:

  • Bring up the digits dataset again
  • Try generating all these metrics (accuracy, precision, recall, etc.) for that dataset.
  • Investigate what other metrics sklearn as to offer and try them.

More articles

Build a Voice-Based Chatbot with OpenAI, Vocode, and ElevenLabs

Learn to create a chatbot using OpenAI, Vocode, and ElevenLabs for natural voice interactions. An example speech-to-text and text-to-speech system.

Read more

Revolutionizing IVR Systems: Attaching Voice Models to LLMs

Discover how attaching voice models to large language models (LLMs) revolutionizes IVR systems for superior customer interactions.

Read more