Mean and Standard Deviation
Mean and Standard Deviation
Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.
This workshop is about two fundamental measures of data. I want to you start thinking about how you can best describe or summarise data. How can we best take a set of data and describe that data in as few variables as possible? These are called summary statistics because they summarise statistical data. In other words, this is your first model!
import numpy as np
Mean
The mean, also known as the average, is a measure of the tendency of the data. For example, if you were provided some data then you could say that, on average, is most likely best represented by the mean.
The mean is calculated as:
$$\mu = \frac{\sum_{i=0}^{N-1}{ x_i }} {N}$$
The sum of all observations divided by the number of observations.
x = [6, 4, 6, 9, 4, 4, 9, 7, 3, 6];
N = len(x)
x_sum = 0
for i in range(N):
x_sum = x_sum + x[i]
mu = x_sum / N
print("μ =", mu)
μ = 5.8
Of course, we should be using libraries to reduce the amount of code we have to write. For low level tasks such as this, the most common library is called Numpy.
We can rewrite the above as:
N = len(x)
x_sum = np.sum(x)
mu = x_sum / N
print("μ =", mu)
μ = 5.8
We can take this even further and just use Numpy’s implementation of the mean:
print("μ =", np.mean(x))
μ = 5.8
Standard Deviation
To describe our data, the mean alone doesn’t provide enough information. It tells us what value we should observe on average. But the values could be +/- 1 or +/- 100 of that value. (+/- is shorthand for “plus or minus”, i.e. “could be greater than or less than this value”).
To provide this information we need a measure of “spread” around the mean. The most common measure of “spread” is the standard deviation.
Read more about the standard deviation at: WinderResearch.com - Why do we use Standard Deviation and is it Right?.
The standard deviation of a population is:
$$\sigma = \sqrt{ \frac{\sum_{i=0}^{N-1}{ (x_i - \mu )^2 }} {N} }$$
x = [6, 4, 6, 9, 4, 4, 9, 7, 3, 6];
N = len(x)
mu = np.mean(x)
print("μ =", mu)
μ = 5.8
print("Deviations from the mean:", x - mu)
print("Squared deviations from the mean:", (x - mu)**2)
print("Sum of squared deviations from the mean:", ((x - mu)**2).sum() )
print("Mean of squared deviations from the mean:", ((x - mu)**2).sum() / N )
Deviations from the mean: [ 0.2 -1.8 0.2 3.2 -1.8 -1.8 3.2 1.2 -2.8 0.2]
Squared deviations from the mean: [ 0.04 3.24 0.04 10.24 3.24 3.24 10.24 1.44 7.84 0.04]
Sum of squared deviations from the mean: 39.6
Mean of squared deviations from the mean: 3.96
print("σ =", np.sqrt(((x - mu)**2).sum() / N ))
σ = 1.98997487421
Again, we don’t need to code this all up. The Numpy equivalent is:
print("σ =", np.std(x))
σ = 1.98997487421
What’s the Catch?
You knew they’d be a catch, right? ;-)
I didn’t mention it at the start, but the two previous measures of the central tendency and the spread are specific to a very special combination of data.
If the observations are distributed in a special way, then these metrics perfectly model the underlying data. If not, then these metrics are invalid.
You probably said “huh?” to a few of those new words, so let’s go through them.