201: Basics and Terminology
The ultimate goal
First lets discuss what the goal is. What is the goal?
- The goal is to make a decision or a prediction
Based upon what?
- Information
How can we improve the quality of the decision or prediction?
- The quality of the solution is defined by the certainty represented by the information.
Think about this for a moment. It’s a key insight. Think about your projects. Your research. The decisions you make. They are all based upon some information. And you can make better decisions when you have more good quality information.
Information
Claude Shannon defined exactly what information is. He defined a measure of the level of information contained within a variable and called it entropy.
Entropy is a measure of how much information is contained within an event. A coin toss has lower entropy (less information) than the roll of a die. This is because the coin toss only has two possible outcomes. A die has 6.
A random variable with a wide spread has more entropy that one with narrow spread.
Key Point: High entropy problems are harder to solve.
Uncertainty
Uncertainty represents the effects of entropy.
High entropy problems are highly uncertain.
I.e. we cannot be certain about a solution if the data has high entropy.
Key Point: We want to be certain about the result.
Uncertainty reduction
Hence, the whole point of any modelling process is to reduce the amount of uncertainty in a decision or estimate that we make.
If we can make good decisions we can build good products and good businesses.
Reducing uncertainty by reducing the entropy measure of the data is the topic of the next section.
Key Point: We need to reduce the uncertainty to improve our decision. The question of the entire course is: how?
Terminology
Before we move any further, we need to talk terminology.
Data science is really bad for lots of different, complex words all meaning the same thing.
The reason for this is that data science has emerged from a number of different disciplines.
For example, statistic’s L1-norm is the same as machine learning’s Manhattan distance.
Also, terminology is quite personal to an individual’s experience. Some of this terminology is my own.
Observations, samples and datasets
An observation is a single measurement. It is often (even by me) referred to as an individual sample or data point. Words don’t matter so long as you and your audience understand that you mean a single instance.
A sample is a chosen set of observations. But this isn’t generally used because of it’s confusion with an individual sample.
Instead, data scientists often use the word dataset to refer to a collection of observations.
How you choose a sample is very, very important. More about that later.
Features or attributes
Other than observations, the next most important word is feature.
A feature is one dimension of the measurement. For example, a finance dataset might have a
loan_amount
feature. A marketing dataset might have an age
feature.
An attribute is another word for a feature.
Labels
Labels represent the answers to the problem, if there are any.
Labels are required for supervised machine learning tasks.
For example, in a classification problem, labels represent the correct class for an observation.
Labels are also often called targets.
Solution wrappers
We can also generalise and abstract solutions into different types. We’ve already mentioned models but haven’t defined what a model is.
Models
Models are a simplified version of reality. We create them to be able to understand and act upon the underlying process.
Reality is messy. We use imprecise tools and equipment to sample an chaotic natural process.
The mess that is included in our measurements is called noise.
Noise masks the data that we are interested in.
Models attempt to simplify the measurement and ignore the noise.
Simpler models are easier to understand and easier to act upon.
Induction
The creation of models from data is called induction.
Contrast this to deduction, which is the process of formulating a model or theory from logical assertions.
Traditional science emphasised the importance of deductive reasoning, and is still preferred in many more traditional disciplines.
And in data science it still holds an important place in sanity checking. If the results are not what you expect, then either you don’t understand what is going on, or you’ve made the wrong deductions.
Prediction
Prediction is often used for “predicting the future”.
But in Data Science prediction means: estimate the most probable output.
So when our algorithm decides that this instance belongs to a class, we’re making a prediction.
Types of learning
When talking about producing learning from data, there are two distinct forms of learning:
- Supervised
- Unsupervised
The type of learning is usually defined by whether the data has labels.
Supervised
Supervised machine learning occurs when an algorithm is provided with data is labelled with a known outcome.
Supervised learning is then split into the type of label that is used. Some algorithms work with categorical data, some with continuous data, some both.
Examples of supervised questions:
- “What is likely to be next quarter’s GDP?
- “Will we continue to retain this customers account for the next 6 months?”
Unsupervised
Unsupervised problems arise when there are no labels indicated in the data.
Unsupervised problems often require some form of grouping or clustering to find similar types of data.
Examples of unsupervised questions:
- “Do our customers fall into specific groups?”
- “What do similar customers like to buy?”
Combination approaches
Fairly often problems come with data that has some labels. For example some instances have been manually labelled by experts, but cannot possibly label all data.
In this case there are a subset of special techniques called semi-supervised algorithms.
These are often simply a combination of clustering followed by classification.
I.e. Similar instances can be labelled with the same label. The difficulty is where you draw the line.