502: Preventing Overfitting with Holdout

Holdout

We have been using:

  • Training data

Not representative of production.

We want to pretend like we are seeing new data:

  • Hold back some data

???

When we train the model, we do so on some data. This is called training data.

Up to now, we have been using the same training data to measure our accuracy.

If we create a lookup table, our accuracy will be 100%. But this doesn’t generalise to new examples.

So instead we want to pretend like we have new examples and use that to test our model.

In other words, we hold back some data.


How does holdout work?

Training test data

Separate the data into a training set and a test set.

Test set size approx. 10-40%.

Minimum size depends on number of features and complexity of model.

???

When we obtain a dataset, we would separate the data into a training set and a test set.

The size of the test set is usually somewhere between 10-40% of the size of the whole dataset.

Generally, the more data you have, the smaller the test set can become.

This way, we can get an accurate estimate of performance if the algorithm was to see new data (assuming that the random elements of the data are stationary!)


Issues

However, there are issues with a simple holdout technique like this.

Put simply, think really hard about the test data

  • Is it independent from the training?
  • Does it represent realistic data?

Randomising data

Common structures found within data:

  • Ordering
    • Time
    • Key (e.g. when obtaining data from a database)
    • Value or label (e.g. all elements of class 0 first, then class 1, …)
  • Geography
    • Only sampled from certain geographies. Doesn’t scale to other geographies
  • Language

Thankfully the fix is simple. Always randomise data before training.

???

One issue is that the data in a set often has structure. E.g. it could be ordered or collected in such a way that when we pick a observation to train or test against, it doesn’t represent the population.


Hyperparameter tuning

Imagine trying to tune a hyperparameter…

Can you see the issue?

We’re using the test set to train hyperparameters!

???

We saw above that a common task is to alter some parameter of the model to improve performance.

If we repeatedly alter the hyperparameter to maximise the test set score, we’re not really finding the best model. We are tuning the hyperparameters to best represent the test set.

Can you see the issue here?

We are using the test set to train out hyperparameters! Over time we would overfit our model to fit the test set!

The simplest fix for this problem is to introduce another holdout set called the validation set.


Validation set

The validation dataset is a second holdout set that is to be used when computing final accuracies.

Validation set

Validation set issues

  • Significantly reduces the amount of data available for training

???

The main issue with using a validation set is obvious from the image. It significantly reduces the amount of data that can be used to train the model.

This will ultimately affect model performance. Since more data usually means better performance.

The simplest fix is to retrain the best model using all of the training and test data put together…


Validation set

But we’re still not using the data in the test set to train the model. The data in the test dataset might be important to train the model.

So…


Cross-validation

  • For each new training run, pick a new subset of the data to train/test against.

???

Cross-validation is a process where we repeatedly perform a fitting procedure but each time pick a new test set to train against.

This way we use all of the test set to train the model with, but we are still able to pick the best model before final validation on an independent dataset.


How does it work?

Cross validation

The choice of the number of iterations is called the number of folds. In the example above there are 2 folds.


Benefit of Cross Validation

  • Run through all the folds per training run

Then we have statistics about how consistent our model is over the various folds.

I.e. we can calculate the mean and standard deviation of our score.


Issues with cross validation

The main issue is the additional time required to repeat the training process for each fold.

This is increasingly problematic for complex models like deep learning.

Let’s talk about visualising overfitting


More articles

101: Why Data Science?

This section introduces Data Science. It explains what it is and why we need it. We discuss some of the reasons for doing Data Science and provides famous examples from around the world.

Read more

102: How to do a Data Science Project

In this video we will talk about the problems encountered in data science. We'll also discover how it fits into a process, which you can used as a plan. Finally, we'll look at the impacts of a Data Science project which will help you avoid any common pitfalls.

Read more
}