Why Correlating Data is Bad and What to do About it

Correlating Data

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

Correlations between features are bad because you are effectively telling the model that this information is twice more important than everything else. You’re feeding the model the same data twice.

Technically it’s known as multicollinear, which is the generalisation to any number of features that could be correlated.

Generally correlating features will decrease the performance of your model, so we need to find them and remove them.

Again, let’s generate some dummy data for simplicity…

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

X = np.random.randn(100,5)
noise = np.random.randn(100)
X[:,0] = 2*X[:,2] + 3*X[:, 4] + 0.5*noise

The easiest way of spotting correlating features is to generate a scatter matrix. This is an image which plots each feature against each other feature.

# Here, we're plotting a "scatter matrix". I.e. a matrix of scatter plots of each feature.
# Its really useful for spotting dodgy data.

from pandas.plotting import scatter_matrix
df = pd.DataFrame(X)
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='hist')

We can see that there is some linearity in the plots. Definitely in the top right.

Again, the simplest thing to do at this stage is manually remove that feature.

del df[4]
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='hist')

There’s still a little bit of correlation in the second feature, but it’s not huge. Try scoring your model with and without this feature.

We could perform this process in a more quantitative manner using eigenvectors and eigenvalues to spot the correlation, but that’s a bit too complex to consider at this point.

More articles

Scaling StableAudio.com Generative Models Globally with NVIDIA Triton & Sagemaker

Learn from the trials and tribulations of scaling audio diffusion models with NVIDIA's Triton Inference Server and AWS Sagemaker.

Read more

Big Data in LLMs with Retrieval-Augmented Generation (RAG)

Explore how Retrieval-Augmented Generation (RAG) enhances Language Models by utilizing indexing, retrieval, and generation for up-to-date data access.

Read more