Data Cleaning Example - Loan Data

A Python Data Science Workshop providing an example of working with and cleaning loan data.

MACHINE LEARNING
WORKSHOP

Data Cleaning Example - Loan Data

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

A huge amount of time is spent cleaning, removing, scaling data. All in an effort to squeeze a bit more performance out of the model.

The data we are using is from Kaggle, and is available in raw from from here. You will need to sign into kaggle if you want to download the full data. I’ve included just a small sample.

It is a loan dataset, showing the loans that have suffered repayment issues. There are a lot of columns and many of them are useless. Many more columns have missing data.

The goal that I set out to achieve was to attempt to predict which loans would suffer problems but we will see that it will require a lot more time to get right.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv("https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/loan_small.csv")
# These are the columns
data.columns
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose',
       'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt',
       'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med',
       'mths_since_last_major_derog', 'policy_code', 'application_type',
       'annual_inc_joint', 'dti_joint', 'verification_status_joint',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
       'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',
       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl',
       'inq_last_12m'],
      dtype='object')
# There are some columns called "id". ID columns don't provide any predictive power
# so let's double check, then remove them.
display(data[["id", "member_id", "emp_title"]].head())
data.drop(['id', 'member_id'], axis=1, inplace=True)

id member_id emp_title
0 1077501 1296599 NaN
1 1077430 1314167 Ryder
2 1077175 1313524 NaN
3 1076863 1277178 AIR RESOURCES BOARD
4 1075358 1311748 University Medical Group

Let’s take a deeper look at the data.

I see that there are a combination of numerical, catagorical and some messed up catagorical data here.

Let’s try and fix some of the columns as an example. In reality, you’d have to do a lot more to fix this data.

data.head()

loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length ... total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 5000.0 5000.0 4975.0 36 months 10.65 162.87 B B2 NaN 10+ years ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2500.0 2500.0 2500.0 60 months 15.27 59.83 C C4 Ryder < 1 year ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2400.0 2400.0 2400.0 36 months 15.96 84.33 C C5 NaN 10+ years ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10000.0 10000.0 10000.0 36 months 13.49 339.31 C C1 AIR RESOURCES BOARD 10+ years ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 3000.0 3000.0 3000.0 60 months 12.69 67.79 B B5 University Medical Group 1 year ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 72 columns

display(set(data["emp_length"]))
data.replace('n/a', np.nan,inplace=True)
{nan,
 '9 years',
 '3 years',
 '< 1 year',
 '1 year',
 '6 years',
 '5 years',
 '4 years',
 '7 years',
 '8 years',
 '10+ years',
 '2 years'}
data.emp_length.fillna(value=0,inplace=True)
set(data["emp_length"])
{0,
 '9 years',
 '3 years',
 '< 1 year',
 '1 year',
 '6 years',
 '5 years',
 '4 years',
 '7 years',
 '8 years',
 '10+ years',
 '2 years'}
data['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
data['emp_length'] = data['emp_length'].astype(int)
set(data["emp_length"])
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

I see another field called term that can be reduced to a better label

set(data['term'])
{' 36 months', ' 60 months'}
data['term'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
data['term'] = data['term'].astype(int)
set(data["term"])
{36, 60}

Now let’s try and define what a bad loan is…

set(data["loan_status"])
{'Charged Off', 'Current', 'Default', 'Fully Paid', 'Late (31-120 days)'}
# This indicates a bad loan. Something we want to predict
bad_indicator = data["loan_status"].isin(["Charged Off", "Default", "Late (31-120 days)"])
# Remove this from dataset
data.drop(["loan_status"], axis=1, inplace=True)
bad_indicator.value_counts()
False    820
True     179
Name: loan_status, dtype: int64

Note to future self, we have unbalanced classes here. This affects some algorithms.

# Find columns that have all nans and remove
naughty_cols = data.columns[data.isnull().sum() == len(data)]
data.drop(naughty_cols, axis=1, inplace=True)
# Any more nans?
data.columns[data.isnull().any()].tolist()
['emp_title',
 'desc',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'last_pymnt_d',
 'next_pymnt_d']
# We could write some code to do this, but I'm going to do it manually for now
string_features = ["emp_title", "desc"]
data[string_features] = data[string_features].fillna(value='')
numeric_features = ["mths_since_last_delinq", "mths_since_last_record"]
data[numeric_features] = data[numeric_features].fillna(value=0)
# Any more nans, just ditch them?
just_ditch = data.columns[data.isnull().any()].tolist()
just_ditch
['last_pymnt_d', 'next_pymnt_d']
data.drop(just_ditch, axis=1, inplace=True)

Normally, we would continue improving the features until we were happy we couldn’t do any more.

When you do, remember that you will have to repeat the same steps to any new incoming data. So remember to make the pre-processing clean and pretty.

Now, let’s try and convert all those string values into numeric catagories for a tree algorithm…

from sklearn import preprocessing

selected = pd.DataFrame(data)

X = selected.apply(preprocessing.LabelEncoder().fit_transform)
X.head()

loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length ... total_rec_int total_rec_late_fee recoveries collection_recovery_fee last_pymnt_amnt last_credit_pull_d collections_12_mths_ex_med policy_code application_type acc_now_delinq
0 32 32 32 0 6 119 1 6 0 10 ... 209 0 0 0 128 19 0 0 0 0
1 12 12 12 1 13 9 2 13 582 1 ... 71 0 27 4 79 44 0 0 0 0
2 10 10 10 0 14 28 2 14 0 10 ... 132 0 0 0 525 19 0 0 0 0
3 92 99 111 0 10 361 2 10 12 10 ... 588 13 0 0 323 18 0 0 0 0
4 14 14 14 1 9 13 1 9 744 1 ... 271 0 0 0 44 19 0 0 0 0

5 rows × 48 columns

Just for giggles, let’s fit a tree classifier and view the accuracy and feature importances.

This might give us some insight into what features are important and a baseline performance.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=3)
clf = clf.fit(X, bad_indicator)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, bad_indicator, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
Accuracy: 0.95 (+/- 0.02)

Uh oh! Look how high that accuracy score is!

This should raise alarm bells.

Either the problem is super simple (and you can see the simplicity in plots) or something is not right.

Let’s look at the importances…

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

names = X.columns

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
Feature ranking:
1. collection_recovery_fee (0.271639)
2. recoveries (0.227616)
3. total_rec_prncp (0.151820)
4. total_pymnt (0.100977)
5. total_pymnt_inv (0.059508)
6. loan_amnt (0.037773)
7. last_pymnt_amnt (0.033616)
8. installment (0.033328)
9. dti (0.020536)
10. grade (0.012934)
11. sub_grade (0.012478)
12. zip_code (0.011637)
13. term (0.006332)
14. earliest_cr_line (0.004536)
15. verification_status (0.003681)
16. addr_state (0.002456)
17. open_acc (0.001890)
18. total_acc (0.001872)
19. total_rec_int (0.001653)
20. mths_since_last_delinq (0.001494)
21. funded_amnt (0.001456)
22. last_credit_pull_d (0.000511)
23. url (0.000258)
24. int_rate (0.000000)
25. collections_12_mths_ex_med (0.000000)
26. issue_d (0.000000)
27. total_rec_late_fee (0.000000)
28. emp_title (0.000000)
29. policy_code (0.000000)
30. funded_amnt_inv (0.000000)
31. emp_length (0.000000)
32. home_ownership (0.000000)
33. annual_inc (0.000000)
34. out_prncp (0.000000)
35. desc (0.000000)
36. pymnt_plan (0.000000)
37. initial_list_status (0.000000)
38. purpose (0.000000)
39. title (0.000000)
40. out_prncp_inv (0.000000)
41. delinq_2yrs (0.000000)
42. application_type (0.000000)
43. inq_last_6mths (0.000000)
44. mths_since_last_record (0.000000)
45. pub_rec (0.000000)
46. revol_bal (0.000000)
47. revol_util (0.000000)
48. acc_now_delinq (0.000000)

png

Ahhhhhh. Look at the top two features:

  1. collection_recovery_fee (0.254455)
  2. recoveries (0.219021)

The recovery fee recieved and the number of recoveries. These are directly related to loan defaults; you will only get a recovery if there is a loan default.

Clearly, we won’t have these features unless a default has already occured and in that case, there’s certainly no point in trying to predict it!

This is a perfect example of data leakage. This is where you use data that is impossible to obtain at the time, usually because it is a direct consequence of an event that you are trying to predict.

Just for further giggles, let’s plot a box plot of the recoveries data…

loan_amount = pd.DataFrame([selected["recoveries"], bad_indicator]).transpose()
loan_amount.boxplot(by="loan_status")
plt.show()

png

This is plotting the recoveries data by the loan status. Note how all of the “normal” loans have zero recoveries.

If this really was a feature we could just threshold above 0 and say it was “suffering”.

Hence why we got 95% in the accuracy score!

However, this begs the question, if this was so easy, why did we get 95% and not 100%?!?!

Let’s remove those features and try again…

X.drop(["collection_recovery_fee", "recoveries"], axis=1, inplace=True)
clf = RandomForestClassifier(max_depth=3)
clf = clf.fit(X, bad_indicator)
scores = cross_val_score(clf, X, bad_indicator, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
Accuracy: 0.89 (+/- 0.01)

Ok, this is starting to look a bit more feasible

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

names = X.columns

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
# plt.savefig('../img/feature_importances.svg', transparent=True, bbox_inches='tight', pad_inches=0)
plt.show()
Feature ranking:
1. last_pymnt_amnt (0.168353)
2. total_rec_prncp (0.151796)
3. total_pymnt_inv (0.135874)
4. total_pymnt (0.069144)
5. funded_amnt (0.052595)
6. int_rate (0.049567)
7. revol_util (0.045033)
8. earliest_cr_line (0.044411)
9. out_prncp (0.039759)
10. total_rec_late_fee (0.037326)
11. last_credit_pull_d (0.032130)
12. sub_grade (0.026536)
13. revol_bal (0.021711)
14. dti (0.020970)
15. url (0.016244)
16. zip_code (0.013791)
17. grade (0.011085)
18. emp_length (0.009951)
19. purpose (0.008554)
20. verification_status (0.008065)
21. open_acc (0.007716)
22. emp_title (0.007045)
23. total_rec_int (0.006613)
24. total_acc (0.005885)
25. loan_amnt (0.005698)
26. addr_state (0.004147)
27. installment (0.000000)
28. issue_d (0.000000)
29. term (0.000000)
30. home_ownership (0.000000)
31. annual_inc (0.000000)
32. funded_amnt_inv (0.000000)
33. acc_now_delinq (0.000000)
34. pymnt_plan (0.000000)
35. desc (0.000000)
36. title (0.000000)
37. application_type (0.000000)
38. inq_last_6mths (0.000000)
39. mths_since_last_delinq (0.000000)
40. mths_since_last_record (0.000000)
41. pub_rec (0.000000)
42. initial_list_status (0.000000)
43. out_prncp_inv (0.000000)
44. collections_12_mths_ex_med (0.000000)
45. policy_code (0.000000)
46. delinq_2yrs (0.000000)

png

The best features now are:

  1. total_rec_prncp (0.232828)
  2. last_pymnt_amnt (0.145886)
  3. total_pymnt_inv (0.140592)
  4. total_pymnt (0.129989)

So it seems like there is some correlation with how much has been paid off and delinquincy.

We should look into these correlations more.

So, what we’d do now is chop off all except the first 10, maybe, and see if we can improve that data (with scaling, encoding, missing data imputing, etc.)

We would also consider mixtures of features, e.g. the proportion of the loan repaid, etc.