Data Cleaning Example - Loan Data

Data Cleaning Example - Loan Data

Welcome! This workshop is from Winder.ai. Sign up to receive more free workshops, training and videos.

A huge amount of time is spent cleaning, removing, scaling data. All in an effort to squeeze a bit more performance out of the model.

The data we are using is from Kaggle, and is available in raw from from here. You will need to sign into kaggle if you want to download the full data. I’ve included just a small sample.

It is a loan dataset, showing the loans that have suffered repayment issues. There are a lot of columns and many of them are useless. Many more columns have missing data.

The goal that I set out to achieve was to attempt to predict which loans would suffer problems but we will see that it will require a lot more time to get right.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv("https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/loan_small.csv")
# These are the columns
data.columns
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose',
       'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt',
       'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med',
       'mths_since_last_major_derog', 'policy_code', 'application_type',
       'annual_inc_joint', 'dti_joint', 'verification_status_joint',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
       'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',
       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl',
       'inq_last_12m'],
      dtype='object')
# There are some columns called "id". ID columns don't provide any predictive power
# so let's double check, then remove them.
display(data[["id", "member_id", "emp_title"]].head())
data.drop(['id', 'member_id'], axis=1, inplace=True)

idmember_idemp_title
010775011296599NaN
110774301314167Ryder
210771751313524NaN
310768631277178AIR RESOURCES BOARD
410753581311748University Medical Group

Let’s take a deeper look at the data.

I see that there are a combination of numerical, catagorical and some messed up catagorical data here.

Let’s try and fix some of the columns as an example. In reality, you’d have to do a lot more to fix this data.

data.head()

loan_amntfunded_amntfunded_amnt_invtermint_rateinstallmentgradesub_gradeemp_titleemp_length...total_bal_ilil_utilopen_rv_12mopen_rv_24mmax_bal_bcall_utiltotal_rev_hi_liminq_fitotal_cu_tlinq_last_12m
05000.05000.04975.036 months10.65162.87BB2NaN10+ years...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12500.02500.02500.060 months15.2759.83CC4Ryder< 1 year...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
22400.02400.02400.036 months15.9684.33CC5NaN10+ years...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
310000.010000.010000.036 months13.49339.31CC1AIR RESOURCES BOARD10+ years...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
43000.03000.03000.060 months12.6967.79BB5University Medical Group1 year...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 72 columns

display(set(data["emp_length"]))
data.replace('n/a', np.nan,inplace=True)
{nan,
 '9 years',
 '3 years',
 '< 1 year',
 '1 year',
 '6 years',
 '5 years',
 '4 years',
 '7 years',
 '8 years',
 '10+ years',
 '2 years'}
data.emp_length.fillna(value=0,inplace=True)
set(data["emp_length"])
{0,
 '9 years',
 '3 years',
 '< 1 year',
 '1 year',
 '6 years',
 '5 years',
 '4 years',
 '7 years',
 '8 years',
 '10+ years',
 '2 years'}
data['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
data['emp_length'] = data['emp_length'].astype(int)
set(data["emp_length"])
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

I see another field called term that can be reduced to a better label

set(data['term'])
{' 36 months', ' 60 months'}
data['term'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
data['term'] = data['term'].astype(int)
set(data["term"])
{36, 60}

Now let’s try and define what a bad loan is…

set(data["loan_status"])
{'Charged Off', 'Current', 'Default', 'Fully Paid', 'Late (31-120 days)'}
# This indicates a bad loan. Something we want to predict
bad_indicator = data["loan_status"].isin(["Charged Off", "Default", "Late (31-120 days)"])
# Remove this from dataset
data.drop(["loan_status"], axis=1, inplace=True)
bad_indicator.value_counts()
False    820
True     179
Name: loan_status, dtype: int64

Note to future self, we have unbalanced classes here. This affects some algorithms.

# Find columns that have all nans and remove
naughty_cols = data.columns[data.isnull().sum() == len(data)]
data.drop(naughty_cols, axis=1, inplace=True)
# Any more nans?
data.columns[data.isnull().any()].tolist()
['emp_title',
 'desc',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'last_pymnt_d',
 'next_pymnt_d']
# We could write some code to do this, but I'm going to do it manually for now
string_features = ["emp_title", "desc"]
data[string_features] = data[string_features].fillna(value='')
numeric_features = ["mths_since_last_delinq", "mths_since_last_record"]
data[numeric_features] = data[numeric_features].fillna(value=0)
# Any more nans, just ditch them?
just_ditch = data.columns[data.isnull().any()].tolist()
just_ditch
['last_pymnt_d', 'next_pymnt_d']
data.drop(just_ditch, axis=1, inplace=True)

Normally, we would continue improving the features until we were happy we couldn’t do any more.

When you do, remember that you will have to repeat the same steps to any new incoming data. So remember to make the pre-processing clean and pretty.

Now, let’s try and convert all those string values into numeric catagories for a tree algorithm…

from sklearn import preprocessing

selected = pd.DataFrame(data)

X = selected.apply(preprocessing.LabelEncoder().fit_transform)
X.head()

loan_amntfunded_amntfunded_amnt_invtermint_rateinstallmentgradesub_gradeemp_titleemp_length...total_rec_inttotal_rec_late_feerecoveriescollection_recovery_feelast_pymnt_amntlast_credit_pull_dcollections_12_mths_ex_medpolicy_codeapplication_typeacc_now_delinq
03232320611916010...209000128190000
112121211392135821...71027479440000
210101001428214010...132000525190000
392991110103612101210...5881300323180000
41414141913197441...27100044190000

5 rows × 48 columns

Just for giggles, let’s fit a tree classifier and view the accuracy and feature importances.

This might give us some insight into what features are important and a baseline performance.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=3)
clf = clf.fit(X, bad_indicator)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, bad_indicator, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
Accuracy: 0.95 (+/- 0.02)

Uh oh! Look how high that accuracy score is!

This should raise alarm bells.

Either the problem is super simple (and you can see the simplicity in plots) or something is not right.

Let’s look at the importances…

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

names = X.columns

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
Feature ranking:
1. collection_recovery_fee (0.271639)
2. recoveries (0.227616)
3. total_rec_prncp (0.151820)
4. total_pymnt (0.100977)
5. total_pymnt_inv (0.059508)
6. loan_amnt (0.037773)
7. last_pymnt_amnt (0.033616)
8. installment (0.033328)
9. dti (0.020536)
10. grade (0.012934)
11. sub_grade (0.012478)
12. zip_code (0.011637)
13. term (0.006332)
14. earliest_cr_line (0.004536)
15. verification_status (0.003681)
16. addr_state (0.002456)
17. open_acc (0.001890)
18. total_acc (0.001872)
19. total_rec_int (0.001653)
20. mths_since_last_delinq (0.001494)
21. funded_amnt (0.001456)
22. last_credit_pull_d (0.000511)
23. url (0.000258)
24. int_rate (0.000000)
25. collections_12_mths_ex_med (0.000000)
26. issue_d (0.000000)
27. total_rec_late_fee (0.000000)
28. emp_title (0.000000)
29. policy_code (0.000000)
30. funded_amnt_inv (0.000000)
31. emp_length (0.000000)
32. home_ownership (0.000000)
33. annual_inc (0.000000)
34. out_prncp (0.000000)
35. desc (0.000000)
36. pymnt_plan (0.000000)
37. initial_list_status (0.000000)
38. purpose (0.000000)
39. title (0.000000)
40. out_prncp_inv (0.000000)
41. delinq_2yrs (0.000000)
42. application_type (0.000000)
43. inq_last_6mths (0.000000)
44. mths_since_last_record (0.000000)
45. pub_rec (0.000000)
46. revol_bal (0.000000)
47. revol_util (0.000000)
48. acc_now_delinq (0.000000)
png

Ahhhhhh. Look at the top two features:

  1. collection_recovery_fee (0.254455)
  2. recoveries (0.219021)

The recovery fee recieved and the number of recoveries. These are directly related to loan defaults; you will only get a recovery if there is a loan default.

Clearly, we won’t have these features unless a default has already occured and in that case, there’s certainly no point in trying to predict it!

This is a perfect example of data leakage. This is where you use data that is impossible to obtain at the time, usually because it is a direct consequence of an event that you are trying to predict.

Just for further giggles, let’s plot a box plot of the recoveries data…

loan_amount = pd.DataFrame([selected["recoveries"], bad_indicator]).transpose()
loan_amount.boxplot(by="loan_status")
plt.show()
png

This is plotting the recoveries data by the loan status. Note how all of the “normal” loans have zero recoveries.

If this really was a feature we could just threshold above 0 and say it was “suffering”.

Hence why we got 95% in the accuracy score!

However, this begs the question, if this was so easy, why did we get 95% and not 100%?!?!

Let’s remove those features and try again…

X.drop(["collection_recovery_fee", "recoveries"], axis=1, inplace=True)
clf = RandomForestClassifier(max_depth=3)
clf = clf.fit(X, bad_indicator)
scores = cross_val_score(clf, X, bad_indicator, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))
Accuracy: 0.89 (+/- 0.01)

Ok, this is starting to look a bit more feasible

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

names = X.columns

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
# plt.savefig('../img/feature_importances.svg', transparent=True, bbox_inches='tight', pad_inches=0)
plt.show()
Feature ranking:
1. last_pymnt_amnt (0.168353)
2. total_rec_prncp (0.151796)
3. total_pymnt_inv (0.135874)
4. total_pymnt (0.069144)
5. funded_amnt (0.052595)
6. int_rate (0.049567)
7. revol_util (0.045033)
8. earliest_cr_line (0.044411)
9. out_prncp (0.039759)
10. total_rec_late_fee (0.037326)
11. last_credit_pull_d (0.032130)
12. sub_grade (0.026536)
13. revol_bal (0.021711)
14. dti (0.020970)
15. url (0.016244)
16. zip_code (0.013791)
17. grade (0.011085)
18. emp_length (0.009951)
19. purpose (0.008554)
20. verification_status (0.008065)
21. open_acc (0.007716)
22. emp_title (0.007045)
23. total_rec_int (0.006613)
24. total_acc (0.005885)
25. loan_amnt (0.005698)
26. addr_state (0.004147)
27. installment (0.000000)
28. issue_d (0.000000)
29. term (0.000000)
30. home_ownership (0.000000)
31. annual_inc (0.000000)
32. funded_amnt_inv (0.000000)
33. acc_now_delinq (0.000000)
34. pymnt_plan (0.000000)
35. desc (0.000000)
36. title (0.000000)
37. application_type (0.000000)
38. inq_last_6mths (0.000000)
39. mths_since_last_delinq (0.000000)
40. mths_since_last_record (0.000000)
41. pub_rec (0.000000)
42. initial_list_status (0.000000)
43. out_prncp_inv (0.000000)
44. collections_12_mths_ex_med (0.000000)
45. policy_code (0.000000)
46. delinq_2yrs (0.000000)
png

The best features now are:

  1. total_rec_prncp (0.232828)
  2. last_pymnt_amnt (0.145886)
  3. total_pymnt_inv (0.140592)
  4. total_pymnt (0.129989)

So it seems like there is some correlation with how much has been paid off and delinquincy.

We should look into these correlations more.

So, what we’d do now is chop off all except the first 10, maybe, and see if we can improve that data (with scaling, encoding, missing data imputing, etc.)

We would also consider mixtures of features, e.g. the proportion of the loan repaid, etc.

More articles

Entropy Based Feature Selection

A Python Workshop explaining and deriving a decision tree.

Read more

Histograms and Skewed Data

A Python Workshop showing you how to create histograms and how to invert skewed data.

Read more
}