The following was submitted as part of my graduate capstone sequence, and as such had to conform to very specific guidelines for language use, method justification, and overall presentation. |
Research Question
“Can a random forest model trained on mortgage application outcomes in the state of Delaware be trained on HMDA data to predict the outcome applications unseen by the model?”
It is typical that a loan applicant with very low income and a high-cost loan request will be rejected; it is unlikely that the bank will be paid back in such cases. There are other factors as well that factor in to loan decisions. This data study contributes research towards the prediction of home-loan approval using these and similary predictors. It will be of interest to prospective applicants, finance professionals, and the MSDA program. The government requires mortgage lending institutions to make comprehensive data on home loans (that nonetheless protects the privacy of borrowers) available for public consideration- the stated purpose is so financial institutions may have their practices scrutinized (Office of Law Revision Council, 1975). The data is available in a year-by-year format from the Federal Financial Institutions Examination Council and represents millions of applications, demographic and real-estate features for each, as well as a flag indicating the decision of the lending institution. A supervised learning algorithm will be applied to the classification of applications into the categories of ‘originated’ (accepted) or not (rejected). The study is limited to the state of Delaware and the data available for 2020. Supervised learning is appropriate because we wish to learn patterns on labeled historical data and then predict on new data with the same structure (Nasteski, 2017). The programming language chosen for this project is Python, and the supervised algorithm is random forest through the sklearn.ensemble module. Tumuleru et al. published a paper on random forest classification for several typical metrics such as income, age, debt-income ratio, and more towards the target outcome of ‘loan approved’ or ‘loan rejected’; after hypothesizing the efficacy of several different types of ML algorithms, they concluded that random forests achieved the desired accuracy (Tumuleru, 2022). This being the case, yet not specifically for mortgage lending, suggests a need for the study outlined in this paper.
The hypotheses under consideration are presented, which are tested by application of McNemar’s Test, a form of chi-square which verifies if a classification algorithm performs better than the naive baseline based upon the proportion of the dominant class. The McNemar test was used in the (Cansiz, 2021) study on binary classification.
H0: A random forest model that attains greater predictive accuracy, as measured by a McNemar test vs that of a naive model, on unseen mortgage application outcomes can be trained on the available data.
H1: A random forest model that attains greater predictive accuracy, as measured by a McNemar test vs that of a naive model, on unseen mortgage application outcomes cannot be trained on the available data.
Data Collection
The dataset is requisitioned in .csv format from Modified Loan/Application Register, Snapshot National Loan Level Dataset (FFIEC, 2020). An advantage of the way this set is made available is that data is available for separate years starting 2017, and a link to the variable schema is available from the same page. A major problem is that there is no way to filter the download by state, so as a result the file is over 10GB in size when uncompressed. A file of this size will immediately use up available memory, though only a small part of it (Delaware state rows) is required. This is overcome by utilizing code to read the .csv file one small chunk at a time and storing only those chunks with the ‘DE’ flag in the state variable.
Data Extraction and Preparation
The python modules used in this analysis are as follows:
import pandas as pd #dataframe object typeimport seaborn as sns #attractive plottingimport matplotlib.pyplot as plt #additional plotting toolsimport numpy as np #scientific computing across dataframesfrom sklearn.model_selection import train_test_split #automatic split of training and testing datafrom sklearn.ensemble import RandomForestRegressor #key supervised learning algorithmfrom sklearn.tree import plot_tree #decision tree visualizationfrom statsmodels.stats.contingency_tables import mcnemar #hypothesis test for model significanceimport warnings #hide warnings for certain code blocksimport csv #write reproducible results
These combined with base Python 3 allow for data requisition and cleaning according to the needs of the analysis. This process begins with the extraction of data from the locally-stored large file. The code below reads the file in manageable chunks and stores the observations flagged for the state of Delaware. This code generates as many warnings as there are chunks for columns having mixed data types. The filterwarnings call prevents this output.
warnings.filterwarnings('ignore')#instantiate empty listchunk_list = []#read raw data in chunks and store delaware rows in listfor chunk in pd.read_csv('/Volumes/EXT128/WGU/Capstone/2020_public_lar_one_year.csv', chunksize=500000):chunk_list.append(chunk[(chunk['state_code'] == 'DE')])
The observations of chunk_list are concatenated into the key dataframe of the analysis, entitled de for Delaware.
#pandas dataframe for desired subsetde = pd.concat(chunk_list)#display samplepd.options.display.max_columns = Nonede.head()
#compute and print sparsityspar = de.isnull().sum().sum()/(len(de.axes[0])*len(de.axes[1]))print('Data Sparsity: ', spar)
Data Sparsity: 0.31783593423409684
This initial data is very sparse at 32%. A goal of the data preparation is to reduce this to zero. Prior to this is a brief introduction to the available values for the target variable, using a barchart.
#display outcome distributionoutcome = sns.countplot(x=de["action_taken"])outcome.set(xlabel ="Action Taken", ylabel = "Count", title ='Distribution of Outcomes')
Critical to this project is an understanding of the various codes available for loan decisions. The FFIEC schema for this variable is as follows, presented for clarity (FFIEC, 2019).
1 - Loan Originated
Use this code for an application that was originated, including an originated loan that resulted from a preapproval request.
2 - Application Approved But Not Accepted
Use this code if the application was approved and a credit decision, but the applicant did not accept (close the loan), or rescinded after closing.
3 - Application Denied
• Use this code if the application was denied.
• Use this code if a counteroffer was not accepted.
4 - Application Withdrawn by Applicant
• Use this code only when the application is expressly withdrawn by the applicant before a credit decision is made and before the file is closed for incompleteness.
• Use this code if a conditional approval includes creditworthiness conditions & the applicant expressly withdraws before satisfying.”
5 - File Closed for Incompleteness
Use this code if the applicant was sent a written notice of incompleteness (NOI) and did not respond to the request for more information.
6 - Loan (re)purchased
Repurchased a loan that was previously sold.
7 - Preapproval request denied
Use this code if a Preapproval was denied.
8 - Preapproval request approved but not accepted
Use this code if the Preapproval was approved but the applicant did not accept or borrower indicates they no longer are interested.
There are a number of outcomes for which it is hard to say if they should be classed as rejected outright. This analysis will be reduced to a binary classification problem by removing the observations that do not fit the originated/rejected paradigm. Before that, the minority nature of rejected loans bear out investigation as many variables may be correlated with rejection exclusively. As said earlier, the sparsity is high. One strategy for reducing NaN content is to delete columns that are largely empty, then run code to drop observations with NAs. If one is not careful with this method, then columns that are NaN because of their association with the minority class will cause all or most minority observations to be deleted. This strategy is employed by first investigating the prevalence of null values within the class of rejected loans.
#show structure of rejected loans classde.loc[(de['action_taken'] == 3)].info()
[output of 99 variable descriptions hidden for presentation purposes- please run the code if you would like to see this]
This is a large number of variables, many of which should be removed for model quality. The following code drops undesired columns based on index number. The strategy is to reduce sparsity and correlation between variables by deleting those with information that can be found elsewhere. For instance, extra columns relating to ethnicity and race are mostly left empty. Thus the solution chosen is to remove them in favor of the primary declaration. Another variable, derived_loan_product_type, is defined in the schema as a combination of variables involving construction and property type.
#drop undesired columnsde_r = de.drop(de.columns[[0,1,2,3,4,5,6,9,10,11,15,16,17,22,23,24,26,27,28,29,30,32,33,39,43,44,46,50,51,52,53,55,56,57,58,59,60,62,63,64,65,67,68,69,70,71,72,75,76,79,80,81,84,85,86,87,88,89,90,91]],axis=1)
Prior to display of data that is kept for analysis, the null values are dropped. Then the datatypes of each are reinforced as either pandas categories or floating-point numbers. This is important because most of the categorical variables are coded as integers per the schema’s loan-industry definitions. Additionaly, the ‘exempt’ class observations of the loan_term variable are dropped because of a low observation count combined with the non-numeric value conflicting with the numeric nature of the variable. Finally, observations with the target variable outcomes 1 and 3 are kept, recoded as 0 and 1 for ‘rejected’ and ‘originated’.
#drop observations with null values from reduced framede_r = de_r.dropna()#drop observations with loan term = 'exempt'de_r.drop(de_r[de_r['loan_term']=='Exempt'].index,inplace=True)#exclude targets outside binary outcome of interestde_r = de_r.loc[(de_r['action_taken'] == 1) | (de_r['action_taken'] == 3)]#recode binary outcomede_r['action_taken'] = np.where(de_r['action_taken'] == 3, 0, 1)#establish data types as categorical or numeric.de_r[['hoepa_status','derived_loan_product_type','derived_dwelling_category','action_taken','purchaser_type','preapproval','reverse_mortgage','open_end_line_of_credit','business_or_commercial_purpose','negative_amortization','interest_only_payment','balloon_payment','other_nonamortizing_features','occupancy_type','manufactured_home_secured_property_type','manufactured_home_land_property_interest','applicant_credit_score_type','co_applicant_credit_score_type','applicant_ethnicity_1','co_applicant_ethnicity_1','applicant_race_1','co_applicant_race_1','applicant_sex','co_applicant_sex','applicant_age','co_applicant_age','initially_payable_to_institution','aus_1']] = de_r[['hoepa_status','derived_loan_product_type','derived_dwelling_category','action_taken','purchaser_type','preapproval','reverse_mortgage','open_end_line_of_credit','business_or_commercial_purpose','negative_amortization','interest_only_payment','balloon_payment','other_nonamortizing_features','occupancy_type','manufactured_home_secured_property_type','manufactured_home_land_property_interest','applicant_credit_score_type','co_applicant_credit_score_type','applicant_ethnicity_1','co_applicant_ethnicity_1','applicant_race_1','co_applicant_race_1','applicant_sex','co_applicant_sex','applicant_age','co_applicant_age','initially_payable_to_institution','aus_1']].astype('category')de_r[['loan_amount','loan_term','property_value','income','tract_population','tract_minority_population_percent','ffiec_msa_md_median_family_income','tract_to_msa_income_percentage','tract_owner_occupied_units','tract_one_to_four_family_homes','tract_median_age_of_housing_units']] = de_r[['loan_amount','loan_term','property_value','income','tract_population','tract_minority_population_percent','ffiec_msa_md_median_family_income','tract_to_msa_income_percentage','tract_owner_occupied_units','tract_one_to_four_family_homes','tract_median_age_of_housing_units']].astype(float)#display infode_r.info()
#display binary outcome distributionbinary_outcome = sns.countplot(x=de_r["action_taken"])binary_outcome.set(xlabel ="Action Taken", ylabel = "Count", title ='Rejected or Originated')
#compute and print reduced frame sparsityspar_r = de_r.isnull().sum().sum()/(len(de.axes[0])*len(de.axes[1]))print('Clean Data Sparsity: ', spar_r)
Clean Data Sparsity: 0.0
The data has been reduced to 51,950 observations with 0% sparsity, as well as 1 binary target variable and 38 explanatory variables of both a categorical and numeric nature. However, data preparation is not yet done as the random forest regressor cannot utilize categorical variables that are not easily convertable to numeric format. Indeed, many of the explanatory variables come encoded as integers but are generally not ordinal- that is, we cannot assume that the size of an integer that stands for a certain class means it is similar to nearby classes. One method of dealing with this is one-hot encoding the categorical variables. This is performed below utilizing a built-in pandas method. Note that the drop_first argument, which deletes the first column of a one-hot variable to avoid multi-colinearity in classification models, is set to True.
#one-hot encode variablesde_r = pd.get_dummies(de_r, drop_first = True, columns = ['hoepa_status','derived_loan_product_type','derived_dwelling_category','purchaser_type','preapproval','reverse_mortgage','open_end_line_of_credit','business_or_commercial_purpose','negative_amortization','interest_only_payment','balloon_payment','other_nonamortizing_features','occupancy_type','manufactured_home_secured_property_type','manufactured_home_land_property_interest','applicant_credit_score_type','co_applicant_credit_score_type','applicant_ethnicity_1','co_applicant_ethnicity_1','applicant_race_1','co_applicant_race_1','applicant_sex','co_applicant_sex','applicant_age','co_applicant_age','initially_payable_to_institution','aus_1'])
At this point the data are prepared and ready to be split into training and testing sets. Below the testing proportion is set to 15%, at which point data preparation is considered to be complete.
#extract prediction labelslabels = np.array(de_r['action_taken'])#extract predictor variablesfeatures= de_r.drop(['action_taken'], axis = 1)#feature names for viariable importance calculationfeature_list = list(de_r.drop(['action_taken'], axis = 1).columns)#train/test split from moduletrain_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.15, random_state = 43022)
Justification of Preparation Methods
In summary, data is acquired and analyzed for null values. Prevalence and location of null values determined the methods for sparsity reduction.
The primary strategy for sparsity reduction was employing an understanding of the set, that is, the target outcome of ‘rejected’ was found to be highly correlated with certain fields being null. This made those fields correlated with the target variable in a way that makes them undesired for prediction, so they were removed. The advantage of this strategy is that most of the target observations were retained by the time a function to remove null values was called. A disadvantage was that slight bias in favor of the rejected class was introduced, and many observations that did not fit the binary outcome could not be used.
The typical strategy of one-hot encoding categorical variables was applied. This is advantageous because it allows the use of variables in the random forest regressor which are not numeric, and most categorical variables were not ordinal despite being coded as integers. This is disadvantageous in the sense that it removes the ordinal nature of one particular variable which is ordinal. The decision was made to apply this to the applicant_age column, which has an NA value coded as 8888, rather than ordinally recoding because a large number of observations are lost when the error observations are removed.
Overall preparation left a class imbalance in favor of originated loans. See model evaluation section for additional info.
Analysis
A random forest regressor of 25 trees was instantiated with default grid settings and regressed on the training data. Tuning parameters were within memory memory constraints and unaltered after discovery of favorable evaluation metrics on the holdout data set.
#instantiate random forest regressor with random state for reproduibilityn_tree = 25rf = RandomForestRegressor(n_estimators = n_tree, random_state = 43022)#fit on training features and labelsrf.fit(train_features, train_labels)
The hypothesis of this project is tested according to a McNemar Chi-Square test. This test analyzes the performance of the model by comparing its accuracy to that of a naive classifier that predicts based only on the prevalence of the majority class. McNemar’s test is applied to 2x2 contingency tables to find whether row and column marginal frequencies are equal for paired samples (Cansiz, 2021). The code below extracts the predictions and computes a data frame indicating correctness on the test set for both classifiers.
#extract predictions of test setpredictions_ = rf.predict(test_features)#predictions are probabilities, therefore round and convert to integer typepredictions = predictions_.round()predictions = ([int(x) for x in predictions])#vector of correct predictions; 1 = correct, 0 = incorrectpred_correct = np.where(predictions == test_labels, 1, 0)#naive classifer: binomial distribution with p = majority class proportionno_info = np.random.binomial(n=1, p=sum(de_r['action_taken'])/len(de_r), size=[len(pred_correct)])#vector of naive classifier correct predictionsnaive_correct = np.where(no_info == test_labels, 1, 0)#data frame for comparisoncompare = pd.DataFrame({'pred_correct':pred_correct,'naive_correct':naive_correct})#view resultcompare.sample(5)
#compute overall naive accuracynaive_acc = sum(compare['naive_correct']==1)/len(compare['naive_correct'])#compute overall rf accuracyrf_acc = sum(compare['pred_correct']==1)/len(compare['pred_correct'])#display resultsprint('The naive classifier achieved',round(naive_acc*100),'% accuracy.')print('The random forest classifier achieved',round(rf_acc*100),'% accuracy.')
The naive classifier achieved 74 % accuracy.
The random forest classifier achieved 99 % accuracy.
The significance of the performance difference between the two classifiers is tested using a contingency table derived from the comparison data frame and the mcnemar function available from statsmodels.
#create matrix of accuracy datacc = len(compare[(compare.pred_correct == 1)&(compare.naive_correct == 1)])cw = len(compare[(compare.pred_correct == 1)&(compare.naive_correct == 0)])wc = len(compare[(compare.pred_correct == 0)&(compare.naive_correct == 1)])ww = len(compare[(compare.pred_correct == 0)&(compare.naive_correct == 0)])#define contingency tabletable = [[cc, wc],[cw, ww]]#calculate mcnemar testresult = mcnemar(table, exact=True)#summarize the findingprint('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))# interpret the p-valuealpha = 0.05if result.pvalue > alpha:print('Same proportions of errors; fail to reject H0')print('The performance of the naive classifier has not been exceeded.')else:print('Different proportions of errors; reject H0')print('The performance of the naive classifier has likely been exceeded.')
statistic=55.000, p-value=0.000
Different proportions of errors; reject H0
The performance of the naive classifier has likely been exceeded.
The distribution of errors for the random forest classifier is not normally distributed, nor is it expected to be given the binary nature of outcomes. Nonetheless it is provided- a -1 indicates a prediction of 0 where 1 was correct and a 1 indicates a prediction of 0 when 1 was correct.
#compute residualsresiduals = predictions - test_labels#code visualizatoin of errorssns.histplot(residuals,discrete=True).set(title='Distribution of Binary Classification Residuals')plt.xticks([-1,0,1])
For aiding in the results summary, the variable importance of the regressor is provided.
#extract variable importanceimportances = list(rf.feature_importances_)#assign namesfeature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]#sort list of feature importancesfeature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)#report top 5 features[print('Feature: {:20} Importance: {}'.format(*pair)) for pair in feature_importances[0:5]]
Feature: hoepa_status_3 Importance: 0.56
Feature: occupancy_type_2 Importance: 0.22
Feature: occupancy_type_3 Importance: 0.07
Feature: income Importance: 0.02
Feature: purchaser_type_1 Importance: 0.02
Justification of Analysis Methods
Random Forest is advantageous for the mix of categorical and numeric data present in the set. A similar study was performed on loan data (not mortgages) where random forest acheived the desired accuracy over other classification algorithms (Tumuleru, 2022) and this fact informed the selection of an algorithm. Some studies have cited a disadvantage using random forests with one-hot encoded variables- specifically that it induces data sparsity and can make feature importance difficult to interpret (Ravi, 2019).
Data Summary and Implications
In summary, random forest was used to regress the predictor variables towards classification of loan applications into ‘originated’ or ‘rejected’. The finding was significant, 99% accuracy which is better than a naive classifier based upon the prevalence of the majority ‘originated’ class. The null hypothesis that a very effective prediction algorithm cannot be developed is rejected.
The target classes were imbalanced in favor of originated loans. This could lead to misleading accuracy metrics, however the high accuracy combined with McNemar’s test mitigated the problem by considering the accuracy inherent in predicting based upon the likelihood of the majority class. The limitation of easily interpreting the variable significance, a result of the one-hot encoding discussed at the end of the previous section, stands. Below we visualize one of the decision trees used in the random forest regressor. Only the first few layers are displayed and discussed because of the prevalence of only the highly important variables warrant discussion.
#create sample tree figurefig = plt.figure(figsize=(15, 10))plot_tree(rf.estimators_[0],feature_names=feature_list,class_names=['rejected','originated'],filled=True, impurity=True,rounded=True, max_depth = 2)fig.savefig('tree.png')
The most significant variable is hoepa_status. We define this according to the schema: “Whether the covered loan is a high-cost mortgage” (FFIEC, 2019). Recalling the data cleaning section, this variable was kept in favor of other loan cost variables due to its high density. The decision tree indicates that a HOEPA status other than ‘3’ was strongly associated with applications that were originated. HOEPA status 3, per the schema, indicates a loan that is exempt from this reporting requirement. Guidelines (National Credit Union, 2014) indicate that HOEPA-exempt mortgages are associated with reverse mortgages, initial constructions, and business-purpose properties. The suggestion of this type of result is that these types of loan applications are more highly scrutinized.
The course of action recommended at this time is to operate under the assumption that standard mortgage applications for residential, non-business buyers are very likely to be accepted.
I recommend direct efforts towards further pruning the dataset with the goal of developing classifiers that can predict loan acceptance given a loan type. i) Performing a similar analysis with data subset according to the HOEPA flag may lead to more granular results for nonstandard mortgages or for borrowers concerned with residential purchases. ii) Modeling this data with the inclusion of location variables may give insight to geographic trends in loan acceptance.
References
Office of Law Revision Council. (Dec 31, 1975). Retrieved Oct 11, 2022 from https://uscode.house.gov/view.xhtml?path=/prelim@title12/chapter29&edition=prelim
Nasteski, V. (Dec 2017). Retrieved Oct 11, 2022 from https://www.researchgate.net/publication/328146111_An_overview_of_the_supervised_machine_learning_methods
Tumuleru, P et al. (Feb 23 2022). Retrieved Oct 12, 2022 from https://ieeexplore.ieee.org/document/9742800
Cansiz, S. (Mar 7 2021). Retrieved Oct 13, 2022 from https://towardsdatascience.com/have-you-ever-evaluated-your-model-in-this-way-a6a599a2f89c
FFIEC. (2020). Retrieved Oct 13, 2022 from https://ffiec.cfpb.gov/data-publication/modified-lar/2021
FFIEC. (2019). Retrieved Oct 21, 2022 from https://ffiec.cfpb.gov/documentation/2019/lar-data-fields/
Ravi R. (Jan 11, 2019). Retrieved Oct. 24, 2022. from https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769
National Credit Union. (Jan 10, 2014). Retrieved Oct. 24, 2019 from https://www.ncua.gov/files/publications/regulation-supervision/RA2013-09-Attachment-NCUA-Dodd-Frank%20Act-HOEPA-Loans-Summary.pdf