MBS Report
MBS Report
INTRODUCTION ...................................................................................................................... 4
MORTGAGE-BACKED SECURITIES(MBS): ...................................................................... 4
PREPAYMENT RISK:............................................................................................................ 4
DATA ...................................................................................................................................... 5
PROBLEM STATEMENT.......................................................................................................... 8
DATA PREPROCESSING:......................................................................................................... 9
Analyzing the data .................................................................................................................. 9
Data Exploration: ................................................................................................................... 9
Missing value treatment .......................................................................................................... 9
Handling outliers .................................................................................................................. 10
Creation of target variable .................................................................................................... 10
Transforming some variables into appropriate format .......................................................... 10
EDA (Exploratory Data Analysis) ............................................................................................. 11
Univariate analysis ................................................................................................................ 11
Bivariate analysis .................................................................................................................. 11
Data Distribution Analysis: ................................................................................................... 12
Correlation Analysis: ............................................................................................................ 12
Feature Importance Analysis: ............................................................................................... 12
Data Visualization: ................................................................................................................ 12
Correlation Plot of Numerical Variables ............................................................................... 12
Feature Engineering ................................................................................................................. 14
Feature Creation:.................................................................................................................. 14
Feature Selection:.................................................................................................................. 14
Feature Extraction:............................................................................................................... 16
Principal Components Analysis ......................................................................................... 16
Model selection and feature comparison: .............................................................................. 17
Feature selected output accuracies: ................................................................................... 18
Model Building ......................................................................................................................... 19
1. Binary Classifier: ................................................................................................................. 19
2. Multi-class Classifier: .......................................................................................................... 19
Types of ML Classification models: ...................................................................................... 19
STEPS OF IMPLEMENT ATION: ................................................................................................. 21
classification report: ............................................................................................................... 22
MBS prepayment risk analysis :............................................................................................... 22
Model Building: -.................................................................................................................... 24
Pipeline ..................................................................................................................................... 25
Process of Pipeline: ............................................................................................................... 25
Creation Of Pipeline For Classification Model .................................................................. 25
Creation Of Pipeline For Regression Model....................................................................... 26
Combination of Both Models using Pickle ............................................................................. 27
Deployment Process .................................................................................................................. 28
Conclusion ................................................................................................................................ 29
INTRODUCTION
MORTGAGE-BACKED SECURITIES(MBS):
Mortgage-backed securities (MBS) are financial instruments that represent a form of
asset-backed security. These securities are created by bundling a group of individual mortgage
loans, such as residential mortgages, into a single investment product. The cash flows from the
underlying mortgages, including principal and interest payments made by homeowners, are
used to generate income for investors who purchase the MBS.
The main benefit of mortgage-backed securities is that they allow financial institutions to
transfer the risk associated with mortgage loans to investors. This, in turn, provides additiona l
liquidity to mortgage lenders, allowing them to issue more mortgages and potentially lower
interest rates for borrowers.
PREPAYMENT RISK:
Prepayment risk refers to the risk that the underlying loans in a mortgage-backed
security (MBS) or other asset-backed security will be paid off earlier than expected. When
borrowers repay their loans ahead of their scheduled payment dates, it can have an impact on
the cash flows received by the investors holding the MBS.
When borrowers prepay their mortgages, they effectively cut short the interest
payments investors would have received if the loans had continued according to their origina l
schedules. This can lead to a decrease in the overall interest income received by the investors.
Prepayments can also create reinvestment risk for investors. As the principal from
prepaid loans is returned to investors, they may have to reinvest that money in new securities
or assets. However, if interest rates have fallen since the original investment, investors may
have to settle for lower yields on their reinvestments.
Prepayment risk is influenced by various factors, including changes in interest rates, housing
market conditions, and borrower behaviour. When interest rates drop, borrowers are more
likely to refinance their mortgages to take advantage of lower rates, leading to higher
prepayment rates. Conversely, when interest rates rise, prepayments tend to decrease as
borrowers are less incentivized to refinance.
DATA
The data is obtained from Freddie Mac official portal for home loans.
The size of the home loans data is (291452 x 28) that is of 291452 data rows and 28 columns
19. Creditrange: A credit score is a prediction of your credit behavior, such as how likely you are
to pay a loan back on time, based on information from your credit reports.
20.Monthly_income: Your gross monthly income includes all sources of money that you
receive over the course of a month, including but not limited to regular wages, earnings
from side jobs and investments. Lenders consider your gross monthly income to determine
your creditworthiness and ability to repay loans.
23. MIP: Federal Housing Administration (FHA) mortgage loans are designed to help
people who might have trouble getting other types of mortgage loans to buy a home. And if a
homebuyer uses an FHA-backed loan, they're required to pay a mortgage insurance premium
(MIP).
24. OCLTV: Loan-to-value (LTV) is calculated simply by taking the loan amount and
dividing it by the value of the asset or collateral being borrowed against.
25. debt-to-income ratio:Your debt-to-income ratio (DTI) is all your monthly debt
payments divided by your gross monthly income. This number is one way lenders measure
your ability to manage the monthly payments to repay the money you plan to borrow. Differ e nt
loan products and lenders will have different DTI limits.
26. OrigUPB :Loan origination is the process by which a borrower applies for a new loan,
and a lender processes that application. Origination generally includes all the steps from taking
a loan application up to disbursal of funds (or declining the application).
28. LTV_Range:The loan-to-value (LTV) ratio is an assessment of lending risk that financ ia l
institutions and other lenders examine before approving a mortgage. Typically, loan
assessments with high LTV ratios are considered higher-risk loans. Therefore, if the mortgage
is approved, the loan has a higher interest rate.
29. MonthsDelinquent:A delinquent status means that you are behind in your payments. The
length of time varies by lender and the type of debt, but this period generally falls anywhere
between 30 to 90 days.
30. Prefix: The designation assigned by the issuer denoting the type of the loans and the
security
31. Original Interest Rate: The interest rate of the loan as stated on the note at the time the
loan was originated or modified.
PROBLEM STATEMENT
Prepayment risk refers to the uncertainty of how likely borrowers are to pay off their loans
early. This risk affects investors in mortgage-backed securities who receive cash flows from
these loans. If borrowers repay their loans ahead of schedule, it can reduce the interest income
received by investors and create challenges in finding suitable reinvestment opportunities.
Understanding the data in machine learning involves thoroughly examining the dataset to gain
insights into its structure, distribution, and characteristics.
Data Exploration:
Check the size of the dataset: Understand the number of samples (instances) and features
(variables) present in the dataset.
Data Types: Identify the data types of each feature (e.g., numerical, categorical, text, datetime)
to determine how to handle and preprocess them.
Data exploration and visualization are valuable tools for gaining insights and guiding your
choices throughout the machine learning process.
So I started replacing the “X” variable with null value and handled the missing values.
Replace missing values in categorical features with the mode (most frequent value) of that
feature. Replace missing values in numerical features with the mean (or median) of the
available values of that feature. This method is simple and effective.
Handling outliers
Handling outliers is an important step in data pre-processing, as outliers can significantly affect
the performance and accuracy. Outliers are data points that deviate significantly from the
majority of the data and may result from errors in data collection or represent rare events.
Capping/Clipping: Cap or clip the extreme values by setting a predefined upper and lower
threshold. Data points above or below these thresholds are replaced with the threshold value.
We used capping method to control the outliers.
Creation of target variable
As per the problem statement it is a classification task. So we have to create a target variable
for binary classification. So we chose status as target variable and transformed to binary form
with the help of default date feature
Many features in data are not in appropriate format such as credit score, DTI , etc. so we have
converted the features into categorical columns. Later we have applied encoding techniques
for model preparation.
EDA (Exploratory Data Analysis)
EDA is a critical step in the machine learning pipeline. It involves the process of visually and
statistically examining data to understand its structure, relationships, patterns, and
characteristics. The primary goal of EDA is to gain insights into the data, identify potential
issues, and make informed decisions on how to pre-process and model the data effectively.
Depending on the number of columns we are analysing we can divide EDA into two types:
1. Univariate analysis
2. Bivariate analysis
For good analysis we divided the columns into two groups. They are categorical columns and
numerical columns.
Univariate analysis
Univariate analysis is analyzing a single feature at a time.
For every categorical columns we applied univariate analysis using the pie charts to know the
value counts of each categorical column . For
every numerical columns we applied univariate analysis using the histogram, distplot and
boxplot to know the value counts of each numerical column .
Bivariate analysis
The primary goal of bivariate analysis is to understand how the values of one variable relate to
the values of another variable. This analysis is particularly useful for identifying correlatio ns,
patterns, trends, and associations between two variables.
Present visualizations that explore the relationship between the target variable and other
features .
By performing bivariate analysis, data analysts can gain insights into how two variables interac t
with each other and understand their joint behavior. It is a crucial step in the exploratory data
analysis (EDA) process and helps guide further analysis and modeling decisions.
Data Distribution Analysis:
The distribution of each feature in the dataset was examined to detect any anomalies or outliers.
This step ensures that the data is appropriate and representative for training machine learning
models.
Correlation Analysis:
The relationships between features in the dataset were studied to identify strong correlatio ns.
Detecting multicollinearity issues is crucial as they can adversely impact the performance of
machine learning algorithms.
Feature engineering refers to the process of using domain knowledge to select and transform
the most relevant variables from raw data when creating a predictive model using machine
learning or statistical modeling.
Feature Creation:
Feature creation is finding the most useful variables to be used in a predictive model. The
process is subjective, and it requires human creativity and intervention. The new features are
created by mixing existing features using addition, subtraction, and ration, and these new
features have great flexibility.
After preprocess and Exploratory data analysis (EDA) task we got data contain 34 columns
and out of those one of variables as Target variable.
Column require for feature engineering are:-
Feature Selection:
While developing the machine learning model, only a few variables in the dataset are useful
for building the model, and the rest features are either redundant or irrelevant. If we input the
dataset with all these redundant and irrelevant features, it may negatively impact and reduce
the overall performance and accuracy of the model. Hence it is very important to identify and
select the most appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine learning. "Feature
selection is a way of selecting the subset of the most relevant features from the original features
set by removing the redundant, irrelevant, or noisy features.".
In this task first of all I had Split the data into test and training data by this I can observe the
features importance.
After this, o I use correlation and heat map to understand the relationship between variables
and remove the multi collinearity variables.
After this, I have a correlation of data in visual form and I can understand it better. Then I used
correlation functio n.
With the following function we can select highly correlated featuresit will remove the first fe
ature that is correlated with anything other feature
After executing this function, I got 7 columns/features that are highly correlated. The threshold
value that I set was 0.9.
These are the Columns/Features which I don’t want because they don’t pass the threshold that
I set. So I dropped these columns.
After that I took all features and applied sklearn.feature_selection method to get the most
important features of this data. So I got most important 6 features:
Feature Extraction:
Feature extraction is an automated feature engineering process that generates new variables by
extracting them from the raw data. The main aim of this step is to reduce the volume of data so
that it can be easily used and managed for data modelling. Feature extraction methods include
cluster analysis, text analytics, edge detection algorithms, and principal components analys is
(PCA).
Principal Components Analysis
I had taken those 6 features for feature extraction and taken the target variable (prepayment).
I perform PCA and feature extraction using the libraries I imported so it gives pca components
and I also calculated the variance of those components it gives around 90 percent so I take
those pca’s.
1. Binary Classifier:
If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
2. Multi-class Classifier:
If a classification problem has more than two outcomes, then it is called as Multi-class
Classifier.
Example: Classifications of types of crops, Classification of types of music.
RANDOM FOREST:
ML. It is based on the concept of ensemble learning, which is a process of combining multip le
classifiers to solve a complex problem and to improve the performance of the model.
"Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset.
Assumptions:
1. There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
2. The predictions from each tree must have very low correlations.
Working:
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
1. Select random K data points from the training set.
2. Build the decision trees associated with the selected data points (Subsets).
3. Choose the number N for decision trees that you want to build.
4. For new data points, find the predictions of each decision tree, and assign the newdata points
to the category that wins the majority votes.
LOGISTIC REGRESSION:
Logistic Regression is a statistical algorithm used for binary classification problems, where the target
variable has two possible outcomes (e.g., yes/no, true/false, 0/1). Despite its name, logistic regression
is a classification algorithm and not a regression algorithm. It predicts the probability that an instance
belongs to a particular class.
Assumptions:
1. The target variable should be binary or dichotomous, meaning it should have only two
possible outcomes.
2. Independence of observations: Each data point should be independent of others.
3. Linearity: Logistic regression assumes a linear relationship between the independent
variables and the log-odds of the target variable.
Working:
Logistic regression uses a logistic function (also called the sigmoid function) to model the
relationship between the independent variables and the probability of the target variable
belonging to a particular class. The function's output ranges between 0 and 1, representing the
probability of the event occurring.
Model Training: Estimate the parameters of the logistic regression model using
optimization techniques like Maximum Likelihood Estimation (MLE) or
Gradient Descent. The goal is to find the best-fitting line that separates the two
classes.
Sigmoid Function: The logistic regression model applies the logistic (sigmo id)
function to the linear combination of the input features and their corresponding
weights. The sigmoid function maps the linear output to a value between 0 and
1.
S(z) = 1 / (1 + e^(-z))
where 'z' is the linear combination of features and weights.
Y_data_pred = Classify_model.predict(X_test)
print(Y_data_pred)
accuracy of model with pipeline:
Advantage of model:
Linear Regression is a supervised statistical algorithm used for predicting continuous numeric
values. It establishes a linear relationship between the dependent variable (target) and one or
more independent variables (features) by fitting a straight line to the data points that best
represent the overall trend.
Assumptions:
1. Linearity: Linear regression assumes a linear relationship between the independent
variables and the dependent variable. The model tries to find the best-fitting line to the
data.
2. Independence of Errors: The errors (residuals) should be independent of each other. In
other words, the error term for one data point should not be influenced by the error term
of another data point.
3. Homoscedasticity: The variance of the errors should be constant across all levels of the
independent variables. This means that the spread of the residuals should be roughly
the same across the range of predicted values.
4. No Multicollinearity: The independent variables should not be highly correlated with
each other. Multicollinearity can lead to unstable coefficient estimates.
Working:
The working process of Linear Regression can be summarized as follows:
1. Data Preparation: Prepare the dataset by cleaning the data, handling missing values,
and encoding categorical variables if necessary.
2. Model Training: Linear Regression finds the best-fitting line that minimizes the sum of
squared differences between the predicted values and the actual values. This is usually
done using a technique called Ordinary Least Squares (OLS).
3. Linear Equation: The model represents the relationship between the dependent variable
'y' and the independent variables 'x1, x2, ..., xn' as a linear equation:
y = β0 + β1 * x1 + β2 * x2 + ... + βn * xn
where 'β0' is the y-intercept, 'β1, β2, ..., βn' are the coefficients (slopes)
for each independent variable, and 'x1, x2, ..., xn' are the corresponding
feature values.
4. Model Evaluation: Evaluate the performance of the model using various metrics like
Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (coefficie nt
of determination), etc.
5. Making Predictions: Once the model is trained and the coefficients are determined, it
can be used to make predictions on new data by plugging in the feature values into the
linear equation.
Linear Regression is a widely used algorithm in various fields such as economics, finance,
social sciences, and engineering for tasks like price prediction, demand forecasting, trend
analysis, and more. It can be extended to handle multiple regression (more than one
independent variable) and polynomial regression to capture non-linear relationships between
variables.
For Regression Model we have created following variabkes:
variables 1- Equated Monthly Instalments (EMI)
1. monthly_payment
2. total_payment
3. interest_amount
4. monthly_income
5. monthly_rate
6. monthly_priciple amount
7. principal_amount_remaining
8. priciple_amount_paid
9. prepayment
Model Building: -
For the regression task we have selected 9 Independent variables which affects the target
variables and prepayment as dependent variables
Independent Variables: -
DTI , PPM, NumBorrowers , ServicerName, EverDelinquent, IsFirstTimeHomebuye r,
CreditRange, monthly_income , s
Dependent Variable:
prepayment
Steps: -
1. Filling Missing values by using Simple Imputer for both categorical features and
numerical features
2. Converting Categorical Features into numerical by using One Hot Encoding
3. Standardization of numerical features by using Standard Scaler
4. Applying Linear Regression for prediction
Model:
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
Results: -
The overall R-squared value for Linear Regression is 91 % with Testing R-squared: 0.92,
Training Mean Squared Error: 0.08, Testing Mean Squared Error: 0.08
Pipeline
The execution of the workflow is in a pipe-like manner. Scikit-learn is a powerful tool for
machine learning, provides a feature for handling such pipes under the sklearn.pipeline module
called Pipeline. A machine learning pipeline can be created by putting together a sequence of
steps involved in training a machine learning model.
It can be used to automate a machine learning workflow. The pipeline can involve pre-
processing, feature selection, classification/regression, and post-processing.
Process of Pipeline:
Data Preparation: The code starts by loading the dataset df, which includes relevant
features (IV) and the target variable ("EverDelinquent"). The independent variables
(IV) are stored in the X matrix, and the target variable is stored in the Y matrix.
Data Splitting: The data is split into training and testing sets using the train_test_split
function from sklearn. The training set (x_train and y_train ) will be used to train the
models, and the testing set (x_test and y_test) will be used to evaluate the model's
performance.
Preprocessing Transformers : Two preprocessing transformers are defined - one for
numeric features and one for categorical features. For numeric features, missing values
are imputed using the median and then scaled using StandardScaler. For categorical
features, missing values are imputed using the most frequent value, and one-hot
encoding is applied using OneHotEncoder.
ColumnTransformer: A ColumnTransformer named preprocessor is created to apply the
appropriate preprocessing transformer to each set of features. It specifies that numeric
features should be processed using numeric_transformer, and categorical features should
be processed using categorical_transformer.
Multiple Pipelines for Multi-Label Classification: A list of pipelines is created,
where each pipeline contains the preprocessor defined in the ColumnTransformer and a
LogisticRegression classifier. A separate pipeline is created for each target variable (each
column in y_train_reshaped ), treating it as a binary classification task.
Fitting Pipelines to Training Data: The pipelines are then fitted to the training data
(x_train and y_train_reshaped ) using a loop. Each pipeline is trained separately for each
target variable (column) in Y
1. Saving the Trained Model: The trained classification model, which is stored in
the pipeline variable, is saved to a pickle file named "classification.pkl." This
allows the model to be easily loaded and used later without retraining.
2. User Input for Loan Application Data: The code prompts the user to input loan
application data, including features such as "MSA," "MIP," "OCLTV," "DTI,"
"OrigUPB," "PropertyState," "MonthsDelinquent," "LTV_Range, "
"CreditRange," and "RepPayRange." The user can enter multiple sets of data
(rows) to make predictions for different loan applications.
3. Model Prediction: For each set of loan application data provided by the user, a
new DataFrame data is created to store the input data. The model is then used to
make predictions on each row of data using the predict method of the pipeline.
Loading the Model: The code uses the pickle.load() method to open the "Classification.pkl" file in
binary read mode ('rb'). The loaded model is stored in the loaded_model variable.
1. Data Preparation and Splitting: The code starts by preparing the feature matrix X,
which includes relevant independent variables (IV), and the target matrix Y, which
contains the target variable ("prepayment"). The data is then split into training and
testing sets using the train_test_split function from sklearn.
2. Preprocessing Transformers : Two preprocessing transformers are defined - one for
numeric features and one for categorical features. For numeric features, missing values
are imputed using the median and then scaled using StandardScaler. For categorical
features, missing values are imputed using the most frequent value, and one-hot
encoding is applied using OneHotEncoder.
3. ColumnTransformer: A ColumnTransformer named preprocessor is created to apply the
appropriate preprocessing transformer to each set of features. It specifies that numeric
features should be processed using numeric_transformer, and categorical features should
be processed using categorical_transformer.
4. Multiple Pipelines for Multi-Output Regression: A list of pipelines is created, where
each pipeline contains the preprocessor defined in the ColumnTransformer and a
MultiOutputRegressor wrapping a LinearRegression regressor. A separate pipeline is created
for each target variable ("prepayment").
5. Data Reshaping: The target variable y_train is reshaped using values.reshape(-1,
len(Y.columns)) to have two dimensions, necessary for multi-output regression.
6. Fitting Pipelines to Training Data: The pipelines are then fitted to the training data
(x_train and y_train_reshaped ) using a loop. Each pipeline is trained separately for each
target variable (column) in Y.
It evaluates the multi-output regression models using R-squared scores, achieving an average R-squared
score of approximately 0.9135. The R-squared score for the "prepayment" target variable is 0.9135,
indicating a good fit of the model to the data.
It saves the trained multi-output regression model to a pickle file named "Regression.pkl." This allows
the model to be easily loaded and used later without retraining. The saved model can be deployed for
predicting prepayment amounts for mortgage-backed loans.
It allows the user to input loan application data and generates a DataFrame named data
containing the entered rows. The DataFrame includes features such as "DTI," "PPM,"
"NumBorrowers," "ServicerName," "EverDelinquent," "IsFirstTimeHomebuyer, "
"CreditRange," and "monthly_income." Each row represents a new loan application.
The code successfully captures user input for loan application data and creates a DataFrame
named data to store the entered rows. The user can input multiple loan applications, and the
DataFrame is updated accordingly. However, the code does not include any specific analys is
or model prediction on the entered data. For a complete report or analysis, the data DataFrame
should be used as input for the trained regression model to predict prepayment amounts for
each loan application. Additionally, the code could be enhanced to handle data validation and
provide a more user-friendly input interface.
Loading the Model: The provided code successfully loads a trained multi-output regression model
from the pickle file "Regression.pkl" into memory. The loaded model is stored in the variable
loaded_model, which can now be used for making predictions on new data.
1. Data Collection: The user is prompted to provide input for common features such as
"DTI" (Debt-to-Income ratio) and "CreditRange." Additionally, features specific to the
classification task ("MSA," "MIP," "OCLTV," etc.) and the regression task ("PPM,"
"NumBorrowers," "ServicerName," etc.) are also collected.
2. Classification Prediction: The loaded classification pipeline, classification_pipeline , is
used to predict the classification labels (Default or Non-Default) based on the user input
data for classification.
3. Regression Prediction: The loaded regression pipeline, regression_pipeline , is utilized to
predict the prepayment amounts based on the user input data for regression.
4. Combination of Predictions: The predictions from both models are combined into a
single DataFrame named combined_predictions . This DataFrame consists of two columns :
"Classification_Prediction_Defaulter" holding the classification predictions and
"Regression_Prediction_prepayment" holding the regression predictions.
5. Displaying Results: The combined_predictions DataFrame is displayed, presenting the
combined predictions for each loan application. The
"Classification_Prediction_Defaulter" column indicates whether the applicant is
classified as a defaulter (1) or not (0), and the "Regression_Prediction_prepayme nt"
column displays the predicted prepayment amounts for each application.
This approach of combining both classification and regression models through pipelines
enables comprehensive insights into loan applicant risk (default classification) and potential
prepayment amounts for mortgage-backed loans. However, it is important to validate these
predictions using real-world data and perform regular model maintenance to ensure the
accuracy and reliability of the predictions over time. Additionally, continuous monitoring of
model performance and incorporating new data for retraining is recommended to keep the
models up-to-date and relevant to changing market dynamics.
Deployment Process
Web Server and Platform: Choose a web server and deployment platform. You can use cloud
platforms like AWS, Heroku, or deploy locally on your server.
Dependencies : Install required dependencies on the server, including Python, Flask, scikit -
learn, pandas, and numpy.
Upload Files : Transfer the app.py, index.html, static.css , classification_.pkl,
regression_.pkl, and combined_predictions.pkl files to the server.
Run the App: Run the Flask app on the server using the command python app.py. The app
should now be live.
Access the App: Access the app through the server's IP address and the port specified in the
Flask app (default is usually port 5000).
User Interaction: Users can select the prediction type (classification or regression) and input
the required features on the app's webpage (index.html).
Results : The app sends the user's input to the Flask backend, where it loads the pre-trained
classification or regression models using pickle.
Prediction: The app uses the loaded models to predict the outcomes based on user input.
Display: The app displays the predictions (combined classification and regression) back to the
user.
Feedback and Monitoring: Collect user feedback on the app's performance and user
experience. Monitor the app for errors, performance issues, and resource usage.
Results:
Classification Prediction: The classification model predicts whether a loan applicant is likely
to be a defaulter or not based on features such as MSA, MIP, OCLTV, OrigUPB, PropertyState,
MonthsDelinquent, LTV_Range, and RepPayRange.
Regression Prediction: The regression model predicts the prepayment amount of a loan based
on features like PPM, NumBorrowers, ServicerName, EverDelinque nt,
IsFirstTimeHomebuyer, and monthly_income.
Combined Prediction: The Flask app combines the results from both models into a single
DataFrame and sends it back to the user. The user can view the predictions for both the
classification and regression tasks.
User Experience : Users can interact with the app through a web interface, eliminating the need
for local model execution. The app provides quick and convenient predictions for loan default
and prepayment amounts.
Deployment Observations : Monitor the app's performance and usage patterns to optimize
resource allocation and handle increased traffic.
Model Accuracy: Assess the accuracy and performance of the pre-trained models on real user
data. Continuously improve the models based on user feedback and data updates.
Overall, the deployed Flask app allows users to get predictions for loan default and prepayment
amounts conveniently through a user-friendly web interface. It simplifies the process of model
access and provides valuable insights for loan applicants and financial institutions. Continuo us
monitoring and feedback will help improve the app's accuracy and user experience over time.
Conclusion
In conclusion, the regression and random forest models performed well and exhibited strong
predictive capabilities in the given dataset containing 291452 rows and 44 columns. Both
models were able to effectively predict the target variable(s) based on the available features.
The regression model, being a traditional statistical modelling technique, provided accurate
predictions by analysing the relationships between the target variable and the independent
variables. It considered various factors such as borrower loan characteristics, first time
borrower, credit scores etc. to make predictions on prepayment risk. The regression model
demonstrated its ability to capture patterns and make accurate predictions based on these
factors.
The classification model, being a traditional statistical modeling technique, excelled at making
accurate predictions by analyzing the relationships between the target variable and the
independent variables. It considered various factors such as borrower loan characteristics, firs t-
time borrower status, credit scores, etc. to classify instances into different predefined
categories, such as low, medium, or high prepayment risk. The classification model effective ly
captured patterns and made reliable predictions based on these factors, allowing it to categorize
new instances into the appropriate risk classes with a high degree of accuracy.
Both models proved their effectiveness in predicting the target variable based on the availab le
dataset. Their ability to accurately forecast outcomes showcases their potential to assist in
decision- making processes related to loan applications, risk assessment, and financ ia l
planning. However, it is crucial to validate and evaluate the models on unseen data and assess
their performance metrics, such as accuracy, precision, recall, or mean squared error, to ensure
their reliability in real world scenarios.
Further analysis and model evaluation, including cross-validation, testing on external datasets,
and comparison with other models, can provide additional insights and strengthen the overall
conclusion regarding the performance and reliability of the regression and random forest
models.