Group 5 Dseb64a Report
Group 5 Dseb64a Report
I. Introduction
1. Background
In the context of the rapid development of the financial industry, especially consumer credit,
credit risk assessment plays an extremely important role. Home Credit, a global financial company,
has developed a collateral-free lending model to provide quick and accessible loans to customers.
However, one of the biggest challenges they face is credit risk management - determining the
possibility of customers not being able to repay the loan on time. Improving the ability to predict risk
not only helps the company minimize losses but also helps expand its reach to customers, especially
those with unclear credit histories.
2. Problem Statement
Currently, Home Credit is facing a situation where some customers are unable to repay their
loans, leading to financial losses and affecting the credit system. Building a model to accurately
predict which customers are at risk of default is an urgent problem. The challenge lies in the fact that
credit data is often complex, incomplete or contains many noisy factors. How to develop an accurate,
reliable machine learning model that can integrate data from many different sources and provide
effective forecasts is the central question we aim at.
3. Dataset
Key target: Build a machine learning model to predict the probability of customer default based on
personal information, credit history, and financial behavior.
Key steps: (1) Exploratory Data Analysis - EDA, (2) Data Preparation, (3) Feature Engineering, (4)
Train and Optimize model, (5) Evaluate model, (6) Export results
2.2.1. bureau_balance
● Label Encoding: Encodes STATUS (O: Closed, 7: Worst) clearly indicate levels of risk or
severity.
● Handling Missing Data: Fills missing values in aggregated data with zeros.
○ Missing values in the dataset may imply no recorded information (e.g., no activity
associated with a specific loan during that period).
○ Assigning 0 can be interpreted as "no risk" or "no recorded activity."
○ Using mean or median might distort the meaning of the STATUS variable, especially
when missing values convey specific context.
2.2.2. bureau
● Filtering out anomalous loans older than 50 years: Loans exceeding 50 years
(approximately 100 years) are uncommon and financially implausible.
● Handling erroneous DAYS_* values: Fixes and replaces incorrect or inconsistent values in
date-related columns.
● One-Hot Encoding: Converts categorical variables (e.g., CREDIT_ACTIVE) into numerical
representations for model compatibility.
● Transforming CREDIT_ACTIVE Column: Credit statuses are encoded as follows:
○ Active: 1 (Active loans, potentially carrying credit risk).
○ Closed: 0 (Completed loans, typically safer).
○ Sold: 2 (Loans sold, indicating certain financial risks).
○ Bad debt: 3 (Bad loans, representing the highest risk).
2.2.3. previous_application
2.2.4. installments_payments
Sorted Columns:
2.2.5. POS_CASH_balance
● Adjusts MONTHS_BALANCE to positive values for easier manipulation.
● Sorting: Sorts the dataset by SK_ID_PREV and MONTHS_BALANCE in descending order
to facilitate rolling computations (e.g., exponential moving averages).
2.2.6. credit_card_balance
● Remove erroneous values for AMT_PAYMENT_CURRENT.
● Compute total missing values (MISSING_VALS_TOTAL_CC).
● Make MONTHS_BALANCE positive and sort the data chronologically.
Engineer domain-specific features like:
Compute rolling Exponential Weighted Moving Averages (EWMA) for certain features over time.
2.2.7. application_train & application test
● POS_NEW_IS_CREDIT_NOT_COMPLETED_ON_TIME:
- Mark the loan as not completed on time:
+ 1: If CNT_INSTALMENT_FUTURE_MIN == 0 and the contract status is not Completed.
+ 0: If the contract is completed on time.
4.4. Aggregate for Previous Application: Update the list of aggregate operations
(agg_list_previous_application) to integrate with the Previous Application dataset
5. credit_card_balance
● The DAYS_DECISION variable was encoded as 1 if the value was less than or equal to 1
year, and 0 if it was greater than 1 year.
● The NAME_TYPE_SUITE variable was classified into two groups: alone (applying alone)
and not_alone (applying with others).
● Values in the NAME_GOODS_CATEGORY and NAME_SELLER_INDUSTRY variables
that do not belong to the main categories were grouped into an others category.
● Categorical variables were encoded using One-Hot Encoding, and aggregated using
operations like sum, mean, max, and min to create composite features.
● The LOAN_RATE feature was calculated as the ratio between the requested loan amount
(AMT_APPLICATION) and the approved loan amount (AMT_CREDIT).
● New features such as NEW_CHURN_PREV were created to indicate whether the loan is
overdue, and NEW_INSURANCE replaced the original NFLAG_INSURED_ON_APPROVAL
variable, using the difference between AMT_CREDIT and AMT_GOODS_PRICE.
● The PREVIOUS_TERM (loan term, calculated as the ratio between AMT_CREDIT and
AMT_ANNUITY) and PREVIOUS_AMT_TO_APPLICATION (ratio of approved loan to
requested loan) were created.
● Time difference features like PREVIOUS_DAYSLASTDUE1ST_DAYSFIRSTDUE_DIFF
were also introduced.
● Interest rate features (INTEREST, INTEREST_RATE) were calculated based on the loan
amount and repayment period.
● Combined application_train and application_test datasets to create the main table df, which
includes basic customer information such as age, income, and loan contract type.
● Aggregated features from all the datasets (min, max, average,...)
The tables then were merged sequentially. Data from installments_payments was merged with
previous_application, followed by data from pos_cash_balance. The aggregated features were created
for each customer (SK_ID_CURR) and stored in the df_prev_ins_pos_agg table. This table was then
merged with the main df table along with data from credit_card_balance and bureau_balance,
resulting in the final comprehensive table all_data.
We identified duplicates using combinations of key features and assigned unique cluster IDs.
Based on these clusters, we created lag_TARGET and lead_TARGET features, along with aggregated
metrics. The main dataset, all_data, was enhanced with these features and external data sources,
ensuring a comprehensive training set.
V. Models
Model Overview
Logistic regression, a powerful and interpretable classification model, is used with a pipeline
structure to seamlessly integrate preprocessing and classification steps. It is trained with a maximum
of 1000 iterations and employs cross-validation to ensure generalizability and prevent overfitting.
Stratified 5-fold cross-validation ensures balanced evaluation by preserving default and non-
default proportions. The ROC-AUC metric is used to compute the Gini coefficient, which measures
model performance:
The logistic regression model is compared to alternative classifiers (e.g., decision trees) using the
ROC-AUC score. If it outperforms the others, it is selected and retrained on the entire training dataset
for better predictions
Output:
The logistic regression model provides an efficient, interpretable, and stable solution to credit
default risk, with clear feature insights, scalable preprocessing, and consistent performance via cross-
validation. It is reliable for real-world deployment and serves as a benchmark for testing more
advanced algorithms.