0% found this document useful (0 votes)
27 views10 pages

Group 5 Dseb64a Report

Data

Uploaded by

hangtri1711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Group 5 Dseb64a Report

Data

Uploaded by

hangtri1711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA VISUALIZATION REPORT - GROUP 7

I. Introduction
1. Background

In the context of the rapid development of the financial industry, especially consumer credit,
credit risk assessment plays an extremely important role. Home Credit, a global financial company,
has developed a collateral-free lending model to provide quick and accessible loans to customers.
However, one of the biggest challenges they face is credit risk management - determining the
possibility of customers not being able to repay the loan on time. Improving the ability to predict risk
not only helps the company minimize losses but also helps expand its reach to customers, especially
those with unclear credit histories.

2. Problem Statement

Currently, Home Credit is facing a situation where some customers are unable to repay their
loans, leading to financial losses and affecting the credit system. Building a model to accurately
predict which customers are at risk of default is an urgent problem. The challenge lies in the fact that
credit data is often complex, incomplete or contains many noisy factors. How to develop an accurate,
reliable machine learning model that can integrate data from many different sources and provide
effective forecasts is the central question we aim at.

3. Dataset

II. Project Design & Data Preparation

2.1. Project Design

Key target: Build a machine learning model to predict the probability of customer default based on
personal information, credit history, and financial behavior.
Key steps: (1) Exploratory Data Analysis - EDA, (2) Data Preparation, (3) Feature Engineering, (4)
Train and Optimize model, (5) Evaluate model, (6) Export results

2.2. Data Preparation

2.2.1. bureau_balance

● Label Encoding: Encodes STATUS (O: Closed, 7: Worst) clearly indicate levels of risk or
severity.
● Handling Missing Data: Fills missing values in aggregated data with zeros.
○ Missing values in the dataset may imply no recorded information (e.g., no activity
associated with a specific loan during that period).
○ Assigning 0 can be interpreted as "no risk" or "no recorded activity."
○ Using mean or median might distort the meaning of the STATUS variable, especially
when missing values convey specific context.

2.2.2. bureau

● Filtering out anomalous loans older than 50 years: Loans exceeding 50 years
(approximately 100 years) are uncommon and financially implausible.
● Handling erroneous DAYS_* values: Fixes and replaces incorrect or inconsistent values in
date-related columns.
● One-Hot Encoding: Converts categorical variables (e.g., CREDIT_ACTIVE) into numerical
representations for model compatibility.
● Transforming CREDIT_ACTIVE Column: Credit statuses are encoded as follows:
○ Active: 1 (Active loans, potentially carrying credit risk).
○ Closed: 0 (Completed loans, typically safer).
○ Sold: 2 (Loans sold, indicating certain financial risks).
○ Bad debt: 3 (Bad loans, representing the highest risk).

2.2.3. previous_application

● Cleaning Erroneous Values


○ Clean DAYS Fields
■ The value 365243.0 in DAYS fields is likely a placeholder for missing or
undefined values, as it represents an unrealistic time span (e.g., ~1000 years).
Replacing it with NaN ensures that these erroneous values do not skew
analyses or derived insights.
○ SELLERPLACE_AREA Outlier
■ The value 4000000 is replaced with NaN helps maintain data integrity and
avoids misleading aggregate metrics.
● Filling NaNs in Categorical Columns
● Categorical Encoding: Custom encoding for columns like NAME_CONTRACT_STATUS
introduces ordinal relationships based on domain logic.

2.2.4. installments_payments

Sorted Columns:

● SK_ID_CURR: Customer identifier.


● SK_ID_PREV: Identifier for the customer's previous loan.
● NUM_INSTALMENT_NUMBER: Sequence number of the installment.

Sorting Order: Ascending order (ascending=True) for all three columns.

Handling Missing Values: A new column, MISSING_VALS_TOTAL_INSTAL, is added to the


dataset to store the total count of missing values in each row.

2.2.5. POS_CASH_balance
● Adjusts MONTHS_BALANCE to positive values for easier manipulation.
● Sorting: Sorts the dataset by SK_ID_PREV and MONTHS_BALANCE in descending order
to facilitate rolling computations (e.g., exponential moving averages).

2.2.6. credit_card_balance
● Remove erroneous values for AMT_PAYMENT_CURRENT.
● Compute total missing values (MISSING_VALS_TOTAL_CC).
● Make MONTHS_BALANCE positive and sort the data chronologically.
Engineer domain-specific features like:

● AMT_DRAWING_SUM: Sum of drawing amounts.


● BALANCE_LIMIT_RATIO: Balance to credit limit ratio.
● CNT_DRAWING_SUM: Total number of drawings.
● Ratios like MIN_PAYMENT_RATIO, differences (PAYMENT_MIN_DIFF), and others.

Compute rolling Exponential Weighted Moving Averages (EWMA) for certain features over time.
2.2.7. application_train & application test

● Remove the FLAG_DOCUMENT columns with almost identical values.


● Convert the DAYS_BIRTH values from days to years, the data type of
REGION_RATING_CLIENT and REGION_RATING_CLIENT_W_CITY to object.
● Replace the erroneous value 365243 and greater than 30 with NaN .
● Remove rows where the value of CODE_GENDER is XNA.
● Fill missing values in categorical columns with XNA.
● Add a new column to count the total number of NaN values in each row.
● Filter numeric columns, excluding EXT_SOURCE and SK_ID_CURR.
● Perform predictions in the order of increasing number of missing values (EXT_SOURCE_2,
EXT_SOURCE_3, EXT_SOURCE_1).

III. Exploratory Data Analysis


The dataset file names, number of rows and columns is summarized in Table 1

Table 1 - Dataset file names, rows, and columns


After importing the data, we can investigate it to assess data quality and identify any trends. By
plotting the distribution of the predictor variable, we can observe from Figure 1 that the dataset is
imbalanced, with the number of samples where the loan is repaid being more than 8 times the number
of samples where the loan is defaulted.
Figure 1 - Distribution of Target Variable: Loan repaid (0) ; Loan defaulted (1)
When checking for missing data, we find in Figure 2 that the main dataset (application_train)
have high missing values. Similar checks are applied to other datasets.

Figure 2 - Percentage of missing data in main dataset


Figure 3 shows the analysis of both age and employment duration that reveals key insights into
loan repayment behavior. The age distribution shows that loan repayment is more common among
individuals in their 30s, while defaults are higher among those in their 40s and 50s. In contrast, the
employment duration distribution is highly skewed for both groups. While the patterns are similar for
both repaid and defaulted loans, the default group shows a slight shift towards more extreme values,
likely due to data anomalies or outliers.

Figure 3 - Distribution of A) Age by Loan Repayment Status, B) Employment Duration by Loan


Repayment Status
A comprehensive analysis is available in our Jupyter notebook. This analysis provides valuable
insights into which features require correction and which may be important for predicting loan
defaults, ultimately identifying key features for model training.
IV. Feature engineering
1. application_train:
1.1. Categorical data transformation
● Label Encoding: CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, replace
'XNA' in CODE_GENDER with 'F' (mode value)
● Group ORGANIZATION_TYPE into large groups such as Trade, Industry, Transport.
● Map education level NAME_EDUCATION_TYPE to order from 1 to 5.
1.2. Numeric data transformation
● Apply transformations: Cube root: With TOTALAREA_MODE.
● Log and Exponent: Normalize columns such as YEARS_BUILD_AVG,
COMMONAREA_AVG, REGION_POPULATION_RELATIVE.
● Remove outliers: Replace 365243 in DAYS_EMPLOYED with NaN.
Limit unusually large values (>10) in AMT_REQ_CREDIT_BUREAU_QRT.
1.3. Create new features
● Income and credit ratios: income_ratio, NEW_INCOME_CREDIT_PERC,
NEW_PAYMENT_RATE.
● Loan ratio and asset value: CREDIT_GOODS_PRICE_RATIO1,
CREDIT_GOODS_PRICE_RATIO2.
● Credit aggregation: TOTAL_ENQUIRIES_CREDIT_BUREAU and related percentages.
● Periodicize HOUR_APPR_PROCESS_START with sine/cosine
1.4. Handle missing data
● Create MISSING_GRADINGS feature to measure the missingness in key columns.
1.5. Aggregate and remove
● Aggregating columns: EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 are merged
into EXT_SOURCE_SUM, EXT_SOURCE_MEAN.
● Drop redundant columns (FLAG_DOCUMENT_*, AMT_REQ_CREDIT_BUREAU_DAY)
2. bereau and bereau_balance
2.1. Bureau Data Processing
2.1.1. One-Hot Encode the STATUS column into binary variables such as STATUS_0, STATUS_1,...
2.1.2. Create Features from Bureau Balance
Aggregating data by grouping SK_ID_BUREAU:
● Count the number of transaction months ( MONTHS_BALANCE_COUNT), average and
total number of transactions by credit status: STATUS_0_MEAN, STATUS_C_SUM,...
● Create a new aggregate feature: NEW_STATUS_SCORE: Weighted aggregation of credit
statuses (STATUS_1, STATUS_2,...) with corresponding exponents.
2. 2. Bureau Balance Data Processing
2.2.1. Create Feature
● CREDIT_ENDDATE_BINARY: Determines whether the loan has a maturity date in the
future (1) or not (0).
● CREDIT_ENDDATE_PERCENTAGE: Average percentage of loans outstanding in each
customer group (SK_ID_CURR).
● Handle missing values based on appropriate assumptions:
- For example, AMT_CREDIT_SUM_LIMIT defaults to 0 if there is no credit limit.
2.2.2. Reduce the number of categories
● CREDIT_TYPE: Groups rare credit types into the common Rare group.
● CREDIT_ACTIVE: Merges Bad debt and Sold statuses into Active.
2.2.3. Create New Feature
● NEW_MONTHS_CREDIT: Average number of months of a loan based on DAYS_CREDIT
and DAYS_CREDIT_ENDDATE.
● CREDIT_AND_DATE_DIFFERENCE: Difference between expected date and actual date of
loan completion.
● RATIO_CREDIT_DAY_OVERDUE_TO_90_DAYS: Ratio of overdue days to 90 days.
● RATIO_AMT_CREDIT_SUM_OVERDUE_TO_CNT_CREDIT_PROLONG: Ratio of total
overdue amount to number of extensions.
2.3. Combining Bureau and Bureau Balance
2.3.1. Merging data: Bureau and bureau_balance data are combined based on SK_ID_BUREAU.
2.3.2. Removing unnecessary columns: Remove columns such as SK_ID_BUREAU,
CREDIT_CURRENCY as they do not provide distinguishing information.
2.3.3. One-Hot Encoding to categorical columns (CREDIT_TYPE, CREDIT_ACTIVE)
2.4. Create aggregate features
2.4.1. Aggregate on each customer
● Apply operations such as sum, mean, max, min to create aggregate features on each customer
(SK_ID_CURR):
- For example: Credit: AMT_CREDIT_SUM_SUM, AMT_CREDIT_SUM_DEBT_MEAN.
Time: DAYS_CREDIT_MIN, DAYS_CREDIT_ENDDATE_MAX.
2.4.2. Advanced aggregate features
● BB_NEW_AMT_CREDIT_SUM_RANGE: Difference between the largest and smallest loan
that the customer has received.
● BB_DEBT_CREDIT_RATIO: Ratio of total debt to total credit.
● BB_OVERDUE_DEBT_RATIO: Ratio of overdue debt to total debt.
2.4.3. Loan Classification
● Divided into 2 groups:
- Active Loans: Apply the aggregation only to active loans.
- Closed Loans: Apply the same to closed loans
3. installments_payments

3.1. Initial data processing

3.1.1. Create new feature

● NEW_DAYS_PAID_EARLIER: The difference between DAYS_INSTALMENT and


DAYS_ENTRY_PAYMENT to represents the number of days the customer paid early or
late.
● Assign payment labels: 1: If payment is late (NEW_DAYS_PAID_EARLIER < 0); 0: If
payment is on time or early.

3.2. Aggregate data

3.2.1. Group by SK_ID_PREV

● Apply aggregate operations to capture general information about payment history:


- NUM_INSTALMENT_VERSION: Number of different payment versions (nunique).
- NUM_INSTALMENT_NUMBER: Maximum number of payment periods.
- DAYS_INSTALMENT & DAYS_ENTRY_PAYMENT: Minimum and maximum values.
- AMT_INSTALMENT & AMT_PAYMENT: Min, max, sum and average payments
- NEW_DAYS_PAID_EARLIER: Average number of early payment days.
- NEW_NUM_PAID_LATER: Total number of late payments.

3.2.2. Rename columns: INS_<Column name>_<Math> (e.g., INS_AMT_PAYMENT_SUM).

3.3. Remove unnecessary columns: INS_DAYS_INSTALMENT_MIN,


INS_DAYS_INSTALMENT_MAX, INS_DAYS_ENTRY_PAYMENT_MIN,
INS_DAYS_ENTRY_PAYMENT_MAX.

3.4. Create Advanced Features

● INS_NEW_PAYMENT_PERC: Payment Rate (AMT_PAYMENT_SUM /


AMT_INSTALMENT_SUM) to evaluate the payment level compared to the expected
amount.
● INS_NEW_PAYMENT_DIFF: Difference between the expected total payment and the actual
total payment.

3.5. Aggregate for Previous Application

● Create a list of aggregate operations to apply at the customer level (SK_ID_CURR):


Operations: mean, min, max, sum.
4. pos_cash_balance

4.1. Initial data processing

4.1.1. Categorical variable encoding


● Apply One-Hot Encoding to the categorical column NAME_CONTRACT_STATUS,
creating binary variables such as NAME_CONTRACT_STATUS_Active,
NAME_CONTRACT_STATUS_Completed.

4.2. Create advanced features

4.2.1. Data aggregation

● Group by SK_ID_PREV and apply the following operations:


● Contract information:
- CNT_INSTALMENT: Minimum and maximum number of payment periods.
- CNT_INSTALMENT_FUTURE: Minimum and maximum number of remaining payment
periods.
● Payment delay:
- SK_DPD (number of overdue days): Maximum and average value.
- SK_DPD_DEF (number of overdue days leading to default): Maximum and average value.
● Contract status: Sum of statuses such as Active, Completed.

4.2.2. Credit Rating Characteristics

● POS_NEW_IS_CREDIT_NOT_COMPLETED_ON_TIME:
- Mark the loan as not completed on time:
+ 1: If CNT_INSTALMENT_FUTURE_MIN == 0 and the contract status is not Completed.
+ 0: If the contract is completed on time.

4.3. Remove unnecessary columns: NAME_CONTRACT_STATUS_Approved_SUM,


NAME_CONTRACT_STATUS_Canceled_SUM.

4.4. Aggregate for Previous Application: Update the list of aggregate operations
(agg_list_previous_application) to integrate with the Previous Application dataset

5. credit_card_balance

5.1. Initial Data Processing

5.1.1. Categorical Variable Encoding

● Apply One-Hot Encoding to the NAME_CONTRACT_STATUS column.


● Remove unnecessary variables, such as NAME_CONTRACT_STATUS_Approved,
NAME_CONTRACT_STATUS_Refused.

5.1.2. Fill in Missing Values

● Fill in the mean for variables such as AMT_INST_MIN_REGULARITY,


CNT_INSTALMENT_MATURE_CUM.
● Fill in the value 0 for transaction-related variables, such as
AMT_DRAWINGS_ATM_CURRENT, CNT_DRAWINGS_ATM_CURRENT.

5.2. Create New Features

● CREDIT_UTILIZATION: Remaining Credit Balance (AMT_BALANCE -


AMT_CREDIT_LIMIT_ACTUAL).
● MIN_PAYMENT_VS_DRAWINGS: Difference between the minimum payment and the
current transaction (AMT_INST_MIN_REGULARITY - AMT_DRAWINGS_CURRENT).
● PAYMENT_VS_TOTAL_RECEIVABLE: Difference between the payment and the total
outstanding balance (AMT_PAYMENT_TOTAL_CURRENT -
AMT_TOTAL_RECEIVABLE).
● SUM_ALL_AMT_DRAWINGS: Total transactions from ATM, POS, and other forms.
● RATIO_ALL_AMT_DRAWINGS_TO_ALL_CNT_DRAWINGS: Ratio between total
transaction amount and total number of transactions.
● CREDIT_CARD_BALANCE_RATIO: Ratio of credit limit utilization to card balance.
● PERCENTAGE_MIN_MISSED_PAYMENTS: Percentage of payments less than the
minimum payment.

5.3. Data aggregation

5.3.1. Grouping and calculation

● Aggregating by SK_ID_CURR and applying the following calculations:


- Financial: Sum, average, maximum, minimum values for variables such as
AMT_BALANCE, AMT_PAYMENT_TOTAL_CURRENT.
- Transactions: Sum and average for CNT_DRAWINGS_ATM_CURRENT,
CNT_DRAWINGS_CURRENT.
- Loan amount: Number of loans and payment periods (NUMBER_OF_LOANS,
TOTAL_INSTALMENTS).
● Calculation
- INSTALLMENTS_PER_LOAN: Average number of payment periods per loan.
- DRAWINGS_RATIO: Ratio of total transaction amount to total number of transactions.
6. previous_application.

6.1. Categorical Variable Transformation

● The WEEKDAY_APPR_PROCESS_START variable was split into two groups:


WEEK_DAY (workdays) and WEEKEND (weekends).
● The HOUR_APPR_PROCESS_START variable was divided into two groups:
working_hours (working hours from 8 AM to 5 PM) and off_hours (outside working hours).

6.2. Data Transformation

● The DAYS_DECISION variable was encoded as 1 if the value was less than or equal to 1
year, and 0 if it was greater than 1 year.
● The NAME_TYPE_SUITE variable was classified into two groups: alone (applying alone)
and not_alone (applying with others).
● Values in the NAME_GOODS_CATEGORY and NAME_SELLER_INDUSTRY variables
that do not belong to the main categories were grouped into an others category.
● Categorical variables were encoded using One-Hot Encoding, and aggregated using
operations like sum, mean, max, and min to create composite features.

6.3. Creating New Features

● The LOAN_RATE feature was calculated as the ratio between the requested loan amount
(AMT_APPLICATION) and the approved loan amount (AMT_CREDIT).
● New features such as NEW_CHURN_PREV were created to indicate whether the loan is
overdue, and NEW_INSURANCE replaced the original NFLAG_INSURED_ON_APPROVAL
variable, using the difference between AMT_CREDIT and AMT_GOODS_PRICE.
● The PREVIOUS_TERM (loan term, calculated as the ratio between AMT_CREDIT and
AMT_ANNUITY) and PREVIOUS_AMT_TO_APPLICATION (ratio of approved loan to
requested loan) were created.
● Time difference features like PREVIOUS_DAYSLASTDUE1ST_DAYSFIRSTDUE_DIFF
were also introduced.
● Interest rate features (INTEREST, INTEREST_RATE) were calculated based on the loan
amount and repayment period.

6.4. Removing Irrelevant Variables


● Some unnecessary variables, such as AMT_DOWN_PAYMENT, SELLERPLACE_AREA,
and variables related to time, were removed to reduce data noise.

6.5. Data Merging

● Combined application_train and application_test datasets to create the main table df, which
includes basic customer information such as age, income, and loan contract type.
● Aggregated features from all the datasets (min, max, average,...)

The tables then were merged sequentially. Data from installments_payments was merged with
previous_application, followed by data from pos_cash_balance. The aggregated features were created
for each customer (SK_ID_CURR) and stored in the df_prev_ins_pos_agg table. This table was then
merged with the main df table along with data from credit_card_balance and bureau_balance,
resulting in the final comprehensive table all_data.

6.6. Creating Features Based on Cluster Relationships

In two main datasets: application_train and application_test

● Selected key columns: DAYS_BIRTH, DAYS_EMPLOYED, DAYS_REGISTRATION,


DAYS_ID_PUBLISH, CODE_GENDER, REGION_POPULATION_RELATIVE, and the
target variable TARGET to ensure the relevant factors for the model. For the test dataset
(application_test), we added a TARGET column with a default value of None to synchronize
the structure between the two datasets.
● Calculated DAYS_REGISTRATION and DAYS_ID_PUBLISH based on DAYS_BIRTH to
reflect differences between key events in the customer’s credit registration and issuance
process.
● Merged the two datasets, application_train and application_test, into a single dataset called
dfull2, to streamline the process and ensure consistency throughout the data.

We identified duplicates using combinations of key features and assigned unique cluster IDs.
Based on these clusters, we created lag_TARGET and lead_TARGET features, along with aggregated
metrics. The main dataset, all_data, was enhanced with these features and external data sources,
ensuring a comprehensive training set.

V. Models

Model Overview

Logistic regression, a powerful and interpretable classification model, is used with a pipeline
structure to seamlessly integrate preprocessing and classification steps. It is trained with a maximum
of 1000 iterations and employs cross-validation to ensure generalizability and prevent overfitting.

Cross-Validation and Gini Coefficient

Stratified 5-fold cross-validation ensures balanced evaluation by preserving default and non-
default proportions. The ROC-AUC metric is used to compute the Gini coefficient, which measures
model performance:

GINI Coefficient = 2 x ROC-AUC - 1

Model Selection and Training

The logistic regression model is compared to alternative classifiers (e.g., decision trees) using the
ROC-AUC score. If it outperforms the others, it is selected and retrained on the entire training dataset
for better predictions

Validation and Performance Metrics


The trained model is then evaluated on the validation dataset to measure its real-world
performance. The predicted probabilities for each customer being a defaulter are generated, and the
Validation ROC-AUC and Validation Gini Coefficient is calculated.

Output:

● Validation ROC-AUC: 0.7794


● Validation Gini Coefficient: 0.5588

VI. Insights and Implications

The logistic regression model provides an efficient, interpretable, and stable solution to credit
default risk, with clear feature insights, scalable preprocessing, and consistent performance via cross-
validation. It is reliable for real-world deployment and serves as a benchmark for testing more
advanced algorithms.

You might also like