Modelling-project notes-2
Modelling-project notes-2
By:
E. AuroRajashri
List of Content
1) Introduction of the business
problem..............................................................................................
....3
1.1 Defining problem statement
1.2 Need of the study/project
1.3 Understanding business/social opportunity
2) Data Report……………………………………………………….
……………………………….….5
2.1 Understanding how data was collected in terms of time, frequency
and methodology
2.2 Visual inspection of data (rows, columns, descriptive details)
2.3 Understanding of attributes (variable info, renaming if required)
3) Exploratory data
analysis…………………………………………………………………………….
………….…7
3.1 Univariate analysis (distribution and spread for every continuous
attribute, distribution of data in categories for categorical ones)
3.2 Bivariate analysis (relationship between different variables,
correlations)
3.3 Removal of unwanted variables (if applicable)
3.4 Missing Value treatment (if applicable)
3.5 Outlier treatment (if required)
3.6 Variable transformation (if applicable)
3.7 Addition of new variables (if required)
6) Model Tuning……………………………………………………….…………….
……………….….12
6.1 Ensemble modelling, wherever applicable
6.2 Any other model tuning measures (if applicable)
6.3 Interpretation of the most optimum model and its implication on the
business
List of Tables
2.2 Descriptive Statistics…………………………………………………………………………….………5
List of Figures
3.1.1 Histogram of age
2.Data Report
2.1 Understanding how data was collected in terms of
time, frequency and methodology
The data provided by a credit card company about its customer’s
credit activity and defaulters information.
There are 99,979 customers and the observations are divided into
36 variables.
Key insights:
Concentration of Transactions: The Direct selling establishments
category dominates with the highest count, nearly 40,000, far
exceeding the other categories. This indicates a large number of
transactions or significant activity in this category.
Moderate Activity: Categories like Books & Magazines and Youthful
Shoes & Clothing have moderate counts (around 10,000–15,000),
showing significant but not overwhelming activity compared to the
leader.
Low Activity Categories: Categories like Dietary Supplements,
Prints & Photos, and Diversified electronics have much lower
counts (under 10,000). These are niche categories with fewer
transactions.
Category Variety: The top 10 categories represent a broad range of
industries, including electronics, apparel, outdoor gear, books, and
general merchandise. This indicates diverse customer interests.
3.1.5 Top 10 Merchant groups
The bar chart you shared shows the top 10 merchant groups and the
count of transactions or occurrences associated with each group. Here's
a breakdown of the insights:
Entertainment is by far the dominant category, with significantly
more counts (around 50,000) than the other categories. This
suggests that consumers engage with or spend more in this group.
Clothing & Shoes follows as the second-highest group, though it's
much lower than Entertainment.
The groups with the lowest counts are Jewelry & Accessories,
Home & Garden, Intangible Products, and Automotive Products.
The distribution shows that spending or transaction volume is
concentrated heavily in Entertainment, with other categories
having relatively smaller but still notable volumes.
3.1.6 Histogram of all numerical variables
This barplot compares the average account amount added in the last 12-
24 months for customers who defaulted (1) versus those who didn't (0).
We can see that:
Customers who defaulted (1) tend to have a higher average
account amount added compared to those who didn't default (0).
This could suggest that customers who add larger amounts to their
accounts might be at a higher risk of default, possibly due to
overextending their financial capabilities.
This strip plot shows the distribution of the maximum paid invoice in the last 12 months for
defaulted and non-defaulted customers. Observations:
The distribution for non-defaulted customers (0) appears to be more concentrated in
the lower range, with some high-value outliers.
Non-defaulted accounts (status 0) show a wider and higher distribution of max paid
invoices, while defaulted accounts (status 1) have smaller invoice amounts. This
pattern could be used for risk assessment or to better understand customer payment
behaviour
Key Insights:
1. Highly Correlated Features:
Features with a correlation coefficient close to 1 or -1
have a very strong linear relationship, either positively
or negatively correlated.
For example, if max_paid_inv_0_12m and
num_active_inv_0_12m show high positive correlation,
it implies that as the number of active invoices
increases, the maximum paid invoice also tends to
increase.
Similarly, features like acct_worst_status_12_24m
might be strongly correlated with
acct_worst_status_6_12m, indicating a consistency in
worst account status over different periods.
2. Clusters of Features:
Features that are highly correlated with each other may
form "clusters." For instance, all account status
variables or payment-related features might be grouped
together, showing that they are related aspects of
customer behavior.
Clustering often reveals related features that can be
treated similarly in model building or analysis, as they
provide overlapping information.
3. Negative Correlations:
Strong negative correlations (close to -1) indicate an
inverse relationship. For example, if default_status has
a negative correlation with max_paid_inv_0_12m, it
means that customers with higher max paid invoices
are less likely to default.
Similarly, a negative correlation between
acct_incoming_debt_vs_paid_0_24m and
acct_days_in_rem_12_24m might show that the more
days a person remains in arrears, the less they manage
to reduce their outstanding debt.
4. Redundancy:
Features that are almost perfectly correlated (near 1)
may represent redundant information. For example, if
acct_worst_status_6_12m and acct_worst_status_3_6m
are highly correlated, it may be redundant to include
both in certain analyses. One of these features can
potentially be dropped in a model without losing
valuable information.
5. Outliers in Correlation:
If there are features that stand out with unexpectedly
high or low correlations compared to others, they may
warrant deeper investigation. These outliers could
represent key insights into behavior or relationships
between variables that are not immediately obvious.
Dropping off the columns which has missing value greater than 25%
and below are the missing values in remaining columns
3.4.2 Post dropping off columns with 25% threshold
3. Train a model: You can then use the balanced data to train
your machine learning model.
The classifier does a good job overall, with a relatively high AUC
score.
Although the classifier performs well in general, it may still fail
to correctly identify the minority class (Class 1) as shown by its
low recall and F1-score for that class.