0% found this document useful (0 votes)
19 views22 pages

Medium-Com-Data-An

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Medium-Com-Data-An

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Search Write Sign up Sign in

Mastering Exploratory Data


Analysis (EDA): Everything You
Need To Know
Sze Zhong LIM · Follow
Published in Data And Beyond · 18 min read · Apr 7, 2024

1K 9

Photo by Andreas Chu on Unsplash

The journey from raw data to actionable insights is often paved with
challenges and uncertainties. At the heart of this journey lies Exploratory
Data Analysis (EDA), a foundational process that serves as a compass,
guiding data scientists through the intricate landscape of their datasets. EDA
is not merely a preliminary step; it is a profound exploration that unlocks the
hidden treasures buried within data, revealing patterns, anomalies, and
relationships that form the bedrock of informed decision-making. Lets start
with, what exactly is EDA?

Free
Exploratory Data Analysis (EDA) is an analytical approach aimed at Membership
uncovering the inherent characteristics of datasets, utilizing statistical and
Distraction-free reading. No ads. Read member-only stories
visualization techniques.
Organize your knowledge with lists and Support writers you read most
highlights.
Earn money for your writing
Sign up to discoverUnlike
human hypothesis-driven analyses driven by prior domain knowledge, EDA
Tell your story. Find your audience.
stories that deepenisyour
a flexible, open-ended exploration that allows data scientists to delve
Listeninto
to audio narrations

understanding of the world.


the data without preconceived notions. Essentially, EDA serves as a Read offline with the Medium app
preliminary step to inspire hypothesis generation by unveiling intriguing
patterns, trends, and correlations within the data. In practical terms, EDA
Sign up for free Try for $5/month
enables us to formulate hypotheses based on data-driven insights, which can
then be tested against hypotheses grounded in domain knowledge, thereby
enriching our understanding and validating our findings.

Exploratory Data Analysis (EDA) typically consists of several key components


or stages that guide data scientists through the process of understanding and
exploring a dataset. These components can vary depending on the specific
goals of the analysis and the characteristics of the data, but commonly
include:
1) Data Collection Top highlight

2) Data Cleaning and Preprocessing


3) Descriptive Statistics
4) Univariate Analysis
5) Bivariate Analysis
6) Multivariate Analysis
7) Feature Engineering
8) Visualization

In this post, I will share my own method of EDA once I have fixed on a
dataset. There will be some differences from project to project, but the
system is there so that most of the items are covered. I will split my EDA into
3 parts:
1) EDA Level 0 — Pure Understanding of Original Data
2) EDA Level 1 — Transformation of Original Data
3) EDA Level 2 — Understanding of Transformed Data

I will be using a some examples from a proper EDA I did. The purpose of this
post is just to share and log the codes used and some of the examples of how EDA
can be done. There might be parts where the insights found do not make sense as
this just a part of a bigger EDA.

EDA Level 0 — Pure Understanding of Original Data


I did a basic check on the column datatype, null counts, distinct values, to
get a better understanding of the data. I also created a distinct values count
dictionary where I go the top 10 counts and their distinct values displayed so
I could roughly gauge how significant the distinct values are in the dataset.

def column_summary(df):
summary_data = []

for col_name in df.columns:


col_dtype = df[col_name].dtype
num_of_nulls = df[col_name].isnull().sum()
num_of_non_nulls = df[col_name].notnull().sum()
num_of_distinct_values = df[col_name].nunique()

if num_of_distinct_values <= 10:


distinct_values_counts = df[col_name].value_counts().to_dict()
else:
top_10_values_counts = df[col_name].value_counts().head(10).to_dict(
distinct_values_counts = {k: v for k, v in sorted(top_10_values_coun

summary_data.append({
'col_name': col_name,
'col_dtype': col_dtype,
'num_of_nulls': num_of_nulls,
'num_of_non_nulls': num_of_non_nulls,
'num_of_distinct_values': num_of_distinct_values,
'distinct_values_counts': distinct_values_counts
})

summary_df = pd.DataFrame(summary_data)
return summary_df

# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)
Below is another code snipper that can extract more info.

# Gets additional value such as min / median / max etc.


def column_summary_plus(df):
result_df = pd.DataFrame(columns=['col_name', 'col_dtype', 'num_distinct_val
'min_value', 'max_value',
'median_no_na', 'average_no_na','average_n
'null_present', 'nulls_num', 'non_nulls_nu
'distinct_values'])

# Loop through each column in the DataFrame


for column in df.columns:
print(f"Start processing {column} col with {df[column].dtype} dtype")
# Get column dtype
col_dtype = df[column].dtype
# Get distinct values and their counts
value_counts = df[column].value_counts()
distinct_values = value_counts.index.tolist()
# Get number of distinct values
num_distinct_values = len(distinct_values)
# Get min and max values
sorted_values = sorted(distinct_values)
min_value = sorted_values[0] if sorted_values else None
max_value = sorted_values[-1] if sorted_values else None

# Get median value


non_distinct_val_list = sorted(df[column].dropna().tolist())
len_non_d_list = len(non_distinct_val_list)
if len(non_distinct_val_list) == 0:
median = None
else:
median = non_distinct_val_list[len_non_d_list//2]

# Get average value if value is number


if np.issubdtype(df[column].dtype, np.number):
if len(non_distinct_val_list) > 0:
average = sum(non_distinct_val_list)/len_non_d_list
non_zero_val_list = [v for v in non_distinct_val_list if v > 0]
average_non_zero = sum(non_zero_val_list)/len_non_d_list
else:
average = None
average_non_zero = None
else:
average = None
average_non_zero = None

# Check if null values are present


null_present = 1 if df[column].isnull().any() else 0

# Get number of nulls and non-nulls


num_nulls = df[column].isnull().sum()
num_non_nulls = df[column].notnull().sum()

# Distinct_values only take top 10 distinct values count


top_10_d_v = value_counts.head(10).index.tolist()
top_10_c = value_counts.head(10).tolist()
top_10_d_v_dict = dict(zip(top_10_d_v,top_10_c))

# Append the information to the result DataFrame


result_df = result_df.append({'col_name': column, 'col_dtype': col_dtype
'min_value': min_value, 'max_value': max_v
'median_no_na': median, 'average_no_na': a
'null_present': null_present, 'nulls_num':
'distinct_values': top_10_d_v_dict}, ignor

return result_df

# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)
If there are any errors, it is most likely due to datatype of the pandas
dataframe. For those who are converting from one format to another, it is
important to keep the datatype info. One example is saving a pandas
dataframe into a csvfile, and then loading it back again, it is so important to
save the datatype and reload it with the datatype.

### To Save Pandas to CSV


def dtype_to_json(pdf, json_file_path: str) -> dict:
'''
Parameters
----------
pdf : pandas.DataFrame
pandas.DataFrame so we can extract the dtype
json_file_path : str
the json file path location

Returns
-------
Dict
The dtype dictionary used

To create a json file which stores the pandas dtype dictionary for
use when converting back from csv to pandas.DataFrame.
'''
dtype_dict = pdf.dtypes.apply(lambda x: str(x)).to_dict()

with open(json_file_path, 'w') as json_file:


json.dump(dtype_dict, json_file)

return dtype_dict

def download_csv_json(df, mainpath):


csvpath = f"{mainpath}.csv"
jsonfp = f"{mainpath}_dtype.json"

dtypedict = dtype_to_json(df, jsonfp)


df.to_csv(csvpath, index=False)

return csvpath, jsonfp

# Example usage:
download_csv_json(df, "/home/some_dir/file_1")

### To Load CSV to Pandas


def json_to_dtype(jsonfilepath):
with open(jsonfilepath, 'r') as json_file:
loaded_dict = json.load(json_file)
return loaded_dict

def csv_to_pandas(csvpath, jsonpath):


dtypedict = json_to_dtype(jsonpath)
pdf = pd.read_csv(csvpath,dtype=dtypedict)

return pdf

# Example usage:
csvfp = "/home/some_dir/file_1.csv"
jsonfp = "/home/some_dir/file_1_dtype.json"
df = csv_to_pandas(csvfp, jsonfp)

By doing this check, I think one of the obvious issues is that the C_ID column
was not a primary key, since the number of distinct values is not equal to the
number of non-nulls.

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df.head())
print(df.describe())
print(df.duplicated().sum())

I will also run the 3 lines above to:


1) Get an idea how the dataset looks like
2) Know the median / mean / rough statistical distribution
3) Check that there are no duplicated rows.
I will then run a fast and dirty check on the distributions using the code
below. The code should product histogram charts :

# Identify numerical columns


numerical_columns = df.select_dtypes(include=[np.number]).columns

# Perform univariate analysis on numerical columns


for column in numerical_columns:
# For continuous variables
if len(df[column].unique()) > 10: # Assuming if unique values > 10, conside
plt.figure(figsize=(8, 6))
sns.histplot(df[column], kde=True)
plt.title(f'Histogram of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.show()
else: # For discrete or ordinal variables
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=column, data=df)
plt.title(f'Count of {column}')
plt.xlabel(column)
plt.ylabel('Count')

# Annotate each bar with its count


for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 5),
textcoords = 'offset points')
plt.show()
Basically, what we want to do is get a rough idea of what the data is like, so
that we can see if there are any inconsistencies and assumptions we might
need to consider before attempting to transform the data. After our fast and
dirty check, we will proceed towards EDA Level 1.

EDA Level 1 — Transformation of Original Data


Based on the Level 0 information. I decided to transform the dataset before
doing a deeper exploration on the dataset for more insights.
1) I changed the column names to all be in small letters and spaces to be
changed to underscore. I also changed it to names that I feel are more
generic and categorized, for easy interpretation.
2) I filled in the empty null / NaN rows with values I feel make sense. (Will
show some examples: dr_cr_ratio, prop_pur_price, cc_out_bal_ave_mth)
3) I changed the datatype to be more appropriate.
4) Do data validation
5) Mapping / Binning of Categorical Features

Changing Column Names


I decided to change the column names to be more readable and
standardized.

All small letters

All spaces changed to underscores

Shortforms standardized

### Rename the column names for familiarity


# This is if there is no requirement to use back the same column names.
# This is also only done if there is no pre-existing format, or if the col names
# Normally will follow feature mart / dept format to name columns for easy under

df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa':
'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'dr
'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'a
'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_pric
'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_
'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt
df_l1.rename(columns=new_col_dict, inplace=True)

I find renaming columns an important part because when I do this, I am


able to categorize columns accordingly and when analyzing features at a
later stage, having clear column names really help in quickly identifying the
points. Eg. when I see features with the cc prefix, I know immediately it is a
credit card feature. This allows me to identify trends along the way.

Filling Up Nulls / NANs


Deciding how to fill up the null values is one of the crucial parts of ensuring
the model predicts accurately. We can only fill up the null values if we have
domain understanding of the feature and also an understanding of the
figures within the dataset. For me, this part usually takes up the most
amount of time from my end, as it really requires trying to understand the
data beforehand.

We can use several methods to figure out how to explore the data and fill up
the null values. I will give two examples:

USING A BOXPLOT

sns.set(style="whitegrid")

# Create the boxplot


plt.figure(figsize=(10, 6)) # Set the size of the plot
sns.boxplot(x='c_incm_typ', y='casa_bal_max_yr', data=df_l1)

# Set labels and title


plt.xlabel('Income Type')
plt.ylabel('casa_bal_max_yr')
plt.title('Boxplot of casa_bal_max_yr by Income Type')
plt.yscale('log')

# Show the plot


plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()

I was trying to find if there was any relationship between casa_bal_max_yr


with Income Type, such that I can use casa_bal_max_yr to define the
person’s Income Type. Since there is no significant result, and the overlap
among the IQR is too huge, we will use 0 as a separate category instead of
deciding to classify as 1 to 8.

USING DESCRIBE

The describe method provides clear understanding of the basic statistical


information. In the below case, I was trying to decide check if all home loans
have property purchase price value. This is because by common sense, if
they took up a home loan, they will have a property, and that property will
have a value.

new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['loan_home_tag'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['prop_pur_price']].isnull().any
print("Number of rows where 'loan_home_tag' is null, but 'prop_pur_price' is not

new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['prop_pur_price'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['loan_home_tag']].isnull().any(
print("Number of rows where 'prop_pur_price' is null, but 'loan_home_tag' is not

new_df = df_l1[['prop_pur_price','loan_home_tag']]
condition = new_df['loan_home_tag'] == 1
new_df[condition].describe()

We can see that there were 5460 customers who had a property purchase
price but did not take up a loan, and that there were 2243 customers who had
taken up a loan but had no property purchase price.

We then used the describe method to see what is the distribution of property
purchase prices for those who have taken a home loan, to see how to impute
the data. For this particular case, I decided to use the median to impute the
nulls in the property purchase price.

Changing the Data Type

dtype_mapping = {'c_id': str, 'c_age': int, 'c_pc': int, 'c_incm_typ': int, 'pro
'casa_td_nos': int, 'loan_home_tag': int, 'loan_auto_tag': int,
'funds_nos': int, 'cc_txn_nos_yr': int, 'u_id': int}
The datatypes were changed. The changes were only minor and most likely
won’t affect the final result but it was done mainly as a best practice. It is
important to understand the data and use the appropriate datatype
especially when dealing with bigger amounts of data.

Doing Data Validation


Based on the highlight of the lack of unique ID for column c_id , let’s do
some data validation for that column.

By right c_id is supposed to be a unique value (representing one customer).


But it is possible that the c_id is duplicated due to input from multiple
sources or snapshots in time, and the ETL process not being entirely clean.
There are several ways to handle unclean data. But we must first understand
what is the unclean data.

If the unclean data is duplicated data with just one or two column values
difference, we can use a rank method to get the row with the least amount of
nulls or the one with the latest data, or even to combine the information.

However, in serious cases where the c_id is wrong, the common protocol
would be to go back to the source and check what went wrong in the source
and then only figure out how to solve it. Lets have a deeper look into c_id to
see what are the issues.

Based on the c_id information and a general read into the data, despite there
being the possibility that the c_id contain information from different
snapshots, it is also unlikely that is the case as some of the profiles do not
conform to common sense.

Such as customer 99731, where one of it shows the person is 52 and has a
degree, but another row shows the person is 65 and only has A-Levels.

For this particular case, I will just take it as different rows since the rows
seems to have no similar trend.

Mapping / Binning of Categorical Features


This part is important because the data might not be categorized according
to industry convention. For instance, in the context of an ultra-wealthy client
list, having a HDB 2 ROOM and HDB 3 ROOM might not make a difference.
They might all be categorized under HDB.

If there are many values for the features, we can either use OneHotEncoding
to split the values into multiple features of binary classification, or we can
use LabelEncoding to label different categories as numbers.

However, I think it best to discuss with domain experts and come up with
relevant categories and bin it manually first.

df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa':
'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'dr
'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'a
'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_pric
'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_
'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt
df_l1.rename(columns=new_col_dict, inplace=True)
fill_values = {'c_edu': 'Unknown', 'c_hse': 'UNKNOWN', 'c_pc': 0, 'c_incm_typ':
'c_occ': 'UNKNOWN',
'casa_td_nos': 0, 'casa_bal_avg_mth': 0, 'casa_bal_max_yr': 0, 'c
'td_bal_avg': 0, 'td_bal_max': 0,
'loan_home_tag':0, 'loan_auto_tag': 0,
'ut_avg': 0, 'ut_max': 0, 'funds_nos': 0,
'cc_txn_amt_max_mth': 0, 'cc_txn_amt_min_mth': 0, 'cc_txn_amt_avg
'cc_txn_amt_yr': 0, 'cc_txn_nos_yr': 0, 'cc_lmt': 0}
df_l1.fillna(fill_values, inplace=True)

Summary for EDA Level 1


At the end of EDA Level 1, there should be a clear table outlining how the
nulls were filled, what are the new column datatypes, and whether they are
numerical / categorical / identifiers. This will make it much easier during the
next phase where we will use models / statistical methods to get feature
importances and analysis.

EDA Level 2 — Understanding of Transformed Data


A recap, in EDA Level 0, we explored the raw data. In EDA Level 1, we
explored the data even deeper and came up with decided on how to
transform the data.
In EDA Level 2, we will understand the transformed data. We will be using
several tools such as:

Correlation Analysis

IV / WOE Values

Feature Importance from Models

Statistical Tests

Further Data Analysis on Imputed Data

Correlation Analysis
Correlation Analysis allows us to see which features are highly correlated.
Despite it not being a requirement to remove highly correlated features for
tree-based algorithms, as the algorithm will allocate feature importance, it is
still a good practice to check the correlation and remove features which are
highly correlated.

Before we started the correlation analysis, I had already noticed that


cc_txn_amt_avg_mth is actually derived from cc_txn_amt_yr, in Level 1 EDA.
As such, I will expect a very high correlation from these 2 features.

numerical_cols = ['c_age', 'prod_nos',


'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']

# Assuming df is your DataFrame


correlation_matrix = df_l2[numerical_cols].corr()

# Create the heatmap


plt.figure(figsize=(20, 16)) # Set the size of the plot
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Set title
plt.title('Correlation Heatmap')

# Show the plot


plt.tight_layout()
plt.show()

# Find the max correlation


upper_triangular = correlation_matrix.where(np.triu(np.ones(correlation_matrix.s
max_correlation = upper_triangular.max().max()
print(f"Maximum pairwise correlation: {max_correlation:.2f}")

The output is as below.


To get the individual pairwise correlation, we can use the code snippet
below:

def corr_v(df_input, col1, col2):


correlation_value = df_input[col1].corr(df_input[col2])
return f"Correlation value between {col1} and {col2} is: {correlation_value}

print(corr_v(df_l2, 'casa_bal_avg_mth', 'casa_bal_max_yr'))


print(corr_v(df_l2, 'td_bal_avg', 'td_bal_max'))
print(corr_v(df_l2, 'ut_avg', 'ut_max'))
print(corr_v(df_l2, 'cc_txn_amt_max_mth', 'cc_txn_amt_yr'))
print(corr_v(df_l2, 'cc_txn_amt_avg_mth', 'cc_txn_amt_yr'))

IV / WOE Values
IV / WOE Value

Information Value (IV) quantifies the prediction power of a feature. You may
read up more about it here. Short story is, we are looking for IV of 0.1 to 0.5

def iv_woe(data, target, bins=10, show_woe=False):

#Empty Dataframe
newDF,woeDF = pd.DataFrame(), pd.DataFrame()

#Extract Column Names


cols = data.columns

#Run WOE and IV on all the independent variables


for ivars in cols[~cols.isin([target])]:
print("Processing variable:", ivars)
if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>1
binned_x = pd.qcut(data[ivars], bins, duplicates='drop')
d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
else:
d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})

# Calculate the number of events in each group (bin)


d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
d.columns = ['Cutoff', 'N', 'Events']

# Calculate % of events in each group.


d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()

# Calculate the non events in each group.


d['Non-Events'] = d['N'] - d['Events']
# Calculate % of non events in each group.
d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'

# Calculate WOE by taking natural log of division of % of non-events and


d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
d.insert(loc=0, column='Variable', value=ivars)
print("Information value of " + ivars + " is " + str(round(d['IV'].sum()
temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, colum
newDF=pd.concat([newDF,temp], axis=0)
woeDF=pd.concat([woeDF,d], axis=0)

#Show WOE Table


if show_woe == True:
print(d)
return newDF, woeDF

numerical_cols = ['c_age', 'prod_nos',


'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
all_cols = numerical_cols + categorical_cols + dependent_col

IVDF, woeDF = iv_woe(df_l2[all_cols], 'c_seg_encoded', bins=10, show_woe=True)

sorted_IVDF = IVDF.sort_values(by='IV', ascending=False)


display(sorted_IVDF)

Use the code snippet above to get the IV value. The result is as below:
By right we are looking for IV of 0.1 to 0.5. However, we only have two
features which are within this range. The results are not satisfactory. For
those IV which are zero,

It may indicate unusual data distributions or wrong handling of missing


data. For those with IV of 0, there is a high chance it is the way it is due to
imbalance of data, resulting in lack of binning. We will keep this in mind
during further analysis.

When there is no binning

When there is binning that enables different cutoffs

Undersampling the majority might be one of the ways to get a more accurate
study on this, but is is a case by case basis.

Feature Importance from Models


We will get the feature importances from various algorithms and see if we
can find any trend from the models. We used Decision Tree Classifier,
Random Forest Classifier, XGBoost, and Logistic Regression.

# Base Settings
df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col

# Settings Train / Test Split.


# We will not be doing Train / Validation / Test split as this is for feature im
from sklearn.model_selection import train_test_split

# Splitting into Training and Holdout Test Sets


# Ensure stratification for now. We will adjust the ratio only later if required
X_train, X_test, y_train, y_test = train_test_split(df_l2[independent_col], df_l
stratify=df_l2[dependent_col

# From Standard Scaler for Numerical Columns (when necessary) Eg. Logistic Regre
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(\
transformers=[('num', StandardScaler(), numerical_cols)],\
remainder='passthrough') # Pass through categorical features unchanged

X_train_transformed = preprocessor.fit_transform(X_train)
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=independent_c
X_test_transformed = preprocessor.fit_transform(X_test)
X_test_transformed_df = pd.DataFrame(X_test_transformed, columns=independent_col
y_train_transformed = y_train.values.ravel()
y_test_transformed = y_test.values.ravel()

# Function for getting feature importance sorted.


def feature_importance_sorted(classification_model_input, X_train, y_train, feat
if classification_model_input is not None:
some_model = classification_model_input
some_model.fit(X_train, y_train)
feature_importances = some_model.feature_importances_
else:
feature_importances = feature_importance_input
feature_importances_sorted = sorted(zip(X_train.columns, feature_importances
df_feature_importances = pd.DataFrame(feature_importances_sorted, columns=['
for feature_name, importance in feature_importances_sorted:
print(f"Feature {feature_name}: {importance}")

df_feature_importances['rank'] = range(1, len(df_feature_importances)+1)


return df_feature_importances

# Decision Tree Classifier Feature Importance


from sklearn.tree import DecisionTreeClassifier
dtc_fi = feature_importance_sorted(DecisionTreeClassifier(), X_train, y_train)

# Random Forest Classifier Feature Importance


from sklearn.ensemble import RandomForestClassifier
rfc_fi = feature_importance_sorted(RandomForestClassifier(), X_train, y_train.va

# XGB Feature Importance


import xgboost as xgb
xgb_fi = feature_importance_sorted(xgb.XGBClassifier(), X_train, y_train)

from sklearn.linear_model import LogisticRegression


lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train.values.ravel())
feature_importances = lr.coef_[0] # Assuming binary classification
lr_fi = feature_importance_sorted(None, X_train, y_train.values.ravel(), feature

From the above code, we can get the individual feature importance from the
models. However, lets rank it for easier reference.

dtc_fi = dtc_fi.rename(columns={'Importance': 'imp_dtc', 'rank': 'rank_dtc'})


rfc_fi = rfc_fi.rename(columns={'Importance': 'imp_rfc', 'rank': 'rank_rfc'})
xgb_fi = xgb_fi.rename(columns={'Importance': 'imp_xgb', 'rank': 'rank_xgb'})
lr_fi = lr_fi.rename(columns={'Importance': 'imp_lr', 'rank': 'rank_lr'})

merged_df = dtc_fi.merge(rfc_fi, on='Feature', how='left')\


.merge(xgb_fi, on='Feature', how='left')\
.merge(lr_fi, on='Feature', how='left')

merged_df

After we rank it, is is much easier to interpret the overall features which are
important and not important.
I colored the features which ranked 20 and more, for different algorithms.
We can see that there is some major overlap between tree-based /
ensembled-based models. Whereas for logistic regression we see a much
different result.

More commonly I see the credit card features being pointed out as less
significant.

Statistical Tests
We can run an individual t-test to check on the difference in distribution of
individual features for affluent and normal customers. I used a significance
of 0.05 and found that the credit card features were found to be insignificant.

aff_df = df_l2[df_l2['c_seg_encoded']==1]
norm_df = df_l2[df_l2['c_seg_encoded']==0]
norm_df_2 = norm_df.sample(frac=0.2, random_state=88)
# Using a smaller sample of the norm_df, since original norm_df is 5x bigger.
# Don't anticipate much change but just trying.

from scipy.stats import ttest_ind


def individual_t_test(df_1, df_2, listoffeatures, alpha_val):
'''
For continuous variable individual t-tests
'''
newlist = []
for feature in listoffeatures:
fea_1 = df_1[feature]
fea_2 = df_2[feature]

t_stat, p_val = ttest_ind(fea_1, fea_2, equal_var=False)


t_stat1 = f'{t_stat:.3f}'
p_val1 = f'{p_val:.3f}'

if p_val < alpha_val:


sig = 'Significant'
else:
sig = 'Insignificant'

newdict = {'feature': feature, 't_stat': t_stat1,


'p_value': p_val1, 'significance': sig}
newlist.append(newdict)

df_result = pd.DataFrame(newlist)
return df_result

individual_t_test(aff_df, norm_df, numerical_cols, 0.05)

individual_t_test(aff_df, norm_df_2, numerical_cols, 0.05)


Example: For cc_txn_nos_yr, the distribution of credit card transaction
number per year for the affluent and the distribution of credit card
transaction number per year for the normal were found to be not statistically
significant.

Data Analysis on Imputed Data


For this, we again use a quick and dirty method of coming up with boxplots
to check between the two different categorical labels.

df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col

for feature in numerical_cols:


plt.figure(figsize=(8, 6))
boxplot = sns.boxplot(x='c_seg_encoded', y=feature, data=df_l2)
plt.title(f'Box Plot of {feature} by AFFLUENT / NORMAL')

# Add condition to use log scale if values are greater than 1000
if df_l2[feature].max() > 1000:
boxplot.set_yscale('log')

plt.xlabel('Customer Type')
plt.ylabel(feature)
plt.show()
casa_bal features showed similar differences. Asset_tot_val was also
significantly different for AFFLUENT, which is common sense.

Summary for EDA Level 2


We should use a confluence of all the points obtained, to come up with
insights or a deeper understanding of the features and data.

Based on the:
1) Information Value
2) Feature Importance from Multiple Algos
3) Statistical Test
4) Further Data Analysis on Imputed Values

I found that credit card related features ranked as the least influential,
whereas total assets and casa balance were consistently ranked as important
features. Despite these information are based on imputed data (which might
have some error), the fact that assets and casa balance determining whether
a person is affluent or not, seems rather logical.

——————————————————————

You may find the Jupyter Notebooks as per below:


Eda Exploratory Data Analysis Pandas Python

Written by Sze Zhong LIM Follow

336 Followers · Writer for Data And Beyond

More from Sze Zhong LIM and Data And Beyond

Sze Zhong LIM in Data And Beyond Pavan Belagatti in Data And Beyond

Exploring Pyspark.ml for Machine Vector Databases: A Beginner’s


Learning: Crafting Optimal Featur… Guide!
What is Feature Selection? In the age of burgeoning data complexity and
high-dimensional information, traditional…

Oct 28, 2023 64 Aug 25, 2023 1.2K 6

Yennhi95zz in Data And Beyond Sze Zhong LIM in Data And Beyond

How to Use AI for Earning Passive Exploring LIME (Local


Income Interpretable Model-agnostic…
5 Realistic Ways to Make Money With AI Unveiling Interpretability in AI using LIME.
Discussion on it’s benefits, limitations,…
Jul 20 163 6 Jan 6 94

See all from Sze Zhong LIM See all from Data And Beyond

Recommended from Medium

Ritesh Shergill Emmanuel Ikogho

📊📈Creating Amazing Data Science is dying; here’s why


Visualizations with Python Why 85% of data science projects fail
A picture is worth a thousand words

Aug 17 948 10 Sep 3 1.7K 85

Lists

Coding & Development Predictive Modeling w/


11 stories · 872 saves Python
20 stories · 1620 saves

Practical Guides to Machine ChatGPT


Learning 21 stories · 851 saves
10 stories · 1977 saves

Python Fundamentals Ethan Duong

Data Visualization for Exploratory Stop aiming at “Data Analyst”


Data Analysis (EDA) in Python position
Python Data Visualization Guide You won’t be a Data Analyst and that is
ABSOLUTELY FINE!

Feb 25 581 4 Jul 15 794 26

John Vastola Zach Qui… in Pipeline: Your Data Engineering Res…

10 Must-Know Machine Learning 3 Data Science Projects That Got


Algorithms for Data Scientists Me 12 Interviews. And 1 That Got…
Machine learning is the science of getting 3 work samples that got my foot in the door,
computers to act without being explicitly… and 1 that almost got me tossed out.

Dec 7, 2022 556 7 Aug 30, 2022 5K 74

See more recommendations

Help Status About Careers Press Blog Privacy Terms Text to speech Teams

You might also like