0% found this document useful (0 votes)

19 views22 pages

Medium-Com-Data-An

Uploaded by

Proton Chaosrikul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views22 pages

Medium-Com-Data-An

Uploaded by

Proton Chaosrikul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Search Write Sign up Sign in

Mastering Exploratory Data

Analysis (EDA): Everything You
Need To Know
Sze Zhong LIM · Follow
Published in Data And Beyond · 18 min read · Apr 7, 2024

1K 9

Photo by Andreas Chu on Unsplash

The journey from raw data to actionable insights is often paved with
challenges and uncertainties. At the heart of this journey lies Exploratory
Data Analysis (EDA), a foundational process that serves as a compass,
guiding data scientists through the intricate landscape of their datasets. EDA
is not merely a preliminary step; it is a profound exploration that unlocks the
hidden treasures buried within data, revealing patterns, anomalies, and
relationships that form the bedrock of informed decision-making. Lets start
with, what exactly is EDA?

Free
Exploratory Data Analysis (EDA) is an analytical approach aimed at Membership
uncovering the inherent characteristics of datasets, utilizing statistical and
Distraction-free reading. No ads. Read member-only stories
visualization techniques.
Organize your knowledge with lists and Support writers you read most
highlights.
Earn money for your writing
Sign up to discoverUnlike
human hypothesis-driven analyses driven by prior domain knowledge, EDA
Tell your story. Find your audience.
stories that deepenisyour
a flexible, open-ended exploration that allows data scientists to delve
Listeninto
to audio narrations

understanding of the world.

the data without preconceived notions. Essentially, EDA serves as a Read offline with the Medium app
preliminary step to inspire hypothesis generation by unveiling intriguing
patterns, trends, and correlations within the data. In practical terms, EDA
Sign up for free Try for $5/month
enables us to formulate hypotheses based on data-driven insights, which can
then be tested against hypotheses grounded in domain knowledge, thereby
enriching our understanding and validating our findings.

Exploratory Data Analysis (EDA) typically consists of several key components

or stages that guide data scientists through the process of understanding and
exploring a dataset. These components can vary depending on the specific
goals of the analysis and the characteristics of the data, but commonly
include:
1) Data Collection Top highlight

2) Data Cleaning and Preprocessing

3) Descriptive Statistics
4) Univariate Analysis
5) Bivariate Analysis
6) Multivariate Analysis
7) Feature Engineering
8) Visualization

In this post, I will share my own method of EDA once I have fixed on a
dataset. There will be some differences from project to project, but the
system is there so that most of the items are covered. I will split my EDA into
3 parts:
1) EDA Level 0 — Pure Understanding of Original Data
2) EDA Level 1 — Transformation of Original Data
3) EDA Level 2 — Understanding of Transformed Data

I will be using a some examples from a proper EDA I did. The purpose of this
post is just to share and log the codes used and some of the examples of how EDA
can be done. There might be parts where the insights found do not make sense as
this just a part of a bigger EDA.

EDA Level 0 — Pure Understanding of Original Data

I did a basic check on the column datatype, null counts, distinct values, to
get a better understanding of the data. I also created a distinct values count
dictionary where I go the top 10 counts and their distinct values displayed so
I could roughly gauge how significant the distinct values are in the dataset.

def column_summary(df):
summary_data = []

for col_name in df.columns:

col_dtype = df[col_name].dtype
num_of_nulls = df[col_name].isnull().sum()
num_of_non_nulls = df[col_name].notnull().sum()
num_of_distinct_values = df[col_name].nunique()

if num_of_distinct_values <= 10:

distinct_values_counts = df[col_name].value_counts().to_dict()
else:
top_10_values_counts = df[col_name].value_counts().head(10).to_dict(
distinct_values_counts = {k: v for k, v in sorted(top_10_values_coun

summary_data.append({
'col_name': col_name,
'col_dtype': col_dtype,
'num_of_nulls': num_of_nulls,
'num_of_non_nulls': num_of_non_nulls,
'num_of_distinct_values': num_of_distinct_values,
'distinct_values_counts': distinct_values_counts
})

summary_df = pd.DataFrame(summary_data)
return summary_df

# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)
Below is another code snipper that can extract more info.

# Gets additional value such as min / median / max etc.

def column_summary_plus(df):
result_df = pd.DataFrame(columns=['col_name', 'col_dtype', 'num_distinct_val
'min_value', 'max_value',
'median_no_na', 'average_no_na','average_n
'null_present', 'nulls_num', 'non_nulls_nu
'distinct_values'])

# Loop through each column in the DataFrame

for column in df.columns:
print(f"Start processing {column} col with {df[column].dtype} dtype")
# Get column dtype
col_dtype = df[column].dtype
# Get distinct values and their counts
value_counts = df[column].value_counts()
distinct_values = value_counts.index.tolist()
# Get number of distinct values
num_distinct_values = len(distinct_values)
# Get min and max values
sorted_values = sorted(distinct_values)
min_value = sorted_values[0] if sorted_values else None
max_value = sorted_values[-1] if sorted_values else None

# Get median value

non_distinct_val_list = sorted(df[column].dropna().tolist())
len_non_d_list = len(non_distinct_val_list)
if len(non_distinct_val_list) == 0:
median = None
else:
median = non_distinct_val_list[len_non_d_list//2]

# Get average value if value is number

if np.issubdtype(df[column].dtype, np.number):
if len(non_distinct_val_list) > 0:
average = sum(non_distinct_val_list)/len_non_d_list
non_zero_val_list = [v for v in non_distinct_val_list if v > 0]
average_non_zero = sum(non_zero_val_list)/len_non_d_list
else:
average = None
average_non_zero = None
else:
average = None
average_non_zero = None

# Check if null values are present

null_present = 1 if df[column].isnull().any() else 0

# Get number of nulls and non-nulls

num_nulls = df[column].isnull().sum()
num_non_nulls = df[column].notnull().sum()

# Distinct_values only take top 10 distinct values count

top_10_d_v = value_counts.head(10).index.tolist()
top_10_c = value_counts.head(10).tolist()
top_10_d_v_dict = dict(zip(top_10_d_v,top_10_c))

# Append the information to the result DataFrame

result_df = result_df.append({'col_name': column, 'col_dtype': col_dtype
'min_value': min_value, 'max_value': max_v
'median_no_na': median, 'average_no_na': a
'null_present': null_present, 'nulls_num':
'distinct_values': top_10_d_v_dict}, ignor

return result_df

# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)
If there are any errors, it is most likely due to datatype of the pandas
dataframe. For those who are converting from one format to another, it is
important to keep the datatype info. One example is saving a pandas
dataframe into a csvfile, and then loading it back again, it is so important to
save the datatype and reload it with the datatype.

### To Save Pandas to CSV

def dtype_to_json(pdf, json_file_path: str) -> dict:
'''
Parameters
----------
pdf : pandas.DataFrame
pandas.DataFrame so we can extract the dtype
json_file_path : str
the json file path location

Returns
-------
Dict
The dtype dictionary used

To create a json file which stores the pandas dtype dictionary for
use when converting back from csv to pandas.DataFrame.
'''
dtype_dict = pdf.dtypes.apply(lambda x: str(x)).to_dict()

with open(json_file_path, 'w') as json_file:

json.dump(dtype_dict, json_file)

return dtype_dict

def download_csv_json(df, mainpath):

csvpath = f"{mainpath}.csv"
jsonfp = f"{mainpath}_dtype.json"

dtypedict = dtype_to_json(df, jsonfp)

df.to_csv(csvpath, index=False)

return csvpath, jsonfp

# Example usage:
download_csv_json(df, "/home/some_dir/file_1")

### To Load CSV to Pandas

def json_to_dtype(jsonfilepath):
with open(jsonfilepath, 'r') as json_file:
loaded_dict = json.load(json_file)
return loaded_dict

def csv_to_pandas(csvpath, jsonpath):

dtypedict = json_to_dtype(jsonpath)
pdf = pd.read_csv(csvpath,dtype=dtypedict)

return pdf

# Example usage:
csvfp = "/home/some_dir/file_1.csv"
jsonfp = "/home/some_dir/file_1_dtype.json"
df = csv_to_pandas(csvfp, jsonfp)

By doing this check, I think one of the obvious issues is that the C_ID column
was not a primary key, since the number of distinct values is not equal to the
number of non-nulls.

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df.head())
print(df.describe())
print(df.duplicated().sum())

I will also run the 3 lines above to:

1) Get an idea how the dataset looks like
2) Know the median / mean / rough statistical distribution
3) Check that there are no duplicated rows.
I will then run a fast and dirty check on the distributions using the code
below. The code should product histogram charts :

# Identify numerical columns

numerical_columns = df.select_dtypes(include=[np.number]).columns

# Perform univariate analysis on numerical columns

for column in numerical_columns:
# For continuous variables
if len(df[column].unique()) > 10: # Assuming if unique values > 10, conside
plt.figure(figsize=(8, 6))
sns.histplot(df[column], kde=True)
plt.title(f'Histogram of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.show()
else: # For discrete or ordinal variables
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=column, data=df)
plt.title(f'Count of {column}')
plt.xlabel(column)
plt.ylabel('Count')

# Annotate each bar with its count

for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 5),
textcoords = 'offset points')
plt.show()
Basically, what we want to do is get a rough idea of what the data is like, so
that we can see if there are any inconsistencies and assumptions we might
need to consider before attempting to transform the data. After our fast and
dirty check, we will proceed towards EDA Level 1.

EDA Level 1 — Transformation of Original Data

Based on the Level 0 information. I decided to transform the dataset before
doing a deeper exploration on the dataset for more insights.
1) I changed the column names to all be in small letters and spaces to be
changed to underscore. I also changed it to names that I feel are more
generic and categorized, for easy interpretation.
2) I filled in the empty null / NaN rows with values I feel make sense. (Will
show some examples: dr_cr_ratio, prop_pur_price, cc_out_bal_ave_mth)
3) I changed the datatype to be more appropriate.
4) Do data validation
5) Mapping / Binning of Categorical Features

Changing Column Names

I decided to change the column names to be more readable and
standardized.

All small letters

All spaces changed to underscores

Shortforms standardized

### Rename the column names for familiarity

# This is if there is no requirement to use back the same column names.
# This is also only done if there is no pre-existing format, or if the col names
# Normally will follow feature mart / dept format to name columns for easy under

I find renaming columns an important part because when I do this, I am

able to categorize columns accordingly and when analyzing features at a
later stage, having clear column names really help in quickly identifying the
points. Eg. when I see features with the cc prefix, I know immediately it is a
credit card feature. This allows me to identify trends along the way.

Filling Up Nulls / NANs

Deciding how to fill up the null values is one of the crucial parts of ensuring
the model predicts accurately. We can only fill up the null values if we have
domain understanding of the feature and also an understanding of the
figures within the dataset. For me, this part usually takes up the most
amount of time from my end, as it really requires trying to understand the
data beforehand.

We can use several methods to figure out how to explore the data and fill up
the null values. I will give two examples:

USING A BOXPLOT

sns.set(style="whitegrid")

# Create the boxplot

plt.figure(figsize=(10, 6)) # Set the size of the plot
sns.boxplot(x='c_incm_typ', y='casa_bal_max_yr', data=df_l1)

# Set labels and title

plt.xlabel('Income Type')
plt.ylabel('casa_bal_max_yr')
plt.title('Boxplot of casa_bal_max_yr by Income Type')
plt.yscale('log')

# Show the plot

plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()

I was trying to find if there was any relationship between casa_bal_max_yr

with Income Type, such that I can use casa_bal_max_yr to define the
person’s Income Type. Since there is no significant result, and the overlap
among the IQR is too huge, we will use 0 as a separate category instead of
deciding to classify as 1 to 8.

USING DESCRIBE

The describe method provides clear understanding of the basic statistical

information. In the below case, I was trying to decide check if all home loans
have property purchase price value. This is because by common sense, if
they took up a home loan, they will have a property, and that property will
have a value.

new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['loan_home_tag'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['prop_pur_price']].isnull().any
print("Number of rows where 'loan_home_tag' is null, but 'prop_pur_price' is not

new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['prop_pur_price'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['loan_home_tag']].isnull().any(
print("Number of rows where 'prop_pur_price' is null, but 'loan_home_tag' is not

new_df = df_l1[['prop_pur_price','loan_home_tag']]
condition = new_df['loan_home_tag'] == 1
new_df[condition].describe()

We can see that there were 5460 customers who had a property purchase
price but did not take up a loan, and that there were 2243 customers who had
taken up a loan but had no property purchase price.

We then used the describe method to see what is the distribution of property
purchase prices for those who have taken a home loan, to see how to impute
the data. For this particular case, I decided to use the median to impute the
nulls in the property purchase price.

Changing the Data Type

dtype_mapping = {'c_id': str, 'c_age': int, 'c_pc': int, 'c_incm_typ': int, 'pro
'casa_td_nos': int, 'loan_home_tag': int, 'loan_auto_tag': int,
'funds_nos': int, 'cc_txn_nos_yr': int, 'u_id': int}
The datatypes were changed. The changes were only minor and most likely
won’t affect the final result but it was done mainly as a best practice. It is
important to understand the data and use the appropriate datatype
especially when dealing with bigger amounts of data.

Doing Data Validation

Based on the highlight of the lack of unique ID for column c_id , let’s do
some data validation for that column.

By right c_id is supposed to be a unique value (representing one customer).

But it is possible that the c_id is duplicated due to input from multiple
sources or snapshots in time, and the ETL process not being entirely clean.
There are several ways to handle unclean data. But we must first understand
what is the unclean data.

If the unclean data is duplicated data with just one or two column values
difference, we can use a rank method to get the row with the least amount of
nulls or the one with the latest data, or even to combine the information.

However, in serious cases where the c_id is wrong, the common protocol
would be to go back to the source and check what went wrong in the source
and then only figure out how to solve it. Lets have a deeper look into c_id to
see what are the issues.

Based on the c_id information and a general read into the data, despite there
being the possibility that the c_id contain information from different
snapshots, it is also unlikely that is the case as some of the profiles do not
conform to common sense.

Such as customer 99731, where one of it shows the person is 52 and has a
degree, but another row shows the person is 65 and only has A-Levels.

For this particular case, I will just take it as different rows since the rows
seems to have no similar trend.

Mapping / Binning of Categorical Features

This part is important because the data might not be categorized according
to industry convention. For instance, in the context of an ultra-wealthy client
list, having a HDB 2 ROOM and HDB 3 ROOM might not make a difference.
They might all be categorized under HDB.

If there are many values for the features, we can either use OneHotEncoding
to split the values into multiple features of binary classification, or we can
use LabelEncoding to label different categories as numbers.

However, I think it best to discuss with domain experts and come up with
relevant categories and bin it manually first.

df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa':
'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'dr
'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'a
'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_pric
'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_
'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt
df_l1.rename(columns=new_col_dict, inplace=True)
fill_values = {'c_edu': 'Unknown', 'c_hse': 'UNKNOWN', 'c_pc': 0, 'c_incm_typ':
'c_occ': 'UNKNOWN',
'casa_td_nos': 0, 'casa_bal_avg_mth': 0, 'casa_bal_max_yr': 0, 'c
'td_bal_avg': 0, 'td_bal_max': 0,
'loan_home_tag':0, 'loan_auto_tag': 0,
'ut_avg': 0, 'ut_max': 0, 'funds_nos': 0,
'cc_txn_amt_max_mth': 0, 'cc_txn_amt_min_mth': 0, 'cc_txn_amt_avg
'cc_txn_amt_yr': 0, 'cc_txn_nos_yr': 0, 'cc_lmt': 0}
df_l1.fillna(fill_values, inplace=True)

Summary for EDA Level 1

At the end of EDA Level 1, there should be a clear table outlining how the
nulls were filled, what are the new column datatypes, and whether they are
numerical / categorical / identifiers. This will make it much easier during the
next phase where we will use models / statistical methods to get feature
importances and analysis.

EDA Level 2 — Understanding of Transformed Data

A recap, in EDA Level 0, we explored the raw data. In EDA Level 1, we
explored the data even deeper and came up with decided on how to
transform the data.
In EDA Level 2, we will understand the transformed data. We will be using
several tools such as:

Correlation Analysis

IV / WOE Values

Feature Importance from Models

Statistical Tests

Further Data Analysis on Imputed Data

Correlation Analysis
Correlation Analysis allows us to see which features are highly correlated.
Despite it not being a requirement to remove highly correlated features for
tree-based algorithms, as the algorithm will allocate feature importance, it is
still a good practice to check the correlation and remove features which are
highly correlated.

Before we started the correlation analysis, I had already noticed that

cc_txn_amt_avg_mth is actually derived from cc_txn_amt_yr, in Level 1 EDA.
As such, I will expect a very high correlation from these 2 features.

numerical_cols = ['c_age', 'prod_nos',

'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']

# Assuming df is your DataFrame

correlation_matrix = df_l2[numerical_cols].corr()

# Create the heatmap

plt.figure(figsize=(20, 16)) # Set the size of the plot
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Set title
plt.title('Correlation Heatmap')

# Show the plot

plt.tight_layout()
plt.show()

# Find the max correlation

upper_triangular = correlation_matrix.where(np.triu(np.ones(correlation_matrix.s
max_correlation = upper_triangular.max().max()
print(f"Maximum pairwise correlation: {max_correlation:.2f}")

The output is as below.

To get the individual pairwise correlation, we can use the code snippet
below:

def corr_v(df_input, col1, col2):

correlation_value = df_input[col1].corr(df_input[col2])
return f"Correlation value between {col1} and {col2} is: {correlation_value}

print(corr_v(df_l2, 'casa_bal_avg_mth', 'casa_bal_max_yr'))

print(corr_v(df_l2, 'td_bal_avg', 'td_bal_max'))
print(corr_v(df_l2, 'ut_avg', 'ut_max'))
print(corr_v(df_l2, 'cc_txn_amt_max_mth', 'cc_txn_amt_yr'))
print(corr_v(df_l2, 'cc_txn_amt_avg_mth', 'cc_txn_amt_yr'))

IV / WOE Values
IV / WOE Value

Information Value (IV) quantifies the prediction power of a feature. You may
read up more about it here. Short story is, we are looking for IV of 0.1 to 0.5

def iv_woe(data, target, bins=10, show_woe=False):

#Empty Dataframe
newDF,woeDF = pd.DataFrame(), pd.DataFrame()

#Extract Column Names

cols = data.columns

#Run WOE and IV on all the independent variables

for ivars in cols[~cols.isin([target])]:
print("Processing variable:", ivars)
if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>1
binned_x = pd.qcut(data[ivars], bins, duplicates='drop')
d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
else:
d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})

# Calculate the number of events in each group (bin)

d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
d.columns = ['Cutoff', 'N', 'Events']

# Calculate % of events in each group.

d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()

# Calculate the non events in each group.

d['Non-Events'] = d['N'] - d['Events']
# Calculate % of non events in each group.
d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'

# Calculate WOE by taking natural log of division of % of non-events and

d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
d.insert(loc=0, column='Variable', value=ivars)
print("Information value of " + ivars + " is " + str(round(d['IV'].sum()
temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, colum
newDF=pd.concat([newDF,temp], axis=0)
woeDF=pd.concat([woeDF,d], axis=0)

#Show WOE Table

if show_woe == True:
print(d)
return newDF, woeDF

numerical_cols = ['c_age', 'prod_nos',

IVDF, woeDF = iv_woe(df_l2[all_cols], 'c_seg_encoded', bins=10, show_woe=True)

sorted_IVDF = IVDF.sort_values(by='IV', ascending=False)

display(sorted_IVDF)

Use the code snippet above to get the IV value. The result is as below:
By right we are looking for IV of 0.1 to 0.5. However, we only have two
features which are within this range. The results are not satisfactory. For
those IV which are zero,

It may indicate unusual data distributions or wrong handling of missing

data. For those with IV of 0, there is a high chance it is the way it is due to
imbalance of data, resulting in lack of binning. We will keep this in mind
during further analysis.

When there is no binning

When there is binning that enables different cutoffs

Undersampling the majority might be one of the ways to get a more accurate
study on this, but is is a case by case basis.

Feature Importance from Models

We will get the feature importances from various algorithms and see if we
can find any trend from the models. We used Decision Tree Classifier,
Random Forest Classifier, XGBoost, and Logistic Regression.

# Base Settings
df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col

# Settings Train / Test Split.

# We will not be doing Train / Validation / Test split as this is for feature im
from sklearn.model_selection import train_test_split

# Splitting into Training and Holdout Test Sets

# Ensure stratification for now. We will adjust the ratio only later if required
X_train, X_test, y_train, y_test = train_test_split(df_l2[independent_col], df_l
stratify=df_l2[dependent_col

# From Standard Scaler for Numerical Columns (when necessary) Eg. Logistic Regre
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(\
transformers=[('num', StandardScaler(), numerical_cols)],\
remainder='passthrough') # Pass through categorical features unchanged

X_train_transformed = preprocessor.fit_transform(X_train)
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=independent_c
X_test_transformed = preprocessor.fit_transform(X_test)
X_test_transformed_df = pd.DataFrame(X_test_transformed, columns=independent_col
y_train_transformed = y_train.values.ravel()
y_test_transformed = y_test.values.ravel()

# Function for getting feature importance sorted.

def feature_importance_sorted(classification_model_input, X_train, y_train, feat
if classification_model_input is not None:
some_model = classification_model_input
some_model.fit(X_train, y_train)
feature_importances = some_model.feature_importances_
else:
feature_importances = feature_importance_input
feature_importances_sorted = sorted(zip(X_train.columns, feature_importances
df_feature_importances = pd.DataFrame(feature_importances_sorted, columns=['
for feature_name, importance in feature_importances_sorted:
print(f"Feature {feature_name}: {importance}")

df_feature_importances['rank'] = range(1, len(df_feature_importances)+1)

return df_feature_importances

# Decision Tree Classifier Feature Importance

from sklearn.tree import DecisionTreeClassifier
dtc_fi = feature_importance_sorted(DecisionTreeClassifier(), X_train, y_train)

# Random Forest Classifier Feature Importance

from sklearn.ensemble import RandomForestClassifier
rfc_fi = feature_importance_sorted(RandomForestClassifier(), X_train, y_train.va

# XGB Feature Importance

import xgboost as xgb
xgb_fi = feature_importance_sorted(xgb.XGBClassifier(), X_train, y_train)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train.values.ravel())
feature_importances = lr.coef_[0] # Assuming binary classification
lr_fi = feature_importance_sorted(None, X_train, y_train.values.ravel(), feature

From the above code, we can get the individual feature importance from the
models. However, lets rank it for easier reference.

dtc_fi = dtc_fi.rename(columns={'Importance': 'imp_dtc', 'rank': 'rank_dtc'})

rfc_fi = rfc_fi.rename(columns={'Importance': 'imp_rfc', 'rank': 'rank_rfc'})
xgb_fi = xgb_fi.rename(columns={'Importance': 'imp_xgb', 'rank': 'rank_xgb'})
lr_fi = lr_fi.rename(columns={'Importance': 'imp_lr', 'rank': 'rank_lr'})

merged_df = dtc_fi.merge(rfc_fi, on='Feature', how='left')\

.merge(xgb_fi, on='Feature', how='left')\
.merge(lr_fi, on='Feature', how='left')

merged_df

After we rank it, is is much easier to interpret the overall features which are
important and not important.
I colored the features which ranked 20 and more, for different algorithms.
We can see that there is some major overlap between tree-based /
ensembled-based models. Whereas for logistic regression we see a much
different result.

More commonly I see the credit card features being pointed out as less
significant.

Statistical Tests
We can run an individual t-test to check on the difference in distribution of
individual features for affluent and normal customers. I used a significance
of 0.05 and found that the credit card features were found to be insignificant.

aff_df = df_l2[df_l2['c_seg_encoded']==1]
norm_df = df_l2[df_l2['c_seg_encoded']==0]
norm_df_2 = norm_df.sample(frac=0.2, random_state=88)
# Using a smaller sample of the norm_df, since original norm_df is 5x bigger.
# Don't anticipate much change but just trying.

from scipy.stats import ttest_ind

def individual_t_test(df_1, df_2, listoffeatures, alpha_val):
'''
For continuous variable individual t-tests
'''
newlist = []
for feature in listoffeatures:
fea_1 = df_1[feature]
fea_2 = df_2[feature]

t_stat, p_val = ttest_ind(fea_1, fea_2, equal_var=False)

t_stat1 = f'{t_stat:.3f}'
p_val1 = f'{p_val:.3f}'

if p_val < alpha_val:

sig = 'Significant'
else:
sig = 'Insignificant'

newdict = {'feature': feature, 't_stat': t_stat1,

'p_value': p_val1, 'significance': sig}
newlist.append(newdict)

df_result = pd.DataFrame(newlist)
return df_result

individual_t_test(aff_df, norm_df, numerical_cols, 0.05)

individual_t_test(aff_df, norm_df_2, numerical_cols, 0.05)

Example: For cc_txn_nos_yr, the distribution of credit card transaction
number per year for the affluent and the distribution of credit card
transaction number per year for the normal were found to be not statistically
significant.

Data Analysis on Imputed Data

For this, we again use a quick and dirty method of coming up with boxplots
to check between the two different categorical labels.

df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col

for feature in numerical_cols:

plt.figure(figsize=(8, 6))
boxplot = sns.boxplot(x='c_seg_encoded', y=feature, data=df_l2)
plt.title(f'Box Plot of {feature} by AFFLUENT / NORMAL')

# Add condition to use log scale if values are greater than 1000
if df_l2[feature].max() > 1000:
boxplot.set_yscale('log')

plt.xlabel('Customer Type')
plt.ylabel(feature)
plt.show()
casa_bal features showed similar differences. Asset_tot_val was also
significantly different for AFFLUENT, which is common sense.

Summary for EDA Level 2

We should use a confluence of all the points obtained, to come up with
insights or a deeper understanding of the features and data.

Based on the:
1) Information Value
2) Feature Importance from Multiple Algos
3) Statistical Test
4) Further Data Analysis on Imputed Values

I found that credit card related features ranked as the least influential,
whereas total assets and casa balance were consistently ranked as important
features. Despite these information are based on imputed data (which might
have some error), the fact that assets and casa balance determining whether
a person is affluent or not, seems rather logical.

——————————————————————

You may find the Jupyter Notebooks as per below:

Eda Exploratory Data Analysis Pandas Python

Written by Sze Zhong LIM Follow

336 Followers · Writer for Data And Beyond

More from Sze Zhong LIM and Data And Beyond

Sze Zhong LIM in Data And Beyond Pavan Belagatti in Data And Beyond

Exploring Pyspark.ml for Machine Vector Databases: A Beginner’s

Learning: Crafting Optimal Featur… Guide!
What is Feature Selection? In the age of burgeoning data complexity and
high-dimensional information, traditional…

Oct 28, 2023 64 Aug 25, 2023 1.2K 6

Yennhi95zz in Data And Beyond Sze Zhong LIM in Data And Beyond

How to Use AI for Earning Passive Exploring LIME (Local

Income Interpretable Model-agnostic…
5 Realistic Ways to Make Money With AI Unveiling Interpretability in AI using LIME.
Discussion on it’s benefits, limitations,…
Jul 20 163 6 Jan 6 94

See all from Sze Zhong LIM See all from Data And Beyond

Recommended from Medium

Ritesh Shergill Emmanuel Ikogho

📊📈Creating Amazing Data Science is dying; here’s why

Visualizations with Python Why 85% of data science projects fail
A picture is worth a thousand words

Aug 17 948 10 Sep 3 1.7K 85

Lists

Coding & Development Predictive Modeling w/

11 stories · 872 saves Python
20 stories · 1620 saves

Practical Guides to Machine ChatGPT

Learning 21 stories · 851 saves
10 stories · 1977 saves

Python Fundamentals Ethan Duong

Data Visualization for Exploratory Stop aiming at “Data Analyst”

Data Analysis (EDA) in Python position
Python Data Visualization Guide You won’t be a Data Analyst and that is
ABSOLUTELY FINE!

Feb 25 581 4 Jul 15 794 26

John Vastola Zach Qui… in Pipeline: Your Data Engineering Res…

10 Must-Know Machine Learning 3 Data Science Projects That Got

Algorithms for Data Scientists Me 12 Interviews. And 1 That Got…
Machine learning is the science of getting 3 work samples that got my foot in the door,
computers to act without being explicitly… and 1 that almost got me tossed out.

Dec 7, 2022 556 7 Aug 30, 2022 5K 74

See more recommendations

Help Status About Careers Press Blog Privacy Terms Text to speech Teams

Data Transformation Techniques
No ratings yet
Data Transformation Techniques
13 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
What Is Exploratory Data Analysis (EDA)
100% (2)
What Is Exploratory Data Analysis (EDA)
13 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Spooky2 Morgellon Lyme Guide 9.2019
No ratings yet
Spooky2 Morgellon Lyme Guide 9.2019
87 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Ceramic Cores For Turbine Blades: A Tooling Perspective: Pradyumna R &baigmah
No ratings yet
Ceramic Cores For Turbine Blades: A Tooling Perspective: Pradyumna R &baigmah
7 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
XS Series E Appen 7 Installation PDF
No ratings yet
XS Series E Appen 7 Installation PDF
101 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data Mining Questions and Answer
No ratings yet
Data Mining Questions and Answer
29 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Introduction and History of Tally
No ratings yet
Introduction and History of Tally
3 pages
ADBMS Assignment 3 (C24036)
No ratings yet
ADBMS Assignment 3 (C24036)
15 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
Lead Acid Battery
No ratings yet
Lead Acid Battery
13 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Introduction To Ds - 2024
No ratings yet
Introduction To Ds - 2024
25 pages
Ai CH-5 Assignment
No ratings yet
Ai CH-5 Assignment
4 pages
Devish All Unit
No ratings yet
Devish All Unit
42 pages
Slidesgo Unveiling Insights A Comprehensive Guide To Exploratory Data Analysis 20240506074423e4z0
No ratings yet
Slidesgo Unveiling Insights A Comprehensive Guide To Exploratory Data Analysis 20240506074423e4z0
10 pages
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
No ratings yet
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
11 pages
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Intro
No ratings yet
Intro
26 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Write An 8051 Assembly Program To Generate A 2ms Waveform Wi
No ratings yet
Write An 8051 Assembly Program To Generate A 2ms Waveform Wi
18 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
EDA DeepDive Guide
No ratings yet
EDA DeepDive Guide
3 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Week 3
No ratings yet
Week 3
23 pages
Welcome
No ratings yet
Welcome
8 pages
Bio 11 Syllabus
No ratings yet
Bio 11 Syllabus
4 pages
Data Structures
No ratings yet
Data Structures
59 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Midterm Solution
No ratings yet
Midterm Solution
8 pages
As Cfe Interop 61850 en PDF
No ratings yet
As Cfe Interop 61850 en PDF
29 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Document
No ratings yet
Document
21 pages
21ai402 Data Analytics Unit-3
No ratings yet
21ai402 Data Analytics Unit-3
150 pages
Unit 4
No ratings yet
Unit 4
33 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Datascience Unit-4
No ratings yet
Datascience Unit-4
6 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
Dev 1
No ratings yet
Dev 1
2 pages
Unit 1
No ratings yet
Unit 1
19 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Mit - It Cost Model PDF
No ratings yet
Mit - It Cost Model PDF
139 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
PWM Signal Generator ESR 1.2
No ratings yet
PWM Signal Generator ESR 1.2
4 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
A Literature Survey On Various Approaches On Content Based Image Search
No ratings yet
A Literature Survey On Various Approaches On Content Based Image Search
6 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Dev Core
No ratings yet
Dev Core
7 pages
Brief History of The Relational Model
No ratings yet
Brief History of The Relational Model
5 pages
Guardianeye Monitoring System: Instruction Manual
No ratings yet
Guardianeye Monitoring System: Instruction Manual
37 pages
Mi-280 - Seafarers' Documentation - Filing Agents' Manual
No ratings yet
Mi-280 - Seafarers' Documentation - Filing Agents' Manual
27 pages
AI and Robotics
No ratings yet
AI and Robotics
22 pages
Modeling Transformer Differential Protection With Harmonic Restraint
No ratings yet
Modeling Transformer Differential Protection With Harmonic Restraint
6 pages
Antonio Bratto The Teaching Brain
No ratings yet
Antonio Bratto The Teaching Brain
6 pages
3
No ratings yet
3
45 pages
Mba 4 TH Sem Only
No ratings yet
Mba 4 TH Sem Only
29 pages
Fuchs Overdrive Supreme 50 Manual
No ratings yet
Fuchs Overdrive Supreme 50 Manual
19 pages
Imagine Videotek VSG 4mtg
No ratings yet
Imagine Videotek VSG 4mtg
8 pages
A Survey On OpenFlow-based Software Defined Networks: Security Challenges and Countermeasures
No ratings yet
A Survey On OpenFlow-based Software Defined Networks: Security Challenges and Countermeasures
14 pages
Airphoton Satellite Payloads
No ratings yet
Airphoton Satellite Payloads
18 pages
TX Ty TZ Ty TZ TX TZ TX Ty Matlab Code:: % Symbolizing The Variables
No ratings yet
TX Ty TZ Ty TZ TX TZ TX Ty Matlab Code:: % Symbolizing The Variables
3 pages
CS Project
No ratings yet
CS Project
14 pages
10alytics Data Analyst Track Welcome Kit Cohort 15-1
No ratings yet
10alytics Data Analyst Track Welcome Kit Cohort 15-1
9 pages
Items For Sale
No ratings yet
Items For Sale
1 page
The Killhouse Entry Point Wiki Fandom
No ratings yet
The Killhouse Entry Point Wiki Fandom
1 page
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet