Medium-Com-Data-An
Medium-Com-Data-An
1K 9
The journey from raw data to actionable insights is often paved with
challenges and uncertainties. At the heart of this journey lies Exploratory
Data Analysis (EDA), a foundational process that serves as a compass,
guiding data scientists through the intricate landscape of their datasets. EDA
is not merely a preliminary step; it is a profound exploration that unlocks the
hidden treasures buried within data, revealing patterns, anomalies, and
relationships that form the bedrock of informed decision-making. Lets start
with, what exactly is EDA?
Free
Exploratory Data Analysis (EDA) is an analytical approach aimed at Membership
uncovering the inherent characteristics of datasets, utilizing statistical and
Distraction-free reading. No ads. Read member-only stories
visualization techniques.
Organize your knowledge with lists and Support writers you read most
highlights.
Earn money for your writing
Sign up to discoverUnlike
human hypothesis-driven analyses driven by prior domain knowledge, EDA
Tell your story. Find your audience.
stories that deepenisyour
a flexible, open-ended exploration that allows data scientists to delve
Listeninto
to audio narrations
In this post, I will share my own method of EDA once I have fixed on a
dataset. There will be some differences from project to project, but the
system is there so that most of the items are covered. I will split my EDA into
3 parts:
1) EDA Level 0 — Pure Understanding of Original Data
2) EDA Level 1 — Transformation of Original Data
3) EDA Level 2 — Understanding of Transformed Data
I will be using a some examples from a proper EDA I did. The purpose of this
post is just to share and log the codes used and some of the examples of how EDA
can be done. There might be parts where the insights found do not make sense as
this just a part of a bigger EDA.
def column_summary(df):
summary_data = []
summary_data.append({
'col_name': col_name,
'col_dtype': col_dtype,
'num_of_nulls': num_of_nulls,
'num_of_non_nulls': num_of_non_nulls,
'num_of_distinct_values': num_of_distinct_values,
'distinct_values_counts': distinct_values_counts
})
summary_df = pd.DataFrame(summary_data)
return summary_df
# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)
Below is another code snipper that can extract more info.
return result_df
# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)
If there are any errors, it is most likely due to datatype of the pandas
dataframe. For those who are converting from one format to another, it is
important to keep the datatype info. One example is saving a pandas
dataframe into a csvfile, and then loading it back again, it is so important to
save the datatype and reload it with the datatype.
Returns
-------
Dict
The dtype dictionary used
To create a json file which stores the pandas dtype dictionary for
use when converting back from csv to pandas.DataFrame.
'''
dtype_dict = pdf.dtypes.apply(lambda x: str(x)).to_dict()
return dtype_dict
# Example usage:
download_csv_json(df, "/home/some_dir/file_1")
return pdf
# Example usage:
csvfp = "/home/some_dir/file_1.csv"
jsonfp = "/home/some_dir/file_1_dtype.json"
df = csv_to_pandas(csvfp, jsonfp)
By doing this check, I think one of the obvious issues is that the C_ID column
was not a primary key, since the number of distinct values is not equal to the
number of non-nulls.
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df.head())
print(df.describe())
print(df.duplicated().sum())
Shortforms standardized
df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa':
'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'dr
'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'a
'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_pric
'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_
'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt
df_l1.rename(columns=new_col_dict, inplace=True)
We can use several methods to figure out how to explore the data and fill up
the null values. I will give two examples:
USING A BOXPLOT
sns.set(style="whitegrid")
USING DESCRIBE
new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['loan_home_tag'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['prop_pur_price']].isnull().any
print("Number of rows where 'loan_home_tag' is null, but 'prop_pur_price' is not
new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['prop_pur_price'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['loan_home_tag']].isnull().any(
print("Number of rows where 'prop_pur_price' is null, but 'loan_home_tag' is not
new_df = df_l1[['prop_pur_price','loan_home_tag']]
condition = new_df['loan_home_tag'] == 1
new_df[condition].describe()
We can see that there were 5460 customers who had a property purchase
price but did not take up a loan, and that there were 2243 customers who had
taken up a loan but had no property purchase price.
We then used the describe method to see what is the distribution of property
purchase prices for those who have taken a home loan, to see how to impute
the data. For this particular case, I decided to use the median to impute the
nulls in the property purchase price.
dtype_mapping = {'c_id': str, 'c_age': int, 'c_pc': int, 'c_incm_typ': int, 'pro
'casa_td_nos': int, 'loan_home_tag': int, 'loan_auto_tag': int,
'funds_nos': int, 'cc_txn_nos_yr': int, 'u_id': int}
The datatypes were changed. The changes were only minor and most likely
won’t affect the final result but it was done mainly as a best practice. It is
important to understand the data and use the appropriate datatype
especially when dealing with bigger amounts of data.
If the unclean data is duplicated data with just one or two column values
difference, we can use a rank method to get the row with the least amount of
nulls or the one with the latest data, or even to combine the information.
However, in serious cases where the c_id is wrong, the common protocol
would be to go back to the source and check what went wrong in the source
and then only figure out how to solve it. Lets have a deeper look into c_id to
see what are the issues.
Based on the c_id information and a general read into the data, despite there
being the possibility that the c_id contain information from different
snapshots, it is also unlikely that is the case as some of the profiles do not
conform to common sense.
Such as customer 99731, where one of it shows the person is 52 and has a
degree, but another row shows the person is 65 and only has A-Levels.
For this particular case, I will just take it as different rows since the rows
seems to have no similar trend.
If there are many values for the features, we can either use OneHotEncoding
to split the values into multiple features of binary classification, or we can
use LabelEncoding to label different categories as numbers.
However, I think it best to discuss with domain experts and come up with
relevant categories and bin it manually first.
df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa':
'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'dr
'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'a
'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_pric
'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_
'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt
df_l1.rename(columns=new_col_dict, inplace=True)
fill_values = {'c_edu': 'Unknown', 'c_hse': 'UNKNOWN', 'c_pc': 0, 'c_incm_typ':
'c_occ': 'UNKNOWN',
'casa_td_nos': 0, 'casa_bal_avg_mth': 0, 'casa_bal_max_yr': 0, 'c
'td_bal_avg': 0, 'td_bal_max': 0,
'loan_home_tag':0, 'loan_auto_tag': 0,
'ut_avg': 0, 'ut_max': 0, 'funds_nos': 0,
'cc_txn_amt_max_mth': 0, 'cc_txn_amt_min_mth': 0, 'cc_txn_amt_avg
'cc_txn_amt_yr': 0, 'cc_txn_nos_yr': 0, 'cc_lmt': 0}
df_l1.fillna(fill_values, inplace=True)
Correlation Analysis
IV / WOE Values
Statistical Tests
Correlation Analysis
Correlation Analysis allows us to see which features are highly correlated.
Despite it not being a requirement to remove highly correlated features for
tree-based algorithms, as the algorithm will allocate feature importance, it is
still a good practice to check the correlation and remove features which are
highly correlated.
# Set title
plt.title('Correlation Heatmap')
IV / WOE Values
IV / WOE Value
Information Value (IV) quantifies the prediction power of a feature. You may
read up more about it here. Short story is, we are looking for IV of 0.1 to 0.5
#Empty Dataframe
newDF,woeDF = pd.DataFrame(), pd.DataFrame()
Use the code snippet above to get the IV value. The result is as below:
By right we are looking for IV of 0.1 to 0.5. However, we only have two
features which are within this range. The results are not satisfactory. For
those IV which are zero,
Undersampling the majority might be one of the ways to get a more accurate
study on this, but is is a case by case basis.
# Base Settings
df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col
# From Standard Scaler for Numerical Columns (when necessary) Eg. Logistic Regre
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(\
transformers=[('num', StandardScaler(), numerical_cols)],\
remainder='passthrough') # Pass through categorical features unchanged
X_train_transformed = preprocessor.fit_transform(X_train)
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=independent_c
X_test_transformed = preprocessor.fit_transform(X_test)
X_test_transformed_df = pd.DataFrame(X_test_transformed, columns=independent_col
y_train_transformed = y_train.values.ravel()
y_test_transformed = y_test.values.ravel()
From the above code, we can get the individual feature importance from the
models. However, lets rank it for easier reference.
merged_df
After we rank it, is is much easier to interpret the overall features which are
important and not important.
I colored the features which ranked 20 and more, for different algorithms.
We can see that there is some major overlap between tree-based /
ensembled-based models. Whereas for logistic regression we see a much
different result.
More commonly I see the credit card features being pointed out as less
significant.
Statistical Tests
We can run an individual t-test to check on the difference in distribution of
individual features for affluent and normal customers. I used a significance
of 0.05 and found that the credit card features were found to be insignificant.
aff_df = df_l2[df_l2['c_seg_encoded']==1]
norm_df = df_l2[df_l2['c_seg_encoded']==0]
norm_df_2 = norm_df.sample(frac=0.2, random_state=88)
# Using a smaller sample of the norm_df, since original norm_df is 5x bigger.
# Don't anticipate much change but just trying.
df_result = pd.DataFrame(newlist)
return df_result
df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_ba
'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos',
'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mt
'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_o
'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col
# Add condition to use log scale if values are greater than 1000
if df_l2[feature].max() > 1000:
boxplot.set_yscale('log')
plt.xlabel('Customer Type')
plt.ylabel(feature)
plt.show()
casa_bal features showed similar differences. Asset_tot_val was also
significantly different for AFFLUENT, which is common sense.
Based on the:
1) Information Value
2) Feature Importance from Multiple Algos
3) Statistical Test
4) Further Data Analysis on Imputed Values
I found that credit card related features ranked as the least influential,
whereas total assets and casa balance were consistently ranked as important
features. Despite these information are based on imputed data (which might
have some error), the fact that assets and casa balance determining whether
a person is affluent or not, seems rather logical.
——————————————————————
Sze Zhong LIM in Data And Beyond Pavan Belagatti in Data And Beyond
Yennhi95zz in Data And Beyond Sze Zhong LIM in Data And Beyond
See all from Sze Zhong LIM See all from Data And Beyond
Lists
Help Status About Careers Press Blog Privacy Terms Text to speech Teams