Step-by-Step Exploratory Data Analysis (EDA) Using Python
Step-by-Step Exploratory Data Analysis (EDA) Using Python
Step-by-Step
Home Exploratory Data Analysis (EDA) using Python
Introduction to EDA
The main objective of this article is to cover the steps involved in Data pre-processing,
Feature Engineering, and different stages of Exploratory Data Analysis, which is an essential
step in any research analysis. Data pre-processing, Feature Engineering, and EDA are
fundamental early steps after data collection. Still, they are not limited to where the data is
simply visualized, plotted, and manipulated, without any assumptions, to assess the quality of
the data and building models. This article will guide you through data pre-processing, feature
engineering, and EDA using Python.
Table of contents
What is Data Pre-processing and Feature Engineering?
Step 1: Import Python Libraries
Step 2: Reading Dataset
Step 3: Data Reduction
Step 4: Feature Engineering
Step 5: Creating Features
Step 6: Data Cleaning/Wrangling
Step 7: EDA Exploratory Data Analysis
Step 8: Statistics Summary
Step 9: EDA Univariate Analysis
Step 10: Data Transformation
Step 12: EDA Bivariate Analysis
Step 13: EDA Multivariate Analysis
Step 14: Impute Missing values
Conclusion
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Frequently Asked Question agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 1/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Data pre-processing involves cleaning and preparing raw data to facilitate feature
engineering. Meanwhile, feature engineering entails employing various techniques to
manipulate the data. This may include adding or removing relevant features, handling
missing data, encoding variables, and dealing with categorical variables, among other tasks.
Undoubtedly, feature engineering is a critical task that significantly influences the outcome
of a model. It involves crafting new features based on existing data while pre-processing
primarily focuses on cleaning and organizing the data.
RSVP!
Import all libraries which are required for our analysis, such as Data Loading, Statistical
analysis, Visualizations, Data Transformations, Merge and Joins, etc.
Pandas and Numpy have been used for Data Manipulation and numerical Calculations
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Matplotlib and Seaborn have been used for Data visualizations.
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 2/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step-by-Step
import seabornExploratory
as sns Data Analysis (EDA) using Python
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
Most of the data are available in a tabular format of CSV files. It is trendy and easy to access.
Using the read_csv() function, data can be converted to a pandas DataFrame.
In this article, the data to predict Used car price is being used as an example. In this dataset,
we are trying to analyze the used car’s price and how EDA focuses on identifying the factors
influencing the car price. We have stored the data in the DataFrame data.
data = pd.read_csv("used_cars.csv")
The main goal of data understanding is to gain general insights about the data, which covers
the number of rows and columns, values in the data, datatypes, and Missing values in the
dataset.
shape – shape will display the number of observations(rows) and features(columns) in the
dataset
data.head()
data.tail()
info() helps to understand the data type and information about data, including the number of
records
We in each
use cookies column,
on Analytics data
Vidhya having
websites null orournot
to deliver null, analyze
services, Data type, the and
web traffic, memory
improveusage of the on the site. By using Analytics Vidhya, you
your experience
dataset agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 3/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
data.info()
data.info() shows the variables Mileage, Engine, Power, Seats, New_Price, and Price have
missing values. Numeric variables like Mileage, Power are of datatype as float64 and int64.
Categorical variables like Location, Fuel_Type, Transmission, and Owner Type are of object
data type
data.nunique()
In our example, data.isnull().sum() is used to get the number of missing records in each
column
data.isnull().sum()
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 4/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
The below code helps to calculate the percentage of missing values in each column
(data.isnull().sum()/(len(data)))*100
The percentage of missing values for the columns New_Price and Price is ~86% and ~17%,
respectively.
In our dataset, the column S.No have only ID values, assuming they don’t have any predictive
power to predict the dependent variable.
We start our Feature Engineering as we need to add some columns required for analysis.
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 5/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
It would be difficult to find the car’s age if it is in year format as the Age of the car is a
contributing factor to Car Price.
Introducing
Step-by-Stepa newExploratory
column, “Car_Age”
DatatoAnalysis
know the age of the
(EDA) car Python
using
Since car names will not be great predictors of the price in our current data. But we can
process this column to extract important information using brand and Model names. Let’s
split the name and introduce new variables “Brand” and “Model”
data['Brand'] = data.Name.str.split().str.get(0)
data['Model'] = data.Name.str.split().str.get(1) +
data.Name.str.split().str.get(2)
data[['Name','Brand','Model']]
In the example, The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks incorrect. This needs to be corrected
print(data.Brand.unique())
print(data.Brand.nunique())
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 6/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
We have done the fundamental data analysis, Featuring, and data clean-up. Let’s move to the
EDA process
EDA can be leveraged to check for outliers, patterns, and trends in the given data.
EDA helps to find meaningful patterns in data.
EDA provides in-depth insights into the data sets to solve our business problems.
EDA gives a clue to impute missing values in the dataset
Step 8: Statistics Summary
The information gives a quick and simple description of the data.
Can include Count, Mean, Standard Deviation, median, mode, minimum value, maximum
value, range, standard deviation, etc.
Statistics summary gives a high-level idea to identify whether the data has any outliers, data
entry error, distribution of data such as the data is normally distributed or left/right skewed
describe()– Provide a statistics summary of data belonging to numerical datatype such as int,
float
data.describe().T
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 7/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Years range from 1996- 2019 and has a high in a range which shows used cars contain
both latest models and old model cars.
On average of Kilometers-driven in Used cars are ~58k KM. The range shows a huge
difference between min and max as max values show 650000 KM shows the evidence
of an outlier. This record can be removed.
Min value of Mileage shows 0 cars won’t be sold with 0 mileage. This sounds like a data
entry issue.
It looks like Engine and Power have outliers, and the data is right-skewed.
The average number of seats in a car is 5. car seat is an important feature in price
contribution.
The max price of a used car is 160k which is quite weird, such a high price for used cars.
There may be an outlier or data entry issue.
describe(include=’all’) provides a statistics summary of all data, include object, category etc
data.describe(include='all').T
Before we do EDA, lets separate Numerical and categorical variables for easy analysis
cat_cols=data.select_dtypes(include=['object']).columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 8/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Data visualization is essential; we must decide what charts to plot to better understand the
data. In this article, we visualize our data using Matplotlib and Seaborn libraries.
Matplotlib is a Python
Step-by-Step 2D plotting
Exploratory library
Data used to draw
Analysis basic
(EDA) charts
using we use Matplotlib.
Python
Seaborn is also a python library built on top of Matplotlib that uses short lines of code to
create and style statistical plots from Pandas and Numpy
Univariate analysis can be done for both Categorical and Numerical variables.
Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.
In our example, we have done a Univariate analysis using Histogram and Box Plot for
continuous Variables.
In the below fig, a histogram and box plot is used to show the pattern of the variables, as
some variables have skewness and outliers.
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 9/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Price and Kilometers Driven are right skewed for this data to be transformed, and all outliers
will be handled during imputation
categorical variables are being visualized using a count plot. Categorical variables provide
the pattern of factors influencing car price
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 10/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Mumbai has the highest number of cars available for purchase, followed by Hyderabad and Coimbatore
~53% of cars have fuel type as Diesel this shows diesel cars provide higher
performance
~72% of cars have manual transmission
~82 % of cars are First owned cars. This shows most of the buyers prefer to purchase
first-owner cars
~20% of cars belong to the brand Maruti followed by 19% of cars belonging to Hyundai
WagonR ranks first among all models which are available for purchase
Step 10: Data Transformation
Before we proceed to Bi-variate Analysis, Univariate analysis demonstrated the data pattern
as some variables to be transformed.
Price and Kilometer-Driven variables are highly skewed and on a larger scale. Let’s do log
transformation.
Log transformation can help in normalization, so this variable can maintain standard scale
with other variables:
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 11/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
log_transform(data,['Kilometers_Driven','Price'])
For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate
Analysis.
A Stacked bar chart can be used for categorical variables if the output variable is a classifier.
Bar plots can be used if the output variable is continuous
In our example, a pair plot has been used to show the relationship between two Categorical
variables.
plt.figure(figsize=(13,17))
sns.pairplot(data=data.drop(['Kilometers_Driven','Price'],axis=1))
plt.show()
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 12/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
The variable Year has a positive correlation with price and mileage
A year has a Negative correlation with kilometers-Driven
Mileage is negatively correlated with Power
As power increases, mileage decreases
Car with recent make is higher at prices. As the age of the car increases price decreases
Engine and Power increase, and the price of the car increases
A bar plot can be used to show the relationship between Categorical variables and
continuous variables
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 13/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Continuous Categorical
Categorical Categorical
Multivariate Analysis
Evaluation Metrics
Preprocessing Data
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. Models
Linear By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
KNN
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 14/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Feature Engineering
Naïve Bayes
Hyperparameter Tuning
Advance Dimensionality
Reduction
Recommendation Engines
Improving ML models
Interpretability of Machine
Learning Models
Model Deployment
Deploying ML Models
Embedded Devices
Observations
The price of cars is high in Coimbatore and less price in Kolkata and Jaipur
Automatic cars have more price than manual cars.
Diesel and Electric cars have almost the same price, which is maximum, and LPG cars
have the lowest price
First-owner cars are higher in price, followed by a second
The third owner’s price is lesser than the Fourth and above
Lamborghini brand is the highest in price
Gallardocoupe Model is the highest in price
2 Seater has the highest price followed by 7 Seater
The latest model cars are high in price
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 15/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Heat Map gives the correlation between the variables, whether it has a positive or negative
correlation.
In our example heat map shows the correlation between the variables.
plt.figure(figsize=(12, 7))
sns.heatmap(data.drop(['Kilometers_Driven','Price'],axis=1).corr(), annot =
True, vmin = -1, vmax = 1)
plt.show()
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 16/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
We cannot impute the data with a simple Mean/Median. We must need business knowledge
or common insights about the data. If we have domain knowledge, it will add value to the
imputation. Some data can be imputed on assumptions.
Step-by-Step Exploratory Data Analysis (EDA) using Python
In our dataset, we have found there are missing values for many columns like Mileage, Power,
and Seats.
We observed earlier some observations have zero Mileage. This looks like a data entry issue.
We could fix this by filling null values with zero and then the mean value of Mileage since
Mean and Median values are nearly the same for this variable chosen Mean to impute the
values.
data.loc[data["Mileage"]==0.0,'Mileage']=np.nan
data.Mileage.isnull().sum()
data['Mileage'].fillna(value=np.mean(data['Mileage']),inplace=True)
Similarly, imputation for Seats. As we mentioned earlier, we need to know common insights
about the data.
Let’s assume some cars brand and Models have features like Engine, Mileage, Power, and
Number of seats that are nearly the same. Let’s impute those missing values with the existing
data:
data.Seats.isnull().sum()
data['Seats'].fillna(value=np.nan,inplace=True)
data['Seats']=data.groupby(['Model','Brand'])['Seats'].apply(lambda
x:x.fillna(x.median()))
data['Engine']=data.groupby(['Brand','Model'])['Engine'].apply(lambda
x:x.fillna(x.median()))
data['Power']=data.groupby(['Brand','Model'])['Power'].apply(lambda
x:x.fillna(x.median()))
In general, there are no defined or perfect rules for imputing missing values in a dataset. Each
method can perform better for some datasets but may perform even worse. Only practice
and experiments give the knowledge which works better.
Conclusion
In this article, we tried to analyze the factors influencing the used car’s price.
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 17/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Most of the customers prefer 2 Seat cars hence the price of the 2-seat cars is higher
than other cars.
The price of the car decreases as the Age of the car increases.
Step-by-Step Exploratory Data Analysis (EDA) using Python
Customers prefer to purchase the First owner rather than the Second or Third.
Due to increased Fuel price, the customer prefers to purchase an Electric vehicle.
Automatic Transmission is easier than Manual.
This way, we perform EDA on the datasets to explore the data and extract all possible
insights, which can help in model building and better decision making.
However, this was only an overview of how EDA works; you can go deeper into it and
attempt the stages on larger datasets.
If the EDA process is clear and precise, our model will work better and gives higher accuracy!
The media shown in this article is not owned by Analytics Vidhya and is used at the
Author’s discretion.
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 18/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Download
Analytics Vidhya App for the Latest blog/Article
Library Management System using Training CNN from Scratch Using the
MYSQL Custom Dataset
Hari says:
September 27, 2022 at 8:57 pm
Hari says:
September 27, 2022 at 8:58 pm
Hello mahadevan, Such a great blog very informative. The multivariase part is little bit vague other than that i like it.
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Name* Email*
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 19/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -
Website
Step-by-Step
Notify me of new Exploratory
posts by email. Data Analysis (EDA) using Python
Submit
Top Resources
10 Best AI Image How to Read and Write Understand Random The Ultimate Guide to K-
Generator Tools to Use With CSV Files in Forest Algorithms With Means Clustering:
in 2023 Python? Examples (Updated Definition, Methods and
2023) Applications
avcontentteam - cro
Harika Bonthu -
AUG 17, 2023 wn Sruthi E R - JUN 17, 2021 Pulkit Sharma -
icon AUG 19, 2019
© Copyright 2013-2023 Analytics Vidhya. Privacy Policy Terms of Use Refund Policy
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 20/20