100% found this document useful (1 vote)

656 views20 pages

Step-by-Step Exploratory Data Analysis (EDA) Using Python

The document describes the steps for exploratory data analysis (EDA) using Python. It discusses importing libraries, reading in a dataset on used car prices, analyzing the data to understand the number of observations and variables, and checking for missing values. Key steps include loading the data, examining the data types and dimensions, identifying unique and duplicated values, and calculating the number of missing values in each column. The goal of EDA is to better understand the data without making assumptions, which helps with data preprocessing, feature engineering, and building models.

Uploaded by

akshay rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

656 views20 pages

Step-by-Step Exploratory Data Analysis (EDA) Using Python

Uploaded by

akshay rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step
Home Exploratory Data Analysis (EDA) using Python 

Malamahadevan Mahadevan — Updated On July 13th, 2023

Beginner Data Visualization Guide Python

Introduction to EDA
The main objective of this article is to cover the steps involved in Data pre-processing,
Feature Engineering, and different stages of Exploratory Data Analysis, which is an essential
step in any research analysis. Data pre-processing, Feature Engineering, and EDA are
fundamental early steps after data collection. Still, they are not limited to where the data is
simply visualized, plotted, and manipulated, without any assumptions, to assess the quality of
the data and building models. This article will guide you through data pre-processing, feature
engineering, and EDA using Python.

This article was published as a part of the Data Science Blogathon.

Table of contents
What is Data Pre-processing and Feature Engineering?
Step 1: Import Python Libraries
Step 2: Reading Dataset
Step 3: Data Reduction
Step 4: Feature Engineering
Step 5: Creating Features
Step 6: Data Cleaning/Wrangling
Step 7: EDA Exploratory Data Analysis
Step 8: Statistics Summary
Step 9: EDA Univariate Analysis
Step 10: Data Transformation
Step 12: EDA Bivariate Analysis
Step 13: EDA Multivariate Analysis
Step 14: Impute Missing values
Conclusion
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

Frequently Asked Question agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 1/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

What is Data Pre-processing and Feature

Engineering?
Step-by-Step Exploratory
In our data-driven processes, weData Analysis
consider (EDA)
refining using
our raw data.Python
Both data pre-processing
and feature engineering play pivotal roles in this endeavor. Data pre-processing
encompasses a range of activities, including data integration, analysis, cleaning,
transformation, and dimension reduction.

Data pre-processing involves cleaning and preparing raw data to facilitate feature
engineering. Meanwhile, feature engineering entails employing various techniques to
manipulate the data. This may include adding or removing relevant features, handling
missing data, encoding variables, and dealing with categorical variables, among other tasks.

Undoubtedly, feature engineering is a critical task that significantly influences the outcome
of a model. It involves crafting new features based on existing data while pre-processing
primarily focuses on cleaning and organizing the data.

Let’s look at how to perform EDA using python!

Building Multi-Stage Reasoning Systems wi…

Date: 1 Nov 2023 Time: 6:00 PM – 7:00 PM IST

RSVP!

Step 1: Import Python Libraries

The first step involved in ML using python is understanding and playing around with our data
using libraries. Here is the link to the dataset.

Import all libraries which are required for our analysis, such as Data Loading, Statistical
analysis, Visualizations, Data Transformations, Merge and Joins, etc.

Pandas and Numpy have been used for Data Manipulation and numerical Calculations

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Matplotlib and Seaborn have been used for Data visualizations.
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 2/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step-by-Step
import seabornExploratory
as sns Data Analysis (EDA) using Python
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

Step 2: Reading Dataset

The Pandas library offers a wide range of possibilities for loading data into the pandas
DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images etc.

Most of the data are available in a tabular format of CSV files. It is trendy and easy to access.
Using the read_csv() function, data can be converted to a pandas DataFrame.

In this article, the data to predict Used car price is being used as an example. In this dataset,
we are trying to analyze the used car’s price and how EDA focuses on identifying the factors
influencing the car price. We have stored the data in the DataFrame data.

data = pd.read_csv("used_cars.csv")

Analyzing the Data

Before we make any inferences, we listen to our data by examining all variables in the data.

The main goal of data understanding is to gain general insights about the data, which covers
the number of rows and columns, values in the data, datatypes, and Missing values in the
dataset.

shape – shape will display the number of observations(rows) and features(columns) in the
dataset

There are 7253 observations and 14 variables in our dataset

head() will display the top 5 observations of the dataset

data.head()

tail() will display the last 5 observations of the dataset

data.tail()

info() helps to understand the data type and information about data, including the number of
records
We in each
use cookies column,
on Analytics data
Vidhya having
websites null orournot
to deliver null, analyze
services, Data type, the and
web traffic, memory
improveusage of the on the site. By using Analytics Vidhya, you
your experience
dataset agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 3/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

data.info()

data.info() shows the variables Mileage, Engine, Power, Seats, New_Price, and Price have
missing values. Numeric variables like Mileage, Power are of datatype as float64 and int64.
Categorical variables like Location, Fuel_Type, Transmission, and Owner Type are of object
data type

Check for Duplication

nunique() based on several unique values in each column and the data description, we can
identify the continuous and categorical columns in the data. Duplicated data can be handled
or removed based on further analysis

data.nunique()

Missing Values Calculation

isnull() is widely been in all pre-processing steps to identify null values in the data

In our example, data.isnull().sum() is used to get the number of missing records in each
column

data.isnull().sum()

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 4/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

The below code helps to calculate the percentage of missing values in each column

(data.isnull().sum()/(len(data)))*100

The percentage of missing values for the columns New_Price and Price is ~86% and ~17%,
respectively.

Step 3: Data Reduction

Some columns or variables can be dropped if they do not add value to our analysis.

In our dataset, the column S.No have only ID values, assuming they don’t have any predictive
power to predict the dependent variable.

# Remove S.No. column from data

data = data.drop(['S.No.'], axis = 1)
data.info()

We start our Feature Engineering as we need to add some columns required for analysis.

Step 4: Feature Engineering

Feature engineering refers to the process of using domain knowledge to select and transform
the most relevant variables from raw data when creating a predictive model using machine
learning or statistical modeling. The main goal of Feature engineering is to create meaningful
data from raw data.

Step 5: Creating Features

We
We usewill playonaround
cookies Analyticswith the
Vidhya variables
websites Yearour
to deliver and Nameanalyze
services, in our dataset.
web If we
traffic, and seeyour
improve the experience
sample on the site. By using Analytics Vidhya, you
agree to our
data, the column “Year” shows the manufacturing Privacy
year Policy
of the and Terms of Use.
car. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 5/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

It would be difficult to find the car’s age if it is in year format as the Age of the car is a
contributing factor to Car Price.

Introducing
Step-by-Stepa newExploratory
column, “Car_Age”
DatatoAnalysis
know the age of the
(EDA) car Python
using

from datetime import date

date.today().year
data['Car_Age']=date.today().year-data['Year']
data.head()

Since car names will not be great predictors of the price in our current data. But we can
process this column to extract important information using brand and Model names. Let’s
split the name and introduce new variables “Brand” and “Model”

data['Brand'] = data.Name.str.split().str.get(0)

data['Model'] = data.Name.str.split().str.get(1) +
data.Name.str.split().str.get(2)

data[['Name','Brand','Model']]

Step 6: Data Cleaning/Wrangling

Some names of the variables are not relevant and not easy to understand. Some data may
have data entry errors, and some variables may need data type conversion. We need to fix
this issue in the data.

In the example, The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks incorrect. This needs to be corrected

print(data.Brand.unique())
print(data.Brand.nunique())

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 6/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

searchfor = ['Isuzu' ,'ISUZU','Mini','Land']

data[data.Brand.str.contains('|'.join(searchfor))].head(5)

Step-by-Step Exploratory Data Analysis (EDA) using Python

data["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini Cooper","Land":"Land

Rover"}, inplace=True)

We have done the fundamental data analysis, Featuring, and data clean-up. Let’s move to the
EDA process

Voila!! Our Data is ready to perform EDA.

Step 7: EDA Exploratory Data Analysis

Exploratory Data Analysis refers to the crucial process of performing initial investigations on
data to discover patterns to check assumptions with the help of summary statistics and
graphical representations.

EDA can be leveraged to check for outliers, patterns, and trends in the given data.
EDA helps to find meaningful patterns in data.
EDA provides in-depth insights into the data sets to solve our business problems.
EDA gives a clue to impute missing values in the dataset
Step 8: Statistics Summary
The information gives a quick and simple description of the data.

Can include Count, Mean, Standard Deviation, median, mode, minimum value, maximum
value, range, standard deviation, etc.

Statistics summary gives a high-level idea to identify whether the data has any outliers, data
entry error, distribution of data such as the data is normally distributed or left/right skewed

In python, this can be achieved using describe()

describe() function gives all statistics summary of data

describe()– Provide a statistics summary of data belonging to numerical datatype such as int,
float

data.describe().T

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 7/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

From the statistics summary, we can infer the below findings :

Years range from 1996- 2019 and has a high in a range which shows used cars contain
both latest models and old model cars.
On average of Kilometers-driven in Used cars are ~58k KM. The range shows a huge
difference between min and max as max values show 650000 KM shows the evidence
of an outlier. This record can be removed.
Min value of Mileage shows 0 cars won’t be sold with 0 mileage. This sounds like a data
entry issue.
It looks like Engine and Power have outliers, and the data is right-skewed.
The average number of seats in a car is 5. car seat is an important feature in price
contribution.
The max price of a used car is 160k which is quite weird, such a high price for used cars.
There may be an outlier or data entry issue.
describe(include=’all’) provides a statistics summary of all data, include object, category etc

data.describe(include='all').T

Before we do EDA, lets separate Numerical and categorical variables for easy analysis

cat_cols=data.select_dtypes(include=['object']).columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

Step 9: EDA Univariate Analysis

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Analyzing/visualizing the dataset by taking onetovariable
agree atPolicy
our Privacy a time:
and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 8/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Data visualization is essential; we must decide what charts to plot to better understand the
data. In this article, we visualize our data using Matplotlib and Seaborn libraries.

Matplotlib is a Python
Step-by-Step 2D plotting
Exploratory library
Data used to draw
Analysis basic
(EDA) charts
using we use Matplotlib.
Python

Seaborn is also a python library built on top of Matplotlib that uses short lines of code to
create and style statistical plots from Pandas and Numpy

Univariate analysis can be done for both Categorical and Numerical variables.

Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.

Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.

In our example, we have done a Univariate analysis using Histogram and Box Plot for
continuous Variables.

In the below fig, a histogram and box plot is used to show the pattern of the variables, as
some variables have skewness and outliers.

for col in num_cols:

print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(grid=False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x=data[col])
plt.show()

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 9/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

Price and Kilometers Driven are right skewed for this data to be transformed, and all outliers
will be handled during imputation

categorical variables are being visualized using a count plot. Categorical variables provide
the pattern of factors influencing car price

fig, axes = plt.subplots(3, 2, figsize = (18, 18))

fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'Fuel_Type', data = data, color = 'blue',
order = data['Fuel_Type'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'Transmission', data = data, color = 'blue',
order = data['Transmission'].value_counts().index);
sns.countplot(ax = axes[1, 0], x = 'Owner_Type', data = data, color = 'blue',
order = data['Owner_Type'].value_counts().index);
sns.countplot(ax = axes[1, 1], x = 'Location', data = data, color = 'blue',
order = data['Location'].value_counts().index);
sns.countplot(ax = axes[2, 0], x = 'Brand', data = data, color = 'blue',
order = data['Brand'].head(20).value_counts().index);
sns.countplot(ax = axes[2, 1], x = 'Model', data = data, color = 'blue',
order = data['Model'].head(20).value_counts().index);
axes[1][1].tick_params(labelrotation=45);
axes[2][0].tick_params(labelrotation=90);
axes[2][1].tick_params(labelrotation=90);

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 10/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

From the count plot, we can have below observations

Mumbai has the highest number of cars available for purchase, followed by Hyderabad and Coimbatore

~53% of cars have fuel type as Diesel this shows diesel cars provide higher
performance
~72% of cars have manual transmission
~82 % of cars are First owned cars. This shows most of the buyers prefer to purchase
first-owner cars
~20% of cars belong to the brand Maruti followed by 19% of cars belonging to Hyundai
WagonR ranks first among all models which are available for purchase
Step 10: Data Transformation
Before we proceed to Bi-variate Analysis, Univariate analysis demonstrated the data pattern
as some variables to be transformed.

Price and Kilometer-Driven variables are highly skewed and on a larger scale. Let’s do log
transformation.

Log transformation can help in normalization, so this variable can maintain standard scale
with other variables:

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 11/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

# Function for log transformation of the column

def log_transform(data,col):
for colname in col:
Step-by-Step Exploratory==Data
if (data[colname] Analysis (EDA) using Python
1.0).all():
data[colname + '_log'] = np.log(data[colname]+1)
else:
data[colname + '_log'] = np.log(data[colname])
data.info()

log_transform(data,['Kilometers_Driven','Price'])

#Log transformation of the feature 'Kilometers_Driven'

sns.distplot(data["Kilometers_Driven_log"], axlabel="Kilometers_Driven_log");

Step 12: EDA Bivariate Analysis

Now, let’s move ahead with bivariate analysis. Bivariate Analysis helps to understand how
variables are related to each other and the relationship between dependent and
independent variables present in the dataset.

For Numerical variables, Pair plots and Scatter plots are widely been used to do Bivariate
Analysis.

A Stacked bar chart can be used for categorical variables if the output variable is a classifier.
Bar plots can be used if the output variable is continuous

In our example, a pair plot has been used to show the relationship between two Categorical
variables.

plt.figure(figsize=(13,17))
sns.pairplot(data=data.drop(['Kilometers_Driven','Price'],axis=1))
plt.show()
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 12/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

Pair Plot provides below insights:

The variable Year has a positive correlation with price and mileage
A year has a Negative correlation with kilometers-Driven
Mileage is negatively correlated with Power
As power increases, mileage decreases
Car with recent make is higher at prices. As the age of the car increases price decreases
Engine and Power increase, and the price of the car increases
A bar plot can be used to show the relationship between Categorical variables and
continuous variables

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 13/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

fig, axarr = plt.subplots(4, 2, figsize=(12, 18))

data.groupby('Location')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0][0],
Step-by-Step
fontsize=12) Exploratory Data Analysis (EDA) using Python
axarr[0][0].set_title("Location Vs Price", fontsize=18)
data.groupby('Transmission')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[0][1],
fontsize=12)
axarr[0][1].set_title("Transmission Vs Price", fontsize=18)
data.groupby('Fuel_Type')
Machine Learning
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1][0],
fontsize=12)
Become a full stack data
axarr[1][0].set_title("Fuel_Type Vs Price", fontsize=18)
scientist
data.groupby('Owner_Type')
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[1][1], Basics of Machine Learning
fontsize=12)
axarr[1][1].set_title("Owner_Type Vs Price", fontsize=18) Machine Learning Lifecycle
data.groupby('Brand')
['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax=axarr[2] Importance of Stats and EDA
[0], fontsize=12)
axarr[2][0].set_title("Brand Vs Price", fontsize=18) Introduction to Exploratory
data.groupby('Model') Data Analysis & Data Insights
['Price_log'].mean().sort_values(ascending=False).head(10).plot.bar(ax=axarr[2]
[1], fontsize=12) Descriptive Statistics
axarr[2][1].set_title("Model Vs Price", fontsize=18)
data.groupby('Seats') Inferential Statistics
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3][0],
How to Understand
fontsize=12)
Population Distributions?
axarr[3][0].set_title("Seats Vs Price", fontsize=18)
data.groupby('Car_Age')
Understanding Data
['Price_log'].mean().sort_values(ascending=False).plot.bar(ax=axarr[3][1],
fontsize=12)
Probability
axarr[3][1].set_title("Car_Age Vs Price", fontsize=18)
plt.subplots_adjust(hspace=1.0)
Exploring Continuous Variable
plt.subplots_adjust(wspace=.5)
sns.despine()
Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine

Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. Models
Linear By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
KNN
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 14/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Selecting the Right Model

Feature Selection Techniques

Step-by-Step Exploratory Data Analysis (EDA) using Python
Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality
Reduction

Unsupervised Machine Learning

Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine
Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Observations
The price of cars is high in Coimbatore and less price in Kolkata and Jaipur
Automatic cars have more price than manual cars.
Diesel and Electric cars have almost the same price, which is maximum, and LPG cars
have the lowest price
First-owner cars are higher in price, followed by a second
The third owner’s price is lesser than the Fourth and above
Lamborghini brand is the highest in price
Gallardocoupe Model is the highest in price
2 Seater has the highest price followed by 7 Seater
The latest model cars are high in price
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 15/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step 13: EDA Multivariate Analysis

As the name suggests, Multivariate analysis looks at more than two variables. Multivariate
analysis is one of the most useful methods to determine relationships and analyze patterns
Step-by-Step Exploratory Data Analysis (EDA) using Python
for any dataset.

A heat map is widely been used for Multivariate Analysis

Heat Map gives the correlation between the variables, whether it has a positive or negative
correlation.

In our example heat map shows the correlation between the variables.

plt.figure(figsize=(12, 7))
sns.heatmap(data.drop(['Kilometers_Driven','Price'],axis=1).corr(), annot =
True, vmin = -1, vmax = 1)
plt.show()

From the Heat map, we can infer the following:

The engine has a strong positive correlation to Power 0.86
Price has a positive correlation to Engine 0.69 as well Power 0.77
Mileage has correlated to Engine, Power, and Price negatively
Price is moderately positive in correlation to year.
Kilometer driven has a negative correlation to year not much impact on the price
Car age has a negative correlation with Price
car Age is positively correlated to Kilometers-Driven as the Age of the car increases;
then the kilometer will also increase of car has a negative correlation with Mileage this
makes sense
Step 14: Impute Missing values
Missing data arise in almost all statistical analyses. There are many ways to impute missing
values; we can impute the missing values by their Mean, median, most frequent, or zero
values
We and use
use cookies advanced
on Analytics imputation
Vidhya algorithms
websites to like KNN,
deliver our services, Regularization,
analyze etc. your experience on the site. By using Analytics Vidhya, you
web traffic, and improve
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 16/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

We cannot impute the data with a simple Mean/Median. We must need business knowledge
or common insights about the data. If we have domain knowledge, it will add value to the
imputation. Some data can be imputed on assumptions.
Step-by-Step Exploratory Data Analysis (EDA) using Python
In our dataset, we have found there are missing values for many columns like Mileage, Power,
and Seats.

We observed earlier some observations have zero Mileage. This looks like a data entry issue.
We could fix this by filling null values with zero and then the mean value of Mileage since
Mean and Median values are nearly the same for this variable chosen Mean to impute the
values.

data.loc[data["Mileage"]==0.0,'Mileage']=np.nan
data.Mileage.isnull().sum()

data['Mileage'].fillna(value=np.mean(data['Mileage']),inplace=True)

Similarly, imputation for Seats. As we mentioned earlier, we need to know common insights
about the data.

Let’s assume some cars brand and Models have features like Engine, Mileage, Power, and
Number of seats that are nearly the same. Let’s impute those missing values with the existing
data:

data.Seats.isnull().sum()
data['Seats'].fillna(value=np.nan,inplace=True)
data['Seats']=data.groupby(['Model','Brand'])['Seats'].apply(lambda
x:x.fillna(x.median()))
data['Engine']=data.groupby(['Brand','Model'])['Engine'].apply(lambda
x:x.fillna(x.median()))
data['Power']=data.groupby(['Brand','Model'])['Power'].apply(lambda
x:x.fillna(x.median()))

In general, there are no defined or perfect rules for imputing missing values in a dataset. Each
method can perform better for some datasets but may perform even worse. Only practice
and experiments give the knowledge which works better.

Conclusion
In this article, we tried to analyze the factors influencing the used car’s price.

Data Analysis helps to find the basic structure of the dataset.

Dropped columns that are not adding value to our analysis.
Performed Feature Engineering by adding some columns which contribute to our
analysis.
Data Transformations have been used to normalize the columns.
We used different visualizations for EDA like Univariate, Bi-Variate, and Multivariate
Analysis.
Through EDA, we got useful insights, and below are the factors influencing the price of the
car and a few takeaways:
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 17/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Most of the customers prefer 2 Seat cars hence the price of the 2-seat cars is higher
than other cars.
The price of the car decreases as the Age of the car increases.
Step-by-Step Exploratory Data Analysis (EDA) using Python
Customers prefer to purchase the First owner rather than the Second or Third.
Due to increased Fuel price, the customer prefers to purchase an Electric vehicle.
Automatic Transmission is easier than Manual.
This way, we perform EDA on the datasets to explore the data and extract all possible
insights, which can help in model building and better decision making.

However, this was only an overview of how EDA works; you can go deeper into it and
attempt the stages on larger datasets.

If the EDA process is clear and precise, our model will work better and gives higher accuracy!

Frequently Asked Question

Q1. What is EDA with Python?
A. Exploratory Data Analysis (EDA) with Python involves analyzing and summarizing data to
gain insights and understand its underlying patterns, relationships, and distributions using
Python programming language.

Q2. How to make EDA in Python?

A. To perform EDA in Python, you can use libraries like Pandas, NumPy, Matplotlib, and
Seaborn. These libraries provide functions and tools for data manipulation, visualization, and
statistical analysis, which facilitate the process of exploring and understanding the data.

Q3. Which is the best EDA tool Python?

A. The choice of the best EDA tool in Python depends on your specific requirements and
preferences. Some popular EDA tools include Jupyter Notebook (with the aforementioned
libraries), Plotly, Tableau, and Power BI. Each tool offers unique features and capabilities, so
it’s advisable to explore them and choose the one that suits your needs best.

Q4. How to perform EDA in machine learning?

A. Performing EDA in machine learning typically involves preprocessing the data by handling
missing values, outliers, and feature scaling. Then, various statistical and visual techniques
can be employed to analyze the relationships between variables, identify patterns, and
assess the relevance of features. This helps in gaining a better understanding of the data
before building a machine learning model.

The media shown in this article is not owned by Analytics Vidhya and is used at the
Author’s discretion.

blogathon data manipulation dataset EDA

About the Author

Malamahadevan Mahadevan

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 18/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Step-by-Step Exploratory Data Analysis (EDA) using Python

Our Top Authors

Sion Barney Prateek

Rahul CHIRAG Arnab Suvojit view
Chakrabarti Darlington Majumder more
Shah GOYAL Mondal Hore
crown crown crown crown crown crown crown
icon icon icon icon icon icon icon

Download
Analytics Vidhya App for the Latest blog/Article

Previous Post Next Post

Library Management System using Training CNN from Scratch Using the
MYSQL Custom Dataset

3 thoughts on "Step-by-Step Exploratory Data Analysis (EDA) using Python"

Hari says:
September 27, 2022 at 8:57 pm

Hello mahadevan, Such a great blog very informative.

Hari says:
September 27, 2022 at 8:58 pm

Hello mahadevan, Such a great blog very informative. The multivariase part is little bit vague other than that i like it.
Reply

Ramakrishnan Iyer says:

June 01, 2023 at 8:47 pm

Great blog! Boon for new entrants to data science domain.

Leave a Reply
Your email address will not be published. Required fields are marked *

Comment

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Name* Email*
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 19/20
10/29/23, 9:18 AM Step-by-Step Exploratory Data Analysis (EDA) using Python -

Website

Notify me of follow-up comments by email.

Step-by-Step
Notify me of new Exploratory
posts by email. Data Analysis (EDA) using Python

Submit

Top Resources

10 Best AI Image How to Read and Write Understand Random The Ultimate Guide to K-
Generator Tools to Use With CSV Files in Forest Algorithms With Means Clustering:
in 2023 Python? Examples (Updated Definition, Methods and
2023) Applications
avcontentteam - cro
Harika Bonthu -
AUG 17, 2023 wn Sruthi E R - JUN 17, 2021 Pulkit Sharma -
icon AUG 19, 2019

AUG 21, 2021

Analytics Vidhya Analytics Vidhya Data Scientists Companies Visit us

About Us Blog Post Jobs
Download App Our Team Hackathon Trainings

Careers Join the Community Hiring Hackathons

Contact us Apply Jobs Advertising

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/ 20/20

Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Typecasting in Python
No ratings yet
Typecasting in Python
6 pages
Python Full
100% (1)
Python Full
59 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
7 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
48 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
71A Machine Learning
No ratings yet
71A Machine Learning
8 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Class Xi Python
100% (2)
Class Xi Python
138 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Regression Project
100% (1)
Regression Project
60 pages
Python Seminar
100% (1)
Python Seminar
60 pages
Project
No ratings yet
Project
18 pages
Introduction To Data Science Lab Manual
100% (1)
Introduction To Data Science Lab Manual
76 pages
Machine Learning (CSC052P6G, CSC033U3M, CSL774, EEL012P5E) : Dr. Shaifu Gupta
No ratings yet
Machine Learning (CSC052P6G, CSC033U3M, CSL774, EEL012P5E) : Dr. Shaifu Gupta
18 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Nptel - Data Mining - Week 2
No ratings yet
Nptel - Data Mining - Week 2
4 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Regression Notes
100% (1)
Regression Notes
20 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Predictive Analytics
No ratings yet
Predictive Analytics
46 pages
Data Visualization - Matplotlib PDF
100% (1)
Data Visualization - Matplotlib PDF
15 pages
ML Unit 1 CS
100% (2)
ML Unit 1 CS
102 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
L2 - Machine Learning Process
No ratings yet
L2 - Machine Learning Process
17 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Data Wrangling
0% (1)
Data Wrangling
7 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
A Modular Approach To Program Organization
No ratings yet
A Modular Approach To Program Organization
51 pages
CS 601 ML Lab Manual
0% (1)
CS 601 ML Lab Manual
14 pages
ML Notes
100% (2)
ML Notes
125 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
Worksheet - Data Visualization
No ratings yet
Worksheet - Data Visualization
3 pages
What Is Object Oriented Programming?: Ch-1 OOP in Python Updated & Revised by Dr. Ra'ed M. Al-Khatib (2019)
No ratings yet
What Is Object Oriented Programming?: Ch-1 OOP in Python Updated & Revised by Dr. Ra'ed M. Al-Khatib (2019)
45 pages
Data Science Masters Program - Curriculum-Updated 2019
No ratings yet
Data Science Masters Program - Curriculum-Updated 2019
52 pages
Machine Learning Unit 5
No ratings yet
Machine Learning Unit 5
43 pages
Cluster
100% (1)
Cluster
72 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Machine Learning
100% (5)
Machine Learning
56 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Pandas Practice Questions
No ratings yet
Pandas Practice Questions
2 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
2 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
7 pages
Lecture 1.3 Triggers and Functions
No ratings yet
Lecture 1.3 Triggers and Functions
39 pages
CCNA3 Solved
No ratings yet
CCNA3 Solved
5 pages
FDA - Ankom
No ratings yet
FDA - Ankom
2 pages
Technology and Livelihood Education: Electronic Products Assembly and Servicing
No ratings yet
Technology and Livelihood Education: Electronic Products Assembly and Servicing
12 pages
Process Technology
No ratings yet
Process Technology
31 pages
Fans and Blowers PDF
No ratings yet
Fans and Blowers PDF
2 pages
PanafricanSutureSenegalGuineaWA
No ratings yet
PanafricanSutureSenegalGuineaWA
10 pages
IIT KGP Report
No ratings yet
IIT KGP Report
43 pages
Ficha Tecnica Bateria Trojan l16h Ac 435ah 6v Es
No ratings yet
Ficha Tecnica Bateria Trojan l16h Ac 435ah 6v Es
2 pages
MLL CLASS XII - Maths 2024-25
No ratings yet
MLL CLASS XII - Maths 2024-25
44 pages
Enzymes
100% (2)
Enzymes
48 pages
Microeconomic Game Theory Students
No ratings yet
Microeconomic Game Theory Students
13 pages
Science 8-DDL2
No ratings yet
Science 8-DDL2
3 pages
Tunesys-Datarebalancing Checklist v2
No ratings yet
Tunesys-Datarebalancing Checklist v2
40 pages
Go For Pythonistas
No ratings yet
Go For Pythonistas
33 pages
Alphacam Whats New
No ratings yet
Alphacam Whats New
82 pages
EDPM-Manuscript Signs Exercise
No ratings yet
EDPM-Manuscript Signs Exercise
6 pages
Hong Kong Testing Company Limited: Contact Details Registration No. HOKLAS 012 Page 1 of 1
No ratings yet
Hong Kong Testing Company Limited: Contact Details Registration No. HOKLAS 012 Page 1 of 1
31 pages
Unit-II (Topic - 5)
No ratings yet
Unit-II (Topic - 5)
25 pages
SRT 4922 - en
100% (1)
SRT 4922 - en
2 pages
Study of Supercritical Coal Fired Power Plant Dynamic Responses and Control For Grid Code Compliance
No ratings yet
Study of Supercritical Coal Fired Power Plant Dynamic Responses and Control For Grid Code Compliance
46 pages
10-M U1-1-M - EM - SivAmoorthy VPM - Kalviexpress
100% (1)
10-M U1-1-M - EM - SivAmoorthy VPM - Kalviexpress
1 page
2023 Cbse Questions Maths (Chapters 1-4)
No ratings yet
2023 Cbse Questions Maths (Chapters 1-4)
16 pages
Advmat S 24 22807 3
No ratings yet
Advmat S 24 22807 3
36 pages
G7 Math Quarter 3 Summative Test
No ratings yet
G7 Math Quarter 3 Summative Test
3 pages
Deadlock in Operating System
No ratings yet
Deadlock in Operating System
6 pages
EXPONENTS AND RADICALS IN THE REAL WORLD Math Peta
No ratings yet
EXPONENTS AND RADICALS IN THE REAL WORLD Math Peta
19 pages
Germanio PDF
No ratings yet
Germanio PDF
18 pages
Cell The Unit of Life
No ratings yet
Cell The Unit of Life
47 pages
Heat and Temperature (Temperature Scale)
No ratings yet
Heat and Temperature (Temperature Scale)
5 pages