EDA - Exploratory Data Analysis
EDA - Exploratory Data Analysis
anything”
import pandas as pd
import numpy as np
Missing values:
Missing values are one of the most common problems you can
encounter when you try to prepare your data for machine
learning. The reason for the missing values might be human
errors, interruptions in the data flow, privacy concerns, and so
on. These affect the performance of the machine learning models.
Most of the algorithms do not accept datasets with missing values
and gives an error.
df.isnull()
df.isnull().sum()
Delete: You can delete the rows with the missing values or delete
the whole column which has missing values.
df=df.fillna(df.median())
Categorical Imputation:
Replacing the missing values with the maximum occurred
value (Mode) in a column is a good option for handling
categorical columns.
df['column_name'].fillna(df['column_name'].value_counts()
.idxmax(), inplace=True)
Predictive filling:
Alternatively, you can choose to fill missing values through
predictive filling.
Handling Outliers:
Binning:
Numerical Binning
ExampleValue Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High
Categorical Binning
ExampleValue Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
Numerical Binning Example
df['bin'] = pd.cut(df['value'], bins=[0,30,70,100],
labels=["Low", "Mid", "High"])
value bin
0 2 Low
1 45 Mid
2 7 Low
3 85 High
4 28 Low
conditions = [
df['Country'].str.contains('Spain'),
df['Country'].str.contains('Italy'),
df['Country'].str.contains('Chile'),
df['Country'].str.contains('Brazil')]
Country Continent
0 Spain Europe
1 Chile South America
2 Australia Other
3 Italy Europe
4 Brazil South America
Log Transform
df['log+1'] =(df['value']+1).transform(np.log)
encoded_columns = pd.get_dummies(df['column'])
df = df.join(encoded_columns).drop('column', axis=1)
New variable creation:
Sometimes, there will be a possibility of creating a new variable
by gaining information from 2 or more columns
This can be used to find the hidden relationship between the
variable and target.
E.g.: In ticket reservation:
No of passenger and their relationship column can be used to get
the information whether the passenger is travelling with family
or with friends or alone
Feature Split:
data.title.head()
0 1995
1 1995
2 1995
3 1995
4 1995
Scaling:
Normalization
value normalized
0 2 0.23
1 45 0.63
2 -23 0.00
3 85 1.00
4 28 0.47
5 2 0.23
6 35 0.54
7 -12 0.10
Standardization
Univariate Analysis:
Check it Out:
The first step in examining your data:
df.head()
df.info()
df.describe()
sns.distplot(df.column, kde=False)
Box-plot
Second visualization tool used in the univariate analysis is box-
plot, this type of graph used for detecting outliers in data.
The distribution of continuous data that facilitates comparison
between variables or across the levels of categorical variables
Count Plot:
A histogram of categorical variable.
Bar Plot:
Represents the central tendency of a numerical variable with a
high solid rectangle and an error bar on top of it to represent the
uncertainty
Sns.barplot(x=’categorical variable’, data=df)
Bivariate Analysis:
Pair plot:
Plot pairwise relationships in a dataset. This function will create
a grid of graphs with combinations of all variables. Mostly
creates scatter plot.
sns.pairplot(df, hue=’target variable’)
Reg plot:
Plot data and linear regression models best fit line in the same
graph
Joint plot:
Plot 2 variables with bi variate and univariate analysis in same
graph
Point plot:
Estimates intervals between categorical variables
Factor plot:
Used for multiple group comparison
Swarm plot:
One continuous and one categorical variable.
sns.swarnplot(data=df, x=’cat variable’, y=’continuous
variable’)
Co- variance:
∑(𝐱 − 𝐱̅)(𝐲 − 𝐲̅)
𝐧−𝟏
Co- relation:
𝐶𝑜 − 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
√𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥 ) ∗ √𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑦)
R2:
𝑅 2 = (𝑐𝑜 − 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛)2