Eda Notes
Eda Notes
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
the data, and possibly formulate hypotheses that could lead to new data collection and
experiments.
Problem
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# display which column has how many missing values (or null values)
cars.isnull().sum()
# dataset information
cars.info()
# remove unimportant cols: MSRP, Invoice
cars = cars.drop(['MSRP', 'Invoice'], axis=1)
# summary statistics
# if std is 0, that colum should be removed from analysis
cars.describe()
'''
iloc[] -> gives integer location based indexing / selection
cars.iloc[0] -> gives 0th row
cars.iloc[-1] -> gives last row
cars.iloc[0:5] -> select 0 to 4th rows
cars.iloc[:, 0] -> select all rows and 0th col
cars.iloc[:, 0:5] -> select all rows and cols from 0 to 4th
cars.iloc[0,2,4], [1,3,5]] -> select 0,2,4 rows with 1,3,5 cols
'''
# select first 10 rows in MPG_City column
x = cars.iloc[0:10, 8]
x
'''
To select only numeric type of columns, we can give their names or datatypes as:
cars[['EngineSize','Cylinders','Horsepower','MPG_City','MPG_Highway','Weight','Wheelbase'
,'Length']]
cars.select_dtypes(include=['float64', 'int64'])
'''
cars_num = cars.select_dtypes(include=['float64', 'int64'])
cars_num
'''
create correlations among numeric cols. method= pearson / kendall
correlation with itself is 1
negative value -> if one value increases, the other one decreases.
Ex: cars_num.corr(method ='pearson') # correlations among all numeric cols
Ex: cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
'''
cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
'''
regression plot is useful to show regression line
if many points are nearer to the line, then there is better correlation
regression line makes the difference between regression plot and correlation plot
'''
sns.regplot(cars['Length'], cars['MPG_City'])
Task on eda