0% found this document useful (0 votes)
69 views

Eda Notes

Exploratory data analysis (EDA) is used to analyze datasets and summarize their main characteristics using visual methods. EDA explores the data to understand what it can reveal beyond formal modeling or hypothesis testing. EDA was promoted by John Tukey to encourage statisticians to explore datasets and potentially formulate new hypotheses or experiments.

Uploaded by

Sudheer Redus
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Eda Notes

Exploratory data analysis (EDA) is used to analyze datasets and summarize their main characteristics using visual methods. EDA explores the data to understand what it can reveal beyond formal modeling or hypothesis testing. EDA was promoted by John Tukey to encourage statisticians to explore datasets and potentially formulate new hypotheses or experiments.

Uploaded by

Sudheer Redus
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

EDA

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to


summarize their main characteristics, often with visual methods.

A statistical model can be used or not, but primarily EDA is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
the data, and possibly formulate hypotheses that could lead to new data collection and
experiments.

NOTE: To find the total no of rows which are missing in a column


cars.isnull().sum()

Problem

Analyze CARS.CSV file.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import cars dataset - this is downloaded from kaggle


# here dataframe by the name 'cars' is created
cars = pd.read_csv("f:/eda/cars.csv")

# display top 5 records


cars.head()

# display which column has how many missing values (or null values)
cars.isnull().sum()

# show how many rows and cols in the dataset


cars.shape

# dataset information
cars.info()
# remove unimportant cols: MSRP, Invoice
cars = cars.drop(['MSRP', 'Invoice'], axis=1)

# remove duplicate rows if any


# keep the first row and remove other duplicate rows of that row
cars = cars.drop_duplicates(keep='first')

# remove the rows having missing values (in Cylinders)


cars.dropna(inplace=True)

# shape of the dataset -- no. of rows, cols


cars.shape

# summary statistics
# if std is 0, that colum should be removed from analysis
cars.describe()

# remove any space in column names


cars.columns = cars.columns.str.replace(' ', '')

# sort the data w.r.t a column -- here we sort on 'MPG_City'


cars_sort = cars.sort_values(by='MPG_City', ascending=False)
cars_sort.head()

'''
iloc[] -> gives integer location based indexing / selection
cars.iloc[0] -> gives 0th row
cars.iloc[-1] -> gives last row
cars.iloc[0:5] -> select 0 to 4th rows
cars.iloc[:, 0] -> select all rows and 0th col
cars.iloc[:, 0:5] -> select all rows and cols from 0 to 4th
cars.iloc[0,2,4], [1,3,5]] -> select 0,2,4 rows with 1,3,5 cols
'''
# select first 10 rows in MPG_City column
x = cars.iloc[0:10, 8]
x

# draw histogram for MPG_City in the intervals of 10


num_bins = 10
plt.hist(cars['MPG_City'], num_bins, color='green', edgecolor='black')

# probability density function -- represents values in 0 to 1 range


# use seaborn - give a tuple to the cars array.
sns.distplot(cars[('MPG_City')], bins=10, color='green')

# draw histogram for Length in the intervals of 10


num_bins = 10
plt.hist(cars['Length'], num_bins)

# probability density function for Length


sns.distplot(cars[('Length')], bins=10)

'''
To select only numeric type of columns, we can give their names or datatypes as:
cars[['EngineSize','Cylinders','Horsepower','MPG_City','MPG_Highway','Weight','Wheelbase'
,'Length']]
cars.select_dtypes(include=['float64', 'int64'])

'''
cars_num = cars.select_dtypes(include=['float64', 'int64'])
cars_num

# display histograms for all numeric columns


cars_num.hist(bins=20)

'''
create correlations among numeric cols. method= pearson / kendall
correlation with itself is 1
negative value -> if one value increases, the other one decreases.
Ex: cars_num.corr(method ='pearson') # correlations among all numeric cols
Ex: cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
'''
cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City

# draw pair plots to show correlations


sns.pairplot(cars_num, vars = ["EngineSize", "Cylinders", "Horsepower", "MPG_City",
"MPG_Highway"], hue='MPG_City')
# thin lines or dots very much closer (not spread apart) represent good correlations.
# box plots can be drawn only for categorical variables
box1 = sns.boxplot(x='Type', y='MPG_City', data=cars)
# SUV gives less mileage and Hybrid gives more mileasge
# sedan and wagon show more variation

box2 = sns.boxplot(x='Origin', y='MPG_City', data=cars)


# The origin of car is Asia gives slightly more mileage

box3 = sns.boxplot(x='DriveTrain', y='MPG_City', data=cars)


# Front drive train is giving more mileage

'''
regression plot is useful to show regression line
if many points are nearer to the line, then there is better correlation
regression line makes the difference between regression plot and correlation plot
'''
sns.regplot(cars['Length'], cars['MPG_City'])

Task on eda

Do EDA On titanic dataset

You might also like