0% found this document useful (0 votes)

155 views4 pages

Eda Notes

Exploratory data analysis (EDA) is used to analyze datasets and summarize their main characteristics using visual methods. EDA explores the data to understand what it can reveal beyond formal modeling or hypothesis testing. EDA was promoted by John Tukey to encourage statisticians to explore datasets and potentially formulate new hypotheses or experiments.

Uploaded by

Sudheer Redus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views4 pages

Eda Notes

Uploaded by

Sudheer Redus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

EDA

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to

summarize their main characteristics, often with visual methods.

A statistical model can be used or not, but primarily EDA is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
the data, and possibly formulate hypotheses that could lead to new data collection and
experiments.

NOTE: To find the total no of rows which are missing in a column

cars.isnull().sum()

Problem

Analyze CARS.CSV file.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import cars dataset - this is downloaded from kaggle

# here dataframe by the name 'cars' is created
cars = pd.read_csv("f:/eda/cars.csv")

# display top 5 records

cars.head()

# display which column has how many missing values (or null values)
cars.isnull().sum()

# show how many rows and cols in the dataset

cars.shape

# dataset information
cars.info()
# remove unimportant cols: MSRP, Invoice
cars = cars.drop(['MSRP', 'Invoice'], axis=1)

# remove duplicate rows if any

# keep the first row and remove other duplicate rows of that row
cars = cars.drop_duplicates(keep='first')

# remove the rows having missing values (in Cylinders)

cars.dropna(inplace=True)

# shape of the dataset -- no. of rows, cols

cars.shape

# summary statistics
# if std is 0, that colum should be removed from analysis
cars.describe()

# remove any space in column names

cars.columns = cars.columns.str.replace(' ', '')

# sort the data w.r.t a column -- here we sort on 'MPG_City'

cars_sort = cars.sort_values(by='MPG_City', ascending=False)
cars_sort.head()

'''
iloc[] -> gives integer location based indexing / selection
cars.iloc[0] -> gives 0th row
cars.iloc[-1] -> gives last row
cars.iloc[0:5] -> select 0 to 4th rows
cars.iloc[:, 0] -> select all rows and 0th col
cars.iloc[:, 0:5] -> select all rows and cols from 0 to 4th
cars.iloc[0,2,4], [1,3,5]] -> select 0,2,4 rows with 1,3,5 cols
'''
# select first 10 rows in MPG_City column
x = cars.iloc[0:10, 8]
x

# draw histogram for MPG_City in the intervals of 10

num_bins = 10
plt.hist(cars['MPG_City'], num_bins, color='green', edgecolor='black')

# probability density function -- represents values in 0 to 1 range

# use seaborn - give a tuple to the cars array.
sns.distplot(cars[('MPG_City')], bins=10, color='green')

# draw histogram for Length in the intervals of 10

num_bins = 10
plt.hist(cars['Length'], num_bins)

# probability density function for Length

sns.distplot(cars[('Length')], bins=10)

'''
To select only numeric type of columns, we can give their names or datatypes as:
cars[['EngineSize','Cylinders','Horsepower','MPG_City','MPG_Highway','Weight','Wheelbase'
,'Length']]
cars.select_dtypes(include=['float64', 'int64'])

'''
cars_num = cars.select_dtypes(include=['float64', 'int64'])
cars_num

# display histograms for all numeric columns

cars_num.hist(bins=20)

'''
create correlations among numeric cols. method= pearson / kendall
correlation with itself is 1
negative value -> if one value increases, the other one decreases.
Ex: cars_num.corr(method ='pearson') # correlations among all numeric cols
Ex: cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
'''
cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City

# draw pair plots to show correlations

sns.pairplot(cars_num, vars = ["EngineSize", "Cylinders", "Horsepower", "MPG_City",
"MPG_Highway"], hue='MPG_City')
# thin lines or dots very much closer (not spread apart) represent good correlations.
# box plots can be drawn only for categorical variables
box1 = sns.boxplot(x='Type', y='MPG_City', data=cars)
# SUV gives less mileage and Hybrid gives more mileasge
# sedan and wagon show more variation

box2 = sns.boxplot(x='Origin', y='MPG_City', data=cars)

# The origin of car is Asia gives slightly more mileage

box3 = sns.boxplot(x='DriveTrain', y='MPG_City', data=cars)

# Front drive train is giving more mileage

'''
regression plot is useful to show regression line
if many points are nearer to the line, then there is better correlation
regression line makes the difference between regression plot and correlation plot
'''
sns.regplot(cars['Length'], cars['MPG_City'])

Task on eda

Do EDA On titanic dataset

Dav Week8 240953580
No ratings yet
Dav Week8 240953580
15 pages
Finalll - Ipynb - Colab
No ratings yet
Finalll - Ipynb - Colab
11 pages
Engo 645
No ratings yet
Engo 645
9 pages
Eda Expt
No ratings yet
Eda Expt
6 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
DataFrames: Handling Missing Values & Visualization
No ratings yet
DataFrames: Handling Missing Values & Visualization
12 pages
Intro To Exploratory Data Analysis Eda in Python
No ratings yet
Intro To Exploratory Data Analysis Eda in Python
7 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
7 pages
Data Analysis Guide for Beginners
No ratings yet
Data Analysis Guide for Beginners
26 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
22 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
Python Codes
No ratings yet
Python Codes
17 pages
Practical 2 fKs4RPadH3
No ratings yet
Practical 2 fKs4RPadH3
4 pages
Internship
No ratings yet
Internship
23 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Import As Import As
No ratings yet
Import As Import As
18 pages
Car Data Analysis Guide
No ratings yet
Car Data Analysis Guide
6 pages
MGN801 Ca1
No ratings yet
MGN801 Ca1
9 pages
Pandas Data Analysis Car Statistics
No ratings yet
Pandas Data Analysis Car Statistics
4 pages
Data Viscode Ass
No ratings yet
Data Viscode Ass
4 pages
EDA Techniques for Cars Dataset in R
No ratings yet
EDA Techniques for Cars Dataset in R
50 pages
Exploratory Data Analysis in Python
No ratings yet
Exploratory Data Analysis in Python
17 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Lab1 For Module3 - Python Code
No ratings yet
Lab1 For Module3 - Python Code
10 pages
AI-MAJOR-AUGUST - Aryal Ashish
No ratings yet
AI-MAJOR-AUGUST - Aryal Ashish
16 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Big Data Analytics Practical Guide
No ratings yet
Big Data Analytics Practical Guide
41 pages
Tài Liệu Không Có Tiêu Đề
No ratings yet
Tài Liệu Không Có Tiêu Đề
7 pages
Data Visualization with Jupyter: Mtcars Analysis
No ratings yet
Data Visualization with Jupyter: Mtcars Analysis
20 pages
Project Report
No ratings yet
Project Report
7 pages
Note
No ratings yet
Note
9 pages
04 Boxplot
No ratings yet
04 Boxplot
22 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Gokul
No ratings yet
Gokul
10 pages
Binning and Normalization Activity
No ratings yet
Binning and Normalization Activity
2 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
ProbList2 24 SLN
No ratings yet
ProbList2 24 SLN
20 pages
Trilokesh Assignment
No ratings yet
Trilokesh Assignment
15 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
Introduction to Base R Programming
No ratings yet
Introduction to Base R Programming
10 pages
PDF Experiments-1 DADV
No ratings yet
PDF Experiments-1 DADV
41 pages
Car Price Prediction EDA Guide
No ratings yet
Car Price Prediction EDA Guide
17 pages
Seaborn EDA for Python Users
No ratings yet
Seaborn EDA for Python Users
39 pages
Exploratiory Data Analysis
No ratings yet
Exploratiory Data Analysis
26 pages
R Lab Ex 1 To 5
No ratings yet
R Lab Ex 1 To 5
26 pages
Practical 2 .Ipynb - Colab
No ratings yet
Practical 2 .Ipynb - Colab
9 pages
City Cycle Fuel Consumption 2024
No ratings yet
City Cycle Fuel Consumption 2024
23 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Ex. No.: 01 Working With Numpy Arrays
No ratings yet
Ex. No.: 01 Working With Numpy Arrays
30 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
No ratings yet
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
6 pages
2 Table and Graphical Representations
No ratings yet
2 Table and Graphical Representations
46 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
R
No ratings yet
R
3 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Decision Tree Entropy Gini
No ratings yet
Decision Tree Entropy Gini
5 pages
Python Data Analysis Guide
100% (1)
Python Data Analysis Guide
36 pages
Understanding Bias and Variance in Regression
No ratings yet
Understanding Bias and Variance in Regression
4 pages
Understanding Bias and Variance in Regression
No ratings yet
Understanding Bias and Variance in Regression
4 pages
6 One Hot Encoding
No ratings yet
6 One Hot Encoding
3 pages
Ensemble Learning Techniques
No ratings yet
Ensemble Learning Techniques
3 pages
Multiple Linear Regression for Home Prices
No ratings yet
Multiple Linear Regression for Home Prices
2 pages
EHTrackR Service Agreement 2019
No ratings yet
EHTrackR Service Agreement 2019
3 pages
Main Exhibitor Application Form
No ratings yet
Main Exhibitor Application Form
2 pages
Research
No ratings yet
Research
6 pages
Simple Linear Regression Problems
No ratings yet
Simple Linear Regression Problems
5 pages
Market Insights for GreenChef
No ratings yet
Market Insights for GreenChef
5 pages
Constructing A Questionnaire For Assessment of Awareness and Acceptance of Diversity in Healthcare Institutions
No ratings yet
Constructing A Questionnaire For Assessment of Awareness and Acceptance of Diversity in Healthcare Institutions
10 pages
Worksheet 5
No ratings yet
Worksheet 5
3 pages
Preschoolers' Emotions and Aggression
No ratings yet
Preschoolers' Emotions and Aggression
12 pages
Isatis Case Studies Mining
100% (1)
Isatis Case Studies Mining
292 pages
Marketing Information Systems Guide
No ratings yet
Marketing Information Systems Guide
16 pages
As 02
No ratings yet
As 02
32 pages
Communication Literature Review Example
100% (3)
Communication Literature Review Example
9 pages
CH-8 SM
No ratings yet
CH-8 SM
47 pages
Can High-Inequality Developing Countries Escape Absolute Poverty?
No ratings yet
Can High-Inequality Developing Countries Escape Absolute Poverty?
7 pages
DR Bilal Paper Career Shocks and C Optimisim
No ratings yet
DR Bilal Paper Career Shocks and C Optimisim
21 pages
Caribbean Studies Student Based Assessment (FINAL) - The Historical Significance of Saint Lucian Carnival
No ratings yet
Caribbean Studies Student Based Assessment (FINAL) - The Historical Significance of Saint Lucian Carnival
29 pages
How Does Machine Learning Change Software Development Practices?
No ratings yet
How Does Machine Learning Change Software Development Practices?
15 pages
ISO Standards On NDT of Welding
100% (1)
ISO Standards On NDT of Welding
2 pages
Fundamentals of Internal Auditing PDF
100% (10)
Fundamentals of Internal Auditing PDF
113 pages
Physical Fitness Review of Related Literature
100% (2)
Physical Fitness Review of Related Literature
6 pages
CHN 2 - Week 2-COPAR
100% (2)
CHN 2 - Week 2-COPAR
55 pages
KS - C - FC - 362 Research Methodology PDF
No ratings yet
KS - C - FC - 362 Research Methodology PDF
2 pages
Business Forecasting Lab Manual - 374
No ratings yet
Business Forecasting Lab Manual - 374
106 pages
1st Lecture-Introduction To Biostatistics and Types of Data-15!02!2025
No ratings yet
1st Lecture-Introduction To Biostatistics and Types of Data-15!02!2025
27 pages
Slide 1: Validity of Evidence - Deepradit Et Al. (2023)
No ratings yet
Slide 1: Validity of Evidence - Deepradit Et Al. (2023)
3 pages
DNP Reflective Journal
No ratings yet
DNP Reflective Journal
5 pages
Robotics Kinematics Basics
No ratings yet
Robotics Kinematics Basics
51 pages
A Study On Occupational Stress Among Faculty Members in Higher Education Institutions in Pune
No ratings yet
A Study On Occupational Stress Among Faculty Members in Higher Education Institutions in Pune
6 pages
Dissertation Chapter 3 Outline Guide
100% (1)
Dissertation Chapter 3 Outline Guide
6 pages
Research Paper Template - Feb 2024
No ratings yet
Research Paper Template - Feb 2024
12 pages
PESTEL Analysis of Indigo Airlines
No ratings yet
PESTEL Analysis of Indigo Airlines
8 pages
Performance Appraisal in Ethiopian Banking
No ratings yet
Performance Appraisal in Ethiopian Banking
32 pages

Eda Notes

Uploaded by

Eda Notes

Uploaded by

EDA

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to

NOTE: To find the total no of rows which are missing in a column

Analyze CARS.CSV file.

# import cars dataset - this is downloaded from kaggle

# display top 5 records

# show how many rows and cols in the dataset

# remove duplicate rows if any

# remove the rows having missing values (in Cylinders)

# shape of the dataset -- no. of rows, cols

# remove any space in column names

# sort the data w.r.t a column -- here we sort on 'MPG_City'

# draw histogram for MPG_City in the intervals of 10

# probability density function -- represents values in 0 to 1 range

# draw histogram for Length in the intervals of 10

# probability density function for Length

# display histograms for all numeric columns

# draw pair plots to show correlations

box2 = sns.boxplot(x='Origin', y='MPG_City', data=cars)

box3 = sns.boxplot(x='DriveTrain', y='MPG_City', data=cars)

Do EDA On titanic dataset

You might also like