0% found this document useful (0 votes)

10 views20 pages

EDA - Task

Exploratory Data Analysis (EDA) is a technique used to summarize and visualize datasets to understand their main characteristics, identify errors, and prepare data for further analysis. It involves data sourcing, cleaning, and applying various methods such as feature scaling and outlier treatment to improve data quality. EDA is crucial in business processes for making informed decisions and building machine learning models.

Uploaded by

sruthichinthagunta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views20 pages

EDA - Task

Uploaded by

sruthichinthagunta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Exploratory Data Analysis

1
Data Analytics/Science Process

Exploratory Data
Analysis

Raw Data
Data is Processed Clean Dataset
Collected

Models &
Algorithms

Data Product Visualize Report Make Decisions

Reality

©
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their
main characteristics in form of visual methods.

• EDA is nothing but a data exploration technique to understand various aspects of the
data.

• The main aim of EDA is to obtain confidence in a data to an extent where we are
ready to engage a machine learning model.

• EDA is important to analyze the data it’s a first steps in data analysis process.

• EDA give a basic idea to understand the data and make sense of the data to figure out
the question you need to ask and find out the best way to manipulate the dataset to get
the answer to your question.

• Exploratory data analysis help us to finding the errors, discovering data, mapping out
data structure, finding out anomalies.

• Exploratory data analysis is important for business process because we are

preparing dataset for deep thorough analysis that will detect you business problem.

• EDA help to build a quick and dirty model, or a baseline model, which can serve
as a comparison against later models that you will build.

3 ]

©
Visualization
Visualization is the presentation of the data in the graphical or visual form to understand the data more
clearly. Visualization is easy to understand the data

Easily understand Easily analyze the Help to get meaningful Help to find the trend
the features of the data and insights from the data. or pattern of the
data summarize it. data.

5
…

Numerical Analysis
Data Cleaning

Data Sourcing Categorical Derived

Analysis Metrics

©
Data Sourcing
• Data Sourcing is the process of gathering data from multiple sources as external or internal data
collection.
• There are two major kind of data which can be classified according to the source:
1. Public data
2. Private data

Public Data Private Data

The data which is easy to access without taking Private Data:- The data which is not available on
any permission from the agencies is called public platform and to access the data we have
public data. The agencies made the data public to take the permission of organisation is called
for the purpose of the research, private data.
• Example: government and other public sector or • Example: Banking ,telecom ,retail sector are
ecommerce sites made the data public. there which not made their data
publicly available.

©
…
After collecting the data , the next step is data The following are some steps involve in Data Cleaning
cleaning. Data cleaning means that you get rid
of any information that doesn’t need to be there
and clean up by mistake.

Data Cleaning is the process of clean the data

to improve the quality of the data for further Handle Missing Values Standardization of the
data analysis and building a machine learning data
model.

The benefit of data cleaning is that all the

incorrect and irrelevant data is gone, and we get
the good quality of data which will help in
improving the accuracy of our machine learning
model.
Outlier Treatment Handle Invalid
values

Replacing with mean/ Predicting the missing

Delete Rows/Columns Algorithm Imputation
median/mode values

This method we commonly This method can be used on Some machine learning Prediction model is one of the
used to handle missing independent variable when algorithm supports to handle advanced method to handle
values. Rows can be deleted it has numerical variables. missing value in the missing values. In this method
if it has insignificant number On categorical feature we datasets. Like KNN, Naïve dataset with no missing value
of missing value Columns apply mode method to fill the Bayes, Random forest. become training set and
can be delete if it has more missing value. dataset with missing value
than 75% of missing value become the test set and the
missing values is treated as
target variable.

Importance of Feature Scaling

When we are dealing with independent variable or features that differ

from each other in terms of range of values or units of the features, then
we have to normalize/standardize the data so that the difference in range
of values doesn’t affect the outcome of the data.

Feature scaling is the method to rescale the

values present in the features. In feature
scaling we convert the scale of different
measurement into a single scale. It standardize
the whole dataset in one range.

Age Income (£) New value Income Minimum =

12000 Income Maximum
= 30000
24 15000 (15000 – 12000)/18000 = 0.16667 (Max – min) = (30000 –
12000) = 18000

30 12000 (12000 – 12000)/18000 =0

Hence, we have converted the income values between 0

28 30000 (30000 – 12000)/18000 =1 and 1 Please note, the new values have
Minimum = 0
Maximum = 1

Age Income (£) New value Average = (15000 + 12000 + 30000)/3 = 19000
Standard deviation = 9643.65

24 15000 (15000 - 19000)/9643.65 = -0.4147 Hence, we have converted the income values to lower
values using the z-score method.
30 12000 (12000 - 19000)/9643.65 = -0.7258
x = c(-0.4147, -0.7258, 1.1406)
mean(x) = -0.000003 ~
28 30000 (30000-19000)/9643.65 = 1.1406 0 var(x) = 0.999 ~1

©
Outlier Treatment
Outliers are the most extremes values in the data. It is an abnormal observations that deviate from the norm.
Outliers do not fit in the normal behavior of the data.

Detect Outliers using following methods Handle Outlier using following methods

1. Boxplot 1. Remove the outliers.

2. Histogram 2. Replace outlier with suitable values by using
3. Scatter plot following methods:-

4. Z-score • Quantile method

5. Inter quartile range(values out of 1.5 time of • Inter quartile range

IQR) 3. Use that ML model which are not sensitive to
outliers
4. Like:-KNN, Decision Tree, SVM, Naïve Bayes,
Ensemble methods

We also perform various analysis Similarly, while analyzing multiple

over Numerical data. features, we might be interested in
knowing their correlation with each other.
For example, dealing with a single
numerical variable, we might be interested
in knowing their statistical information such
as mean, median, 25th Percentile, 75th
Percentile,
min, max etc.

©
Derived Metrics
Derived metrics create a new variable from the existing variable to get a insightful information from the data by
analyzing the data.

Feature Binning

Feature Encoding

From Domain Knowledge

Calculated from Data

It transform continuous or numeric variable into categorical

value without taking dependent variable into consideration

Equal Equal width separate the continuous variable

Width to several categories having same range of
width.

Equal frequency separate the

Equal
continuous variable into several
Frequency
categories having approximately same
number of values.

Label Label encoding is technique to transform categorical variables into numerical variables by assigning a
encoding numerical value to each of the categories.

This technique is used when independent variables are nominal. It creates k different columns each
One-Hot
for a category and replaces one column with 1 rest of the columns is 0.
encoding
Here, 0 represents the absence, and 1 represents the presence of that category.

Some of the use cases where we use EDA is:-

Cancer Data Analysis

In this data set we have to predict who are suffering from cancer and who’s
not.
Uses
Cases
Fraud Data Analysis in E-commerce Transactions

In this dataset we have to detect the fraud in a E-commerce

transaction.

Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
AS330 Series Elevator-Used Inverter User Manual V1.01
No ratings yet
AS330 Series Elevator-Used Inverter User Manual V1.01
128 pages
Module 2
No ratings yet
Module 2
78 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Eda 1
No ratings yet
Eda 1
25 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Giver EdGuide V4b
100% (1)
Giver EdGuide V4b
28 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
No ratings yet
6 - InnovatiCS - Data Visualization (Numerical & Graphical Descriptive Statistics)
96 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Exploratory Data Analysis EDA and Feature Engineering 10 Merged
No ratings yet
Exploratory Data Analysis EDA and Feature Engineering 10 Merged
99 pages
Fundamentals of Data Source and Preparation For ML v31
No ratings yet
Fundamentals of Data Source and Preparation For ML v31
45 pages
Eda 2
No ratings yet
Eda 2
69 pages
Fiber Glass Protection
100% (1)
Fiber Glass Protection
679 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Notes Unit I
No ratings yet
Notes Unit I
47 pages
Unit 1
No ratings yet
Unit 1
50 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
DYNA102 Stanadyne Pump
100% (3)
DYNA102 Stanadyne Pump
4 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
ML Exp1 - 2201107
No ratings yet
ML Exp1 - 2201107
34 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
Eda
No ratings yet
Eda
6 pages
ENP Energy Efficient Free Cooling For Data Centers
No ratings yet
ENP Energy Efficient Free Cooling For Data Centers
16 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
Unit 1
No ratings yet
Unit 1
23 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Group 7
No ratings yet
Group 7
19 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Lesson 04 Data Analytics Overview
No ratings yet
Lesson 04 Data Analytics Overview
47 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Unit 4
No ratings yet
Unit 4
33 pages
Silt Control in Irrigation Channels
100% (1)
Silt Control in Irrigation Channels
36 pages
Unit 1
No ratings yet
Unit 1
19 pages
Lecture 22
No ratings yet
Lecture 22
20 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
System Design
50% (2)
System Design
58 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Unit 1
No ratings yet
Unit 1
21 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
Tcs
No ratings yet
Tcs
46 pages
Eda ML 2
No ratings yet
Eda ML 2
10 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
12 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Exploratory Data Analysis in ML
No ratings yet
Exploratory Data Analysis in ML
7 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Cell Broadcast (GBSS19.1 01)
No ratings yet
Cell Broadcast (GBSS19.1 01)
87 pages
CH 4 Force System Resultant
No ratings yet
CH 4 Force System Resultant
50 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Assembly and Operating Instructions: Inverter Welding Machine
No ratings yet
Assembly and Operating Instructions: Inverter Welding Machine
14 pages
Polarity and Intermolecular Forces Lab Sheet
100% (1)
Polarity and Intermolecular Forces Lab Sheet
9 pages
Characteristics of Letters
No ratings yet
Characteristics of Letters
22 pages
Spesifikasi Rig 450 HP (BMA#06)
No ratings yet
Spesifikasi Rig 450 HP (BMA#06)
21 pages
Pokétwitch Eng
No ratings yet
Pokétwitch Eng
5 pages
Chapter 4
100% (1)
Chapter 4
166 pages
Untitled - Notepad
No ratings yet
Untitled - Notepad
1 page
High Performance System Dynamics Simulation of The Entire System Tire-Suspension-Steering-Vehicle
No ratings yet
High Performance System Dynamics Simulation of The Entire System Tire-Suspension-Steering-Vehicle
22 pages
Welding Final Presentation
No ratings yet
Welding Final Presentation
18 pages
Nitish Bnkassociate
No ratings yet
Nitish Bnkassociate
2 pages
Kaizen Costing
No ratings yet
Kaizen Costing
4 pages
SYNTHESIS
No ratings yet
SYNTHESIS
2 pages
Set 12-Math-Class V
No ratings yet
Set 12-Math-Class V
6 pages
HGP11 Q3 W3 - Las
No ratings yet
HGP11 Q3 W3 - Las
13 pages
Lecture 2 Design Controls and Criteria
No ratings yet
Lecture 2 Design Controls and Criteria
17 pages
Numerical Methods and Reservoir Simulation: Al-Ayen University College of Petroleum Engineering
No ratings yet
Numerical Methods and Reservoir Simulation: Al-Ayen University College of Petroleum Engineering
13 pages
Accepted For Publication-Journal of Biomolecular Structure & Dynamics - Decision On Manuscript ID TBSD-2023-4739
No ratings yet
Accepted For Publication-Journal of Biomolecular Structure & Dynamics - Decision On Manuscript ID TBSD-2023-4739
3 pages
Case Note Excellent1 Annotated
No ratings yet
Case Note Excellent1 Annotated
8 pages
Daniel Robert Middleton
No ratings yet
Daniel Robert Middleton
3 pages
Hard Work, Determination, and Persistence: 3 Keywords in Life
No ratings yet
Hard Work, Determination, and Persistence: 3 Keywords in Life
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet