0% found this document useful (0 votes)
9 views20 pages

EDA - Task

Exploratory Data Analysis (EDA) is a technique used to summarize and visualize datasets to understand their main characteristics, identify errors, and prepare data for further analysis. It involves data sourcing, cleaning, and applying various methods such as feature scaling and outlier treatment to improve data quality. EDA is crucial in business processes for making informed decisions and building machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views20 pages

EDA - Task

Exploratory Data Analysis (EDA) is a technique used to summarize and visualize datasets to understand their main characteristics, identify errors, and prepare data for further analysis. It involves data sourcing, cleaning, and applying various methods such as feature scaling and outlier treatment to improve data quality. EDA is crucial in business processes for making informed decisions and building machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Exploratory Data Analysis

1
Data Analytics/Science Process

Exploratory Data
Analysis

Raw Data
Data is Processed Clean Dataset
Collected

Models &
Algorithms

Data Product Visualize Report Make Decisions


Reality

©
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their
main characteristics in form of visual methods.

• EDA is nothing but a data exploration technique to understand various aspects of the
data.

• The main aim of EDA is to obtain confidence in a data to an extent where we are
ready to engage a machine learning model.

• EDA is important to analyze the data it’s a first steps in data analysis process.

• EDA give a basic idea to understand the data and make sense of the data to figure out
the question you need to ask and find out the best way to manipulate the dataset to get
the answer to your question.

• Exploratory data analysis help us to finding the errors, discovering data, mapping out
data structure, finding out anomalies.

• Exploratory data analysis is important for business process because we are


preparing dataset for deep thorough analysis that will detect you business problem.

• EDA help to build a quick and dirty model, or a baseline model, which can serve
as a comparison against later models that you will build.

3 ]

©
Visualization
Visualization is the presentation of the data in the graphical or visual form to understand the data more
clearly. Visualization is easy to understand the data

Easily understand Easily analyze the Help to get meaningful Help to find the trend
the features of the data and insights from the data. or pattern of the
data summarize it. data.

©
Steps involved in EDA

5

Numerical Analysis
Data Cleaning

Data Sourcing Categorical Derived


Analysis Metrics

©
Data Sourcing
• Data Sourcing is the process of gathering data from multiple sources as external or internal data
collection.
• There are two major kind of data which can be classified according to the source:
1. Public data
2. Private data

Public Data Private Data

The data which is easy to access without taking Private Data:- The data which is not available on
any permission from the agencies is called public platform and to access the data we have
public data. The agencies made the data public to take the permission of organisation is called
for the purpose of the research, private data.
• Example: government and other public sector or • Example: Banking ,telecom ,retail sector are
ecommerce sites made the data public. there which not made their data
publicly available.

©

After collecting the data , the next step is data The following are some steps involve in Data Cleaning
cleaning. Data cleaning means that you get rid
of any information that doesn’t need to be there
and clean up by mistake.

Data Cleaning is the process of clean the data


to improve the quality of the data for further Handle Missing Values Standardization of the
data analysis and building a machine learning data
model.

The benefit of data cleaning is that all the


incorrect and irrelevant data is gone, and we get
the good quality of data which will help in
improving the accuracy of our machine learning
model.
Outlier Treatment Handle Invalid
values

©
Handle Missing Values

Replacing with mean/ Predicting the missing


Delete Rows/Columns Algorithm Imputation
median/mode values

This method we commonly This method can be used on Some machine learning Prediction model is one of the
used to handle missing independent variable when algorithm supports to handle advanced method to handle
values. Rows can be deleted it has numerical variables. missing value in the missing values. In this method
if it has insignificant number On categorical feature we datasets. Like KNN, Naïve dataset with no missing value
of missing value Columns apply mode method to fill the Bayes, Random forest. become training set and
can be delete if it has more missing value. dataset with missing value
than 75% of missing value become the test set and the
missing values is treated as
target variable.

©
Standardization/Feature Scaling

10

©

Importance of Feature Scaling

When we are dealing with independent variable or features that differ


from each other in terms of range of values or units of the features, then
we have to normalize/standardize the data so that the difference in range
of values doesn’t affect the outcome of the data.

Feature scaling is the method to rescale the


values present in the features. In feature
scaling we convert the scale of different
measurement into a single scale. It standardize
the whole dataset in one range.

11

©
Example
Normalization

Age Income (£) New value Income Minimum =


12000 Income Maximum
= 30000
24 15000 (15000 – 12000)/18000 = 0.16667 (Max – min) = (30000 –
12000) = 18000

30 12000 (12000 – 12000)/18000 =0


Hence, we have converted the income values between 0

28 30000 (30000 – 12000)/18000 =1 and 1 Please note, the new values have
Minimum = 0
Maximum = 1

12

©
Example
Standardization

Age Income (£) New value Average = (15000 + 12000 + 30000)/3 = 19000
Standard deviation = 9643.65

24 15000 (15000 - 19000)/9643.65 = -0.4147 Hence, we have converted the income values to lower
values using the z-score method.
30 12000 (12000 - 19000)/9643.65 = -0.7258
x = c(-0.4147, -0.7258, 1.1406)
mean(x) = -0.000003 ~
28 30000 (30000-19000)/9643.65 = 1.1406 0 var(x) = 0.999 ~1

13

©
Outlier Treatment
Outliers are the most extremes values in the data. It is an abnormal observations that deviate from the norm.
Outliers do not fit in the normal behavior of the data.

Detect Outliers using following methods Handle Outlier using following methods

1. Boxplot 1. Remove the outliers.


2. Histogram 2. Replace outlier with suitable values by using
3. Scatter plot following methods:-

4. Z-score • Quantile method

5. Inter quartile range(values out of 1.5 time of • Inter quartile range


IQR) 3. Use that ML model which are not sensitive to
outliers
4. Like:-KNN, Decision Tree, SVM, Naïve Bayes,
Ensemble methods

14

©
Numerical Analysis

We also perform various analysis Similarly, while analyzing multiple


over Numerical data. features, we might be interested in
knowing their correlation with each other.
For example, dealing with a single
numerical variable, we might be interested
in knowing their statistical information such
as mean, median, 25th Percentile, 75th
Percentile,
min, max etc.

15

©
Derived Metrics
Derived metrics create a new variable from the existing variable to get a insightful information from the data by
analyzing the data.

Feature Binning

Feature Encoding

From Domain Knowledge


.

Calculated from Data

16

©
Feature Binning

It transform continuous or numeric variable into categorical


value without taking dependent variable into consideration

Equal Equal width separate the continuous variable


Width to several categories having same range of
width.

Equal frequency separate the


Equal
continuous variable into several
Frequency
categories having approximately same
number of values.

17

©
Feature Encoding
Feature encoding help us to transform categorical data into numeric data.

Label Label encoding is technique to transform categorical variables into numerical variables by assigning a
encoding numerical value to each of the categories.

This technique is used when independent variables are nominal. It creates k different columns each
One-Hot
for a category and replaces one column with 1 rest of the columns is 0.
encoding
Here, 0 represents the absence, and 1 represents the presence of that category.

18

©
Use cases
Basically EDA is important in every business problem, it’s a first crucial step in data analysis process.

Some of the use cases where we use EDA is:-

Cancer Data Analysis

In this data set we have to predict who are suffering from cancer and who’s
not.
Uses
Cases
Fraud Data Analysis in E-commerce Transactions

In this dataset we have to detect the fraud in a E-commerce


transaction.

19

©
Thank you

You might also like