EDA - Task
EDA - Task
1
Data Analytics/Science Process
Exploratory Data
Analysis
Raw Data
Data is Processed Clean Dataset
Collected
Models &
Algorithms
©
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their
main characteristics in form of visual methods.
• EDA is nothing but a data exploration technique to understand various aspects of the
data.
• The main aim of EDA is to obtain confidence in a data to an extent where we are
ready to engage a machine learning model.
• EDA is important to analyze the data it’s a first steps in data analysis process.
• EDA give a basic idea to understand the data and make sense of the data to figure out
the question you need to ask and find out the best way to manipulate the dataset to get
the answer to your question.
• Exploratory data analysis help us to finding the errors, discovering data, mapping out
data structure, finding out anomalies.
• EDA help to build a quick and dirty model, or a baseline model, which can serve
as a comparison against later models that you will build.
3 ]
©
Visualization
Visualization is the presentation of the data in the graphical or visual form to understand the data more
clearly. Visualization is easy to understand the data
Easily understand Easily analyze the Help to get meaningful Help to find the trend
the features of the data and insights from the data. or pattern of the
data summarize it. data.
©
Steps involved in EDA
5
…
Numerical Analysis
Data Cleaning
©
Data Sourcing
• Data Sourcing is the process of gathering data from multiple sources as external or internal data
collection.
• There are two major kind of data which can be classified according to the source:
1. Public data
2. Private data
The data which is easy to access without taking Private Data:- The data which is not available on
any permission from the agencies is called public platform and to access the data we have
public data. The agencies made the data public to take the permission of organisation is called
for the purpose of the research, private data.
• Example: government and other public sector or • Example: Banking ,telecom ,retail sector are
ecommerce sites made the data public. there which not made their data
publicly available.
©
…
After collecting the data , the next step is data The following are some steps involve in Data Cleaning
cleaning. Data cleaning means that you get rid
of any information that doesn’t need to be there
and clean up by mistake.
©
Handle Missing Values
This method we commonly This method can be used on Some machine learning Prediction model is one of the
used to handle missing independent variable when algorithm supports to handle advanced method to handle
values. Rows can be deleted it has numerical variables. missing value in the missing values. In this method
if it has insignificant number On categorical feature we datasets. Like KNN, Naïve dataset with no missing value
of missing value Columns apply mode method to fill the Bayes, Random forest. become training set and
can be delete if it has more missing value. dataset with missing value
than 75% of missing value become the test set and the
missing values is treated as
target variable.
©
Standardization/Feature Scaling
10
©
…
11
©
Example
Normalization
28 30000 (30000 – 12000)/18000 =1 and 1 Please note, the new values have
Minimum = 0
Maximum = 1
12
©
Example
Standardization
Age Income (£) New value Average = (15000 + 12000 + 30000)/3 = 19000
Standard deviation = 9643.65
24 15000 (15000 - 19000)/9643.65 = -0.4147 Hence, we have converted the income values to lower
values using the z-score method.
30 12000 (12000 - 19000)/9643.65 = -0.7258
x = c(-0.4147, -0.7258, 1.1406)
mean(x) = -0.000003 ~
28 30000 (30000-19000)/9643.65 = 1.1406 0 var(x) = 0.999 ~1
13
©
Outlier Treatment
Outliers are the most extremes values in the data. It is an abnormal observations that deviate from the norm.
Outliers do not fit in the normal behavior of the data.
Detect Outliers using following methods Handle Outlier using following methods
14
©
Numerical Analysis
15
©
Derived Metrics
Derived metrics create a new variable from the existing variable to get a insightful information from the data by
analyzing the data.
Feature Binning
Feature Encoding
16
©
Feature Binning
17
©
Feature Encoding
Feature encoding help us to transform categorical data into numeric data.
Label Label encoding is technique to transform categorical variables into numerical variables by assigning a
encoding numerical value to each of the categories.
This technique is used when independent variables are nominal. It creates k different columns each
One-Hot
for a category and replaces one column with 1 rest of the columns is 0.
encoding
Here, 0 represents the absence, and 1 represents the presence of that category.
18
©
Use cases
Basically EDA is important in every business problem, it’s a first crucial step in data analysis process.
In this data set we have to predict who are suffering from cancer and who’s
not.
Uses
Cases
Fraud Data Analysis in E-commerce Transactions
19
©
Thank you