0% found this document useful (0 votes)

22 views24 pages

Unit 2 Lec4

The document discusses exploratory data analysis (EDA) techniques for understanding data. It covers univariate and multivariate EDA, including graphical and non-graphical methods like histograms, scatter plots, correlation, and cross-tabulation. The document provides examples of applying these techniques and emphasizes that EDA is important for analyzing data without assumptions in order to produce valid results.

Uploaded by

Sabyrzhan Orynbassar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views24 pages

Unit 2 Lec4

Uploaded by

Sabyrzhan Orynbassar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Exploratory Data Analysis

In the last chapter, we have learned

• The difference between data and information.

• How to summarize the data within individual values.

• We looked into various measures in descriptive statistics such as

measure of center of tendency, measure of dispersion, measure of shape,
and measure of position. These measures help us to describe our data in
an efficient manner.

All of that are good for initial analysis but we need to get a complete picture of our
data. We need to apply exploratory data analysis (EDA) to analyze what is happening
with our data effectively.
It is a critical process of
What EDA? performing initial
investigations on data to:

• Discover the patterns and identify the

relationship between the variables.
• Locate outliers.
• Summary statistics and graphical
representations.
• Test hypothesis.
To understand EDA, let us take an example

You decided to watch a movie you have not heard of.

- You find yourself puzzled with a lot of questions that need to be answered to decide.

- Some of these questions could be

o What is the type of movie?

o What are the ratings and reviews

o Cast

o Watch the trailer

The above activity that you have done are like the activity that you are going to perform in new dataset. This is EDA.
The importance of EDA

- It allows us to analyze the data before coming to any assumption.

- It ensures that the results produced are valid and applicable to the source of activity (business) outcomes and goals.

We use EDA in Machine learning as an initial step to understand the data and the key to all important features in the dataset
must be visualized before implementing the Machine Learning Algorithm.
Types of Data Analysis Techniques

• Graphical --> Exploratory Data Analysis

• Quantitative (Non-Graphical) --> Classical Statistical Methods.

EDA Classification

• Univariate EDA (Graphical and Non-Graphical)

• Multivariate EDA (Graphical and Non-Graphical)

When you talk about EDA, you come across two key important terms

Key Terms

Univariate Analysis:
• In univariate analysis, there is only one dependable variable, where Uni
means one and variate means variable.
Bivariate Analysis
• In Bivariate analysis, there is only two variable, and the analysis is
related to the relationship between them
Multivariate Analysis
• Multi is related to more than two variable and the analysis are related to
these variables.
Univariate non-Graphical Analysis

What are the key steps that we would take in analyzing a single column in our dataset?

Analyzing numerical data with:

• Measure of center

• Measure of Dispersion

• Measure of position

• Measure of shape

• Discover outliers

We covered most of these topics in descriptive statistics

Univariate Graphical Analysis

This type of analysis helps you to look graphically at the distribution of the dataset.

Most common type of graphs

▪ Pie charts
▪ Bar charts
▪ Histograms
▪ Line graphs
▪ Box plots

Pie charts
- It is a circular statistical graphical chart which divided into slices to
illustrate numerical proportions

- In the pie chart, the center angle and the arc length of each slice are
proportional to the quantity or percentage they represent

2017
Bar Graphs

- It consists of horizontal or vertical bars that are separated from each other.
- It is a graphical representation of categorical data (Quantitative data) or discrete
and the height of the bars shows the frequency and width of the bars are the same.
There is equal space between each pair of consecutive bars

Histograms

- It consists of horizontal or vertical bars that are adjacent to each other.

- It is a graphical representation of quantitative data. The area of the rectangular bars
shows the frequency of the data and the width of the bars need not be the same.
Line Plot
- It is also known as a line plot or line chart. It is a graph that uses lines to connect
individual data points.
- It is used for quantitative data and commonly displays the change if the data
a specific time interval. Line graphs are quite informative in visualizing the data trends .

Box plots
- It is used for detecting and illustrating location and variation changes between
different groups of data.
- Box plot are very good at presenting information about how tightly data is grouped,
center of tendency and skewness as well as outliers.
But how can you identify the skewness in Box Plots?

Source: simplypsychology
Let us apply this information using Python…….
Multivariate analysis EDA

- It is like univariate analysis, which involves both computing summary statistics and producing visual displays.

- Generally, the type of analysis depends on the nature of your data (numerical or categorical)

In Multivariate Non-Graphical EDA In Multivariate Graphical EDA

- Cross tabulation
- Correlation using scatter plots
- Calculate correlation and covariance
- Correlation using heat maps

There are more analyses but from the statistical point of view, these are the most common analysis used.
Multivariate non-Graphical EDA

Cross tabulation

▪ It is for categorical data (and numerical data with only a few variables).

▪ It is performed by making a two-way table with column headings that match the levels of one variable and row headings that
match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels.

Source: Datatab
Covariance and correlation

- Both are used to analyze the linear relationship between variables.

Covariance Correlation
Indicate whether the variables are positively or Indicates the degree to which the variables are related
negatively related

Indicate the direction and not strength of the linear Measure both the direction and strength of linear
relationship between variables relationship between two variables.

It has dimension It is a dimensionless unit

σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑐𝑜𝑣𝑥,𝑦 = 𝑟=
𝑁−1 σ 𝑥𝑖 − 𝑥ҧ 2σ 𝑦𝑖 − 𝑦ത 2

In short, Correlation is a function of covariance

Association
It represents the connection between two variables. This connection can be something less strictly defined. For example, a certain smell can
remind you of someone or a place, or a certain sound can remind you of a certain event that was important to you.

Confounding variable
Sometimes there is a third variable that is not accounted can affect the relationship between the two variables under study.

Example: suppose a researcher collects data on ice cream sales and shark attacks and finds
that the two variables are highly correlated. That is unlikely, right ?
More likely cause is the confounding variable temperature
When the temperature is warmer, more people buy ice cream, and more people
go to the ocean

Requirements for confounding variables

- It must be correlated with the independent variable (ice cream sales and temperature)

- It must have a causal relationship with the dependent variable (shark attacks and temperature)
Causation

- It means that changes in one variable directly caused changes in the other.

- For example: Smoking and cancer. A number of studies and collective evidence give strong indications that lung cancer is
causally related to tobacco consumption.

Causality implies association

Correlation Matrix

- It is a table showing the correlation coefficients between variables. Each cell in the table shows the correlation between
two variables.
- The value of the correlation (r) provides the strength and direction of
association as follows:

o Vale of r ranges from -1 to 1

o Positive value indicates a positive association and vice versa
o The correlation strength is strongest in the positive direction
when r =1 and reduces with the value of r and vice versa
There are three broad reasons for computing a correlation matrix:

- To summarize a large amount of data where the goal is to see patterns. In our example above, the observable pattern in the
diagonal is highly correlate with each other.

- To input into other analyses. For example, people commonly use correlation matrixes as inputs for some statistical analysis
models.

- As a diagnostic when checking other analyses. For example, with linear regression, a high number of correlations suggests
that the linear regression estimates will be unreliable.
Multivariate Graphical EDA

Scatter plot
- Scatter plot represents individual pieces of data using dots. These plots make it easier to see if two variables are related to each
other. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.
- It is considered as a visual counterpart of correlation matrix.

Source:CQE Academy

Source: chartio
Scatter matrix
Normally when we have more than two variables and we need to check their scatter plot; we make use of the scatter matrix in
Seaborn library. We use pair plot as shown

Do you want to get the above plot, try the following:

import seaborn as sns
Penguins = sns.load_dataset(“penguins”)
sns.pairplot(penguins, hue = “species”)

In addition, we have heat map that help us to describe the relationship in the format of color coding.
Steps in EDA

Variables Identification

Univariate Analysis

Bi/Multivariate Analysis

Missing Value Treatment

Outlier Removal
Variables Identification
- In this step, we identify every variable by discovering its type.
- According to our needs, we can change the datatype of any variable

Univariate Analysis
- In here, we study individual characteristics of every feature/variable in the dataset.
• Continues variable (Histogram, Boxplot, KDE, and Q-Q plot)
• Categorical variable (Bar chart, Pie chart, and frequency table)

Bivariate Analysis
- Here, we study the relationship between any two variables which can be categorical – continuous, categorical – categorical, or
continuous – continuous.
• continuous – continuous: scatter plot, Heatmap, Joint plot, pair plot
• categorical – categorical: Factor plot, Swarm map, violin plot, Strip plot
• categorical – continuous: Crosstab, Stacked Bar, Barchart

Extend the analysis for more than two variables for Multivariate analysis

Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Exploratory Data Analysis: A First Look at The Data
No ratings yet
Exploratory Data Analysis: A First Look at The Data
9 pages
What Is Exploratory Data Analysis (EDA) ?
No ratings yet
What Is Exploratory Data Analysis (EDA) ?
6 pages
Unit 3
No ratings yet
Unit 3
222 pages
Data Science - Module 2 (Updated)
No ratings yet
Data Science - Module 2 (Updated)
94 pages
Telyu 05
No ratings yet
Telyu 05
53 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
Exp 4-10 Merged
No ratings yet
Exp 4-10 Merged
89 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
DataAnalytics (Unit 2)
No ratings yet
DataAnalytics (Unit 2)
131 pages
03a EDA
No ratings yet
03a EDA
47 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
Chapter 7 SQQS1033
No ratings yet
Chapter 7 SQQS1033
37 pages
IDA Question Bank Ch2
No ratings yet
IDA Question Bank Ch2
26 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
2 Eda
No ratings yet
2 Eda
20 pages
Unit 2
No ratings yet
Unit 2
34 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Edab Module - 1
No ratings yet
Edab Module - 1
20 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Exploratory Data Analysis Types
No ratings yet
Exploratory Data Analysis Types
14 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
IOT Domain
No ratings yet
IOT Domain
70 pages
Unit 2
No ratings yet
Unit 2
20 pages
C21 Sma Exp4
No ratings yet
C21 Sma Exp4
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
17 pages
Data Basics For ML
No ratings yet
Data Basics For ML
23 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
5 pages
Unit 3
No ratings yet
Unit 3
77 pages
Unit 1
No ratings yet
Unit 1
52 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Program 2
No ratings yet
Program 2
9 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Comparing Tools Provided by Python and R For Exploratory Data Analysis
No ratings yet
Comparing Tools Provided by Python and R For Exploratory Data Analysis
12 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
05 AIHC Exp02
No ratings yet
05 AIHC Exp02
11 pages
Edashsh
No ratings yet
Edashsh
7 pages
Unit 3
No ratings yet
Unit 3
47 pages
EDA
No ratings yet
EDA
3 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Why Exploratory Data Analysis Is Important
No ratings yet
Why Exploratory Data Analysis Is Important
2 pages
Exploratory Data Analysis in ML
No ratings yet
Exploratory Data Analysis in ML
7 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages