0% found this document useful (0 votes)
21 views24 pages

Unit 2 Lec4

The document discusses exploratory data analysis (EDA) techniques for understanding data. It covers univariate and multivariate EDA, including graphical and non-graphical methods like histograms, scatter plots, correlation, and cross-tabulation. The document provides examples of applying these techniques and emphasizes that EDA is important for analyzing data without assumptions in order to produce valid results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views24 pages

Unit 2 Lec4

The document discusses exploratory data analysis (EDA) techniques for understanding data. It covers univariate and multivariate EDA, including graphical and non-graphical methods like histograms, scatter plots, correlation, and cross-tabulation. The document provides examples of applying these techniques and emphasizes that EDA is important for analyzing data without assumptions in order to produce valid results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Exploratory Data Analysis

In the last chapter, we have learned

• The difference between data and information.

• How to summarize the data within individual values.

• We looked into various measures in descriptive statistics such as


measure of center of tendency, measure of dispersion, measure of shape,
and measure of position. These measures help us to describe our data in
an efficient manner.

All of that are good for initial analysis but we need to get a complete picture of our
data. We need to apply exploratory data analysis (EDA) to analyze what is happening
with our data effectively.
It is a critical process of
What EDA? performing initial
investigations on data to:

• Discover the patterns and identify the


relationship between the variables.
• Locate outliers.
• Summary statistics and graphical
representations.
• Test hypothesis.
To understand EDA, let us take an example

You decided to watch a movie you have not heard of.


- You find yourself puzzled with a lot of questions that need to be answered to decide.

- Some of these questions could be

o What is the type of movie?

o What are the ratings and reviews

o Cast

o Watch the trailer

The above activity that you have done are like the activity that you are going to perform in new dataset. This is EDA.
The importance of EDA

- It allows us to analyze the data before coming to any assumption.

- It ensures that the results produced are valid and applicable to the source of activity (business) outcomes and goals.

We use EDA in Machine learning as an initial step to understand the data and the key to all important features in the dataset
must be visualized before implementing the Machine Learning Algorithm.
Types of Data Analysis Techniques

• Graphical --> Exploratory Data Analysis

• Quantitative (Non-Graphical) --> Classical Statistical Methods.


EDA Classification

• Univariate EDA (Graphical and Non-Graphical)

• Multivariate EDA (Graphical and Non-Graphical)

When you talk about EDA, you come across two key important terms

Key Terms

Univariate Analysis:
• In univariate analysis, there is only one dependable variable, where Uni
means one and variate means variable.
Bivariate Analysis
• In Bivariate analysis, there is only two variable, and the analysis is
related to the relationship between them
Multivariate Analysis
• Multi is related to more than two variable and the analysis are related to
these variables.
Univariate non-Graphical Analysis

What are the key steps that we would take in analyzing a single column in our dataset?

Analyzing numerical data with:

• Measure of center

• Measure of Dispersion

• Measure of position

• Measure of shape

• Discover outliers

We covered most of these topics in descriptive statistics


Univariate Graphical Analysis

This type of analysis helps you to look graphically at the distribution of the dataset.

Most common type of graphs


▪ Pie charts
▪ Bar charts
▪ Histograms
▪ Line graphs
▪ Box plots

Pie charts
- It is a circular statistical graphical chart which divided into slices to
illustrate numerical proportions

- In the pie chart, the center angle and the arc length of each slice are
proportional to the quantity or percentage they represent

2017
Bar Graphs

- It consists of horizontal or vertical bars that are separated from each other.
- It is a graphical representation of categorical data (Quantitative data) or discrete
and the height of the bars shows the frequency and width of the bars are the same.
There is equal space between each pair of consecutive bars

Histograms

- It consists of horizontal or vertical bars that are adjacent to each other.


- It is a graphical representation of quantitative data. The area of the rectangular bars
shows the frequency of the data and the width of the bars need not be the same.
Line Plot
- It is also known as a line plot or line chart. It is a graph that uses lines to connect
individual data points.
- It is used for quantitative data and commonly displays the change if the data
a specific time interval. Line graphs are quite informative in visualizing the data trends .

Box plots
- It is used for detecting and illustrating location and variation changes between
different groups of data.
- Box plot are very good at presenting information about how tightly data is grouped,
center of tendency and skewness as well as outliers.
But how can you identify the skewness in Box Plots?

Source: simplypsychology
Let us apply this information using Python…….
Multivariate analysis EDA

- It is like univariate analysis, which involves both computing summary statistics and producing visual displays.

- Generally, the type of analysis depends on the nature of your data (numerical or categorical)

In Multivariate Non-Graphical EDA In Multivariate Graphical EDA


- Cross tabulation
- Correlation using scatter plots
- Calculate correlation and covariance
- Correlation using heat maps

There are more analyses but from the statistical point of view, these are the most common analysis used.
Multivariate non-Graphical EDA

Cross tabulation

▪ It is for categorical data (and numerical data with only a few variables).

▪ It is performed by making a two-way table with column headings that match the levels of one variable and row headings that
match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels.

Source: Datatab
Covariance and correlation

- Both are used to analyze the linear relationship between variables.

Covariance Correlation
Indicate whether the variables are positively or Indicates the degree to which the variables are related
negatively related

Indicate the direction and not strength of the linear Measure both the direction and strength of linear
relationship between variables relationship between two variables.

It has dimension It is a dimensionless unit

σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑐𝑜𝑣𝑥,𝑦 = 𝑟=
𝑁−1 σ 𝑥𝑖 − 𝑥ҧ 2σ 𝑦𝑖 − 𝑦ത 2

In short, Correlation is a function of covariance


Association
It represents the connection between two variables. This connection can be something less strictly defined. For example, a certain smell can
remind you of someone or a place, or a certain sound can remind you of a certain event that was important to you.

Confounding variable
Sometimes there is a third variable that is not accounted can affect the relationship between the two variables under study.

Example: suppose a researcher collects data on ice cream sales and shark attacks and finds
that the two variables are highly correlated. That is unlikely, right ?
More likely cause is the confounding variable temperature
When the temperature is warmer, more people buy ice cream, and more people
go to the ocean

Requirements for confounding variables


- It must be correlated with the independent variable (ice cream sales and temperature)

- It must have a causal relationship with the dependent variable (shark attacks and temperature)
Causation

- It means that changes in one variable directly caused changes in the other.

- For example: Smoking and cancer. A number of studies and collective evidence give strong indications that lung cancer is
causally related to tobacco consumption.

Causality implies association


Correlation Matrix

- It is a table showing the correlation coefficients between variables. Each cell in the table shows the correlation between
two variables.
- The value of the correlation (r) provides the strength and direction of
association as follows:

o Vale of r ranges from -1 to 1


o Positive value indicates a positive association and vice versa
o The correlation strength is strongest in the positive direction
when r =1 and reduces with the value of r and vice versa
There are three broad reasons for computing a correlation matrix:

- To summarize a large amount of data where the goal is to see patterns. In our example above, the observable pattern in the
diagonal is highly correlate with each other.

- To input into other analyses. For example, people commonly use correlation matrixes as inputs for some statistical analysis
models.

- As a diagnostic when checking other analyses. For example, with linear regression, a high number of correlations suggests
that the linear regression estimates will be unreliable.
Multivariate Graphical EDA

Scatter plot
- Scatter plot represents individual pieces of data using dots. These plots make it easier to see if two variables are related to each
other. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.
- It is considered as a visual counterpart of correlation matrix.

Source:CQE Academy

Source: chartio
Scatter matrix
Normally when we have more than two variables and we need to check their scatter plot; we make use of the scatter matrix in
Seaborn library. We use pair plot as shown

Do you want to get the above plot, try the following:


import seaborn as sns
Penguins = sns.load_dataset(“penguins”)
sns.pairplot(penguins, hue = “species”)

In addition, we have heat map that help us to describe the relationship in the format of color coding.
Steps in EDA

Variables Identification

Univariate Analysis

Bi/Multivariate Analysis

Missing Value Treatment

Outlier Removal
Variables Identification
- In this step, we identify every variable by discovering its type.
- According to our needs, we can change the datatype of any variable

Univariate Analysis
- In here, we study individual characteristics of every feature/variable in the dataset.
• Continues variable (Histogram, Boxplot, KDE, and Q-Q plot)
• Categorical variable (Bar chart, Pie chart, and frequency table)

Bivariate Analysis
- Here, we study the relationship between any two variables which can be categorical – continuous, categorical – categorical, or
continuous – continuous.
• continuous – continuous: scatter plot, Heatmap, Joint plot, pair plot
• categorical – categorical: Factor plot, Swarm map, violin plot, Strip plot
• categorical – continuous: Crosstab, Stacked Bar, Barchart

Extend the analysis for more than two variables for Multivariate analysis

You might also like