Unit 2 Lec4
Unit 2 Lec4
All of that are good for initial analysis but we need to get a complete picture of our
data. We need to apply exploratory data analysis (EDA) to analyze what is happening
with our data effectively.
It is a critical process of
What EDA? performing initial
investigations on data to:
o Cast
The above activity that you have done are like the activity that you are going to perform in new dataset. This is EDA.
The importance of EDA
- It ensures that the results produced are valid and applicable to the source of activity (business) outcomes and goals.
We use EDA in Machine learning as an initial step to understand the data and the key to all important features in the dataset
must be visualized before implementing the Machine Learning Algorithm.
Types of Data Analysis Techniques
When you talk about EDA, you come across two key important terms
Key Terms
Univariate Analysis:
• In univariate analysis, there is only one dependable variable, where Uni
means one and variate means variable.
Bivariate Analysis
• In Bivariate analysis, there is only two variable, and the analysis is
related to the relationship between them
Multivariate Analysis
• Multi is related to more than two variable and the analysis are related to
these variables.
Univariate non-Graphical Analysis
What are the key steps that we would take in analyzing a single column in our dataset?
• Measure of center
• Measure of Dispersion
• Measure of position
• Measure of shape
• Discover outliers
This type of analysis helps you to look graphically at the distribution of the dataset.
Pie charts
- It is a circular statistical graphical chart which divided into slices to
illustrate numerical proportions
- In the pie chart, the center angle and the arc length of each slice are
proportional to the quantity or percentage they represent
2017
Bar Graphs
- It consists of horizontal or vertical bars that are separated from each other.
- It is a graphical representation of categorical data (Quantitative data) or discrete
and the height of the bars shows the frequency and width of the bars are the same.
There is equal space between each pair of consecutive bars
Histograms
Box plots
- It is used for detecting and illustrating location and variation changes between
different groups of data.
- Box plot are very good at presenting information about how tightly data is grouped,
center of tendency and skewness as well as outliers.
But how can you identify the skewness in Box Plots?
Source: simplypsychology
Let us apply this information using Python…….
Multivariate analysis EDA
- It is like univariate analysis, which involves both computing summary statistics and producing visual displays.
- Generally, the type of analysis depends on the nature of your data (numerical or categorical)
There are more analyses but from the statistical point of view, these are the most common analysis used.
Multivariate non-Graphical EDA
Cross tabulation
▪ It is for categorical data (and numerical data with only a few variables).
▪ It is performed by making a two-way table with column headings that match the levels of one variable and row headings that
match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels.
Source: Datatab
Covariance and correlation
Covariance Correlation
Indicate whether the variables are positively or Indicates the degree to which the variables are related
negatively related
Indicate the direction and not strength of the linear Measure both the direction and strength of linear
relationship between variables relationship between two variables.
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑐𝑜𝑣𝑥,𝑦 = 𝑟=
𝑁−1 σ 𝑥𝑖 − 𝑥ҧ 2σ 𝑦𝑖 − 𝑦ത 2
Confounding variable
Sometimes there is a third variable that is not accounted can affect the relationship between the two variables under study.
Example: suppose a researcher collects data on ice cream sales and shark attacks and finds
that the two variables are highly correlated. That is unlikely, right ?
More likely cause is the confounding variable temperature
When the temperature is warmer, more people buy ice cream, and more people
go to the ocean
- It must have a causal relationship with the dependent variable (shark attacks and temperature)
Causation
- It means that changes in one variable directly caused changes in the other.
- For example: Smoking and cancer. A number of studies and collective evidence give strong indications that lung cancer is
causally related to tobacco consumption.
- It is a table showing the correlation coefficients between variables. Each cell in the table shows the correlation between
two variables.
- The value of the correlation (r) provides the strength and direction of
association as follows:
- To summarize a large amount of data where the goal is to see patterns. In our example above, the observable pattern in the
diagonal is highly correlate with each other.
- To input into other analyses. For example, people commonly use correlation matrixes as inputs for some statistical analysis
models.
- As a diagnostic when checking other analyses. For example, with linear regression, a high number of correlations suggests
that the linear regression estimates will be unreliable.
Multivariate Graphical EDA
Scatter plot
- Scatter plot represents individual pieces of data using dots. These plots make it easier to see if two variables are related to each
other. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.
- It is considered as a visual counterpart of correlation matrix.
Source:CQE Academy
Source: chartio
Scatter matrix
Normally when we have more than two variables and we need to check their scatter plot; we make use of the scatter matrix in
Seaborn library. We use pair plot as shown
In addition, we have heat map that help us to describe the relationship in the format of color coding.
Steps in EDA
Variables Identification
Univariate Analysis
Bi/Multivariate Analysis
Outlier Removal
Variables Identification
- In this step, we identify every variable by discovering its type.
- According to our needs, we can change the datatype of any variable
Univariate Analysis
- In here, we study individual characteristics of every feature/variable in the dataset.
• Continues variable (Histogram, Boxplot, KDE, and Q-Q plot)
• Categorical variable (Bar chart, Pie chart, and frequency table)
Bivariate Analysis
- Here, we study the relationship between any two variables which can be categorical – continuous, categorical – categorical, or
continuous – continuous.
• continuous – continuous: scatter plot, Heatmap, Joint plot, pair plot
• categorical – categorical: Factor plot, Swarm map, violin plot, Strip plot
• categorical – continuous: Crosstab, Stacked Bar, Barchart
Extend the analysis for more than two variables for Multivariate analysis