0% found this document useful (0 votes)
17 views132 pages

Section 1 Slide

Uploaded by

Lee Louise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views132 pages

Section 1 Slide

Uploaded by

Lee Louise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

1.

Exploratory Data Analysis


Vu Dam- Data Mentor
Agenda

1. EDA Definition and Importance

2. EDA Purpose

3. EDA Process

4. Data Overview
EDA Definition and Importance

+ Developed by American
mathematician John Tukey in the
1970s, EDA techniques is still
widely used in data discovery
process.

+ In short, EDA is the process of


investigating data sets to
summarize their main
characteristics and cleaning data
EDA Purpose

1. To understand the data: 2. To clean the data:


▪ Data type ▪ Deal with missing values
▪ Data size ▪ Deal with duplicated values
▪ Data meaning ▪ Deal with outliers
▪ Feature (univariate analysis)
▪ Correlation (bivariate analysis,
multivariate analysis)
Exploratory Data Analysis Process

Statistical Clean Data Basic Analysis


Data Overview
Measures

▪ Data type ▪ Standard Deviation ▪ Missing values ▪ Univariate Analysis

▪ Data size ▪ Variance ▪ Duplicated Values ▪ Bivariate Analysis

▪ Data meaning ▪ Percentiles ▪ Outliers ▪ Multivariate


▪ Quartile Analysis

▪ Quantiles
Recap

- EDA is very important

- Purpose:
- Understand data
- Clean data

- Process:
- Data Overview
- Statistical Measures
- Clean Data
- Basic Analysis
2. Basic Statistical Measures
Vu Dam- Data Mentor
Agenda

1. Central Tendency

2. Mean, Median, and Mode

3. Data Distribution

4. Recap
Citation

- https://unitrain.edu.vn/do-lech-chuan-standard-deviation-la-gi-vi-du-ve-do-lech-chuan/

- https://www.careerlink.vn/en/careertools/economic-knowledge/standard-deviation-la-gi-cong-thuc-tinh-va
-ung-dung

- https://vietnambiz.vn/qui-tac-da-kiem-chung-empirical-rule-la-gi-vi-du-ve-qui-tac-da-kiem-chung-2020010
7105607407.htm

- https://financebiz.vn/empirical-rule-la-gi/

- http://www.amathsdictionaryforkids.com

- https://vietnamnet.vn/nghi-van-diem-thi-bat-thuong-o-ha-giang-ho-mot-tieng-cung-da-co-bien-ban-46
2574.html
Central Tendency

+ A measure of central tendency describes a set of data by identifying the central position in the
data set as a single value

+ 3 most common measures: mean, median, and mode. In different situations some measures
become more appropriate to use than others.
Central Tendency – Mean and Median
Central Tendency - Mode
Data Distribution

+ Data distribution is a function


that specifies all possible values
for a variable and quantifies the
relative frequency (probability
of how often they occur).

Right Skew
Right Skew and Left Skew
+ Symmetric Data: Data sets whose values are evenly spread around the centre

+ Skewed Data: Data sets that are not symmetric

Zero Skew

Left Skew Right Skew


Real Life Example
Real Life Example
Real Life Example
Real Life Example
Recap

- Mean (average): The total sum of values divided by the total observations. The mean is
highly sensitive to the outliers.

- Median (center value): The total count of an ordered sequence of numbers divided by 2.
The median is not affected by the outliers.

- Mode (most common): The values most frequently observed. There can be more than one
modal value in the same variable.

- Symmetric Data: Data sets whose values are evenly spread around the central

- Skewed Data: Data sets that are not symmetric


- Left (negative skewed): mean < median
- Right (positive skewed): mean > median
3. Basic Statistical Measures (cont)
Vu Dam- Data Mentor
Agenda

1. Measure of Spread and Variability- Range

2. Variance, Standard Deviation

3. Meaning and Application


Measure of Spread and Variability- Range
Measure of Spread and Variability- Range
Variance & Standard Deviation

+ Standard Deviation: Measure of how much the individual scores of a data set differ from the
mean.

+ Variance: Square of standard deviation. Average of the squared differences from the mean.
Variance & Standard Deviation
Meaning and Application

- Detect Outliers

- Investment

- Weather Forecast
Real Life Example

- Mean City A: 94.6

- Mean City B: 86.1

- Stdv City A: 0.89

- Stdv City B: 5.7


Real Life Example
Real Life Example
Recap

- Range: Max - Min

- Standard deviation (concentrated around the mean): The standard amount of deviation
(distance) from the mean. The std is affected by the outliers. It is the square root of the
variance.
- 99.7 % data is within 3 standard deviations

- Variance (variability from the mean): The square of the standard deviation. It is also affected by
outliers.

- Detected Outliers: If a value is 3 standard deviations away from the mean, that data point is
identified as an outlier.
4. Quartiles and Interquartile
Range (IQR)
Vu Dam- Data Mentor
Agenda

1. Quartiles

2. Interquartile Range (IQR)

3. Box and Whisker Plot


Quartiles
Quartiles

A score of 42 (Q1) represents the first quartile and is the


25th percentile.

Q1 tells us that 25% of the scores are less than 42 and 75% of the
class scores are greater.

Q2 (the median) is the 50th percentile and shows that 50% of the
scores are less than 62, and 50% of the scores are above 62.

Finally, Q3, the 75th percentile, reveals that 25% of the scores
are greater and 75% are less than 68.
Interquartile Range (IQR)
Box and Whisker Plot

• Able to handle and present


a large amount of data.

• A visually effective method


of viewing a summary

• A graphical way showing


outliers.

• Great for comparison of two


or more datasets.
Real Life Example
Recap

- Quartiles: Q1, Q2 (median) , Q3 divide data into 4 equal parts

- Interquartile Range (IQR): Q3 – Q1

- Box and Whisker Plot:


- Able to handle and present a large amount of data.
- A visually effective method of viewing a summary.
- A graphical way showing outliers.
- Great for comparison of two or more datasets.
5. Understand Percentiles
Vu Dam- Data Mentor
Agenda

1. Percentile

2. Percentile Ranking

3. Statistics Recap
Percentiles
Percentile Ranking
Percentile Ranking
Recap

- Mean (average): The total sum of values divided by the total observations. The mean is
highly sensitive to the outliers.

- Median (center value): The total count of an ordered sequence of numbers divided by 2.
The median is not affected by the outliers.

- Mode (most common): The values most frequently observed. There can be more than one
modal value in the same variable.

- Symmetric Data: Data sets whose values are evenly spread around the central

- Skewed Data: Data sets that are not symmetric


- Left (negative skewed): mean < median
- Right (positive skewed): mean > median
Recap

- Range: Max - Min

- Standard deviation (concentrated around the mean): The standard amount of deviation
(distance) from the mean. It is affected by outliers
- 99.7 % data is within 3 standard deviations

- Variance (variability from the mean): The square of the standard deviation. It is also affected by
outliers.

- Detected Outliers: If a value is 3 standard deviations away from the mean, that data point is
identified as an outlier.
Recap

- Quartiles: Q1, Q2 (median) , Q3 divide data into 4 equal parts

- Interquartile Range (IQR): Q3 – Q1

- Box and Whisker Plot:


- Able to handle and present a large amount of data.
- A visually effective method of viewing a summary.
- A graphical way showing outliers.
- Great for comparison of two or more datasets.

- Percentiles: Divide data into 100 equal parts

- Percentile Ranking: Percentage of scores that fall at or below a given score


Exploratory Data Analysis Process

Statistical Clean Data Basic Analysis


Data Overview
Measures

▪ Data type ▪ Standard Deviation ▪ Missing values ▪ Univariate Analysis

▪ Data size ▪ Variance ▪ Duplicated Values ▪ Bivariate Analysis

▪ Data meaning ▪ Percentiles ▪ Outliers ▪ Multivariate


▪ Quartile Analysis

▪ Quantiles
6. Deal With Missing Values
Vu Dam- Data Mentor
Agenda

1. Definition

2. Missing Value Types

3. Reasons to Deal with Missing Values

4. Methods to Deal with Missing Values


What is Missing Value?

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an
observation. Missing data are a common occurrence and can have a significant effect on the conclusions that
can be drawn from the data.

Examples of Missing Value


Missing Value Types

Structurally missing data


Missing completely at random
Structurally missing data is data
that is missing for a logical reason. The person has missing data is
In other words, it is data that is completely unrelated to the other
information in the data.
missing because it should not A B
exist. .

Missing at random
Missing not at random C D We can predict the value that is
The value of the variable that's missing missing based on the other data.
is related to the reason it's missing.
Missing Value Types

Structurally missing data


Missing completely at random
Structurally missing data is data
that is missing for a logical reason. The person has missing data is
In other words, it is data that is completely unrelated to the other
information in the data.
missing because it should not A B
exist. .

C D
Missing Value Types

Missing completely at random


The person has missing data is
completely unrelated to the other
information in the data.
A B

Missing at random
C D We can predict the value that is
missing based on the other data.
Missing Value Types

Structurally missing data


Missing completely at random
Structurally missing data is data
that is missing for a logical reason. The person has missing data is
In other words, it is data that is completely unrelated to the other
information in the data.
missing because it should not A B
exist. .

Missing at random
Missing not at random C D We can predict the value that is
The value of the variable that's missing missing based on the other data.
is related to the reason it's missing.
Meaning

Missing data are problematic because, depending on the type, they can sometimes bias your results.
Meaning

Utilize Machine Learning


Deleting the column/row with
01 missing data 04 Algorithms that support missing
values

Filling the missing data with the


02 mean or median value if it’s a 05 Prediction of missing values
numerical variable.

Filling the missing data with the


03 mode value if it’s a categorical
variable.
Recap

- Missing Value Types:


- Structurally Missing Data
- Missing Completely at Random
- Missing at Random
- Missing not at Random

- Missing Value may lead to false insight

- Ways to deal with Missing Value:


- Delete
- Mean/ Mode/ Median Imputation
- Machine Learning
- Prediction
Citation

- Kumar, S. (2021, September 28). 7 ways to handle missing values in machine learning. Medium.
Retrieved August 18, 2022, from
https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a632
6adf79e

- Bock, T. (2020, December 7). What are the different types of missing data? Displayr. Retrieved
August 18, 2022, from https://www.displayr.com/different-types-of-missing-data/

- Bhandari, P. (2021, December 14). How to deal with missing data. Scribbr. Retrieved August 18,
2022, from https://www.scribbr.com/statistics/missing-data/
7. Deal With Duplicated Values
Vu Dam- Data Mentor
Agenda

1. Definition

2. Reasons to Deal with Duplicated Values

3. Method to Deal with Duplicated Values


Definition

Duplicate data is any record that inadvertently shares data with another record in a Database. Duplicate data
is easy to identify, and it mostly occurs when transferring data between systems. The most popular
occurrence of duplicate data is a complete carbon copy of a record.

Examples of Duplicate Value


Reasons to Deal with Duplicated Values

- Costs and lost productivity

- Suffering brand reputation

- Disjointed customer service

- Lack of a single customer view

- Storage costs
Ways to deal with Duplicated values

- Define Level of matching


- Merge Duplicates
Recap

- Duplicated Values: 2 or more records are the same in the data set

- Ways to deal with Duplicated Value:


- Define level of matching
- Merge duplicates
Citation

- https://blog.hubspot.com/customers/data-duplication-and-hubspot-impact-your-business
8. Deal With Outliers: Detect
Vu Dam- Data Mentor
Agenda

1. Definition

2. Causes and Reasons to Deal with Outlier

3. Methods to Detect Outlier


Definition

An outlier is a value or point that differs substantially from the rest of the data. In simple terms, an outlier is
an extremely high or extremely low data point in a data graph or dataset you're working with.

Examples of Outliers
Causes and Reasons to Deal with Outlier

Outliers can have many causes, such as:

● Measurement or input error.


● Data corruption.
● True outlier observation (ex. Michael Jordan in basketball)

Reasons to Deal with:

● Affect Statistical Results => Unreliable insights


● Affect Model of Machine Learning
Methods to Detect Outlier

- Domain Knowledge

- Statistical Indicators:
● Distance from the mean in standard deviations (default: 3)
● Distance from the interquartile range by a multiple of the interquartile range (default: 1.5)
Recap

- Outlier: a value that differs substantially from the rest of the data

- Methods to Detect Outliers:


- Domain knowledge
- Distance from the mean in standard deviations (default: 3)
- Distance from the interquartile range by a multiple of the interquartile range
(default: 1.5)
Citation

- https://dataschool.com/fundamentals-of-analysis/what-is-an-outlier/

- https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
9. Deal With Outliers:
Remove or Keep or Change
Vu Dam- Data Mentor
Agenda

1. Remove Method

2. Keep Method

3. Change Method
Remove Method

Conditions:

● A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that
observation because you know it’s incorrect.

● Not a part of the population you are studying (i.e., unusual properties or conditions).

● Not in the above conditions:

○ Use statistics to remove outliers


Keep Method

Conditions:

● True outlier observation (ex. Michael Jordan in basketball)

● The statistics measure you use to get insight is not affected much by these outliers
Change Method

Conditions:

● Care about other data of this outlier

● Don’t want the outlier affect the statistic measures too much

Methods:

● Quantile (range from any value to any other value) based flooring and capping

● Mean/Median imputation
Recap

- Remove Method

- Keep Method

- Change Method

- Quantile based flooring and capping


- Mean/Median imputation
Clean Data Recap

- Ways to deal with Missing Value:


- Delete
- Mean/ Mode/ Median Imputation
- Machine Learning
- Prediction

- Ways to deal with Duplicated Value:


- Define level of matching
- Merge duplicates

- Ways to deal with Outliers:


- Remove
- Keep
- Quantile based flooring and capping
Exploratory Data Analysis Process

Statistical Clean Data Basic Analysis


Data Overview
Measures

▪ Data type ▪ Standard Deviation ▪ Missing values ▪ Univariate Analysis

▪ Data size ▪ Variance ▪ Duplicated Values ▪ Bivariate Analysis

▪ Data meaning ▪ Percentiles ▪ Outliers ▪ Multivariate


▪ Quartile Analysis

▪ Quantiles
Citation

● https://statisticsbyjim.com/basics/remove-outliers/

● https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/

● https://www.pluralsight.com/guides/cleaning-up-data-from-outliers
10. Univariate Analysis:
Types of Variables
Vu Dam- Data Mentor
Types of Variables
Recap

Numerical (quantitative) variables: numerical values, sensible to add, subtract, take averages, etc

● Continuous: infinite number of values (account balance, height, …)

● Discrete: finite number of values (age, score, …)

Categorical (qualitative) variables: limited number of distinct categories (can be number) , not sensible to
do arithmetic operations

● Ordinal: levels have an inherent ordering (ranking, level of education, … )

● Regular Categorical: (country, gender, …)


11. Univariate Analysis:
Introduction
Vu Dam- Data Mentor
Agenda

1. Definition and Meaning

2. Methods

3. Practice
Definition and Meaning

The term univariate analysis refers to the analysis of one variable

The purpose of univariate analysis is to understand the distribution of values for a single
variable. You can contrast this type of analysis with the following:

● Bivariate Analysis: The analysis of two variables.

● Multivariate Analysis: The analysis of two or more variables.


Method 1: Summary Statistics

Mean: 3.8

Median: 4

Range: 6

IQR: 2.5

Stdv: 1.87

(continuous variable)
Method 2: Frequency Distribution

continuous variable +
categorical variable
Method 3: Charts
Method 3: More Charts
Recap

● Univariate analysis : analysis of one variable

● Methods:

○ Summary Statistics

○ Frequency Distribution Table

○ Charts: box plot, density curve, pie , bar


Citation

● https://www.statology.org/univariate-analysis/
12. Bivariate Analysis:
Introduction
Vu Dam- Data Mentor
Agenda

1. Definition and Meaning

2. Types and Methods

3. Numerical vs Numerical
Definition and Meaning

The term bivariate analysis refers


to the analysis of any concurrent
relation between two variables

This analysis explores the


relationship of two variables as
well as the depth of this
relationship to figure out if there
are any discrepancies between
two variables and any causes of
this difference.
Types and Methods

● Numerical and Numerical

● Categorical and Categorical

● Numerical and Categorical


Numerical vs Numerical: Scatter Plot
Numerical vs Numerical: Correlation Coefficient

Correlation coefficient formulas are used


to find how strong a relationship is
between data. The formulas return a
value between -1 and 1, where:

● 1 indicates a strong positive


relationship.
● -1 indicates a strong negative
relationship.
● A result of zero indicates no
relationship at all.
Numerical vs Numerical: : Correlation Coefficient and Scatter Plot

Correlation Coefficient = 1 Correlation Coefficient = -1 Correlation Coefficient = 0


Numerical vs Numerical: More on Correlation Coefficient

Correlation Coefficient = 0.694


Numerical vs Numerical: More on Correlation Coefficient

Positive Correlation when


Correlation coefficient >= 0.6

Negative Correlation when


Correlation coefficient <= -0.6
Numerical vs Numerical: Practical Examples
Numerical vs Numerical: Practical Examples
Numerical vs Numerical: Practical Examples
Numerical vs Numerical: Linear Regression Equation
Bivariate Analysis Meaning

● Seeing relationships
between variables

● Help to eliminate
unnecessary features for
model
Numerical vs Numerical: Final Note

Correlation does not mean Causation


Recap

● Bivariate analysis : analysis of relationships between two variables

○ Numerical and Numerical

○ Categorical and Categorical

○ Numerical and Categorical

● Methods for Numerical vs Numerical:

○ Scatter Plot

○ Correlation Coefficient

○ Linear Regression Equation

● Correlation does not mean Causation


Citation

● https://statisticsbyjim.com/basics/correlations/

● https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/

● https://www.statology.org/correlation-examples-in-real-life/
13. Bivariate Analysis:
Introduction (continued)
Vu Dam- Data Mentor
Agenda

1. Categorical and Categorical

2. Numerical and Categorical


14. Multivariate Analysis:
Introduction and Pairplot
Vu Dam- Data Mentor
Agenda

1. Definition and Advantage

2. Types + Methods

3. Real – life example


Definition and Advantage

• Multivariate Analysis is defined as a process of involving multiple dependent variables


resulting in one outcome.

• Advantage: the conclusion drawn is more accurate, realistic and nearer to the real-life
situation.

• Disadvantage:

• It requires rather complex computations to arrive at a satisfactory conclusion.

• It is a time-consuming process.
Types

• Dependence Method:

• One or some of the variables are dependent on others.

• Dependence looks at cause and effect; in other words, can the values of two or more
independent variables be used to explain, describe, or predict the value of another,
dependent variable?

• Independent Variable: size, age, location, neighborhood, …

• Dependent Variable: price

• In machine learning, dependence techniques are used to build predictive models


Types

• Interdependence Method:

• It is used to understand the structural makeup and underlying patterns within a


dataset.

• It seeks to give meaning to a set of variables or to group them together in meaningful


ways.
Types
Methods

• Many complicated ones:

• Multiple Linear Regression

• Multiple logistic regression

• Multivariate analysis of variance


(MANOVA)

• Factor analysis

• Cluster analysis

• …………………
Methods

• Many complicated ones:

• Multiple Linear Regression

• Multiple logistic regression

• Multivariate analysis of variance


(MANOVA)

• Factor analysis

• Cluster analysis

• …………………
Real Life Example

In the recent event of COVID-19, a team of data scientists predicted that Delhi would have more
than 500,000 COVID-19 patients by the end of July 2020. This analysis was based on multiple
variables like government decision, public behavior, population, occupation, public transport,
healthcare services, and overall immunity of the community.

https://www.mygreatlearning.com/academy/learn-for-free/courses/multivariate-time-series-on
-covid-data?gl_blog_id=17681
PairPlot
Recap

● Multivariate analysis : involving multiple dependent variables resulting in one outcome.

● Types:

○ Dependence Method

○ Interdependence Method

● Methods: complicated , used for ML

● Pair plot
Citation

● https://www.mygreatlearning.com/blog/introduction-to-multivariate-analysis/

● https://careerfoundry.com/en/blog/data-analytics/multivariate-analysis/
15. Transform Data
– Feature Engineering
Vu Dam- Data Mentor
Agenda

1. Definition

2. Importance

3. Examples
Definition + Importance

• Feature engineering is a machine learning technique that leverages data to create new
variables that aren’t in the training set.

• It can produce new features for both supervised and unsupervised learning, with the goal of
simplifying and speeding up data transformations while also enhancing model accuracy.
Examples

Categorical Encoding Feature Splitting


Examples

Categorical Encoding Feature Splitting


Recap

● Feature Engineering: create new features

● Purpose:

○ Get insight

○ Fasten ML process

● Methods: many
Citation

● https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machin
e-learning-2080b0269f10

You might also like