Section 1 Slide
Section 1 Slide
2. EDA Purpose
3. EDA Process
4. Data Overview
EDA Definition and Importance
+ Developed by American
mathematician John Tukey in the
1970s, EDA techniques is still
widely used in data discovery
process.
▪ Quantiles
Recap
- Purpose:
- Understand data
- Clean data
- Process:
- Data Overview
- Statistical Measures
- Clean Data
- Basic Analysis
2. Basic Statistical Measures
Vu Dam- Data Mentor
Agenda
1. Central Tendency
3. Data Distribution
4. Recap
Citation
- https://unitrain.edu.vn/do-lech-chuan-standard-deviation-la-gi-vi-du-ve-do-lech-chuan/
- https://www.careerlink.vn/en/careertools/economic-knowledge/standard-deviation-la-gi-cong-thuc-tinh-va
-ung-dung
- https://vietnambiz.vn/qui-tac-da-kiem-chung-empirical-rule-la-gi-vi-du-ve-qui-tac-da-kiem-chung-2020010
7105607407.htm
- https://financebiz.vn/empirical-rule-la-gi/
- http://www.amathsdictionaryforkids.com
- https://vietnamnet.vn/nghi-van-diem-thi-bat-thuong-o-ha-giang-ho-mot-tieng-cung-da-co-bien-ban-46
2574.html
Central Tendency
+ A measure of central tendency describes a set of data by identifying the central position in the
data set as a single value
+ 3 most common measures: mean, median, and mode. In different situations some measures
become more appropriate to use than others.
Central Tendency – Mean and Median
Central Tendency - Mode
Data Distribution
Right Skew
Right Skew and Left Skew
+ Symmetric Data: Data sets whose values are evenly spread around the centre
Zero Skew
- Mean (average): The total sum of values divided by the total observations. The mean is
highly sensitive to the outliers.
- Median (center value): The total count of an ordered sequence of numbers divided by 2.
The median is not affected by the outliers.
- Mode (most common): The values most frequently observed. There can be more than one
modal value in the same variable.
- Symmetric Data: Data sets whose values are evenly spread around the central
+ Standard Deviation: Measure of how much the individual scores of a data set differ from the
mean.
+ Variance: Square of standard deviation. Average of the squared differences from the mean.
Variance & Standard Deviation
Meaning and Application
- Detect Outliers
- Investment
- Weather Forecast
Real Life Example
- Standard deviation (concentrated around the mean): The standard amount of deviation
(distance) from the mean. The std is affected by the outliers. It is the square root of the
variance.
- 99.7 % data is within 3 standard deviations
- Variance (variability from the mean): The square of the standard deviation. It is also affected by
outliers.
- Detected Outliers: If a value is 3 standard deviations away from the mean, that data point is
identified as an outlier.
4. Quartiles and Interquartile
Range (IQR)
Vu Dam- Data Mentor
Agenda
1. Quartiles
Q1 tells us that 25% of the scores are less than 42 and 75% of the
class scores are greater.
Q2 (the median) is the 50th percentile and shows that 50% of the
scores are less than 62, and 50% of the scores are above 62.
Finally, Q3, the 75th percentile, reveals that 25% of the scores
are greater and 75% are less than 68.
Interquartile Range (IQR)
Box and Whisker Plot
1. Percentile
2. Percentile Ranking
3. Statistics Recap
Percentiles
Percentile Ranking
Percentile Ranking
Recap
- Mean (average): The total sum of values divided by the total observations. The mean is
highly sensitive to the outliers.
- Median (center value): The total count of an ordered sequence of numbers divided by 2.
The median is not affected by the outliers.
- Mode (most common): The values most frequently observed. There can be more than one
modal value in the same variable.
- Symmetric Data: Data sets whose values are evenly spread around the central
- Standard deviation (concentrated around the mean): The standard amount of deviation
(distance) from the mean. It is affected by outliers
- 99.7 % data is within 3 standard deviations
- Variance (variability from the mean): The square of the standard deviation. It is also affected by
outliers.
- Detected Outliers: If a value is 3 standard deviations away from the mean, that data point is
identified as an outlier.
Recap
▪ Quantiles
6. Deal With Missing Values
Vu Dam- Data Mentor
Agenda
1. Definition
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an
observation. Missing data are a common occurrence and can have a significant effect on the conclusions that
can be drawn from the data.
Missing at random
Missing not at random C D We can predict the value that is
The value of the variable that's missing missing based on the other data.
is related to the reason it's missing.
Missing Value Types
C D
Missing Value Types
Missing at random
C D We can predict the value that is
missing based on the other data.
Missing Value Types
Missing at random
Missing not at random C D We can predict the value that is
The value of the variable that's missing missing based on the other data.
is related to the reason it's missing.
Meaning
Missing data are problematic because, depending on the type, they can sometimes bias your results.
Meaning
- Kumar, S. (2021, September 28). 7 ways to handle missing values in machine learning. Medium.
Retrieved August 18, 2022, from
https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a632
6adf79e
- Bock, T. (2020, December 7). What are the different types of missing data? Displayr. Retrieved
August 18, 2022, from https://www.displayr.com/different-types-of-missing-data/
- Bhandari, P. (2021, December 14). How to deal with missing data. Scribbr. Retrieved August 18,
2022, from https://www.scribbr.com/statistics/missing-data/
7. Deal With Duplicated Values
Vu Dam- Data Mentor
Agenda
1. Definition
Duplicate data is any record that inadvertently shares data with another record in a Database. Duplicate data
is easy to identify, and it mostly occurs when transferring data between systems. The most popular
occurrence of duplicate data is a complete carbon copy of a record.
- Storage costs
Ways to deal with Duplicated values
- Duplicated Values: 2 or more records are the same in the data set
- https://blog.hubspot.com/customers/data-duplication-and-hubspot-impact-your-business
8. Deal With Outliers: Detect
Vu Dam- Data Mentor
Agenda
1. Definition
An outlier is a value or point that differs substantially from the rest of the data. In simple terms, an outlier is
an extremely high or extremely low data point in a data graph or dataset you're working with.
Examples of Outliers
Causes and Reasons to Deal with Outlier
- Domain Knowledge
- Statistical Indicators:
● Distance from the mean in standard deviations (default: 3)
● Distance from the interquartile range by a multiple of the interquartile range (default: 1.5)
Recap
- Outlier: a value that differs substantially from the rest of the data
- https://dataschool.com/fundamentals-of-analysis/what-is-an-outlier/
- https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
9. Deal With Outliers:
Remove or Keep or Change
Vu Dam- Data Mentor
Agenda
1. Remove Method
2. Keep Method
3. Change Method
Remove Method
Conditions:
● A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that
observation because you know it’s incorrect.
● Not a part of the population you are studying (i.e., unusual properties or conditions).
Conditions:
● The statistics measure you use to get insight is not affected much by these outliers
Change Method
Conditions:
● Don’t want the outlier affect the statistic measures too much
Methods:
● Quantile (range from any value to any other value) based flooring and capping
● Mean/Median imputation
Recap
- Remove Method
- Keep Method
- Change Method
▪ Quantiles
Citation
● https://statisticsbyjim.com/basics/remove-outliers/
● https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
● https://www.pluralsight.com/guides/cleaning-up-data-from-outliers
10. Univariate Analysis:
Types of Variables
Vu Dam- Data Mentor
Types of Variables
Recap
Numerical (quantitative) variables: numerical values, sensible to add, subtract, take averages, etc
Categorical (qualitative) variables: limited number of distinct categories (can be number) , not sensible to
do arithmetic operations
2. Methods
3. Practice
Definition and Meaning
The purpose of univariate analysis is to understand the distribution of values for a single
variable. You can contrast this type of analysis with the following:
Mean: 3.8
Median: 4
Range: 6
IQR: 2.5
Stdv: 1.87
(continuous variable)
Method 2: Frequency Distribution
continuous variable +
categorical variable
Method 3: Charts
Method 3: More Charts
Recap
● Methods:
○ Summary Statistics
● https://www.statology.org/univariate-analysis/
12. Bivariate Analysis:
Introduction
Vu Dam- Data Mentor
Agenda
3. Numerical vs Numerical
Definition and Meaning
● Seeing relationships
between variables
● Help to eliminate
unnecessary features for
model
Numerical vs Numerical: Final Note
○ Scatter Plot
○ Correlation Coefficient
● https://statisticsbyjim.com/basics/correlations/
● https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/
● https://www.statology.org/correlation-examples-in-real-life/
13. Bivariate Analysis:
Introduction (continued)
Vu Dam- Data Mentor
Agenda
2. Types + Methods
• Advantage: the conclusion drawn is more accurate, realistic and nearer to the real-life
situation.
• Disadvantage:
• It is a time-consuming process.
Types
• Dependence Method:
• Dependence looks at cause and effect; in other words, can the values of two or more
independent variables be used to explain, describe, or predict the value of another,
dependent variable?
• Interdependence Method:
• Factor analysis
• Cluster analysis
• …………………
Methods
• Factor analysis
• Cluster analysis
• …………………
Real Life Example
In the recent event of COVID-19, a team of data scientists predicted that Delhi would have more
than 500,000 COVID-19 patients by the end of July 2020. This analysis was based on multiple
variables like government decision, public behavior, population, occupation, public transport,
healthcare services, and overall immunity of the community.
https://www.mygreatlearning.com/academy/learn-for-free/courses/multivariate-time-series-on
-covid-data?gl_blog_id=17681
PairPlot
Recap
● Types:
○ Dependence Method
○ Interdependence Method
● Pair plot
Citation
● https://www.mygreatlearning.com/blog/introduction-to-multivariate-analysis/
● https://careerfoundry.com/en/blog/data-analytics/multivariate-analysis/
15. Transform Data
– Feature Engineering
Vu Dam- Data Mentor
Agenda
1. Definition
2. Importance
3. Examples
Definition + Importance
• Feature engineering is a machine learning technique that leverages data to create new
variables that aren’t in the training set.
• It can produce new features for both supervised and unsupervised learning, with the goal of
simplifying and speeding up data transformations while also enhancing model accuracy.
Examples
● Purpose:
○ Get insight
○ Fasten ML process
● Methods: many
Citation
● https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machin
e-learning-2080b0269f10