0% found this document useful (0 votes)
30 views

Data Preprocessing

Q2 = 42 Step 3: Find Q1 and Q3 Q1 = 20 Q3 = 48 Step 4: Calculate IQR = Q3 - Q1 = 48 - 20 = 28 Step 5: Check for outliers using 1.5*IQR rule Lower limit = Q1 - 1.5*IQR = 20 - 1.5*28 = -14 Upper limit = Q3 + 1.5*IQR = 48 + 1.5*28 = 76 Step 6: Values less than -14 and greater than 76 are outliers 109 is the outlier as it is greater than 76. Therefore, the outlier is 109.

Uploaded by

Raksa Kun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Data Preprocessing

Q2 = 42 Step 3: Find Q1 and Q3 Q1 = 20 Q3 = 48 Step 4: Calculate IQR = Q3 - Q1 = 48 - 20 = 28 Step 5: Check for outliers using 1.5*IQR rule Lower limit = Q1 - 1.5*IQR = 20 - 1.5*28 = -14 Upper limit = Q3 + 1.5*IQR = 48 + 1.5*28 = 76 Step 6: Values less than -14 and greater than 76 are outliers 109 is the outlier as it is greater than 76. Therefore, the outlier is 109.

Uploaded by

Raksa Kun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data

Preprocessing

Lecturer : Lyheang UNG


Table Of Content

Types of Data 01 02 Types of Variable

Data Cleaning 03 04 Feature Scaling

Feature Encoding 05 TOC 06 Feature Correlation


1. Types of Data
Data

Categorical Numerical
Qualitative Quantitative

Nominal Ordinal Interval Ratio


1. Categorical Data
Categorical Data refers to a collection of information divided into groups to represents the
characteristic or the quality of someone or something. It does not have mathematical
meaning and is classified into two groups:

● Nominal
○ It does not have order (the values remain the same even changing the order).
○ E.g. gender, religion, color, language
● Ordinal
○ It has order/rank.
○ E.g. study grade, racing rank, language level, customer satisfaction
1. Categorical Data
● Graphical Representation: Sex

Bar Chart Pie Chart


1. Numerical Data
Numerical Data refers to discrete or continuous values representing the quantity or the
measurement of something. It does have mathematical meaning and is classified into two
groups:

● Interval
○ The distances between values are equal, but has no true zero
○ E.g. temperature (°C, °F) [can be negative], time (12 hours), IQ score
● Ratio
○ It is for the interval data which has true zero.
○ E.g. length, height, weight, age

▪ The difference between interval and ratio scales comes from their ability to dip below
zero
1. Numerical Data
● Graphical Representation: Age

Histogram Line Graph Boxplot


1. Detail Properties
Properties: Nominal Ordinal Interval Ratio
Order is known ✓ ✓ ✓
Count/Frequency ✓ ✓ ✓ ✓
Mode ✓ ✓ ✓ ✓
Median ✓ ✓
Mean ✓ ✓
Can add and subtract value ✓ ✓
Can multiply and divide value ✓
Has ”true zero” representing the
absence of the property being ✓
measured
2. Types of Variable
● Independent variables are regarded as the inputs
to a system and may take on different values
freely. They are also called as predictor or
ℎ(𝑥) explanatory variable denoted by 𝑥.
● Dependent variables are those values that change
as consequence of changes in other values in the
system. They are also called as response variable
denoted by 𝑦.

𝑥 ℎ 𝑥 ? 𝑦

Weight Height
3. Data Cleaning
● Data Cleaning is a process of identifying the corrupt or inaccurate parts from the
data and correcting those parts by replacing, modifying or deleting the dirty or
coarse data.
● The data have to be cleaned based on the context or problem so that it can be
further analyzed and processed, especially before applying any Machine Learning
algorithms.
3.1. Low Variation
For variables with very low variation, we should
either:
✓ Standardize all variables or use the standard
deviation to account for variables with
difference scales.
✓ Drop variables with zero variance (when the
variable containing only one value) due to less
predictive power.
1
• 𝑣𝑎𝑟 𝑥 = σ𝑛𝑖=1 𝑥𝑖 − 𝑚𝑒𝑎𝑛(𝑥) 2
𝑛 Different Variances
• 𝑠𝑡𝑑(𝑥) = 𝑣𝑎𝑟(𝑥)
● Data sets with similar values are said to have little variability, while data sets that have values
that are spread out have high variability.
● Small std suggests values close to the mean, high suggests that values are spread out
3.2. Missing Value
Missing Values could be NA, n/a, empty string, ?,-1,-99,-999, null, etc. They should
carefully be treated based on given context, because they would affect quality of the
model during training.
3.2. Missing Value
Possible ways to deal with missing values:
✓ Delete those records – not efficient when the dataset is small
✓ Delete those columns
✓ Apply imputation:
○ Numerical variable: replace the missing values with either mean, median or mode
○ Categorical variable: replace the missing values with the highest frequency value
(mode)
1 𝑛
• 𝑚𝑒𝑎𝑛 𝑥 = 𝑥ҧ = σ 𝑥
𝑛 𝑖=1 𝑖

• 𝑚𝑒𝑑𝑖𝑎𝑛 𝑥 = middle value (when the data are arranged in order) of 𝑥


• 𝑚𝑜𝑑𝑒 𝑥 = most common value of 𝑥
3.2. Missing Value
● Based on Vishal Patel, he suggests to
apply imputation only when the
amount of missing values of a feature
is less than 50%.
● A feature whose missing values is over
95% should be filtered out.
3.3. Outlier
● In statistic, an Outlier is an observation that lies an abnormal distance from the
other observations in the dataset. In some cases, it is considered as the corrupt or
incorrect data.
3.3. Outlier
● As there is no precise way to define what are the outliers in general, we or a domain
expert, must interpret the raw observations and decide whether a value is an outlier.
● The outliers deserve special attention or should be completely ignored for certain
algorithms (i.e. Linear Regression, K-Means) because of their sensitivity.
● Common causes of outliers:
○ Data entry errors (human errors)
○ Measurement errors (instrument errors)
○ Sampling errors (extracting or mixing data from wrong or various sources)
○ Natural (not an error, novelties in data)
3.3. Outlier
The data point 𝑥 is considered as an outlier when:
● It lies outside the mean plus or minus 3 times the standard deviation of the variable if
the data distribution is Gaussian/Normal Distribution.

−3 < Z-score(x) < 3

● It falls outside of 1.5 times of interquartile range above the 3rd quartile and below the
1st quartile.
𝑄1 − 1.5 × 𝐼𝑄𝑅 > 𝑥 > 𝑄3 + 1.5 × 𝐼𝑄𝑅
3.3. Z-score

● Z-score measures how far away a particular data point


is from the mean of a dataset, in terms of standard
deviations.

● In Gaussian/Normal distribution, nearly 99.7 percent of


the data are found between -3 and +3 of the Z-score.
𝑥 −𝑚𝑒𝑎𝑛 𝑥
Z-score(x) =
𝑠𝑡𝑑 𝑥
3.3. Z-score
Example: Find the outliers of the following data having mean=10 and standard
deviation=30.

𝒙
-120

84

66

-9

200

45

-1
3.3. Z-score
𝑥 Z-score(𝑥) 𝑚𝑒𝑎𝑛(𝑥) = 10

-120 -4.33 𝑠𝑡𝑑(𝑥) = 30


84 2.46
𝑥−𝑚𝑒𝑎𝑛(𝑥)
Z-score(𝑥 ) =
66 1.86 𝑠𝑡𝑑(𝑥)

-9 -0.63 −120−10
Z-score(𝑥1 ) = 30
= −4.33
200 6.33
45 1.16
−1−10
Z-score(𝑥7 ) = 30
= −0.36
-1 -0.36
3.3. Z-score
𝑥 Z-score(𝑥)
-120 -4.33
84 2.46
According to the values of the z-score in the table, ”-120”
66 1.86
and ”200” are the outliers as their z-score is either less
-9 -0.63 than -3 or greater than 3.
200 6.33
45 1.16
-1 -0.36
3.3. Interquartile Range (IQR)
● The interquartile range (IQR) is a measure of variability, based on dividing a data set
into quartiles.
● Quartiles divide a rank-ordered data set into 4 equal parts. The values dividing each
part are called the first, second, and third quartiles denoted by Q1, Q2, and Q3,
respectively.
● IQR is the difference between Q3 – Q1.

● IQR describes the middle 50% of values when ordered from lowest to highest
3.3. Interquartile Range (IQR)
● Interquartile Range is frequently used with Box and Whisker Plot (boxplot) to visualize
basic characteristic of the distribution and to identify the outliers.

Box Outlier

Whiske
r
Min Value of Max Value of
Sample Q1 Median=Q2 Q3 Sample

IQR
3.3. Interquartile Range (IQR)
Example: Find the outliers among the following data using IQR.

𝑥
15
10
39
20
44
55
48
109
3.3. Interquartile Range (IQR)
Step 1: Order the data in ascendant order

10, 15, 20, 39, 44, 48, 55, 109

Step 2: Find the median of data, which is the 𝑄2


39+44
Q 2 = 𝑚𝑒𝑑𝑖𝑎𝑛 𝑥 = = 41.5
2

Step 3: Split data into two parts and find 𝑄1 and 𝑄3


10, 15, 20, 39, 44, 48, 55, 109
17.5 41.5 51.5
15+20 48+55
𝑄1 = = 17.5 and 𝑄3 = = 51.5
2 2
3.3. Interquartile Range (IQR)
Step 4: Calculate IQR

𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 51.5 − 17.5 = 34

Step 5: Find the range of data such that 𝑄1 − 1.5 × 𝐼𝑄𝑅, 𝑄3 + 1.5 × 𝐼𝑄𝑅

𝑄1 − 1.5 × 𝐼𝑄𝑅 = 17.5 − 1.5 × 34 = -33.5

𝑄3 + 1.5 × 𝐼𝑄𝑅 = 51.5 + 1.5 × 34 = 102.5

→ Boundary range = [−33.5, 102.5]


3.3. Interquartile Range (IQR)
Step 6: Compare each value of 𝑥 with boundary range = [−33.5, 102.5]

It shows that ”109” lays outside the boundary range. Thus it is the outlier.

10 55 109

Q1=-33.5 Q2=41.5 Q3=51.5


3.4. Imbalanced Dataset
● Imbalanced dataset typically refers to the classification tasks where the classes are
not represented equally.
● Machine learning algorithms such as K-Nearest Neighbor, Logistic Regression, etc.
might be overfitting when they are trained with highly imbalanced dataset as they
heavily focuses on the majority class.

Imbalance dataset
3.4. Undersampling vs. Oversampling
● Undersampling (or downsampling) drops the observations of the majority by randomly
down sampling them to be of the same size as the smaller class.
● Oversampling (or upsampling) duplicates the observations of the minority class by
repeatedly sampling them over and over until it reaches the same size as the majority
class.
3.4. SMOTE
SMOTE stands for Synthetic Minority Oversampling Technique — it consists of creating
artificial samples from the minority class by using K-Nearest Neighbors (KNN). SMOTE is
used to avoid model overfitting.

New example
𝑥2

𝑥1
KNN
3.4. SMOTE
4. Feature Scaling
● Most often, the dataset contains features highly vary in magnitudes and units. As
learning algorithms only take in features magnitude neglecting their unit, the features
with higher magnitude might dominate the one with low magnitudes.
● Feature scaling is essential for algorithms (i.e. KNN, K-Means, PCA) who use Euclidean
Distance as feature with higher range will weigh a lot than the others in distance
calculation.
● Another reason why feature scaling is applied is that training algorithms which use
Gradient Descent tends to converge much faster.

𝑎𝑥1 𝑏𝑥1
𝑎= , 𝑏=
𝑎𝑥2 𝑏𝑥2 2 2
𝑑𝑒𝑢𝑐 𝑎, 𝑏 = 𝑎𝑥1 − 𝑏𝑥1 + 𝑎𝑥2 − 𝑏𝑥2
𝑥1 ∈ 0.1, 0.01, … , 𝑥2 ∈ {103 , 104 , … }
4. Feature Scaling
Example: After applying PCA, the dataset are more separable when features are scaled.

Without Scaling With Scaling


4. Feature Scaling – Heavy Tail Distribution
If the distribution is heavy tail, consider applying square root or in extreme case, applying
the logarithm before scaling using standardization and min-max scaling.
4. Standardization
● Standardizing means subtracting a measure
of location (mean) and dividing by a measure
of scale (standard deviation). If a random
variable follows Gaussian distribution, we
obtain “Standard Normal” random variable
with mean 0 and standard deviation 1.
● Standardizing replaces the values of a
random variable by their z-score:
𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
𝑥𝑛𝑒𝑤 =
𝑠𝑡𝑑(𝑥)
4. Standardization
Example: Apply Standization with the following two Gaussian variables.

𝑥1 𝑥2
39 192
8 229
• 𝑚𝑒𝑎𝑛 𝑥1 = 10, 𝑠𝑡𝑑 𝑥1 = 30
-12 187
1 187 • 𝑚𝑒𝑎𝑛 𝑥2 = 200, 𝑠𝑡𝑑 𝑥2 = 20

0 223
-38 219
-37 174
4. Standardization
● Compute Z-score of each variable

𝑥1 𝑥2 Z-score(𝑥1 ) Z-score(𝑥2 )
39 192 1.79 -0.47
8 229 0.54 1.37
-12 187 -0.25 -0.72
1 187 0.26 -0.72
0 223 0.22 1.07
-38 219 -1.30 0.87
-37 174 -1.26 -1.38
4. Min-Max Normalization
● A type of normalization which scales a
variable by its minimum and range, to
make all the elements lie between 0 and
1, thus bringing all the values of numeric
variables to a common scale.
● It is done by replacing all the values of 𝑥
by:
𝑥 − 𝑥𝑚𝑖𝑛
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
4. Min-Max Normalization
Example: Apply Min-Max normalization with the following data.

𝑥1 𝑥2
39 192
8 229
-12 187
1 187
0 223
-38 219
-37 174
4. Min-Max Normalization
● Step 1: Find min and max value of each data

𝑥1 𝑥2
39 192
8 229
• 𝑚𝑖𝑛 𝑥1 = −38, 𝑚𝑎𝑥 𝑥1 = 39
-12 187
1 187 • 𝑚𝑖𝑛 𝑥2 = 174, 𝑚𝑎𝑥 𝑥2 = 229

0 223
-38 219
-37 174
4. Min-Max Normalization
● Step 2: Compute new values of each variable by using min-max formula

1 2
𝑥1 𝑥2 𝑥𝑛𝑒𝑤 𝑥𝑛𝑒𝑤
39 192 1 0.32
8 229 0.59 1
-12 187 0.33 0.23
1 187 0.50 0.23
0 223 0.49 0.89
-38 219 0 0.81
-37 174 0.01 0
5. Feature Encoding
● It is very common to see categorical features in a dataset. However most of learning
algorithms can only deal with numerical values. Thus, it is essential to encode
categorical features into numerical values.
● There are many techniques to encode categorical variable based on its characteristics:
○ Ordinal/Label Encoding
○ One-hot Encoding
○ Binary Encoding
○ Hashing Encoding
○ Etc.
5. Label Encoding
● Label/Ordinal Encoding is used only with categorical data that have a natural ordered
relationship between each value (e.g. ‘Size’ with values such as ‘small’, ‘medium’, and
‘large’).
● The encoding is done by assigning the discrete number to the values based on their
order.

E.g. X = [”small”, ”medium”, ”large”] → [0, 1, 2]


5. One-Hot Encoding
● A common way when there is no natural ordering between the values of a categorical
variable.
● For each value of a categorical variable, a dummy variable is created where the value is
1 if for that instance the original value takes that value and 0 otherwise.

E.g. X = [”Cambodia”, ”Thai”, ”Lao”]


Dummy Variables

Country Country_Cambodia Country_Thai Country_Lao

Cambodia 1 0 0

Thai 0 1 0

Lao 0 0 1
5. One-Hot Encoding
● A few algorithms drop one column by default in theory, but practically almost all the
algorithms out there doesn’t not automatically drop one column out by default.
● A general rule of thumb is to drop one column out to avoid dummy variable trap, or
multicollinearity before feeding the data into training.
● Dropping any column from the one hot encoding is fine, as long as the total number of
the column = number of category -1.
E.g. X = [”Cambodia”, ”Thai”, ”Lao”], category = 3, drop 1 = 2:
Country Country_Cambodia Country_Thai

Cambodia 1 0

Thai 0 1

Lao 0 0
6. Feature Correlation
● Correlation is a statistical term which in common usage refers to how close two
variables are to having a linear relationship/dependency with each other.
● It is used in ”Feature Selection” to identify which input features are qualified in
contribution to a learning algorithm in predicting the output. E.g. Input features with
high correlation are more likely to have the same effect on the output. So, we can drop
one of them.
● Common feature correlation methods:
○ Pearson Correlation, Spearman Correlation
○ Chi-Square Test
○ Anova Test
6. Pearson Correlation
● In statistics, Pearson correlation coefficient is a measure of linear correlation between
two numerical variables 𝑥 and 𝑦.
● According to Cauchy-Schwarz, it has a value between +1 and −1, where 1 is perfect
positive linear correlation, 0 is no linear correlation, and −1 is perfect negative linear
correlation.
● Given n paired data { 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 }, Pearson correlation coefficient 𝑟𝑥𝑦 is defined
as follows:

σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)

𝑟𝑥𝑦 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2
6. Pearson Correlation
6. Pearson Correlation
Example: Given the records of a shop about daily temperature and sale as following,
determine the correlation between temperature and sale using Pearson correlation
coefficient.

Temp C Sales
14.2 $215
16.4 $325
11.9 $185
15.2 $332
18.5 $406
6. Pearson Correlation
Denote:
𝑥: Temperature of the day in degree Celsius
𝑦: Amount of sale in dollar each day
6. Pearson Correlation
Step 1: Find the mean of 𝑥 and 𝑦

𝒙 𝒚
14.2 215

16.4 325

11.9 185

15.2 332

18.5 406

𝑥ҧ =15.24 𝑦ത = 292.6
6. Pearson Correlation
Step 2: Compute 𝑥 − 𝑥,ҧ 𝑦 − 𝑦ത

𝒙 𝒚 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚
14.2 215 -1.04 -77.6
16.4 325 16.4 325
11.9 185 11.9 185
15.2 332 15.2 332
18.5 406 18.5 406
𝑥ҧ =15.24 𝑦ത = 292.6
6. Pearson Correlation
ത 𝑥 − 𝑥ҧ 2 , (𝑦 − 𝑦)
Step 3: Compute 𝑥 − 𝑥ҧ (𝑦 − 𝑦), ത 2

𝒙 𝒚 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത (𝑥 − 𝑥)ҧ 2 ത 2
(𝑦 − 𝑦)
14.2 215 -1.04 -77.6 80.704 1.0816 6021.76
16.4 325 1.16 32.4 37.584 1.3456 1049.76
11.9 185 -3.34 -107.6 359.384 11.1556 11577.76
15.2 332 -0.04 39.4 -1.576 0.0016 1552.36
18.5 406 3.26 113.4 369.684 10.6276 12859.56
𝑥ҧ =15.24 𝑦ത = 292.6
6. Pearson Correlation
ത σ 𝑥 − 𝑥ҧ 2 , σ(𝑦 − 𝑦)
Step 4: Compute σ 𝑥 − 𝑥ҧ (𝑦 − 𝑦), ത 2

𝒙 𝒚 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത (𝑥 − 𝑥)ҧ 2 ത 2
(𝑦 − 𝑦)
14.2 215 -1.04 -77.6 80.704 1.0816 6021.76
16.4 325 1.16 32.4 37.584 1.3456 1049.76
11.9 185 -3.34 -107.6 359.384 11.1556 11577.76
15.2 332 -0.04 39.4 -1.576 0.0016 1552.36
18.5 406 3.26 113.4 369.684 10.6276 12859.56
𝑥ҧ =15.24 𝑦ത = 292.6 845.78 24.212 33061.2
6. Pearson Correlation
Step 5: Compute correlation coefficient

σni=1(xi − xത)(yi − yത ) 845.78


rxy = = = 0.9453
σni=1 xi − xത 2 σni=1 yi − yത 2 24.212 × 33061.2

rxy = 0.9453

Based on the value 𝑟𝑥𝑦 = 0.9453, the two variables, temperature and sale, have strong
positive linear relationship.
Q&A

You might also like