Data Preprocessing
Data Preprocessing
Preprocessing
Categorical Numerical
Qualitative Quantitative
● Nominal
○ It does not have order (the values remain the same even changing the order).
○ E.g. gender, religion, color, language
● Ordinal
○ It has order/rank.
○ E.g. study grade, racing rank, language level, customer satisfaction
1. Categorical Data
● Graphical Representation: Sex
● Interval
○ The distances between values are equal, but has no true zero
○ E.g. temperature (°C, °F) [can be negative], time (12 hours), IQ score
● Ratio
○ It is for the interval data which has true zero.
○ E.g. length, height, weight, age
▪ The difference between interval and ratio scales comes from their ability to dip below
zero
1. Numerical Data
● Graphical Representation: Age
𝑥 ℎ 𝑥 ? 𝑦
Weight Height
3. Data Cleaning
● Data Cleaning is a process of identifying the corrupt or inaccurate parts from the
data and correcting those parts by replacing, modifying or deleting the dirty or
coarse data.
● The data have to be cleaned based on the context or problem so that it can be
further analyzed and processed, especially before applying any Machine Learning
algorithms.
3.1. Low Variation
For variables with very low variation, we should
either:
✓ Standardize all variables or use the standard
deviation to account for variables with
difference scales.
✓ Drop variables with zero variance (when the
variable containing only one value) due to less
predictive power.
1
• 𝑣𝑎𝑟 𝑥 = σ𝑛𝑖=1 𝑥𝑖 − 𝑚𝑒𝑎𝑛(𝑥) 2
𝑛 Different Variances
• 𝑠𝑡𝑑(𝑥) = 𝑣𝑎𝑟(𝑥)
● Data sets with similar values are said to have little variability, while data sets that have values
that are spread out have high variability.
● Small std suggests values close to the mean, high suggests that values are spread out
3.2. Missing Value
Missing Values could be NA, n/a, empty string, ?,-1,-99,-999, null, etc. They should
carefully be treated based on given context, because they would affect quality of the
model during training.
3.2. Missing Value
Possible ways to deal with missing values:
✓ Delete those records – not efficient when the dataset is small
✓ Delete those columns
✓ Apply imputation:
○ Numerical variable: replace the missing values with either mean, median or mode
○ Categorical variable: replace the missing values with the highest frequency value
(mode)
1 𝑛
• 𝑚𝑒𝑎𝑛 𝑥 = 𝑥ҧ = σ 𝑥
𝑛 𝑖=1 𝑖
● It falls outside of 1.5 times of interquartile range above the 3rd quartile and below the
1st quartile.
𝑄1 − 1.5 × 𝐼𝑄𝑅 > 𝑥 > 𝑄3 + 1.5 × 𝐼𝑄𝑅
3.3. Z-score
𝒙
-120
84
66
-9
200
45
-1
3.3. Z-score
𝑥 Z-score(𝑥) 𝑚𝑒𝑎𝑛(𝑥) = 10
-9 -0.63 −120−10
Z-score(𝑥1 ) = 30
= −4.33
200 6.33
45 1.16
−1−10
Z-score(𝑥7 ) = 30
= −0.36
-1 -0.36
3.3. Z-score
𝑥 Z-score(𝑥)
-120 -4.33
84 2.46
According to the values of the z-score in the table, ”-120”
66 1.86
and ”200” are the outliers as their z-score is either less
-9 -0.63 than -3 or greater than 3.
200 6.33
45 1.16
-1 -0.36
3.3. Interquartile Range (IQR)
● The interquartile range (IQR) is a measure of variability, based on dividing a data set
into quartiles.
● Quartiles divide a rank-ordered data set into 4 equal parts. The values dividing each
part are called the first, second, and third quartiles denoted by Q1, Q2, and Q3,
respectively.
● IQR is the difference between Q3 – Q1.
● IQR describes the middle 50% of values when ordered from lowest to highest
3.3. Interquartile Range (IQR)
● Interquartile Range is frequently used with Box and Whisker Plot (boxplot) to visualize
basic characteristic of the distribution and to identify the outliers.
Box Outlier
Whiske
r
Min Value of Max Value of
Sample Q1 Median=Q2 Q3 Sample
IQR
3.3. Interquartile Range (IQR)
Example: Find the outliers among the following data using IQR.
𝑥
15
10
39
20
44
55
48
109
3.3. Interquartile Range (IQR)
Step 1: Order the data in ascendant order
Step 5: Find the range of data such that 𝑄1 − 1.5 × 𝐼𝑄𝑅, 𝑄3 + 1.5 × 𝐼𝑄𝑅
It shows that ”109” lays outside the boundary range. Thus it is the outlier.
10 55 109
Imbalance dataset
3.4. Undersampling vs. Oversampling
● Undersampling (or downsampling) drops the observations of the majority by randomly
down sampling them to be of the same size as the smaller class.
● Oversampling (or upsampling) duplicates the observations of the minority class by
repeatedly sampling them over and over until it reaches the same size as the majority
class.
3.4. SMOTE
SMOTE stands for Synthetic Minority Oversampling Technique — it consists of creating
artificial samples from the minority class by using K-Nearest Neighbors (KNN). SMOTE is
used to avoid model overfitting.
New example
𝑥2
𝑥1
KNN
3.4. SMOTE
4. Feature Scaling
● Most often, the dataset contains features highly vary in magnitudes and units. As
learning algorithms only take in features magnitude neglecting their unit, the features
with higher magnitude might dominate the one with low magnitudes.
● Feature scaling is essential for algorithms (i.e. KNN, K-Means, PCA) who use Euclidean
Distance as feature with higher range will weigh a lot than the others in distance
calculation.
● Another reason why feature scaling is applied is that training algorithms which use
Gradient Descent tends to converge much faster.
𝑎𝑥1 𝑏𝑥1
𝑎= , 𝑏=
𝑎𝑥2 𝑏𝑥2 2 2
𝑑𝑒𝑢𝑐 𝑎, 𝑏 = 𝑎𝑥1 − 𝑏𝑥1 + 𝑎𝑥2 − 𝑏𝑥2
𝑥1 ∈ 0.1, 0.01, … , 𝑥2 ∈ {103 , 104 , … }
4. Feature Scaling
Example: After applying PCA, the dataset are more separable when features are scaled.
𝑥1 𝑥2
39 192
8 229
• 𝑚𝑒𝑎𝑛 𝑥1 = 10, 𝑠𝑡𝑑 𝑥1 = 30
-12 187
1 187 • 𝑚𝑒𝑎𝑛 𝑥2 = 200, 𝑠𝑡𝑑 𝑥2 = 20
0 223
-38 219
-37 174
4. Standardization
● Compute Z-score of each variable
𝑥1 𝑥2 Z-score(𝑥1 ) Z-score(𝑥2 )
39 192 1.79 -0.47
8 229 0.54 1.37
-12 187 -0.25 -0.72
1 187 0.26 -0.72
0 223 0.22 1.07
-38 219 -1.30 0.87
-37 174 -1.26 -1.38
4. Min-Max Normalization
● A type of normalization which scales a
variable by its minimum and range, to
make all the elements lie between 0 and
1, thus bringing all the values of numeric
variables to a common scale.
● It is done by replacing all the values of 𝑥
by:
𝑥 − 𝑥𝑚𝑖𝑛
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
4. Min-Max Normalization
Example: Apply Min-Max normalization with the following data.
𝑥1 𝑥2
39 192
8 229
-12 187
1 187
0 223
-38 219
-37 174
4. Min-Max Normalization
● Step 1: Find min and max value of each data
𝑥1 𝑥2
39 192
8 229
• 𝑚𝑖𝑛 𝑥1 = −38, 𝑚𝑎𝑥 𝑥1 = 39
-12 187
1 187 • 𝑚𝑖𝑛 𝑥2 = 174, 𝑚𝑎𝑥 𝑥2 = 229
0 223
-38 219
-37 174
4. Min-Max Normalization
● Step 2: Compute new values of each variable by using min-max formula
1 2
𝑥1 𝑥2 𝑥𝑛𝑒𝑤 𝑥𝑛𝑒𝑤
39 192 1 0.32
8 229 0.59 1
-12 187 0.33 0.23
1 187 0.50 0.23
0 223 0.49 0.89
-38 219 0 0.81
-37 174 0.01 0
5. Feature Encoding
● It is very common to see categorical features in a dataset. However most of learning
algorithms can only deal with numerical values. Thus, it is essential to encode
categorical features into numerical values.
● There are many techniques to encode categorical variable based on its characteristics:
○ Ordinal/Label Encoding
○ One-hot Encoding
○ Binary Encoding
○ Hashing Encoding
○ Etc.
5. Label Encoding
● Label/Ordinal Encoding is used only with categorical data that have a natural ordered
relationship between each value (e.g. ‘Size’ with values such as ‘small’, ‘medium’, and
‘large’).
● The encoding is done by assigning the discrete number to the values based on their
order.
Cambodia 1 0 0
Thai 0 1 0
Lao 0 0 1
5. One-Hot Encoding
● A few algorithms drop one column by default in theory, but practically almost all the
algorithms out there doesn’t not automatically drop one column out by default.
● A general rule of thumb is to drop one column out to avoid dummy variable trap, or
multicollinearity before feeding the data into training.
● Dropping any column from the one hot encoding is fine, as long as the total number of
the column = number of category -1.
E.g. X = [”Cambodia”, ”Thai”, ”Lao”], category = 3, drop 1 = 2:
Country Country_Cambodia Country_Thai
Cambodia 1 0
Thai 0 1
Lao 0 0
6. Feature Correlation
● Correlation is a statistical term which in common usage refers to how close two
variables are to having a linear relationship/dependency with each other.
● It is used in ”Feature Selection” to identify which input features are qualified in
contribution to a learning algorithm in predicting the output. E.g. Input features with
high correlation are more likely to have the same effect on the output. So, we can drop
one of them.
● Common feature correlation methods:
○ Pearson Correlation, Spearman Correlation
○ Chi-Square Test
○ Anova Test
6. Pearson Correlation
● In statistics, Pearson correlation coefficient is a measure of linear correlation between
two numerical variables 𝑥 and 𝑦.
● According to Cauchy-Schwarz, it has a value between +1 and −1, where 1 is perfect
positive linear correlation, 0 is no linear correlation, and −1 is perfect negative linear
correlation.
● Given n paired data { 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 }, Pearson correlation coefficient 𝑟𝑥𝑦 is defined
as follows:
σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത
𝑟𝑥𝑦 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2
6. Pearson Correlation
6. Pearson Correlation
Example: Given the records of a shop about daily temperature and sale as following,
determine the correlation between temperature and sale using Pearson correlation
coefficient.
Temp C Sales
14.2 $215
16.4 $325
11.9 $185
15.2 $332
18.5 $406
6. Pearson Correlation
Denote:
𝑥: Temperature of the day in degree Celsius
𝑦: Amount of sale in dollar each day
6. Pearson Correlation
Step 1: Find the mean of 𝑥 and 𝑦
𝒙 𝒚
14.2 215
16.4 325
11.9 185
15.2 332
18.5 406
𝑥ҧ =15.24 𝑦ത = 292.6
6. Pearson Correlation
Step 2: Compute 𝑥 − 𝑥,ҧ 𝑦 − 𝑦ത
𝒙 𝒚 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚
14.2 215 -1.04 -77.6
16.4 325 16.4 325
11.9 185 11.9 185
15.2 332 15.2 332
18.5 406 18.5 406
𝑥ҧ =15.24 𝑦ത = 292.6
6. Pearson Correlation
ത 𝑥 − 𝑥ҧ 2 , (𝑦 − 𝑦)
Step 3: Compute 𝑥 − 𝑥ҧ (𝑦 − 𝑦), ത 2
𝒙 𝒚 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത (𝑥 − 𝑥)ҧ 2 ത 2
(𝑦 − 𝑦)
14.2 215 -1.04 -77.6 80.704 1.0816 6021.76
16.4 325 1.16 32.4 37.584 1.3456 1049.76
11.9 185 -3.34 -107.6 359.384 11.1556 11577.76
15.2 332 -0.04 39.4 -1.576 0.0016 1552.36
18.5 406 3.26 113.4 369.684 10.6276 12859.56
𝑥ҧ =15.24 𝑦ത = 292.6
6. Pearson Correlation
ത σ 𝑥 − 𝑥ҧ 2 , σ(𝑦 − 𝑦)
Step 4: Compute σ 𝑥 − 𝑥ҧ (𝑦 − 𝑦), ത 2
𝒙 𝒚 ഥ
𝒙−𝒙 ഥ
𝒚−𝒚 (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത (𝑥 − 𝑥)ҧ 2 ത 2
(𝑦 − 𝑦)
14.2 215 -1.04 -77.6 80.704 1.0816 6021.76
16.4 325 1.16 32.4 37.584 1.3456 1049.76
11.9 185 -3.34 -107.6 359.384 11.1556 11577.76
15.2 332 -0.04 39.4 -1.576 0.0016 1552.36
18.5 406 3.26 113.4 369.684 10.6276 12859.56
𝑥ҧ =15.24 𝑦ത = 292.6 845.78 24.212 33061.2
6. Pearson Correlation
Step 5: Compute correlation coefficient
rxy = 0.9453
Based on the value 𝑟𝑥𝑦 = 0.9453, the two variables, temperature and sale, have strong
positive linear relationship.
Q&A