CIVI6731 Week8
CIVI6731 Week8
CIVI 6731
BIG DATA ANALYTICS FOR SMART CITIES
Week 8
Classification (2)
Learning Objectives
Classification
with k-Nearest Neighbors
“Lazy Learners”
Looking up the examples from training set which best match
the new example in the test set and predict accordingly
No “learning” really happens!
85 85 High
85 90 80 High
(85,85) & High
78 83 Low
83 (90,80) & High
(78,83) & Low
80
WWR%
78 85 90
WWR%
Level
Shad.
VT%
85 85 FALSE High
90 80 TRUE High
(85,85, FALSE) & High
78 83 FALSE Low
(90,80, TRUE) & High
(78,83,FALSE) & Low
Orient.
Shad.
4D Vectors
(4 attributes)
81 ?
Low!
WWR%
77 82 86
10
Measures of Proximity
Measures for evaluating similarity of two records (two vectors in
the n-D attribute space)
Distance
Cosine Similarity
Correlation Similarity
Simple Matching Coefficient
Jaccard Similarity
11
d
Distance
𝑋⃗ 𝑥 ,𝑥 ,…,𝑥
Datapoint Y
𝑌 𝑦 ,𝑦 ,…,𝑦
Euclidean Distance
𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝒅 𝒙𝟏 𝒚𝟏 𝟐 𝒙𝟐 𝒚𝟐 𝟐 ⋯ 𝒙𝒏 𝒚𝒏 𝟐
12
Distance (cnt.)
𝑋⃗ 𝑥 ,𝑥 ,…,𝑥
Datapoint Y
𝑌 𝑦 ,𝑦 ,…,𝑦
|𝒙𝒊 𝒚𝒊 |
𝒊 𝟏 13
Distance (cnt.)
𝑋⃗ 𝑥 ,𝑥 ,…,𝑥
Datapoint Y
𝑌 𝑦 ,𝑦 ,…,𝑦
𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝒅 |𝒙𝒊 𝒚𝒊 |𝒑
𝒊 𝟏
14
Measures of Proximity
Distance (cnt.)
What distance measure to select?
15
Measures of Proximity
Distance (cnt.)
Issue with distance: it depends on the scale and units of
attributes!
Attributes are in different measures
Attributes are in different units
Solution:
Normalize all attributes
Range transformation: rescale between 0 and1: x (x-min)/(max-min)
Z-transformation: x (x-mean)/SD
16
Measures of Proximity
Distance (cnt.)
For Categorical data:
If ordinal: turn to integers
(cold, mild, warm, hot) (0,1,2,3)
17
Cosine Similarity
Measure of angles between vectors: θ
Datapoint Y
𝑿. 𝒀
𝑪𝒐𝒔 𝑿, 𝒀
𝑿 . 𝒀
E.G.
1 5 0 0 0 3 7
𝑋 1, 2, 0, 0, 3 𝐶𝑜𝑠 𝑋, 𝑌 0.66
𝑌 5 0, 0, 6, 7 1 2 3 5 6 7
18
Measures of Proximity
Correlation Similarity
Pearson correlation between 𝑋⃗ and 𝑌 is a measure of linear
relationship between their attributes
𝑐𝑜𝑣 𝑋, 𝑌
𝑟 𝑋⃗, 𝑌
𝑉𝑎𝑟 𝑋 . 𝑉𝑎𝑟 𝑦
19
Measures of Proximity
Simple Matching Coefficient (SMC)
Good for binary attributes
𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
𝑆𝑀𝐶 𝑋⃗, 𝑌
𝑡𝑜𝑡𝑎𝑙 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
E.G.
4
𝑋 1, 1, 0, 0, 1, 1, 0 𝑆𝑀𝐶 𝑋, 𝑌
7
𝑌 1, 0, 0, 1, 1, 0, 0
20
Measures of Proximity
Jaccard Similarity
Similar to SMC, but first sets the nonoccurrence aside
𝑐𝑜𝑚𝑚𝑜𝑛 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒
𝐽 𝑋⃗, 𝑌
𝑡𝑜𝑡𝑎𝑙 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠
E.G.
𝑋 1, 1, 0, 0, 1, 1, 0 2
𝐽 𝑋, 𝑌
𝑌 1, 0, 0, 1, 1, 0, 0 5
21
22
k-NN – Summary
A lazy learner
Easiest to implement – can be implemented even in Excel;
Proper quality needs a significant number of inputs.
Preparation:
Numeric attributes must be normalized;
Categorical attributes must be turned to Boolean or integer.
Application:
Select a k;
Select a proximity measure;
Decide whether or not to apply distance-based weights in prediction;
Find the k nearest neighbors;
Evaluate votes among the k nearest neighbors and predicting.
23
k-NN – Summary
24
Application Example –
Occupancy Prediction
Brennan et al. in 2015 showed the high accuracy of kNN in predicting
occupancy rate (count), using environmental sensory information
Storage &
Communication
[Raspberry pi
3B]
Sensing
Processing Unit
Unit
Data
interpretation Temperature CO2
[Arduino UNO] & Humidity [K30]
[DHT22] 25
Application Example –
Occupancy Prediction
Brennan et al in 2015 showed the high accuracy of kNN in predicting
occupancy rate (count), using environmental sensory information
Results:
26
27
Classification
with Naïve Bayesian
𝑃 𝐵 𝐴 .𝑃 𝐴 𝑃 𝐴 𝐵 .𝑃 𝐵
Class Conditional
Pr (CCP)
“Posterior Pr” “Prior Pr”
𝑷 𝑨𝑩 .𝑷 𝑩
𝑷 𝑩𝑨
𝑷 𝑨
Outcome
(in classification Thomas Bayes 1701 – 1761
Evidence
task: the class)
(in classification task:
predictive attribute(s))
29
30
31
I open my eyes this morning; have a morning class in Hall and it’s snowing again!
What is the chance of me being late today?!
34
Advantages of NB Limitations of NB
Easy to understand Issue with incomplete
Particularly when understanding training sets (zero-frequency
Bayes theorem! problem)
Easy to implement Issue with continuous
Robust against missing attributes
values Issue with attribute
Set aside attributes with missing independence
data
42
43
Solution 1) Discretization
Problems: subjectivity of bucketing ranges
Solution 1) Pre-processing
Complete a correlation analysis and remove strongly correlated
attributes before training NB model
o Numeric attributes: Pearson test
o Categorical attributes: Chi-squared (χ2) test
Week8 Tutorial
Occupancy Detection to Enhance the Digital Twin!
As a part of the Digital Twin project, Concordia Facility
Management aims to detect the occupancy from environmental
sensory data (similar to Brennan et al. in 2015)
You’re developing a classifier which reads the sensory
information in a room and predicts whether the room
currently:
is vacant;
has a LOW count of occupants (1~4);
has a MEDIUM count of occupants (5~14); or
has a HIGH count of occupants (≥15).
46
47