L01-intro-clustering
L01-intro-clustering
• Additional materials
• Can submit additional code or results in a .zip file
Course project
• <= 5 people
• Topic of your choice
• Can be implementing a paper
• Extension of a homework
• Project for other courses with an additional machine learning
component
• Your current research (with additional scope)
• Or work on a new application
• Must already have existing data! No data collection!
• Topics need to be pre-approved
• Details about the procedure TBA
Why study machine learning?
The machine learning trend 2015
http://www.gartner.com/newsroom/id/3114217
The machine learning trend 2016
http://www.gartner.com/newsroom/id/3412017
The machine learning trend 2017
http://www.gartner.com/smarterwithgartner/top-trends-in-the-gartner-hype-cycle-for-emerging-technologies-2017/
The machine learning trend 2018
https://www.gartner.com/smarterwithgartner/5-trends-emerge-in-gartner-hype-cycle-for-emerging-
technologies-2018/
The machine learning trend 2019
https://www.gartner.com/smarterwithgartner/top-trends-on-the-gartner-hype-cycle-for-artificial-intelligence-2019/
The machine learning trend 2021
https://www.gartner.com/en/articles/the-4-trends-that-prevail-on-the-gartner-hype-cycle-for-ai-2021
The machine learning trend 2022
https://www.gartner.com/en/articles/what-s-new-in-the-2022-gartner-hype-cycle-for-emerging-technologies
2023
https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2023-gartner-hype-cycle
• “If I were to guess like what our biggest existential threat
is, it’s probably that. So we need to be very careful with
the artificial intelligence. There should be some regulatory
oversight maybe at the national and international level,
just to make sure that we don’t do something very
foolish.”
• “I think people who are naysayers and try to drum up
these doomsday scenarios — I just, I don’t understand it.
It’s really negative and in some ways I actually think it is
pretty irresponsible”
Poll
What is Pattern Recognition?
• “Pattern recognition is a branch of machine learning that
focuses on the recognition of patterns and regularities in
data, although it is in some cases considered to be nearly
synonymous with machine learning.”
wikipedia
• What about
• AI
• Data mining
• Knowledge Discovery in Databases (KDD)
• Statistics
• Data science
What is AI?
• Classical definition
• A system that appears intelligent
• Populace definition
https://techsauce.co/pr-news/tcas-use-cloud-
computing-and-ai-for-admission
Artificial General Intelligence (AGI)
• “hypothetical ability of an intelligent agent to understand
or learn any intellectual task that a human being can.”
Wikipedia
Can continue to learn new skills on its own.
http://www.deeplearningbook.org/
Different terminologies
http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf
Merging communities and fields
• With the advent of Deep learning the fields are merging
and the differences are becoming unclear
Course philosophy
• Going beyond the black box
• In this course you will
• Understand models on a deeper level
• Implement stuff from scratch
The danger zone
https://towardsdatascience.com/ocr-for-scanned-numbers-using-googles-automl-vision-29d193070c64
Types of machine learning
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
• IF A==1 && B==1 && C==1 && D==1 && E==1 &&
F==1 && G==0, THEN output(0).
• IF B==1 && C==1, THEN output(1) F(x)
• …..
• OTHERWISE, output(“not number”)
Learning from data
• Machine learning requires identifying the same ingredients
• Input, Output, Task
https://cloud.google.com/blog/products/gcp/how-a-japanese-cucumber-farmer-is-using-deep-learning-and-tensorflow
An example
• Handwritten digit recognition
• Input: x = 28 x 28 pixel image
• Output: y = digit 0 to 9
• Task: find F(x) such that y ≈ F(x)
sensors
Real world observations
1
Feature vector
5
Feature x
3.6
extraction 1
3
-1
How do we learn from data?
1
5 Training set
3.6
1
3
-1
Learning
algorithm
Training phase
How do we learn from data?
New input X
1
5
3.6
1 h Predicted output y
3
-1
Testing
phase
Feature extraction
• The process of extracting meaningful information related
to the goal
• A distinctive characteristic or quality
• Example features
data1
data2
data3
Garbage in Garbage out
• The machine is as intelligent as the data/features we put
in
• “Garbage in, Garbage out”
• Data cleaning is often done
to reduce unwanted things
https://precisionchiroco.com/garbage-in-garbage-out/
The need for data cleaning
https://www.linkedin.com/pulse/big-data-conundrum-garbage-out-other-challenges-business-platform
Feature properties
• The quality of the feature vector is related to its ability to
discriminate samples from different classes
New input X
h2
1
5
3.6
1 h1 Predicted output y
3
-1
Testing
phase
Metrics
• Compare the output of the models
• Errors/failures, accuracy/success
• We want to quantify the error/accuracy of the models
• How would you measure the error/accuracy of the
following
Ground truths
• We usually compare the model predicted answer with the
correct answer.
• What if there is no real answer?
• How would you rate machine translation?
ไปไหน
http://www.ustar-consortium.com/qws/slot/u50227/research.html
Commonly used metrics
• Error rate
• Accuracy rate
• Precision
• True positive
• Recall
• False alarm
• F score
A detection problem
• Identify whether an event occur
• A yes/no question
• A binary classifier
Smoke detector
Hotdog detector
Evaluating a detection problem
• 4 possible scenarios
Detector
Yes No
Actual Yes True positive False negative
(Type II error)
No False Alarm True negative
(Type I error)
Usually there’s a trade off between precision and recall. We will revisit this later
Let’s consider a case
• A: no rain predictor has 97% accuracy
• Always say no rain.
Detector
Rain No rain
Actual Rain 0 1
No rain 0 30
Note that precision and recall says nothing about the true negative
Evaluating models
• We talked about the training set used to learn the model
Deep learning
Why anything else besides deep
learning
https://medium.com/analytics-vidhya/ongoing-kaggle-survey-picks-the-topmost-data-
science-trends-7c19ec7606a1
KNN and K-means
clustering
Our first model - Unsupervised
learning
Discover the hidden structure in unlabeled data X (no y)
• Customer/product segmentation
• Data analysis for ...
• Identify number of speakers in a meeting recording
• Helps supervised learning in some task
Example - Customer analysis
Brand loyalty
Price
sensitivity
Example - Customer analysis
Brand loyalty
Price
sensitivity
Example - Real Estate
segmentation in Thailand
What should be the
input feature of
this?
Example - Real Estate
segmentation in Thailand
What should be the
input feature of
this?
Example - Real Estate
segmentation in Thailand
Example - Real Estate
segmentation in Thailand
Example - Real Estate
segmentation in Thailand
Example - Real Estate
segmentation in Thailand
K-mean clustering
Clustering - task that tries to automatically discover groups
within the data
Price
sensitivity
K-mean clustering
Clustering - task that tries to automatically discover groups
within the data
How?
Price
sensitivity
Nearest Neighbour classification
Find the closest training data, assign the same label as the
training data
Price
sensitivity
K-Nearest Neighbour (kNN)
classification
Nearest Neighbour is susceptible to noise in the training
data
Use a voting scheme instead
Brand loyalty
Which cluster?
Price
sensitivity
K-Nearest Neighbour (kNN)
classification
Nearest Neighbour is susceptible to noise in the training
data
Use a voting scheme instead
Brand loyalty
Given query data
For every point in the training data k=4
Euclidean distance
Euclidean
Cosine similarity
Cosine similarity
Many more distances, Jaccard distance, Earth mover distance = cos(angle)
KNN runtime
For every point in the training data
Compute the distance with the query
Find the K closest data points
Assign label by voting
O(N)
O(JN) - If we have J queries
Expensive
Ways to make it faster
Kernelized KNN
Locally Sensitive Hashing (LSH)
Use centroids
Centroids
Basically, the representative of the cluster
Find the mean location of the cluster by averaging
Can use mode or median depending on the data
Brand loyalty
Which cluster?
O(JL)
L - number of clusters
Price
sensitivity
K-mean clustering
1. Randomly init k centroids by picking from data points
2. Assign each data points to centroids
3. Update centroids for each cluster
4. Repeat 2-3 until centroids does not change
Brand loyalty
Price
sensitivity
An Illustration Of K-Mean Clustering
From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Characteristics of K-means
▪ The number of clusters, K, is specified in advance.
▪ Always converge to a (local) minimum.
• Poor starting centroid locations can lead to incorrect minima.
Number of cluster, K
Selecting K - other methods
95% explained variance
Fraction of explained variance
K Accuracy
K=2
Training
K=3 Testing / 2 50%
K-mean
Cross-
K=4 Clustering 3 68%
validation
Model
4 83%
…
…
Choose K that maximizes
certain objective (e.g.
accuracy on testing data) Best method
Summary
• Other clustering methods
• K-mode, K-median
• Spectral clustering (clustering in embedding space)
• DBScan (clustering by “density”– very robust, no need for k*)
https://github.com/NSHipster/DBSCAN