CSEN240
Machine Learning
Yen-Kuang Chen, Ph.D., IEEE Fellow
[email protected]
1
Outline
• Class 1
• What is machine learning?
• Why do we want to learn about machine learning? (part 1)
• When can machines learn? (The conditions under which machines can learn)
• Why can machines learn? (The underlying mechanism allows machines to learn)
• Class 2
• Why do we want to learn about machine learning? (part 2)
• How can machines learn? (Techniques/approaches)
• How can machines learn better? (Strategies to improve performance/efficiency)
Quiz
Why do we want to use machine learning?
• (A) To make complex decisions without data
• (B) To avoid processing a large amount of information
• (C) To build systems with simple programmable rules
• (D) To process a huge amount of information and handle complexity
Quiz
When can machines learn?
• (A) When there is no performance measurement to be improved
• (B) When there are no samples or observations about the relationship
• (C) When there is an underlying relationship to be learned
• (D) When machines have access to unlimited computational
resources
Class 1 Summary
• Why do we want to use machine learning?
• A huge amount of information must be processed
• However, no simple programmable rules
• ML can be an alternative route to building complicated systems
• When can machines learn?
• Exists some underlying relationship to be learned
• There is performance measurement to be improved
• There are samples (observations) about the relationship
• So, ML has some data to learn from
• Why can machines learn?
• The ability to learn is because the models and algorithms used
Formalize the Machine Learning
• Input: 𝑥⃑ ∈ 𝑋
• Output: 𝑦⃑ ∈ 𝑌
• Unknown relationship to be learned (aka, target function): 𝑓: 𝑋 → 𝑌
• Data (aka, training examples): 𝐷 = 𝑥⃑! , 𝑦⃑! , 𝑥⃑" , 𝑦⃑" , … , 𝑥⃑# , 𝑦⃑#
• Skill (hopefully with good performance): 𝑔: 𝑋 → 𝑌
Unknown target function
𝑓: 𝑋 → 𝑌
Hypothesis
Training examples Learned formula
𝐷 = 𝑥⃑!, 𝑦⃑! , … , 𝑥⃑" , 𝑦⃑" ML 𝑔≈𝑓
A Simple Hypothesis: the ‘Perceptron’
• For 𝑥⃑ = (𝑥! , 𝑥" , …, 𝑥$ ) ‘features of tumor’, compute a weighted score
malignant if ∑$%&! 𝑤% 𝑥% ≥ threshold
0
benign if ∑$%&! 𝑤% 𝑥% < threshold
• 𝑌: +1 malignant , −1 benign
• Note y is single-dimensional in this example
• Classification formula 𝑔 𝑥⃑ = sign ∑$%&! 𝑤% 𝑥% − threshold
• (Ignore 0 for now)
Vector Form of Perceptron Hypothesis
(For mathematical convenience)
• 𝑔 𝑥⃑ = sign ∑$%&! 𝑤% 𝑥% − threshold
= sign ∑$%&! 𝑤% 𝑥% + (−threshold)×(+1)
= sign ∑$%&' 𝑤% 𝑥% 𝑤! 𝑥!
= sign 𝑤 ⋅ 𝑥⃑
!
Perceptron in ℝ
(For visualization convenience)
• 𝑔 𝑥⃑ = sign 𝑤' + 𝑤! 𝑥! + 𝑤" 𝑥"
+
+
ℝ" ℝ# -
-
Features of tumor 𝑥⃑ Points on the 2D plane Points in ℝ#
Labels 𝑦 +1 (malignant), -1 (benign)
Hypothesis 𝑔 𝑥⃑ Line Hyperplanes in ℝ#
Positive on one side of a line (hyperplane);
Negative on the other side
After having the model, how can we find {𝑤! } & threshold?
à ML algorithms
Math Pre-Requisites
• Vector inner product • Vector addition
• Quiz: 𝑤( ⋅ 𝑥⃑#(() > or < 0? • Where is 𝑤(+! = 𝑤( + 𝑥⃑#(() ?
𝑥⃑#(%) 𝑥⃑#(%)
𝑥⃑#(%)
𝑤% 𝑤% 𝑤%
A Simple Learning Algorithm
• Start from some random 𝑤, , and correct its mistakes on 𝐷
• For 𝑡 = 0, 1, …
• Find a mistake called 𝑥⃑'()) , 𝑦'())
• Based on the mistake, “correct” 𝑤) → 𝑤)+,
• until no more mistakes
Perceptron Learning Algorithm (PLA)
• Start from some random 𝑤, , and correct its mistakes on 𝐷
• For 𝑡 = 0, 1, … 𝑦#(%) = +1
𝑥⃑#(%)
𝑤%'(
• Find a mistake called 𝑥⃑'()) , 𝑦'())
sign 𝑤) ⋅ 𝑥⃑'()) ≠ 𝑦'())
𝑤%
• (try to) correct the mistake by
𝑤)+, ← 𝑤) + 𝑦'()) 𝑥⃑'()) 𝑦#(%) = -1
𝑤%'(
𝑤%
• until no more mistakes
𝑥⃑#(%)
Pictorial Example
Source: H.-T. Lin, Machine Learning Foundations
Pictorial Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
Source: H.-T. Lin, Machine Learning Foundations
Pictorial Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
Source: H.-T. Lin, Machine Learning Foundations
Pictorial Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
Source: H.-T. Lin, Machine Learning Foundations
Pictorial Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
Source: H.-T. Lin, Machine Learning Foundations
Pictorial Example
Source: H.-T. Lin, Machine Learning Foundations
Group Exercise: Numerical Example
•𝐷= 1, 2 , −1 , 2, 1 , −1 , (3, 3), 1 , (4, 4), 1
• 𝑤, = (−1, 0.75, 0)
• Quiz
• What is 1 & -1 in D’s “third-”dimension?
• What is -1 in 𝑤/ ’s first-dimension?
• Given 𝑤/ , how to classify D?
Numerical Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
𝑥⃑#(%) = (1, 𝑥( , 𝑥" )
•𝐷= 1, 2 , −1 , 2, 1 , −1 , (3, 3), 1 , (4, 4), 1
• 𝑤, = (−1, 0.75, 0)
• Mistake 2,1 , −1
𝑥"
𝑥(
Numerical Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
𝑥⃑#(%) = (1, 𝑥( , 𝑥" )
• 𝑤! = −2, −1.25, −1
• Mistake 3,3 , 1
Numerical Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
𝑥⃑#(%) = (1, 𝑥( , 𝑥" )
• 𝑤" = −1, 1.75, 2
• Mistake 1,2 , −1
Numerical Example
• 𝑤- = −2, 0.75, 0
Why?
Why can machines learn? humans design
• The ability to learn is because the models and algorithms used
Unknown target function
𝑓: 𝑋 → 𝑌
Hypothesis
Training examples Learned formula
𝐷 = 𝑥⃑!, 𝑦⃑! , … , 𝑥⃑" , 𝑦⃑" ML 𝑔≈𝑓
Summary
• Why do we want to use machine learning?
• A huge amount of information must be processed
• However, no simple programmable rules
• ML can be an alternative route to building complicated systems
• When can machines learn?
• Exists some underlying relationship to be learned
• There is performance measurement to be improved
• There are samples (observations) about the relationship
• So, ML has some data to learn from
• Why can machines learn?
• The ability to learn is because the models and algorithms used
Extended Discussion
• Why “cannot” machines learn?
Outline
• Class 1
• What is machine learning?
• Why do we want to learn about machine learning? (part 1)
• When can machines learn? (The conditions under which machines can learn)
• Why can machines learn? (The underlying mechanism allows machines to learn)
• Class 2
• Why do we want to learn about machine learning? (part 2)
• How can machines learn? (Techniques/approaches)
• How can machines learn better? (Strategies to improve performance/efficiency)
Review of Perceptron Learning Algorithm
• What kinds of data were given?
• Labeled (often created by humans)
• What kinds of labels? Supervised classification
• Two classes
• Assumptions?
• Linear separatable
• Learning algorithm?
• Iteratively updates its weights in response to errors in its prediction
Course Atlas
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
29
How?
Supervised vs. Unsupervised Learning
• Supervised Learning
• The algorithm is provided with a set of input/output pairs
• The correct output is pre-“labeled”
• Goal: to learn mapping function to predict outputs for new/unseen inputs
• Unsupervised Learning
• No pre-labeled outputs provided to the algorithm
• Instead, the algorithm must explore the structure of the data on its own and
identify meaningful patterns or relationships among the input variables
Classification vs. Regression
• Classification
• Goal: to predict a categorical or discrete output variable
• E.g., benign or malignant, hand-written digit recognition
• Regression
• Goal: to predict a continuous numerical output
• E.g., price of a house, price of a used car
Binary vs. Multiclass Classification
• Patient sick/not sick • Written digits à 0, 1, …, 9
• Email spam/non-spam • Emails à primary, social,
• Credit approve/disapprove promotion, spam
• Answer correct/incorrect • Pictures à apple, orange,
strawberry
• Quiz: any other examples?
Clustering and Dimension Reduction
• Clustering
• Goal: to identify natural groupings in the data without any prior knowledge of the
correct labels or categories
• E.g., segment customers into groups based on their purchasing behaviors
• Dimension reduction
• Goal: to transform the original data into a new set of data that capture the most
variation in the data with the fewest number of components
• Applications: compression, feature extraction, data visualization
• Note that
• Data reduction refers to the process of reducing the amount of data in a dataset by removing
irrelevant or redundant information.
• Feature extraction, on the other hand, is the process of extracting useful information or
features from the raw data.
Quiz
1. Classification • Predicting the stock price of a
2. Regression company based on historical data.
3. Clustering • Identifying fraudulent transactions
in a credit card dataset.
4. Dimension reduction
• Segmenting customers into groups
based on their purchasing
behaviors.
• Visualizing high-dimensional data
in a lower-dimensional space.
Course Atlas
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
35
List of Key Questions About Each Machine
Learning Algorithm
• What kinds of data were given?
• Labeled?
• What kind of labels?
• Continuous or classification labels? Number of categories?
• Assumptions?
• Linear separatable?
• Learning algorithm?
• Direct optimization? Iterative optimization? Parameters?
• Inference computation?
• Linear? Polynomial? Non-linear?
• Error function?
• Class labels? Probabilities? Manual threshold?
• Overfitting or underfitting?
36
Summary
• How can machines learn?
• Different approaches to enable machines to learn, e.g.,
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
Better?
Remaining Question
• How can machines learn better?
• What are the strategies to improve performance/efficiency?
• Review of PLA
• Assumptions: Linear separable
• Learning algorithm: Iteratively updates its weights in response to errors in its
prediction
• However,
• Will it guarantee finding the linear separable boundary?
• How can we find the linear separable boundary faster?
• Furthermore, if data are linear separable, is PLA the best algorithm?
Linear Separability
Source: H.-T. Lin, Machine Learning Foundations
Will PLA Find the Boundary if Linear Separable
Notation:
• Linear separable means there is a 𝑤! such that 𝑤% : weight
• sign 𝑤/ ⋅ 𝑥⃑0 = 𝑦0 for all n 𝑥⃑) : input
𝑦) : output
• 𝑦0 𝑤/ ⋅ 𝑥⃑0 ≥ min 𝑦0 𝑤/ ⋅ 𝑥⃑0 > 0
0
• The inner product of 𝑤! ⋅ 𝑤" after updating with 𝑥⃑#(") , 𝑦#(")
• 𝑤/ ⋅ 𝑤1 = 𝑤/ ⋅ (𝑤123 + 𝑦0(123) 𝑥⃑0(123) )
= 𝑤/ ⋅ 𝑤123 + 𝑤/ ⋅ 𝑦0(123) 𝑥⃑0(123)
≥ 𝑤/ ⋅ 𝑤123 + min 𝑦0 𝑤/ ⋅ 𝑥⃑0
0
≥ 𝑤/ ⋅ 𝑤6 + 𝑡 min 𝑦0 𝑤/ ⋅ 𝑥⃑0
0
• Normally, two unit vectors are close to each other when the product
product of two unit vectors is closer to 1
𝑤" Does Not Grow Too Fast
" "
• 𝑤( = 𝑤(.! + 𝑦#((.!) 𝑥⃑#((.!)
" "
= 𝑤(.! + 2𝑦#((.!) 𝑤(.! S 𝑥⃑#((.!) + 𝑦#((.!) 𝑥⃑#((.!)
• When there is a mistake, sign 𝑤% ⋅ 𝑥⃑#(%) ≠ 𝑦#(%)
" "
≤ 𝑤(.! + 𝑦#((.!) 𝑥⃑#((.!)
" "
≤ 𝑤(.! + max 𝑥⃑#
#
" "
≤ 𝑤' + 𝑡 max 𝑥⃑#
#
Guarantee
/! ⋅/" 123 5# /! ⋅7⃑#
#
• ≥ 𝑡⋅ after correcting 𝑡 mistakes
/! /" /! ⋅189 7⃑#
#
• As long as linear separable and correct mistakes
• Inner product of 𝑤D and 𝑤) grows fast 𝑂(𝑡)
• Length of 𝑤) grows slowly 𝑂( 𝑡)
• PLA’s 𝑤) is more and more aligned with 𝑤D
Pros and Cons
• Pros
• Simple to implement
• Cons
• Assuming linear separable (property unknown in advance)
• Not fully sure how long it will halt
Variations of the Simple PLA
• Allow a small number of data points as noise
𝑤 = arg min ∑# 𝑦# − sign(𝑤 ⋅ 𝑥⃑# )
/
• Unfortunately, NP-hard to solve
• Modified the algorithm to Pocket algorithm
Source: H.-T. Lin, Machine Learning Foundations
Variations of the Simple PLA
• Randomize the order of the training data
• Prevent the algorithm from getting stuck in local minima
• Use a more sophisticated learning rate
• Learning rate: a hyperparameter that controls size of update steps
• 𝑤)+, ← 𝑤) + 𝑦'()) 𝑥⃑'()) à 𝑤)+, ← 𝑤) + 𝛼𝑦'()) 𝑥⃑'())
• Modify the algorithm
• The perceptron with momentum
Extended Discussion
• Even if data are linearly separable, is PLA the best algorithm?
List of Key Questions About Each Machine
Learning Algorithm
• What kinds of data were given?
• Labeled?
• What kind of labels?
• Continuous or classification labels? Number of categories?
• Assumptions?
• Linear separatable?
• Learning algorithm?
• Direct optimization? Iterative optimization? Parameters?
• Inference computation?
• Linear? Polynomial? Non-linear?
• Error function?
• Class labels? Probabilities? Manual threshold?
• Overfitting vs. underfitting?
47
This Course: Learn to Choose the Right Tool
• Many ML approaches
• Each tailored to different needs
• Each works best under specific circumstances
Our goal: learn which tool to use and when
Summary
• How can machines learn?
• Different approaches to enable machines to learn, e.g.,
• Supervised learning (with labeled data)
• Unsupervised learning (identify patterns in unlabeled data)
• How can machines learn better?
• Strategies to improve performance/efficiency
• In the next few weeks, we will see a number of different algorithms that can
solve the same problems (with the same assumptions)
Learning Objectives
• Demonstrate knowledge of and ability to solve problems in foundational
topics in machine learning
• Including logistic regression, linear discriminant analysis, Bayesian classification, and
support vector machines.
• Implement supervised learning algorithms
• For example, decision trees and linear regression.
• Implement unsupervised learning algorithms
• Such as clustering algorithms or principal-component analysis.
• Work with real data sets, create training and test data, and analyze the
results of learning algorithms.
• Demonstrate knowledge of neural networks
• Particularly backpropagation.