2024 Scu ML 1 3 Pla
2024 Scu ML 1 3 Pla
Machine Learning
Yen-Kuang Chen, Ph.D., IEEE Fellow
[email protected]
1
Outline
• Class 1
• What is machine learning?
• Why do we want to learn about machine learning? (part 1)
• When can machines learn? (The conditions under which machines can learn)
• Why can machines learn? (The underlying mechanism allows machines to learn)
• Class 2
• Why do we want to learn about machine learning? (part 2)
• How can machines learn? (Techniques/approaches)
• How can machines learn better? (Strategies to improve performance/efficiency)
Quiz
Why do we want to use machine learning?
• (A) To make complex decisions without data
• (B) To avoid processing a large amount of information
• (C) To build systems with simple programmable rules
• (D) To process a huge amount of information and handle complexity
Quiz
When can machines learn?
• (A) When there is no performance measurement to be improved
• (B) When there are no samples or observations about the relationship
• (C) When there is an underlying relationship to be learned
• (D) When machines have access to unlimited computational
resources
Class 1 Summary
• Why do we want to use machine learning?
• A huge amount of information must be processed
• However, no simple programmable rules
• ML can be an alternative route to building complicated systems
• 𝑌: +1 malignant , −1 benign
• Note y is single-dimensional in this example
= sign ∑$%&' 𝑤% 𝑥% 𝑤! 𝑥!
= sign 𝑤 ⋅ 𝑥⃑
!
Perceptron in ℝ
(For visualization convenience)
• 𝑔 𝑥⃑ = sign 𝑤' + 𝑤! 𝑥! + 𝑤" 𝑥"
+
+
ℝ" ℝ# -
-
Features of tumor 𝑥⃑ Points on the 2D plane Points in ℝ#
Labels 𝑦 +1 (malignant), -1 (benign)
Hypothesis 𝑔 𝑥⃑ Line Hyperplanes in ℝ#
Positive on one side of a line (hyperplane);
Negative on the other side
After having the model, how can we find {𝑤! } & threshold?
à ML algorithms
Math Pre-Requisites
• Vector inner product • Vector addition
• Quiz: 𝑤( ⋅ 𝑥⃑#(() > or < 0? • Where is 𝑤(+! = 𝑤( + 𝑥⃑#(() ?
𝑥⃑#(%) 𝑥⃑#(%)
𝑥⃑#(%)
𝑤% 𝑤% 𝑤%
A Simple Learning Algorithm
• Start from some random 𝑤, , and correct its mistakes on 𝐷
• For 𝑡 = 0, 1, …
• Find a mistake called 𝑥⃑'()) , 𝑦'())
• Based on the mistake, “correct” 𝑤) → 𝑤)+,
• until no more mistakes
Perceptron Learning Algorithm (PLA)
• Start from some random 𝑤, , and correct its mistakes on 𝐷
• For 𝑡 = 0, 1, … 𝑦#(%) = +1
𝑥⃑#(%)
𝑤%'(
• Find a mistake called 𝑥⃑'()) , 𝑦'())
sign 𝑤) ⋅ 𝑥⃑'()) ≠ 𝑦'())
𝑤%
• (try to) correct the mistake by
𝑤)+, ← 𝑤) + 𝑦'()) 𝑥⃑'()) 𝑦#(%) = -1
𝑤%'(
𝑤%
• until no more mistakes
𝑥⃑#(%)
Pictorial Example
• Quiz
• What is 1 & -1 in D’s “third-”dimension?
• What is -1 in 𝑤/ ’s first-dimension?
• Given 𝑤/ , how to classify D?
Numerical Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
𝑥⃑#(%) = (1, 𝑥( , 𝑥" )
• Mistake 2,1 , −1
𝑥"
𝑥(
Numerical Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
𝑥⃑#(%) = (1, 𝑥( , 𝑥" )
• 𝑤! = −2, −1.25, −1
• Mistake 3,3 , 1
Numerical Example PLA: 𝑤%'( ← 𝑤% + 𝑦#(%) 𝑥⃑#(%)
𝑥⃑#(%) = (1, 𝑥( , 𝑥" )
• Mistake 1,2 , −1
Numerical Example
• 𝑤- = −2, 0.75, 0
Why?
Why can machines learn? humans design
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
29
How?
Supervised vs. Unsupervised Learning
• Supervised Learning
• The algorithm is provided with a set of input/output pairs
• The correct output is pre-“labeled”
• Goal: to learn mapping function to predict outputs for new/unseen inputs
• Unsupervised Learning
• No pre-labeled outputs provided to the algorithm
• Instead, the algorithm must explore the structure of the data on its own and
identify meaningful patterns or relationships among the input variables
Classification vs. Regression
• Classification
• Goal: to predict a categorical or discrete output variable
• E.g., benign or malignant, hand-written digit recognition
• Regression
• Goal: to predict a continuous numerical output
• E.g., price of a house, price of a used car
Binary vs. Multiclass Classification
• Patient sick/not sick • Written digits à 0, 1, …, 9
• Email spam/non-spam • Emails à primary, social,
• Credit approve/disapprove promotion, spam
• Answer correct/incorrect • Pictures à apple, orange,
strawberry
• Dimension reduction
• Goal: to transform the original data into a new set of data that capture the most
variation in the data with the fewest number of components
• Applications: compression, feature extraction, data visualization
• Note that
• Data reduction refers to the process of reducing the amount of data in a dataset by removing
irrelevant or redundant information.
• Feature extraction, on the other hand, is the process of extracting useful information or
features from the raw data.
Quiz
1. Classification • Predicting the stock price of a
2. Regression company based on historical data.
3. Clustering • Identifying fraudulent transactions
in a credit card dataset.
4. Dimension reduction
• Segmenting customers into groups
based on their purchasing
behaviors.
• Visualizing high-dimensional data
in a lower-dimensional space.
Course Atlas
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
35
List of Key Questions About Each Machine
Learning Algorithm
• What kinds of data were given?
• Labeled?
• What kind of labels?
• Continuous or classification labels? Number of categories?
• Assumptions?
• Linear separatable?
• Learning algorithm?
• Direct optimization? Iterative optimization? Parameters?
• Inference computation?
• Linear? Polynomial? Non-linear?
• Error function?
• Class labels? Probabilities? Manual threshold?
• Overfitting or underfitting?
36
Summary
• How can machines learn?
• Different approaches to enable machines to learn, e.g.,
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
Better?
Remaining Question
• How can machines learn better?
• What are the strategies to improve performance/efficiency?
• Review of PLA
• Assumptions: Linear separable
• Learning algorithm: Iteratively updates its weights in response to errors in its
prediction
• However,
• Will it guarantee finding the linear separable boundary?
• How can we find the linear separable boundary faster?
• Furthermore, if data are linear separable, is PLA the best algorithm?
Linear Separability
" "
≤ 𝑤(.! + 𝑦#((.!) 𝑥⃑#((.!)
" "
≤ 𝑤(.! + max 𝑥⃑#
#
" "
≤ 𝑤' + 𝑡 max 𝑥⃑#
#
Guarantee
/! ⋅/" 123 5# /! ⋅7⃑#
#
• ≥ 𝑡⋅ after correcting 𝑡 mistakes
/! /" /! ⋅189 7⃑#
#