1AI.04b - Introduction To Machine Learning - Supervised Learning - DT PDF
1AI.04b - Introduction To Machine Learning - Supervised Learning - DT PDF
Supervised learning
1
Artificial Intelligence Course
CONTENTS
• Regression:
o Linear Regression
• Classification:
o Naïve Bayes
o Decision trees
2
Artificial Intelligence Course
Supervised learning
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅 𝑑
• Classification:
o y is discrete. To simplify, 𝑦 𝜖 {−1, +1}
o 𝑓: 𝑅 𝑑 → −1, +1
f is called a binary classifier.
• Regression:
o y is a real value, 𝑦 𝜖 𝑅
o 𝑓: 𝑅 𝑑 → 𝑅
f is called a regressor.
3
Artificial Intelligence Course
Regression
Linear Regression
Artificial Intelligence Course
Linear Regression
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑅
5
Artificial Intelligence Course
Linear Regression
• Univariate linear regression model (simple case with one feature):
o 𝜔 are weights
o 𝑤 is the vector 𝜔0 , 𝜔1
o Learning the linear model → learning the 𝝎
6
Artificial Intelligence Course
Linear Regression
line in 𝑅 2 hyperplane in 𝑅3
7
Artificial Intelligence Course
o 𝜔 are weights
o 𝑤 is the vector 𝜔0 , 𝜔1
o Learning the linear model → learning the 𝝎
• Estimation with loss function:
o E.g. use the squared-error
o Find 𝜔0 , 𝜔1 that minimize the loss over all examples:
8
Artificial Intelligence Course
9
Artificial Intelligence Course
𝐷𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 𝑜𝑓 𝑝𝑟𝑖𝑐𝑒 𝑣𝑒𝑟𝑠𝑢𝑠 𝑓𝑙𝑜𝑜𝑟 𝑠𝑝𝑎𝑐𝑒 𝑜𝑓 ℎ𝑜𝑢𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑎𝑙𝑒, 𝐿𝑜𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
the solution is 𝜔1 = 0.232, 𝜔0 = 246 (y = 0.232x + 246) for various values of 𝜔0 , 𝜔1
10
Artificial Intelligence Course
Gradient descent
• Gradient Descent is an optimization method.
• 𝛼 𝑖𝑠 𝑎 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
11
Artificial Intelligence Course
Gradient descent
• For univariate regression, the loss is quadratic, so the partial
derivative will be linear.
• Recall:
o Chain rule:
o and
12
Artificial Intelligence Course
Gradient descent
• For one training example, (𝑥, 𝑦), let’s work out the partial
derivatives—the slopes:
13
Artificial Intelligence Course
Gradient descent
• Plugging this into the gradient descent algorithm, and folding the 2
into the unspecified learning rate 𝛼, we get the following learning rule
for the weights:
14
Artificial Intelligence Course
Gradient descent
• For 𝑁 training examples, we want to minimize the sum of the
individual losses for each example:
15
Artificial Intelligence Course
16
Artificial Intelligence Course
Example
17
Artificial Intelligence Course
18
Artificial Intelligence Course
• Lập trình:
• Training: tìm đường thẳng (model) gần các điểm trên nhất.
• Tìm bằng thuật toán Gradient descent.
• Inference: dự đoán xem giá của ngôi nhà 100m2 có giá bao nhiêu dựa
trên đường tìm được.
19
Artificial Intelligence Course
20
Artificial Intelligence Course
Loss function
• Con người: Việc tìm w0, w1 có thể đơn giản nếu làm bằng mắt
• Máy tính: Ban đầu giá trị được chọn ngẫu nhiên, sau đấy được chỉnh dần.
21
Artificial Intelligence Course
• Tóm tắt: đầu tiên từ việc tìm đường thẳng (model) -> tìm w0,
w1 để hàm J nhỏ nhất. Giờ cần một thuật toán để tìm giá trị nhỏ
nhất của hàm J. Đó chính là gradient descent.
22
Artificial Intelligence Course
Đạo hàm
• f'(1) = 2 * 1 < f'(2) = 2 * 2: đồ thị gần điểm x
= 2 dốc hơn đồ thị gần điểm x = 1
• Trị tuyệt đối của đạo hàm tại một điểm càng lớn
thì gần điểm đấy càng dốc.
• Đạo hàm tại điểm nào đó mà âm: đồ thị
đang giảm hay khi tăng x thì y sẽ giảm
• Đạo hàm tại điểm nào đó mà dương thì đồ
thị quanh điểm đấy đang tăng.
23
Artificial Intelligence Course
Gradient descent
• Gradient descent là thuật toán tìm giá trị nhỏ nhất của hàm số f(x)
dựa trên đạo hàm. Thuật toán:
24
Artificial Intelligence Course
25
Artificial Intelligence Course
26
Artificial Intelligence Course
• https://aicurious.io/posts/linear-regression/
27
Artificial Intelligence Course
Classification
Naïve Bayes
Artificial Intelligence Course
Conditional Probability
29
Artificial Intelligence Course
Bayes Rule
• Writing 𝑝(𝐴 ∧ 𝐵) in two different ways:
30
Artificial Intelligence Course
Bayes Rule
• This table divides the sample space into 4 mutually exclusive events.
• The probability in the margins are called marginals and are calculated
by summing across the rows and the columns.
• Another form:
31
Artificial Intelligence Course
32
Artificial Intelligence Course
Why probabilities?
• Why are we bringing here a Bayesian framework?
• Recall Classification framework:
o Given: Training data: 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅 𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑌
Task: Learn a classification function: 𝑓: 𝑅 𝑑 → 𝑌
33
Artificial Intelligence Course
Discriminative Algorithms
• Idea: model 𝑝 𝑦|𝑥 , conditional distribution of 𝑦 given 𝑥.
• In Discriminative Algorithms: find a decision boundary that separates
positive from negative example.
• To predict a new example, check on which side of the decision
boundary it falls.
• Model 𝑝 𝑦|𝑥 directly.
34
Artificial Intelligence Course
Generative Algorithms
• Idea: Build a model for what positive examples look like. Build a
different model for what negative example look like.
• To predict a new example, match it with each of the models and see
which match is best.
• Model 𝑝 𝑥|𝑦 and 𝑝 𝑦 !
𝑝 𝑥|𝑦 𝑝 𝑦
• Use Bayes rule to obtain 𝑝 𝑦|𝑥 =
𝑝 𝑥
• To make a prediction:
35
Artificial Intelligence Course
36
Artificial Intelligence Course
Setting
• A training data 𝑥𝑖 , 𝑦𝑖 is a feature vector and 𝑦𝑖 is a discrete label.
• 𝑑 features, and 𝑛 examples.
• E.g. consider document classification, each example is a documents,
each feature represents the presence or absence of a particular word
in the document.
• We have a training set.
• A new example with feature values 𝑥𝑛𝑒𝑤 = 𝑎1 , 𝑎2 , … , 𝑎𝑑 .
• We want to predict the label 𝑦𝑛𝑒𝑤 of the new example.
37
Artificial Intelligence Course
Setting
• Use Bayes rule to obtain:
Algorithm
• Learning: Based on the frequency counts in the dataset:
1. Estimate all 𝑝 𝑦 , ∀𝑦 𝜖 𝑌.
2. Estimate all 𝑝 𝑎𝑗 |𝑦 ∀𝑦 𝜖 𝑌, ∀𝑎𝑗 .
• Classification: For a new example, use:
40
Artificial Intelligence Course
Example
41
Artificial Intelligence Course
Solution
• Conditional probabilities:
𝑦𝑛𝑒𝑤 = 𝑦𝑒𝑠
42
Artificial Intelligence Course
43
Artificial Intelligence Course
44
Artificial Intelligence Course
45
Artificial Intelligence Course
46
Artificial Intelligence Course
47
Artificial Intelligence Course
48
Artificial Intelligence Course
49
Artificial Intelligence Course
50
Artificial Intelligence Course
51
Artificial Intelligence Course
Classification
Decision trees
Artificial Intelligence Course
Tree Classifiers
• The terminology Tree is graphic.
• However, a decision tree is grown from the root downward.
• The idea is to send the examples down the tree, using the concept of
information entropy.
• General Steps to build a tree:
1. Start with the root node that has all the examples.
2. Greedy selection of the next best feature to build the branches. The
splitting criteria is node purity.
3. Class majority will be assigned to the leaves.
53
Artificial Intelligence Course
Classification
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛
where 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝑖𝑠 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙/𝑞𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒), 𝑦𝑖 𝜖𝑌.
• E.g. 𝑌 = −1, +1 , 𝑌 = 0 + 1
• Task: Learn a classification function:
𝑓: 𝑅𝑑 → 𝑌
• In the case of Tree Classifiers:
1. No need for 𝑥𝑖 𝜖 𝑅𝑑 , so no need to turn categorical features into
numerical features.
2. The model is a tree.
54
Artificial Intelligence Course
Example
55
Artificial Intelligence Course
Example
56
Artificial Intelligence Course
57
Artificial Intelligence Course
58
Artificial Intelligence Course
59
Artificial Intelligence Course
60
Artificial Intelligence Course
61
Artificial Intelligence Course
62
Artificial Intelligence Course
63
Artificial Intelligence Course
• At the first split starting from the root, we choose the attribute that
has the max gain.
• Then, we re-start the same process at each of the children nodes (if
node not pure).
64
Artificial Intelligence Course
65