0% found this document useful (0 votes)
77 views65 pages

1AI.04b - Introduction To Machine Learning - Supervised Learning - DT PDF

This document discusses machine learning and supervised learning. It covers linear regression and classification algorithms like Naive Bayes and decision trees. For linear regression, it describes modeling a linear function to predict a continuous output value. It also covers gradient descent, an optimization algorithm used to minimize loss and learn the parameters of a linear model from training data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views65 pages

1AI.04b - Introduction To Machine Learning - Supervised Learning - DT PDF

This document discusses machine learning and supervised learning. It covers linear regression and classification algorithms like Naive Bayes and decision trees. For linear regression, it describes modeling a linear function to predict a continuous output value. It also covers gradient descent, an optimization algorithm used to minimize loss and learn the parameters of a linear model from training data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Artificial Intelligence Course

Chapter 4: Machine Learning

Supervised learning

1
Artificial Intelligence Course

CONTENTS
• Regression:
o Linear Regression
• Classification:
o Naïve Bayes
o Decision trees

2
Artificial Intelligence Course

Supervised learning
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅 𝑑
• Classification:
o y is discrete. To simplify, 𝑦 𝜖 {−1, +1}
o 𝑓: 𝑅 𝑑 → −1, +1
f is called a binary classifier.
• Regression:
o y is a real value, 𝑦 𝜖 𝑅
o 𝑓: 𝑅 𝑑 → 𝑅
f is called a regressor.

3
Artificial Intelligence Course

Regression
Linear Regression
Artificial Intelligence Course

Linear Regression
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑅

• Task: Learn a regression function:


o ℎ: 𝑅 𝑑 → 𝑅
oℎ 𝑥 = 𝑦
• A regression model is called to be linear if it is represented by a linear function.

5
Artificial Intelligence Course

Linear Regression
• Univariate linear regression model (simple case with one feature):

• Multivariable linear regression model:

o 𝜔 are weights
o 𝑤 is the vector 𝜔0 , 𝜔1
o Learning the linear model → learning the 𝝎

6
Artificial Intelligence Course

Linear Regression

line in 𝑅 2 hyperplane in 𝑅3

7
Artificial Intelligence Course

Univariate Linear Regression


• Univariate linear regression model (simple case with one feature): :

o 𝜔 are weights
o 𝑤 is the vector 𝜔0 , 𝜔1
o Learning the linear model → learning the 𝝎
• Estimation with loss function:
o E.g. use the squared-error
o Find 𝜔0 , 𝜔1 that minimize the loss over all examples:

8
Artificial Intelligence Course

Univariate Linear Regression


• We want to find:
• The sum is minimized when its partial
derivatives with respect to 𝜔0 𝑎𝑛𝑑 𝜔1 are zero :

• These equations have a unique solution:

9
Artificial Intelligence Course

Univariate Linear Regression

𝐷𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠 𝑜𝑓 𝑝𝑟𝑖𝑐𝑒 𝑣𝑒𝑟𝑠𝑢𝑠 𝑓𝑙𝑜𝑜𝑟 𝑠𝑝𝑎𝑐𝑒 𝑜𝑓 ℎ𝑜𝑢𝑠𝑒𝑠 𝑓𝑜𝑟 𝑠𝑎𝑙𝑒, 𝐿𝑜𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
the solution is 𝜔1 = 0.232, 𝜔0 = 246 (y = 0.232x + 246) for various values of 𝜔0 , 𝜔1

10
Artificial Intelligence Course

Gradient descent
• Gradient Descent is an optimization method.

• 𝛼 𝑖𝑠 𝑎 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒

11
Artificial Intelligence Course

Gradient descent
• For univariate regression, the loss is quadratic, so the partial
derivative will be linear.
• Recall:
o Chain rule:
o and

12
Artificial Intelligence Course

Gradient descent
• For one training example, (𝑥, 𝑦), let’s work out the partial
derivatives—the slopes:

• Applying this to both 𝜔0 𝑎𝑛𝑑 𝜔1 we get:

13
Artificial Intelligence Course

Gradient descent
• Plugging this into the gradient descent algorithm, and folding the 2
into the unspecified learning rate 𝛼, we get the following learning rule
for the weights:

• These updates make intuitive sense: if ℎ𝑤 𝑥 > 𝑦 (i.e., the output is


too large):
o reduce 𝜔0 a bit,
o and reduce 𝜔1 if 𝑥 was a positive input but increase 𝜔1 if 𝑥 was a negative
input.

14
Artificial Intelligence Course

Gradient descent
• For 𝑁 training examples, we want to minimize the sum of the
individual losses for each example:

15
Artificial Intelligence Course

Pros & Cons


+ Effective and efficient even in high dimensions.
BUT …
- Iterative (sometimes need many iterations to converge).
- Needs to choose the rate 𝛼.

16
Artificial Intelligence Course

Example

17
Artificial Intelligence Course

18
Artificial Intelligence Course

• Lập trình:
• Training: tìm đường thẳng (model) gần các điểm trên nhất.
• Tìm bằng thuật toán Gradient descent.
• Inference: dự đoán xem giá của ngôi nhà 100m2 có giá bao nhiêu dựa
trên đường tìm được.

19
Artificial Intelligence Course

Thiết lập công thức

• Model: phương trình đường thẳng có dạng y = ax + b


• Thay vì dùng kí hiệu a, b cho phương trình đường thẳng; để tiện cho biểu
diễn ma trận phần sau ta sẽ thay w1 = a, w0 = b
 y = w1 * x + w0
• Để tiện cho việc thiết lập công thức, ta sẽ đặt ký hiệu cho dữ liệu ở bảng
dữ liệu: (x1, y1) = (30, 448.524), (x2​,y2​) = (32.4138, 509.248)
• Nhà diện tích xi thực sự có giá yi .
• Giá trị mà model hiện tại đang dự đoán kí hiệu là yi=w1* xi+w0

20
Artificial Intelligence Course

Loss function
• Con người: Việc tìm w0, w1 có thể đơn giản nếu làm bằng mắt
• Máy tính: Ban đầu giá trị được chọn ngẫu nhiên, sau đấy được chỉnh dần.

21
Artificial Intelligence Course

• Tóm tắt: đầu tiên từ việc tìm đường thẳng (model) -> tìm w0,
w1 để hàm J nhỏ nhất. Giờ cần một thuật toán để tìm giá trị nhỏ
nhất của hàm J. Đó chính là gradient descent.

22
Artificial Intelligence Course

Đạo hàm
• f'(1) = 2 * 1 < f'(2) = 2 * 2: đồ thị gần điểm x
= 2 dốc hơn đồ thị gần điểm x = 1
• Trị tuyệt đối của đạo hàm tại một điểm càng lớn
thì gần điểm đấy càng dốc.
• Đạo hàm tại điểm nào đó mà âm: đồ thị
đang giảm hay khi tăng x thì y sẽ giảm
• Đạo hàm tại điểm nào đó mà dương thì đồ
thị quanh điểm đấy đang tăng.

23
Artificial Intelligence Course

Gradient descent
• Gradient descent là thuật toán tìm giá trị nhỏ nhất của hàm số f(x)
dựa trên đạo hàm. Thuật toán:

24
Artificial Intelligence Course

• Việc chọn hệ số learning_rate cực kì quan trọng:


• Nếu learning_rate nhỏ: mỗi lần hàm số giảm rất ít nên cần rất nhiều lần thực hiện
bước 2 để hàm số đạt giá trị nhỏ nhất
• Nếu learning_rate hợp lý: sau một số lần lặp bước 2 vừa phải thì hàm sẽ đạt giá trị
đủ nhỏ.
• Nếu learning_rate quá lớn: sẽ gây hiện tượng overshoot và không bao giờ đạt được
giá trị nhỏ nhất của hàm.

25
Artificial Intelligence Course

26
Artificial Intelligence Course

• https://aicurious.io/posts/linear-regression/

27
Artificial Intelligence Course

Classification
Naïve Bayes
Artificial Intelligence Course

Conditional Probability

29
Artificial Intelligence Course

Bayes Rule
• Writing 𝑝(𝐴 ∧ 𝐵) in two different ways:

• 𝑝 𝐴 𝐵 is called posterior (posterior distribution on A given B).


• 𝑝 𝐴 is called prior.
• 𝑝(𝐵) is called evidence.
• 𝑝(𝐵|𝐴) is called likelihood.

30
Artificial Intelligence Course

Bayes Rule

• This table divides the sample space into 4 mutually exclusive events.
• The probability in the margins are called marginals and are calculated
by summing across the rows and the columns.
• Another form:

31
Artificial Intelligence Course

Example of Using Bayes Rule

• A: patient has cancer.


• B: patient has a positive lab test.

32
Artificial Intelligence Course

Why probabilities?
• Why are we bringing here a Bayesian framework?
• Recall Classification framework:
o Given: Training data: 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 / 𝑥𝑖 𝜖 𝑅 𝑑 𝑎𝑛𝑑 𝑦𝑖 𝜖 𝑌
Task: Learn a classification function: 𝑓: 𝑅 𝑑 → 𝑌

• Learn a mapping from 𝑥 to 𝑦.


• We want to find this mapping 𝑓(𝑥) = 𝑦 through 𝑝(𝑦|𝑥)!

33
Artificial Intelligence Course

Discriminative Algorithms
• Idea: model 𝑝 𝑦|𝑥 , conditional distribution of 𝑦 given 𝑥.
• In Discriminative Algorithms: find a decision boundary that separates
positive from negative example.
• To predict a new example, check on which side of the decision
boundary it falls.
• Model 𝑝 𝑦|𝑥 directly.

34
Artificial Intelligence Course

Generative Algorithms
• Idea: Build a model for what positive examples look like. Build a
different model for what negative example look like.
• To predict a new example, match it with each of the models and see
which match is best.
• Model 𝑝 𝑥|𝑦 and 𝑝 𝑦 !
𝑝 𝑥|𝑦 𝑝 𝑦
• Use Bayes rule to obtain 𝑝 𝑦|𝑥 =
𝑝 𝑥
• To make a prediction:

35
Artificial Intelligence Course

Naive Bayes Classifier


• Probabilistic model.
• Highly practical method.
• Application domains to natural language text documents.
• Naive because of the strong independence assumption it makes (not
realistic).
• Simple model.
• Strong method can be comparable to decision trees and neural
networks in some cases.

36
Artificial Intelligence Course

Setting
• A training data 𝑥𝑖 , 𝑦𝑖 is a feature vector and 𝑦𝑖 is a discrete label.
• 𝑑 features, and 𝑛 examples.
• E.g. consider document classification, each example is a documents,
each feature represents the presence or absence of a particular word
in the document.
• We have a training set.
• A new example with feature values 𝑥𝑛𝑒𝑤 = 𝑎1 , 𝑎2 , … , 𝑎𝑑 .
• We want to predict the label 𝑦𝑛𝑒𝑤 of the new example.

37
Artificial Intelligence Course

Setting
• Use Bayes rule to obtain:

• Can we estimate these two terms from the training data?


1. 𝑝 𝑦 can be easy to estimate: count the frequency with which each label 𝑦.
2. 𝑝 𝑎1 , 𝑎2 , … , 𝑎𝑑 |𝑦 is not easy to estimate unless we have a very large
sample. (We need to see every example many times to get reliable
estimates)
38
Artificial Intelligence Course

Naive Bayes Classifier


• Makes a simplifying assumption that the feature values are
conditionally independent given the label.
• Given the label of the example, the probability of observing the
conjunction 𝑎1 , 𝑎2 , … , 𝑎𝑑 is the product of the probabilities for the
individual features:

• Naive Bayes Classifier:

• Can we estimate these two terms from the training data?


Yes!
39
Artificial Intelligence Course

Algorithm
• Learning: Based on the frequency counts in the dataset:
1. Estimate all 𝑝 𝑦 , ∀𝑦 𝜖 𝑌.
2. Estimate all 𝑝 𝑎𝑗 |𝑦 ∀𝑦 𝜖 𝑌, ∀𝑎𝑗 .
• Classification: For a new example, use:

• Note: No model per se or hyperplane, just count the frequencies of


various data combinations within the training examples.

40
Artificial Intelligence Course

Example

• Can we predict the class of the new example?

41
Artificial Intelligence Course

Solution

• Conditional probabilities:

𝑦𝑛𝑒𝑤 = 𝑦𝑒𝑠

42
Artificial Intelligence Course

X = TATA - SUV – BLACK ?

43
Artificial Intelligence Course

X = TATA – SUV – BLACK ?


• The probability of YES:
P(YES|X) = P(TATA|YES).P(SUV|YES).P(BLACK|YES).P(YES)
= 3/5 . 2/5 . 3/5 . 5/10 = 0.072
• The probability of NO:
P(NO|X) = P(TATA|NO).P(SUV|NO).P(BLACK|NO).P(NO)
= 2/5. 4/5. 2/5. 5/10 = 0.064

44
Artificial Intelligence Course

45
Artificial Intelligence Course

• Single Parent, Young, Low. Buy a car ?


• Computing the probability of the output labels
(P(Y)) given the data.
• P(No) = 4/10
• P(Yes) = 6/10
• Childless, Young, and Low, we'll calculate the
probability with respect to both class labels as
follows:
• P(Single Parent|Yes) = 1/6
• P(Single Parent|No) = 1/4
• P(Young|Yes) = 2/6
• P(Young|No) = 1/4
• P(Low|Yes) = 1/6
• P(Low|No) = 4/4

46
Artificial Intelligence Course

• Single Parent, Young, Low. Buy a car ?


• Computing the probability of the output labels
(P(Y)) given the data.
• P(No) = 4/10
• P(Yes) = 6/10
• Childless, Young, and Low, we'll calculate the
probability with respect to both class labels as
follows:
• P(Single Parent|Yes) = 1/6
• P(Single Parent|No) = 1/4
• P(Young|Yes) = 2/6
• P(Young|No) = 1/4
• P(Low|Yes) = 1/6
• P(Low|No) = 4/4

• P(Single Parent|Yes) * P(Young|Yes) *


P(Low|Yes)*P(Yes)
• P(Single Parent|No) * P(Young|No) *
P(Low|No)*P(NO)

47
Artificial Intelligence Course

48
Artificial Intelligence Course

49
Artificial Intelligence Course

50
Artificial Intelligence Course

51
Artificial Intelligence Course

Classification
Decision trees
Artificial Intelligence Course

Tree Classifiers
• The terminology Tree is graphic.
• However, a decision tree is grown from the root downward.
• The idea is to send the examples down the tree, using the concept of
information entropy.
• General Steps to build a tree:
1. Start with the root node that has all the examples.
2. Greedy selection of the next best feature to build the branches. The
splitting criteria is node purity.
3. Class majority will be assigned to the leaves.
53
Artificial Intelligence Course

Classification
• Given a training set of example input–output pairs:
𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛
where 𝑥𝑖 𝜖 𝑅𝑑 𝑎𝑛𝑑 𝑦𝑖 𝑖𝑠 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 (𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙/𝑞𝑢𝑎𝑙𝑖𝑡𝑎𝑡𝑖𝑣𝑒), 𝑦𝑖 𝜖𝑌.
• E.g. 𝑌 = −1, +1 , 𝑌 = 0 + 1
• Task: Learn a classification function:
𝑓: 𝑅𝑑 → 𝑌
• In the case of Tree Classifiers:
1. No need for 𝑥𝑖 𝜖 𝑅𝑑 , so no need to turn categorical features into
numerical features.
2. The model is a tree.

54
Artificial Intelligence Course

Example

55
Artificial Intelligence Course

Example

56
Artificial Intelligence Course

Splitting criteria in C4.5


1. The central choice is selecting the next attribute to split on.
2. We want some criteria that measures the homogeneity or impurity
of examples in the nodes:
a. Quantify the mix of classes at each node.
b. Maximum if equal number of examples from each class.
c. Minimum if the node is pure.
3. A perfect measure commonly used in Information Theory:

57
Artificial Intelligence Course

Splitting criteria in C4.5

• In general, for c classes:

58
Artificial Intelligence Course

Splitting criteria in C4.5


• Now each node has some entropy that measures the homogeneity in
the node.
• How to decide which attribute is best to split on based on
entropy?
• We use Information Gain that measures the expected reduction in
entropy caused by partitioning the examples according
to the attributes:

59
Artificial Intelligence Course

Back to the example

60
Artificial Intelligence Course

Back to the example

61
Artificial Intelligence Course

Back to the example

62
Artificial Intelligence Course

Back to the example

63
Artificial Intelligence Course

Back to the example

• At the first split starting from the root, we choose the attribute that
has the max gain.
• Then, we re-start the same process at each of the children nodes (if
node not pure).
64
Artificial Intelligence Course

Pros & Cons


+ Intuitive, interpretable.
+ Can be turned into rules.
+ Well-suited for categorical data.
+ Simple to build.
BUT …
- Unstable (change in an example may lead to a different tree).
- Univariate (split one attribute at a time, does not combine features).
- A choice at some node depends on the previous choices.
- Need to balance the data.

65

You might also like