ML s2 Part1 24
ML s2 Part1 24
机器学习
Machine Learning
刘红岩
[email protected]
清华大学经济管理学院
2024/4/28
Basic Concepts of Classification & Prediction
基本概念
2024/4/28
Classification (分类)
Training dataset
Independent variable
(自变量)
Dependent variable (因变量)
name age years income
Mike <=30 3 8120.5
Mary <=30 2 6208
Bill 31…40 4 3060
Jim >40 7 7050 Continuous value
Dave >40 6 10300
Anne 31…40 7 20060
… … … …
Typical application
vCrowdfunding prediction
Typical application
vRecommendation: rate prediction
vSale prediction
Prediction:
CLASSIFICATION
Types of classification problem
!Let C={c1, c2, cn} be the set of class labels associated with the training and
testing data set. Each instance in the dataset is denoted by (xi, yi)
! xi: features,
! yi: set of class labels associated with the instance
2024/4/28
Classification—A Two-Step Process
vModel construction: describing a set of predetermined classes
§ Each tuple/sample is assumed to belong to a predefined classes, as
determined by the class label attribute
§ The set of tuples used for model construction is training set
§ The model is represented as classification rules, decision trees, or
mathematical formulae
2024/4/28
Classification—A Two-Step Process
vModel usage: for classifying future or unknown objects
§ Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
§ If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
2024/4/28
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
2024/4/28
Example of classification model: Decision Tree
a l a l u s
o ric o ric uo
g g t in s s
c at e c at e c on cl a
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
6 No Medium 60K No
Training Set
Apply
Model
Decision
Tid Attrib1 Attrib2 Attrib3 Class
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Training, Validation and Testing
vProblem
§ What if all examples with a certain class were missed out of the training set?
Training, Validation and Testing
vCross-validation(交叉验证)
• Every example is used in training and testing in turn
• Folds: partition of the data
E.g fold=3: threefold cross-validation
§ ten-fold cross-validation
• N=10: standard
• experiments & theoretical evidence
2024/4/28
Typical techniques
典型技术
2024/4/28
K nearest Neighbors
_
_ d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
_ _ i1 j1 i2 j2 ip jp
+ +
_ .
xq + v - minA
Normalization: v' =
_ maxA - minA
+
2024/4/28
举例,K=3
sepal_length sepal_width petal_length petal_width type
5.7 2.9 4.2 1.3 Iris-versicolor
6.2 2.9 4.3 1.3 Iris-versicolor
5.7 2.8 4.1 1.3 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3.0 5.9 2.1 Iris-virginica
5.1 3.8 1.6 0.2 Iris-setosa
4.6 3.2 1.4 0.2 Iris-setosa
5.3 3.7 1.5 0.2 Iris-setosa
vDecision Rule
§Standard rule
– Choose the majority class within the decision set
_
§Weighted decision rule _
_ _
+
Weight the classes of the decision set +
_ .
xq +
• By distance
_ +
Nearest-Neighbor Classification
vDiscussion
! $
𝑑 𝑖, 𝑗 = ( 𝑥!" − 𝑥#" + |𝑥!% − 𝑥#%|$ + ⋯ + |𝑥!& − 𝑥#& |$ )
2024/4/28
Output: A Decision Tree for “buys_computer”
no yes no yes
2024/4/28
Algorithm for Decision Tree Induction
vBasic algorithm (a greedy algorithm)
§ Tree is constructed in a top-down recursive divide-and-conquer
manner(自顶向下递归的分治方式构造)
§ At start, all the training examples are at the root
§ Attributes are categorical (if continuous-valued, they are discretized in
advance)
§ Examples are partitioned recursively based on selected attributes
§ Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
2024/4/28
Algorithm for Decision Tree Induction
vConditions for stopping partitioning
§ All samples for a given node belong to the same class
§ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
2024/4/28
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
a?ge? <=30
30…40
medium
high
yes
no
excellent
fair
yes
yes
31…40 medium no excellent yes
31…40 low yes excellent yes
31…40 high yes fair yes
<=30 30..40
>40 >40
>40
medium
low
no
yes
fair
fair
yes
yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no
2024/4/28
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n S contains f(ci,s) tuples of class Ci for i = {1, …, m} A B class
a1 b1 c1
n Information entropy measures info required to classify any
arbitrary tuple a1 b2 c1
)(+" ,-) )(+" ,-) a1 b1 c2
𝐼 𝑆 = ∑(
!'" − 𝑙𝑜𝑔% |-| a2 b2 c2
|-|
n Info. entropy of attribute A with values {a1,a2,…,av} a2 b3 c1
v
| sj |
E ( A) = å I (s j )
j =1 | s |
Gain( A) = I ( S ) - E ( A)
2024/4/28
Information Gain (ID3/C4.5)
A B class
0 0 % % a1 b1 c1
𝐼 𝑆 = − 1 𝑙𝑜𝑔% 1 − 1 𝑙𝑜𝑔% 1 a1 b2 c1
a1 b1 c2
% % " " a2 b2 c2
𝐴 = 𝑎1: 𝐼 s" = − 0 𝑙𝑜𝑔% 0 − 0 𝑙𝑜𝑔% 0
a2 b3 c1
" " " "
𝐴 = 𝑎2: 𝐼 s% = − % 𝑙𝑜𝑔% % − % 𝑙𝑜𝑔% %
( )(+" ,-) )(+" ,-)
𝐼 𝑆 = ∑!'" − 𝑙𝑜𝑔% |-|
|-|
0 % v
| sj |
E 𝐴 = 𝐼 s" + 𝐼 s% E ( A) = å I (s j )
1 1
j =1 | s |
Gain( A) = I ( S ) - E ( A) Gain( A) = I ( S ) - E ( A)
Entropy
1 1 1
-- + -- + -- +
1 1 1
47
An Illustrative Example (3)
9 low
RID income
yes fair
studentcredit_ratingclass
yes
31…40 medium
31…40 low
no
yes
excellent
excellent
yes
yes
11 medium 3 high
yes excellent yes no >40 fair
31…40 high
medium yes
yes
no
fair
fair
yes
yes
2024/4/28
Continuous-Value Attributes
vLet attribute A be a continuous-valued attribute
vMust determine the best split point for A
§ Sort the value A in ascending order
§ Typically, the midpoint between each pair of adjacent values is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
§ The point with the minimum expected information requirement for A is selected as the split-
point for A
vSplit:
§ D1 is the set of tuples in D satisfying A £ split-point, and D2 is the set of tuples in D satisfying A >
split-point
2024/4/28
Continuous-Value Attributes (C4.5)
vHumidity <=92.5
Humidity Windy Class
§ E(S1)=-1/2log1/2- 1/2log1/2=1 96 FALSE P
95 FALSE N
§ E(S2)=-4/12log4/12- 8/12log8/12=0.92 90 TRUE N
90 TRUE P
§ E(Humidity)=2/14*1+12/14*0.92=0.93 85 FALSE N
80 FALSE P
vHumidity <=87.5: I=0.92 80 FALSE P
80 TRUE N
vHumidity <=82.5 : I=0.83 78 FALSE P
75 FALSE P
vHumidity <=79 : I=0.85 70 TRUE N
70 FALSE P
vHumidity <=76.5 : I=0.89 70 TRUE P
65 TRUE P
vHumidity <=72.5 : I=0.93
2024/4/28
Gain Ratio for Attribute Selection (C4.5)
§ GainRatio(A) = Gain(A)/SplitInfo(A)
vThe attribute with the maximum gain ratio is selected as the splitting attribute
Gain Ratio for Attribute Selection (C4.5)
Gain(income) = 0.029
age income student credit_rating buys_computer
Gain( student ) = 0.151 <=30 high no fair no
<=30 high no excellent no
Gain(credit _ rating ) = 0.048 31…40 high no fair yes
>40 medium no fair yes
vEx. >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
4 4 6 6 4 4
SplitInfoincome( D) = - ´ log 2 ( ) - ´ log 2 ( ) - ´ log 2 ( ) = 0.926 <=30 medium no fair no
14 14 14 14 14 14 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
gain_ratio(income) = 0.029/0.926 = 0.031 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
决策树的剪枝(pruning)
v过度拟合(overfitting)
§ 过度拟合了训练集中的样本特点,训练集的准确度高,但通常具有较低的概
括(generalization)能力,在预测未知类别对象时的准确率较低
v拟合不足(underfitting)
§ 如果过早地停止对结点的进一步分裂也会导致拟合不足问题
Avoid overfitting (过度适应)
accuracy On training
On testing
Complexity of tree
刘红岩
Decision tree: python
vsklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=
2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_lea
f_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
Decision tree: python
visualization
Decision tree: python
from sklearn.model_selection import cross_val_score
model = tree.DecisionTreeClassifier() #default criterion=”gini” ; no pruning
accuracy=cross_val_score(model, iris_x, iris_y, cv=10) #默认scoring=‘accuracy')
print(“training accuracy:”, accuracy, “\n avg(accuracy):”, accuracy.mean())
Decision tree: python
v#only for binary classfication
vprecision = metrics.precision_score(test_y, predict)
vrecall = metrics.recall_score(test_y, predict)
vprint('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
Bagging, Random Forest
2024/4/28
Bootstrap
v0.632 Bootstrap
§ Dataset with n instances is sampled n times → training set
(may contain duplicate samples)
§ Instances which are not picked → test set
§ Every time, Probability of being picked:
§ Probability of not being picked:
§ Probability of an instance not picked by n times:
n
æ 1ö -1
ç 1 - ÷ » e = 0.368
è nø
2024/4/28
Bagging
!Original learning set S
!Model Generation:
§ Generate T bootstrap datasets from S: L1, L2, … LT
§ For each training set Li , build a classification model Ci
! Prediction:
§ s: a new sample
§ Aggregation = majority vote among the T predictions/votes Ci(s), i=1,
2, …, T
Random Forest
vDiversifying by feature projection
vMajor steps:
§ Original dataset S with n instances of D features
§ Generate T bootstrap sample datasets with m instances
§ Build one tree per bootstrap sample dataset
§ Increase diversity via additional randomization: randomly pick a subset of d (d<<D)
features to split at each node
vPrediction
§ Vote among T trees
Ensemble classifiers
vBagging
vBoosting
§ Adaboost
§ GBDT
§ LightGBM
§ XGBOOST
Python : random forest
Python: random forest
vfrom sklearn.ensemble import RandomForestClassifier
vRF = RandomForestClassifier()
vrand_scores = cross_val_score(RF, train_X, train_y, cv=3, scoring='accuracy')
vt_diff.append((t_end - t_start))
vrand_mean = rand_scores.mean()
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
#默认 base_estimator是决策树
clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10).fit(train_X, train_y)
Predict_y=clf.predict(test_X)
Support Vector Machines
支持向量机
vFeatures: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries. Used both for classification and prediction
vApplications:
2024/4/28
SVM—Support Vector Machines
vA new classification method for both linear and nonlinear data
vIt uses a nonlinear mapping to transform the original training data into a higher
dimension
vWith the new dimension, it searches for the linear optimal separating hyperplane (i.e.,
“decision boundary”)
vWith an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane
vSVM finds this hyperplane using support vectors (“essential” training tuples) and
margins (defined by the support vectors)
2024/4/28
Linear Classifiers
denotes +1
denotes -1
denotes +1
denotes -1
2024/4/28
Linear Classifiers
denotes +1
denotes -1
2024/4/28
Linear Classifiers
denotes +1
denotes -1
2024/4/28
Linear Classifiers
denotes +1
denotes -1
2024/4/28
generalization
v不同直线用于非训练样本分类情况 𝑥!#
§ 不同直线的泛化(generalization)能力
不同
𝑥!"
SVM:linear
v 给定m个已知类别的样本:{(𝒙! , 𝑦! )}$
!"# ,
v 𝒙! ∈ 𝑅% 代表第j个样本的属性取值向量:
v 𝒙𝒋 = (𝑥!# 𝑥!' … 𝑥!% )(
v 𝑦! ∈ {+1, −1}代表第j个样本的真实类别取值,
v 其中+1代表正例,-1代表负例。
Linear Classifiers
𝑥!#
v Concepts:
• 支持向量(support vector)
• 间隔(margin)
𝑥!"
2024/4/28
Support vector
v Support vectors: 支持向量指的是那些离分割直线距离最近的样本
denotes +1
denotes -1
Support Vectors
are those
datapoints that the
margin pushes up
against
2024/4/28
Margin
v Margin: 支持向量到分割超平面(此处是直线)的距离称为间隔
𝑥!#
𝑥!"
2024/4/28
Maximum Margin
v 有证明显示,间隔最大的超平面具有最好的泛化能力
v 为了使其与正、负例训练样本的间隔都达到最大,分割超平面应该与正例支持向
量和负例支持向量的距离相同。
2024/4/28
Maximum Margin Linear Classifier
2024/4/28
SVM: maximum margin linear classifier
v在n维空间中超平面的数学表达为:
𝒘𝑻 𝒙 + 𝑏 = 0
|𝒘𝑻 𝒙𝒋 + 𝑏|
𝑑) = x2
||𝒘|| 𝒙!
𝒅𝒋
||𝒘||是向量𝒘的模:
x1
Maximum Margin
v寻找分割平面:wTx+b=0
v使得样本到平面的最小距离最大 𝑤" 𝑥" + 𝑤# 𝑥# + 𝑏 = 0
x2
|𝑾" 𝒙𝒋 +b|
v𝑚𝑎𝑥*,, {𝑚𝑖𝑛𝒙𝒋 },
|*|
𝒅𝒋
j=1, 2, …, m
x1
Learning the Maximum Margin Classifier
|𝒘𝑻𝒙𝒋 + 𝑏|
为了便于求解,对于训练样本: 𝑑# =
||𝒘||
𝑾$ 𝑿𝒊 +b ≥ 1 𝑦 & = +1 d
𝑾$ 𝑿𝒊 +b ≤ −1 𝑦 & = −1
%
vMargin: d= dj
|𝑾|
vOptimization problem
%
𝑚𝑎𝑥;,< | 𝑾 |
Subject to:
𝑦 ! (𝑾= 𝒙𝒋 +b) ≥ 1
约束二次规划的求解
Support Vector Machines
vWhat if the problem is not linearly separable?
Feature Space
vData are not separable in one dimension
vNot separable if you insist on using a specific class of function
2024/4/28
Blown Up Feature Space
vData are separable in <x, x2> space
x2
2024/4/28
not linearly separable
v对于任一在原始样本空间特征向量为𝒙的样本,用φ(𝒙)表示映射到高维
特征空间后的特征向量,则在高维空间中的支持向量机模型为:
𝑓 𝒙 = 𝒘𝑻𝜑(𝒙) + 𝑏
(
𝑻
v其对偶问题对应的模型为: 𝑓 𝒙 = D 𝛼# 𝑦# 𝜑(𝒙𝒋 F 𝜑 𝒙 + 𝑏
#'"
v对偶优化问题为: ( ( (
1
𝜶 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜶 D 𝛼# − D D 𝛼! 𝛼# 𝑦! 𝑦# 𝜑(𝒙𝒊 )𝑻𝜑(𝒙𝒋 )
2
#'" !'" #'"
𝑠. 𝑡. ∑(
#'" 𝛼# 𝑦# = 0, 𝛼# ≥ 0, 𝑗 = 1, 2, … , m
not linearly separable
v𝜑(𝒙𝒊 )𝑻 𝜑(𝒙𝒋 )的计算需要先将样本𝒙𝒊 和𝒙𝒋 映射到高维空间,然后计算他
们在高维空间中的向量内积,为了简化计算,将此过程用一个称为核函
数的函数𝒦(C,C)代替 𝑻
𝒦 𝒙𝒊 , 𝒙𝒋 = 𝜑(𝒙𝒊 P 𝜑(𝒙𝒋 O
v利用样本在原始空间中的核函数值进行计算,支持向量机模型为:
(
𝑓 𝒙 = D 𝛼# 𝑦# 𝒦 𝒙, 𝒙𝒋 + 𝑏
v对偶优化问题为: #'"
( ( (
1
𝜶 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜶 D 𝛼# − D D 𝛼! 𝛼# 𝑦! 𝑦# 𝒦 𝒙𝒊 , 𝒙𝒋
2
#'" !'" #'"
𝑠. 𝑡. ∑(
#'" 𝛼# 𝑦# = 0, 𝛼# ≥ 0, 𝑗 = 1, 2, … , m
常用的核函数 kernel function
v多项式核函数
𝒦 𝒙𝒊 , 𝒙𝒋 = (𝒙𝒊 𝑻 𝒙𝒋 )𝒅 , 𝑑≥1
v高斯核函数
𝒦 𝒙𝒊 , 𝒙𝒋 = exp( − ||𝒙𝒊 − 𝒙𝒋 ||" ⁄2 𝜎 " D
v线性核函数 𝒦 𝒙𝒊 , 𝒙𝒋 = 𝒙𝒊 𝑻 𝒙𝒋
Python: SVM
vfrom sklearn.svm import SVC
vsvm_model = SVC()
class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0,
shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)
真 实 类
别预测类别 Yes No
2 3
Yes
真正例数TP 假正例数FP
3 6
No
假负例数FN 真负例数TN
2024/4/28 101
TP FP
TP rate = FP rate =
TP + FN FP + TN
TP TP
precision = recall =
TP + FP TP + FN
2 ´ precision ´ recall 2 ´ TP
F-measure = =
precision + recall 2 ´ TP + FP + FN
2024/4/28 102
ROC curve
vROC:receiver operating characteristic (接收者操作特性)
§ Y轴:TPR (True positive rate)
• 预测为正例的样本中所含正例的个数在正例总数中的百分比
§ X轴:FPR (False positive rate)
• 预测为正例的样本中所含负例的个数在负例总数中的百分比
Rank probability class label
ROC curve % negative %Positive
1 0.95 Y
TP
FPR TPR
2 0.93 Y TP rate =
3 0.93 N TP + FN 20% 4/10
4 0.88 Y 40% 7/10
5 0.86 Y 60% 8/10
FP
6 0.85 N FP rate =
FP + TN 80% 9/10
7 0.82 Y
8 0.80 Y 100% 10/10
9 0.80 N
10 0.79 Y 120
11 0.77 N
12 0.76 Y 100
13 0.73 N 80
14 0.65 N model
15 0.63 Y 60
random
16 0.58 N 40
17 0.56 N
18 0.49 N 20
19 0.48 Y 0
20 0.47 N 0 20 40 60 80 100
ROC curve, AUC
ROC curve, AUC Rank proba
bility
class
label
1 0.95 Y
2 0.93 Y
3 0.93 N
4 0.88 Y
5 0.86 Y
6 0.85 N
7 0.82 Y
8 0.80 Y
9 0.80 N
10 0.79 Y
11 0.77 N
12 0.76 Y
13 0.73 N
14 0.65 N
15 0.63 Y
16 0.58 N
17 0.56 N
18 0.49 N
19 0.48 Y
20 0.47 N
Precision, recall
vBinary classification
#only for binary classfication
precision = metrics.precision_score(test_y, predict)
recall = metrics.recall_score(test_y, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
Homework: individual
1. For data set Titanic with attribute survived as class attribute,
(1)Split data to use 70% as training and 30% as test set
(2)Use KNN, decision tree, and SVM to build classification models, output
performance (accuracy, precision, recall) on test set in a dataframe format
with four columns: classifiers, accuracy, precision, recall, and draw roc
curve for the classifiers.
Group homework
1. Describe the problem you want solve based on the Ad click data. Use
supervised learning models such as KNN, decision tree and SVM to solve the
problem.