0% found this document useful (0 votes)
27 views110 pages

ML s2 Part1 24

Uploaded by

hajjatlopezsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views110 pages

ML s2 Part1 24

Uploaded by

hajjatlopezsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

智能技术之

机器学习
Machine Learning
刘红岩
[email protected]
清华大学经济管理学院

清华大学经济管理学院 Tsinghua University School of Economics and Management


Syllabus
vMachine learning
§ 数据预处理 preprocessing
§ 有监督学习 Supervised learning
• 分类 classification
• 数值预测 numeric prediction
§ 无监督学习 Unsupervised learning
Outline
vBasic concepts
vAlgorithm Design Issues
vTypical techniques
vModel evaluation
vApplication case
vSummary

2024/4/28
Basic Concepts of Classification & Prediction
基本概念

清华大学经济管理学院 Tsinghua University


2024/4/28
School of Economics and Management
Prediction
!predict categorical class labels (discrete or nominal):
classification
!predicts unknown or missing values:Model continuous-valued
functions: numeric prediction
!Supervised learning method
§ Training dataset
§ Test dataset

2024/4/28
Classification (分类)
Training dataset

客户编号 年龄 性别 年收入(万) 婚姻 按时还款


1 <30 女 86 已婚 否
2 <30 男 65 单身 否
3 <30 男 90 离异 否
4 <30 女 75 已婚 否
5 30-50 女 82 已婚 是
6 30-50 男 91 已婚 是
7 30-50 女 200 离异 是
8 30-50 女 40 单身 否
9 30-50 男 20 离异 否
10 >50 女 96 离异 否
11 >50 女 80 单身 否
12 >50 男 50 单身 是
13 >50 女 80 离异 否
14 >50 男 92 离异 ?
Examples of Classification Task
!Classifying credit card transactions as legitimate or fraudulent
!Predict if a loan applicant will pay the monthly repayment
!Predict if a user will click an advertisement
!Predicting tumor cells as benign or malignant

!Categorizing news stories as finance, weather, entertainment, sports, etc.


!disease diagnosis: glaucoma, cataract, and/or trachoma
Numeric Prediction

Independent variable
(自变量)
Dependent variable (因变量)
name age years income
Mike <=30 3 8120.5
Mary <=30 2 6208
Bill 31…40 4 3060
Jim >40 7 7050 Continuous value
Dave >40 6 10300
Anne 31…40 7 20060
… … … …
Typical application
vCrowdfunding prediction
Typical application
vRecommendation: rate prediction

vSale prediction
Prediction:
CLASSIFICATION
Types of classification problem
!Let C={c1, c2, cn} be the set of class labels associated with the training and
testing data set. Each instance in the dataset is denoted by (xi, yi)
! xi: features,
! yi: set of class labels associated with the instance

!Binary vs. multi-class classification: |yi|=1


§ If |C|=2: binary classification
§ If |C|>2: multi-class classification
Types of classification problem
客户编号 年龄 性别 年收入(万) 婚姻 按时还款
1 <30 女 86 已婚 否
2 <30 男 65 单身 否
3 <30 男 90 离异 否
4 <30 女 75 已婚 否
5 30-50 女 82 已婚 是
Classification
vEager
§ Two steps
• Model construction
• Model usage (model evaluation)
vLazy
§ No model constructed beforehand

2024/4/28
Classification—A Two-Step Process
vModel construction: describing a set of predetermined classes
§ Each tuple/sample is assumed to belong to a predefined classes, as
determined by the class label attribute
§ The set of tuples used for model construction is training set
§ The model is represented as classification rules, decision trees, or
mathematical formulae

2024/4/28
Classification—A Two-Step Process
vModel usage: for classifying future or unknown objects
§ Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
§ If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known

2024/4/28
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

2024/4/28
Example of classification model: Decision Tree
a l a l u s
o ric o ric uo
g g t in s s
c at e c at e c on cl a
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Model
Decision
Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ? Tree


12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Training, Validation and Testing

vTraining & testing


§ Training set (2/3), Test set (1/3)
§ Accuracy rate vs Error rate

vTraining, validation & testing


§ Training set, Validation set, Testing set

vProblem
§ What if all examples with a certain class were missed out of the training set?
Training, Validation and Testing

vCross-validation(交叉验证)
• Every example is used in training and testing in turn
• Folds: partition of the data
E.g fold=3: threefold cross-validation
§ ten-fold cross-validation
• N=10: standard
• experiments & theoretical evidence

2024/4/28
Typical techniques
典型技术

清华大学经济管理学院 Tsinghua University


2024/4/28
School of Economics and Management
Classification
vEager
§ 决策树、NB、SVM、logistic Regression 、Neural
Network (Deep Learning)、 …..
§ Ensemble
vLazy
§ K nearest neighbor (KNN)

2024/4/28
K nearest Neighbors

清华大学经济管理学院 Tsinghua University


2024/4/28
School of Economics and Management
The k-Nearest Neighbor Algorithm
vAll instances correspond to points in the n-D space.
vThe nearest neighbor are defined in terms of Euclidean distance.
vThe target function could be discrete- or real- valued.
vFor discrete-valued, the k-NN returns the most common value among the k
training examples nearest to xq.

_
_ d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
_ _ i1 j1 i2 j2 ip jp
+ +
_ .
xq + v - minA
Normalization: v' =
_ maxA - minA
+

2024/4/28
举例,K=3
sepal_length sepal_width petal_length petal_width type
5.7 2.9 4.2 1.3 Iris-versicolor
6.2 2.9 4.3 1.3 Iris-versicolor
5.7 2.8 4.1 1.3 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3.0 5.9 2.1 Iris-virginica
5.1 3.8 1.6 0.2 Iris-setosa
4.6 3.2 1.4 0.2 Iris-setosa
5.3 3.7 1.5 0.2 Iris-setosa

5.0 3.3 1.4 0.2 Iris-setosa


5.1 2.5 3.0 1.1 Iris-versicolor
6.3 2.9 5.6 1.8 Iris-virginica
2024/4/28
Nearest-Neighbor Classification

vDecision Rule
§Standard rule
– Choose the majority class within the decision set
_
§Weighted decision rule _
_ _
+
Weight the classes of the decision set +
_ .
xq +
• By distance
_ +
Nearest-Neighbor Classification
vDiscussion

v High classification accuracy


• In many applications
vIncremental
Classifier can easily be adapted to new training objects
vCan be used also for numeric prediction
v Application of classifier expensive
Requires k-nearest neighbor query

vDoes not generate explicit knowledge about the classes


KNeighborsClassifier
vclass sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='u
niform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_pa
rams=None, n_jobs=None)
• n_neighbors:int, default=5
• weights:{‘uniform’, ‘distance’}, default=’uniform’
• ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally
• ‘distance’ : weight points by the inverse of their distance.
• p:int, default=2;Power parameter for the Minkowski metric.
• metric:str or callable, default=’minkowski’
Metric:
vMinkowski distance(闵可夫斯基距离):

! $
𝑑 𝑖, 𝑗 = ( 𝑥!" − 𝑥#" + |𝑥!% − 𝑥#%|$ + ⋯ + |𝑥!& − 𝑥#& |$ )

vKNeighborsClassifier(n_neighbors= 3, p=2, metric='minkowski’)


vKNeighborsClassifier(n_neighbors= 3, p=1, metric='minkowski’)
vP=2: Euclidean distance
vp=1: Manhattan distance
Python: KNN
Decision Tree

清华大学经济管理学院 Tsinghua University


2024/4/28
School of Economics and Management
Decision Tree Induction: Training
Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

2024/4/28
Output: A Decision Tree for “buys_computer”

•Nodes are tests for attr. values;


•There is one branch for each
age? value of the attribute
• Leaves specify the class labels
<=30 30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

2024/4/28
Algorithm for Decision Tree Induction
vBasic algorithm (a greedy algorithm)
§ Tree is constructed in a top-down recursive divide-and-conquer
manner(自顶向下递归的分治方式构造)
§ At start, all the training examples are at the root
§ Attributes are categorical (if continuous-valued, they are discretized in
advance)
§ Examples are partitioned recursively based on selected attributes
§ Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

2024/4/28
Algorithm for Decision Tree Induction
vConditions for stopping partitioning
§ All samples for a given node belong to the same class
§ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf

2024/4/28
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes

a?ge? <=30
30…40
medium
high
yes
no
excellent
fair
yes
yes
31…40 medium no excellent yes
31…40 low yes excellent yes
31…40 high yes fair yes

<=30 30..40
>40 >40
>40
medium
low
no
yes
fair
fair
yes
yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no

RID income studentcredit_ratingclass


1 high no fair no
2 high no excellent no
8 medium no fair no
9 low
RID income yesstudentcredit_ratingclass
yes fair
11 medium yes 3 excellent
high yes no fair yes
7 low yes excellent yes
RID income studentcredit_ratingclass
12 medium4 medium no excellent
no yesfair yes
13 high 5 yes
low fair yes yes
fair yes
6 low yes excellent no
10 medium yes fair yes
14 medium no excellent no
2024/4/28
Picking the Root Attribute
A B class count
• Consider data with two Boolean 0 0 - 50
attributes (A,B). 0 1 - 50
1 0 + 0
1 1 + 100

• What should be the first attribute we select?


• Splitting on A: we get purely labeled nodes. 1 A 0
+ -
1 B 0
•Splitting on B: we don’t get purely labeled nodes.
1 A 0 -
• What if we have: <(A=1,B=0), - >: 3 examples
+ -
2024/4/28
Picking the Root Attribute
• The goal is to have the resulting decision tree as small as possible
(Occam’s Razor)
• The main decision in the algorithm is the selection of the next
attribute to condition on.

• We want attributes that split the examples to sets that are


relatively pure in one label; this way we are closer to a leaf
node

• The most popular heuristics is based on information gain,


originated with the ID3 system of Quinlan.

2024/4/28
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n S contains f(ci,s) tuples of class Ci for i = {1, …, m} A B class
a1 b1 c1
n Information entropy measures info required to classify any
arbitrary tuple a1 b2 c1
)(+" ,-) )(+" ,-) a1 b1 c2
𝐼 𝑆 = ∑(
!'" − 𝑙𝑜𝑔% |-| a2 b2 c2
|-|
n Info. entropy of attribute A with values {a1,a2,…,av} a2 b3 c1
v
| sj |
E ( A) = å I (s j )
j =1 | s |

n information gain: information gained by branching on attribute A

Gain( A) = I ( S ) - E ( A)
2024/4/28
Information Gain (ID3/C4.5)
A B class
0 0 % % a1 b1 c1
𝐼 𝑆 = − 1 𝑙𝑜𝑔% 1 − 1 𝑙𝑜𝑔% 1 a1 b2 c1
a1 b1 c2
% % " " a2 b2 c2
𝐴 = 𝑎1: 𝐼 s" = − 0 𝑙𝑜𝑔% 0 − 0 𝑙𝑜𝑔% 0
a2 b3 c1
" " " "
𝐴 = 𝑎2: 𝐼 s% = − % 𝑙𝑜𝑔% % − % 𝑙𝑜𝑔% %
( )(+" ,-) )(+" ,-)
𝐼 𝑆 = ∑!'" − 𝑙𝑜𝑔% |-|
|-|
0 % v
| sj |
E 𝐴 = 𝐼 s" + 𝐼 s% E ( A) = å I (s j )
1 1
j =1 | s |

Gain( A) = I ( S ) - E ( A) Gain( A) = I ( S ) - E ( A)
Entropy

1 1 1

-- + -- + -- +

1 1 1

47
An Illustrative Example (3)

a?ge? Gain(S, student)=0.151


Gain(S, credit-rating)=0.048
<=30 >40 Gain(S, income)=0.029
30..40
Gain(S,age)=0.246
RID income studentcredit_ratingclass <=30age income
high
student credit_rating
no fair
buys_computer
no

1 high no fair no <=30


<=30
high
medium
no
no
excellent
fair
no
no

2 high no excellent no <=30


<=30
low
medium
yes
yes
fair
excellent
yes
yes

8 medium no fair no 30…40 high no fair yes

9 low
RID income
yes fair
studentcredit_ratingclass
yes
31…40 medium
31…40 low
no
yes
excellent
excellent
yes
yes

11 medium 3 high
yes excellent yes no >40 fair
31…40 high
medium yes
yes
no
fair
fair
yes
yes

7 low yes >40 excellent yes


>40 low yes fair yes

RID income >40


low
studentcredit_ratingclass
medium
yes
yes
excellent
fair
no
yes
12 medium4 medium no >40 excellent
medium
no yes
no
fair
excellent
yes no
13 high 5 yes
low fair yes yes fair yes
6 low yes excellent no
10 medium yes fair yes
2024/4/28 14 medium no excellent no
Extracting Classification Rules from Trees
vRepresent the knowledge in the form of IF-THEN rules:easier for humans to understand
vExample
IF age = “<=30” AND student = “no”

THEN buys_computer = “no” age


IF age = “<=30” AND student = “yes” <=30 30..40 >40
THEN buys_computer = “yes”
student yes credit rating
IF age = “31…40”

THEN buys_computer = “yes” no yes excellent fair


IF age = “>40” AND credit_rating = “excellent”
no yes no yes
THEN buys_computer = “no”

IF age = “>40” AND credit_rating = “fair”

THEN buys_computer = “yes”

2024/4/28
Continuous-Value Attributes
vLet attribute A be a continuous-valued attribute
vMust determine the best split point for A
§ Sort the value A in ascending order

§ Typically, the midpoint between each pair of adjacent values is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

§ The point with the minimum expected information requirement for A is selected as the split-
point for A

vSplit:
§ D1 is the set of tuples in D satisfying A £ split-point, and D2 is the set of tuples in D satisfying A >
split-point

2024/4/28
Continuous-Value Attributes (C4.5)
vHumidity <=92.5
Humidity Windy Class
§ E(S1)=-1/2log1/2- 1/2log1/2=1 96 FALSE P
95 FALSE N
§ E(S2)=-4/12log4/12- 8/12log8/12=0.92 90 TRUE N
90 TRUE P
§ E(Humidity)=2/14*1+12/14*0.92=0.93 85 FALSE N
80 FALSE P
vHumidity <=87.5: I=0.92 80 FALSE P
80 TRUE N
vHumidity <=82.5 : I=0.83 78 FALSE P
75 FALSE P
vHumidity <=79 : I=0.85 70 TRUE N
70 FALSE P
vHumidity <=76.5 : I=0.89 70 TRUE P
65 TRUE P
vHumidity <=72.5 : I=0.93

2024/4/28
Gain Ratio for Attribute Selection (C4.5)

vInformation gain measure is biased towards attributes with a large number of


values
vC4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization
to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|

§ GainRatio(A) = Gain(A)/SplitInfo(A)
vThe attribute with the maximum gain ratio is selected as the splitting attribute
Gain Ratio for Attribute Selection (C4.5)

Gain(income) = 0.029
age income student credit_rating buys_computer
Gain( student ) = 0.151 <=30 high no fair no
<=30 high no excellent no
Gain(credit _ rating ) = 0.048 31…40 high no fair yes
>40 medium no fair yes
vEx. >40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
4 4 6 6 4 4
SplitInfoincome( D) = - ´ log 2 ( ) - ´ log 2 ( ) - ´ log 2 ( ) = 0.926 <=30 medium no fair no
14 14 14 14 14 14 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
gain_ratio(income) = 0.029/0.926 = 0.031 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
决策树的剪枝(pruning)

v过度拟合(overfitting)
§ 过度拟合了训练集中的样本特点,训练集的准确度高,但通常具有较低的概
括(generalization)能力,在预测未知类别对象时的准确率较低
v拟合不足(underfitting)
§ 如果过早地停止对结点的进一步分裂也会导致拟合不足问题
Avoid overfitting (过度适应)

vOverfitting: An induced tree may overfit the training data


§ Too many branches, some may reflect anomalies due to noise or
outliers
§ Poor accuracy for unseen samples

accuracy On training

On testing
Complexity of tree

刘红岩
Decision tree: python
vsklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=
2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_lea
f_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
Decision tree: python
visualization
Decision tree: python
from sklearn.model_selection import cross_val_score
model = tree.DecisionTreeClassifier() #default criterion=”gini” ; no pruning
accuracy=cross_val_score(model, iris_x, iris_y, cv=10) #默认scoring=‘accuracy')
print(“training accuracy:”, accuracy, “\n avg(accuracy):”, accuracy.mean())
Decision tree: python
v#only for binary classfication
vprecision = metrics.precision_score(test_y, predict)
vrecall = metrics.recall_score(test_y, predict)
vprint('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
Bagging, Random Forest

清华大学经济管理学院 Tsinghua University School of Economics and Management


Bagging
vBootstrap Aggregating的缩写
vData randomness for diversity
vTechnique of ensemble learning
§ to avoid over-fitting
§ to improve stability and accuracy
v Two steps
§ Bootstrap sample set
§ Aggregation
Bootstrap
vBootstrap(自引导法,自助法)
§ Sampling with replacement—训练数据采用有放回抽样,没有抽中的记录
成为测试集的一部分
§ Dataset S with n instances
§ sampled m times → training set (may contain duplicate samples)
§ Instances which are not picked → test set

2024/4/28
Bootstrap
v0.632 Bootstrap
§ Dataset with n instances is sampled n times → training set
(may contain duplicate samples)
§ Instances which are not picked → test set
§ Every time, Probability of being picked:
§ Probability of not being picked:
§ Probability of an instance not picked by n times:
n
æ 1ö -1
ç 1 - ÷ » e = 0.368
è nø

2024/4/28
Bagging
!Original learning set S
!Model Generation:
§ Generate T bootstrap datasets from S: L1, L2, … LT
§ For each training set Li , build a classification model Ci
! Prediction:
§ s: a new sample
§ Aggregation = majority vote among the T predictions/votes Ci(s), i=1,
2, …, T
Random Forest
vDiversifying by feature projection
vMajor steps:
§ Original dataset S with n instances of D features
§ Generate T bootstrap sample datasets with m instances
§ Build one tree per bootstrap sample dataset
§ Increase diversity via additional randomization: randomly pick a subset of d (d<<D)
features to split at each node
vPrediction
§ Vote among T trees
Ensemble classifiers
vBagging
vBoosting
§ Adaboost
§ GBDT
§ LightGBM
§ XGBOOST
Python : random forest
Python: random forest
vfrom sklearn.ensemble import RandomForestClassifier
vRF = RandomForestClassifier()
vrand_scores = cross_val_score(RF, train_X, train_y, cv=3, scoring='accuracy')
vt_diff.append((t_end - t_start))
vrand_mean = rand_scores.mean()
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
#默认 base_estimator是决策树
clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10).fit(train_X, train_y)
Predict_y=clf.predict(test_X)
Support Vector Machines
支持向量机

清华大学经济管理学院 Tsinghua University


2024/4/28
School of Economics and Management
SVM—History and Applications
vVapnik and Boser, Guyon (1992)—groundwork from Vapnik & Chervonenkis’ statistical
learning theory in 1960s

vFeatures: training can be slow but accuracy is high owing to their ability to model
complex nonlinear decision boundaries. Used both for classification and prediction

vApplications:

§ handwritten digit recognition, object recognition, speaker identification, text


classification

2024/4/28
SVM—Support Vector Machines
vA new classification method for both linear and nonlinear data
vIt uses a nonlinear mapping to transform the original training data into a higher
dimension
vWith the new dimension, it searches for the linear optimal separating hyperplane (i.e.,
“decision boundary”)
vWith an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane
vSVM finds this hyperplane using support vectors (“essential” training tuples) and
margins (defined by the support vectors)

2024/4/28
Linear Classifiers

denotes +1

denotes -1

How would you


classify this data?
Linear Classifiers

denotes +1
denotes -1

How would you


classify this data?

2024/4/28
Linear Classifiers

denotes +1
denotes -1

How would you


classify this data?

2024/4/28
Linear Classifiers

denotes +1
denotes -1

How would you


classify this data?

2024/4/28
Linear Classifiers

denotes +1
denotes -1

Any of these would


be fine..

..but which is best?

2024/4/28
generalization
v不同直线用于非训练样本分类情况 𝑥!#

§ 不同直线的泛化(generalization)能力
不同

𝑥!"
SVM:linear
v 给定m个已知类别的样本:{(𝒙! , 𝑦! )}$
!"# ,

v 𝒙! ∈ 𝑅% 代表第j个样本的属性取值向量:
v 𝒙𝒋 = (𝑥!# 𝑥!' … 𝑥!% )(
v 𝑦! ∈ {+1, −1}代表第j个样本的真实类别取值,
v 其中+1代表正例,-1代表负例。
Linear Classifiers
𝑥!#
v Concepts:
• 支持向量(support vector)
• 间隔(margin)

𝑥!"

2024/4/28
Support vector
v Support vectors: 支持向量指的是那些离分割直线距离最近的样本

denotes +1
denotes -1

Support Vectors
are those
datapoints that the
margin pushes up
against

2024/4/28
Margin
v Margin: 支持向量到分割超平面(此处是直线)的距离称为间隔
𝑥!#

Define the margin of a


linear classifier as the
width that the boundary
could be increased by
before hitting a data point.
d

𝑥!"

2024/4/28
Maximum Margin
v 有证明显示,间隔最大的超平面具有最好的泛化能力
v 为了使其与正、负例训练样本的间隔都达到最大,分割超平面应该与正例支持向
量和负例支持向量的距离相同。

2024/4/28
Maximum Margin Linear Classifier

• The maximum margin linear


denotes +1
classifier is the linear classifier
denotes -1
with the maximum margin.
• This is the simplest kind of
SVM (Called an LSVM, linear
SVM)

2024/4/28
SVM: maximum margin linear classifier
v在n维空间中超平面的数学表达为:

𝒘𝑻 𝒙 + 𝑏 = 0

𝒘 = (𝑤! 𝑤" … 𝑤# )$ 是超平面的法向量

v对于一个未知类别的样本𝒙𝒊 ,其类别𝑦& 的判断方法:

𝑦& = +1, 𝑖𝑓 𝒘𝑻 𝒙𝒊 + 𝑏 > 0


,
𝑦& = −1, 𝑖𝑓 𝒘𝑻 𝒙𝒊 + 𝑏 ≤ 0
SVM: maximum margin linear classifier
v任意一个点𝒙𝒋 = (𝑥)! 𝑥)" … 𝑥)# )$ 到超平面𝒘𝑻 𝒙 + 𝑏 = 0的距离𝑑) :
𝑤" 𝑥" + 𝑤# 𝑥# + 𝑏 = 0

|𝒘𝑻 𝒙𝒋 + 𝑏|
𝑑) = x2
||𝒘|| 𝒙!

𝒅𝒋

||𝒘||是向量𝒘的模:

||𝒘||= 𝑤! " + 𝑤" " + ⋯ + 𝑤# "

x1
Maximum Margin
v寻找分割平面:wTx+b=0
v使得样本到平面的最小距离最大 𝑤" 𝑥" + 𝑤# 𝑥# + 𝑏 = 0

x2

|𝑾" 𝒙𝒋 +b|
v𝑚𝑎𝑥*,, {𝑚𝑖𝑛𝒙𝒋 },
|*|
𝒅𝒋
j=1, 2, …, m

x1
Learning the Maximum Margin Classifier
|𝒘𝑻𝒙𝒋 + 𝑏|
为了便于求解,对于训练样本: 𝑑# =
||𝒘||

𝑾$ 𝑿𝒊 +b ≥ 1 𝑦 & = +1 d
𝑾$ 𝑿𝒊 +b ≤ −1 𝑦 & = −1
%
vMargin: d= dj
|𝑾|

vOptimization problem
%
𝑚𝑎𝑥;,< | 𝑾 |
Subject to:
𝑦 ! (𝑾= 𝒙𝒋 +b) ≥ 1
约束二次规划的求解
Support Vector Machines
vWhat if the problem is not linearly separable?
Feature Space
vData are not separable in one dimension
vNot separable if you insist on using a specific class of function

2024/4/28
Blown Up Feature Space
vData are separable in <x, x2> space
x2

2024/4/28
not linearly separable
v对于任一在原始样本空间特征向量为𝒙的样本,用φ(𝒙)表示映射到高维
特征空间后的特征向量,则在高维空间中的支持向量机模型为:
𝑓 𝒙 = 𝒘𝑻𝜑(𝒙) + 𝑏

(
𝑻
v其对偶问题对应的模型为: 𝑓 𝒙 = D 𝛼# 𝑦# 𝜑(𝒙𝒋 F 𝜑 𝒙 + 𝑏
#'"

v对偶优化问题为: ( ( (
1
𝜶 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜶 D 𝛼# − D D 𝛼! 𝛼# 𝑦! 𝑦# 𝜑(𝒙𝒊 )𝑻𝜑(𝒙𝒋 )
2
#'" !'" #'"
𝑠. 𝑡. ∑(
#'" 𝛼# 𝑦# = 0, 𝛼# ≥ 0, 𝑗 = 1, 2, … , m
not linearly separable
v𝜑(𝒙𝒊 )𝑻 𝜑(𝒙𝒋 )的计算需要先将样本𝒙𝒊 和𝒙𝒋 映射到高维空间,然后计算他
们在高维空间中的向量内积,为了简化计算,将此过程用一个称为核函
数的函数𝒦(C,C)代替 𝑻
𝒦 𝒙𝒊 , 𝒙𝒋 = 𝜑(𝒙𝒊 P 𝜑(𝒙𝒋 O

v利用样本在原始空间中的核函数值进行计算,支持向量机模型为:
(

𝑓 𝒙 = D 𝛼# 𝑦# 𝒦 𝒙, 𝒙𝒋 + 𝑏
v对偶优化问题为: #'"

( ( (
1
𝜶 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜶 D 𝛼# − D D 𝛼! 𝛼# 𝑦! 𝑦# 𝒦 𝒙𝒊 , 𝒙𝒋
2
#'" !'" #'"
𝑠. 𝑡. ∑(
#'" 𝛼# 𝑦# = 0, 𝛼# ≥ 0, 𝑗 = 1, 2, … , m
常用的核函数 kernel function
v多项式核函数
𝒦 𝒙𝒊 , 𝒙𝒋 = (𝒙𝒊 𝑻 𝒙𝒋 )𝒅 , 𝑑≥1

v高斯核函数
𝒦 𝒙𝒊 , 𝒙𝒋 = exp( − ||𝒙𝒊 − 𝒙𝒋 ||" ⁄2 𝜎 " D

v线性核函数 𝒦 𝒙𝒊 , 𝒙𝒋 = 𝒙𝒊 𝑻 𝒙𝒋
Python: SVM
vfrom sklearn.svm import SVC
vsvm_model = SVC()
class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0,
shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)

kernel: 参数选择有RBF, Linear, Poly, Sigmoid, precomputed或者⾃定义⼀个核函数, 默认的是“RBF” 。


RBF (径向基核),即⾼斯核函数;
Linear指的是线性核函数,
Poly指的是多项式核,
Sigmoid指的是双曲正切函数tanh核
Model Evaluation
Purpose of Model Evaluation

vVerify whether a model built from a given training data set is


biased
vVerify performance of a model
vEvery model built using different techniques has its own
advantages and disadvantages.
confusion matrix

真 实 类
别预测类别 Yes No

2 3
Yes
真正例数TP 假正例数FP
3 6
No
假负例数FN 真负例数TN

2024/4/28 101
TP FP
TP rate = FP rate =
TP + FN FP + TN

TP TP
precision = recall =
TP + FP TP + FN

2 ´ precision ´ recall 2 ´ TP
F-measure = =
precision + recall 2 ´ TP + FP + FN

2024/4/28 102
ROC curve
vROC:receiver operating characteristic (接收者操作特性)
§ Y轴:TPR (True positive rate)
• 预测为正例的样本中所含正例的个数在正例总数中的百分比
§ X轴:FPR (False positive rate)
• 预测为正例的样本中所含负例的个数在负例总数中的百分比
Rank probability class label
ROC curve % negative %Positive
1 0.95 Y
TP
FPR TPR
2 0.93 Y TP rate =
3 0.93 N TP + FN 20% 4/10
4 0.88 Y 40% 7/10
5 0.86 Y 60% 8/10
FP
6 0.85 N FP rate =
FP + TN 80% 9/10
7 0.82 Y
8 0.80 Y 100% 10/10
9 0.80 N
10 0.79 Y 120
11 0.77 N
12 0.76 Y 100
13 0.73 N 80
14 0.65 N model
15 0.63 Y 60
random
16 0.58 N 40
17 0.56 N
18 0.49 N 20
19 0.48 Y 0
20 0.47 N 0 20 40 60 80 100
ROC curve, AUC
ROC curve, AUC Rank proba
bility
class
label
1 0.95 Y
2 0.93 Y
3 0.93 N
4 0.88 Y
5 0.86 Y
6 0.85 N
7 0.82 Y
8 0.80 Y
9 0.80 N
10 0.79 Y
11 0.77 N
12 0.76 Y
13 0.73 N
14 0.65 N
15 0.63 Y
16 0.58 N
17 0.56 N
18 0.49 N
19 0.48 Y
20 0.47 N
Precision, recall
vBinary classification
#only for binary classfication
precision = metrics.precision_score(test_y, predict)
recall = metrics.recall_score(test_y, predict)
print('precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall))
Homework: individual
1. For data set Titanic with attribute survived as class attribute,
(1)Split data to use 70% as training and 30% as test set
(2)Use KNN, decision tree, and SVM to build classification models, output
performance (accuracy, precision, recall) on test set in a dataframe format
with four columns: classifiers, accuracy, precision, recall, and draw roc
curve for the classifiers.
Group homework
1. Describe the problem you want solve based on the Ad click data. Use
supervised learning models such as KNN, decision tree and SVM to solve the
problem.

You might also like