0% found this document useful (0 votes)
39 views33 pages

CS373 Lecture18.1

Uploaded by

milishukla1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views33 pages

CS373 Lecture18.1

Uploaded by

milishukla1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Data mining & Machine

Learning
CS 373
Purdue University

Dan Goldwasser
[email protected]
Multiclass classification Tasks
• So far, our discussion was limited to binary predictions
– Well, almost (?)
• What happens if our decision is not over binary labels?
– Many interesting classification problems are not!
– POS: Noun,verb, determiner,..
– Document classification: sports, finance, politics
– Sentiment: Positive, negative, objective

How can we approach these problems?

• Can the problem be reduced into a


binary classification problem?
2
Multiclass classification
• We will look into two approaches:
– Combining multiple binary classifiers
• One-vs-All
• All-vs-All
– Training a single classifier
• Extending SVM to the multiclass case
One-Vs-All
Assumption: Each class can be separated from the rest
using a binary classifier
• Learning: Decomposed to learning k independent binary classifiers, one
corresponding to each class
– An example (x,y) is considered positive for class y and negative to all others.
– Assume m examples, k class labels (assume m/k in each)
– Classifier fi: m/k (positive) and (k-1)m/k (negative)
• Decision: Winner Takes All:
– f(x) = argmaxi fi (x) = argmaxi (vix)

4
Example: One-vs-All
Feature function notation
• For examples with label i we want: wiTx > wjTx
• Alternative notation: Stack all weight vectors

• Define features jointly over the input and output

is
equivalent to wiTx > wjTx
Example
• The same pattern is encoded as different features associated with
different classes.
• The weights capture the relationship between the pattern and the
output class.

7
Multiclass Perceptron

Image from CIML.info


9
Multiclass Logistic Regression
• Recall: logistic regression learns a probabilistic classifier, using the
sigmoid function to model the conditional probability of the label.

Training objective: Find w that


maximizes the conditional
likelihood:
Multiclass Logistic Regression
• The multiclass version can be rewritten as -
Multiclass Logistic Regression
• The training objective- find w that maximizes the conditional
likelihood of the data {(x,y)i}
Multiclass Logistic Regression
Minimize the negative
log-likelihood of the data

Expected feature counts given the


current model

Feature counts for the gold (observed data)


How we got here?

Maximize the data


likelihood--> log likelihood

Minimize the negative


log-likelihood of the data

Expected feature counts given the


current model

Feature counts for the gold (observed data)


Multiclass SVM
• Single classifier optimizing a global objective
– Extend the SVM framework to the multiclass settings

• Binary SVM:
– Minimize ||W|| such that the closest points to the hyperplane have a score of
+/- 1

• Multiclass SVM
– Each label has a different weight vector
– Maximize multiclass margin
Margin in the Multiclass case
Revise the definition for the multiclass case:
• The difference between the score of the correct label and the
scores of competing labels

margin

Colors indicate different labels

SVM Objective: Minimize total norm of weights s.t. the


true label is scored at least 1 more than the second best. 16
Hard Multiclass SVM
Regularization

The score of the true


label has to be
higher than 1,
for any label
Soft Multiclass SVM
Regularizer Slack Variables
• asd

The score of the true label should have a


Relax hard constraints
margin of 1-ξi
using slack variables

Positive slack
Introduction to Machine Learning. Fall 2015 18
K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001
Multiclass classification so far
• Learning:

• Prediction

19
Cost Sensitive Multiclass Classification
• Sometime we are willing to “tolerate” some
mistakes more than others

20
Cost Sensitive Multiclass Classification
• We can think about it as a hierarchy:
• Define a distance metric:
– Δ(y,y’) = tree distance between y and y’

We would like to incorporate that into our learning model

Introduction to Machine Learning. Fall 2015 21


Cost Sensitive Multiclass Classification
• We can think about it as a hierarchy:
• Define a distance metric:
– Δ(y,y’) = tree distance between y and y’

We would like to incorporate that into our learning model

Introduction to Machine Learning. Fall 2015 22


Cost Sensitive Multiclass Classification

Instead we can have an unconstrained version -


Question: W
hat is sub-g
of this loss radient
function?

23
Reminder: Subgradient descent
• asdas

Slides by Sebastian Nowozin and Christoph H. Lampert “structured models in computer vision” tutorial CVPR 2011
24
Reminder: Subgradient descent

Slides by Sebastian Nowozin and Christoph H. Lampert “structured models in computer vision” tutorial CVPR 2011
25
Reminder: Subgradient descent

Slides by Sebastian Nowozin and Christoph H. Lampert “structured models in computer vision” tutorial CVPR 2011
26
Reminder: Subgradient descent

Slides by Sebastian Nowozin and Christoph H. Lampert “structured models in computer vision” tutorial CVPR 2011
27
Subgradient for the MC case

28
Subgradient for the MC case

29
Subgradient for the MC case

30
Subgradient for the MC case

31
Subgradient for the MC case

32
Subgradient descent for the MC case

Introduction to Machine Learning. Fall 2015 33


Subgradient descent for the MC case

Question: What is the difference between this algorithm and the


perceptron variant for multiclass classification?
Introduction to Machine Learning. Fall 2015 34

You might also like