0% found this document useful (0 votes)

37 views

Classification Methods

Uploaded by

Sayantan Majhi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Classification Methods

Uploaded by

Sayantan Majhi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Classification

Lecture 2: Methods

Jing Gao
SUNY Buffalo

1
Outline
• Basics
– Problem, goal, evaluation
• Methods
– Nearest Neighbor
– Decision Tree
– Naïve Bayes
– Rule-based Classification
– Logistic Regression
– Support Vector Machines
– Ensemble methods
– ………
• Advanced topics
– Multi-view Learning
– Semi-supervised Learning
– Transfer Learning
– ……
2
Bayesian Classification

• Bayesian classifier vs. decision tree

– Decision tree: predict the class label
– Bayesian classifier: statistical classifier; predict class
membership probabilities
• Based on Bayes theorem; estimate posterior
probability
• Naïve Bayesian classifier:
– Simple classifier that assumes attribute independence
– Efficient when applied to large databases
– Comparable in performance to decision trees

3
Posterior Probability
• Let X be a data sample whose class label is unknown
• Let Hi be the hypothesis that X belongs to a particular
class Ci
• P(Hi|X) is posteriori probability of H
conditioned on X
– Probability that data example X belongs to class Ci
given the attribute values of X
– e.g., given X=(age:31…40, income: medium,
student: yes, credit: fair), what is the probability X
buys computer?
4
Bayes Theorem
• To classify means to determine the highest P(Hi|X)
among all classes C1,…Cm
– If P(H1|X)>P(H0|X), then X buys computer
– If P(H0|X)>P(H1|X), then X does not buy computer
– Calculate P(Hi|X) using the Bayes theorem

Class Prior Probability Descriptor Posterior Probability

P H i  P  X | H i 
P H i | X  
P X 

Class Posterior Probability Descriptor Prior Probability

5
Class Prior Probability

• P(Hi) is class prior probability that X belongs to a

particular class Ci
– Can be estimated by ni/n from training data samples
– n is the total number of training data samples
– ni is the number of training data samples of class Ci

6
Age Income Student Credit Buys_computer
P1 31…4 high no fair no
0
P2 <=30 high no excellent no
P3 31…4 high no fair yes
0
P4 >40 medium no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent no
P7 31…4 low yes excellent yes
0
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 >40 medium yes fair yes

H1: Buys_computer=yes
H0: Buys_computer=no P( X | H )P(H )
P(H1)=6/10 = 0.6 P(H | X )  i i
i P( X )
P(H0)=4/10 = 0.4 7
Descriptor Prior Probability

• P(X) is prior probability of X

– Probability that observe the attribute values of X
– Suppose X= (x1, x2,…, xd) and they are independent,
then P(X) =P(x1) P(x2) … P(xd)
– P(xj)=nj/n, where
– nj is number of training examples having value xj
for attribute Aj
– n is the total number of training examples
– Constant for all classes
8
Age Income Student Credit Buys_computer
P1 31…40 high no fair no
P2 <=30 high no excellent no
P3 31…40 high no fair yes
P4 >40 medium no fair yes
P5 >40 low yes fair yes
P6 >40 Low yes excellent No
P7 31…40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 >40 medium yes fair yes

• X=(age:31…40, income: medium, student: yes, credit: fair)

• P(age=31…40)=3/10 P(income=medium)=3/10 P( X | H )P(H )
P(student=yes)=5/10 P(credit=fair)=7/10 P(H | X )  i i
i P( X )
• P(X)=P(age=31…40)  P(income=medium)  P(student=yes)  P(credit=fair)
=0.3  0.3  0.5  0.7 = 0.0315 9
Descriptor Posterior Probability

• P(X|Hi) is posterior probability of X given Hi

– Probability that observe X in class Ci
– Assume X=(x1, x2,…, xd) and they are independent,
then P(X|Hi) =P(x1|Hi) P(x2|Hi) … P(xd|Hi)
– P(xj|Hi)=ni,j/ni, where
– ni,j is number of training examples in class Ci having
value xj for attribute Aj
– ni is number of training examples in Ci

10
Age Income Student Credit Buys_computer
P1 31…40 high no fair no
P2 <=30 high no excellent no
P3 31…40 high no fair yes
P4 >40 medium no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent no
P7 31…40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 >40 medium yes fair yes

• X= (age:31…40, income: medium, student: yes, credit: fair)

• H1 = X buys a computer
• n1 = 6 , n11=2, n21=2, n31=4, n41=5,
• P(X|H1)= 2 2 4 5 5
     0.062
6 6 6 6 81 11
Age Income Student Credit Buys_computer
P1 31…40 high no fair no
P2 <=30 high no excellent no
P3 31…40 high no fair yes
P4 >40 medium no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent no
P7 31…40 low yes excellent yes
P8 <=30 medium no fair no
P9 <=30 low yes fair yes
P10 >40 medium yes fair yes

• X= (age:31…40, income: medium, student: yes, credit: fair)

• H0 = X does not buy a computer
• n0 = 4 , n10=1, n20=1, n31=1, n41 = 2,
P( X | H )P(H )
• P(X|H0)= 1 1 1 2 1 P(H | X )  i i
     0.0078 i P( X )
4 4 4 4 128 12
Bayesian Classifier – Basic Equation
Class Prior Probability Descriptor Posterior Probability

P H i  P  X | H i 
P H i | X  
P X 

Class Posterior Probability Descriptor Prior Probability

To classify means to determine the highest P(Hi|X) among all

classes C1,…Cm
P(X) is constant to all classes
Only need to compare P(Hi)P(X|Hi)
13
Weather Dataset Example
X =< rain, hot, high, false>
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
14
Weather Dataset Example
• Given a training set, we can compute probabilities:
P(Hi) P(p) = 9/14
P(n) = 5/14

P(xj|Hi) Outlook P N H umidity P N

sunny 2/ 9 3/ 5 high 3/ 9 4/ 5
overcast 4/ 9 0 normal 6/ 9 1/ 5
rain 3/ 9 2/ 5
Temperature P N Windy P N
hot 2/ 9 2/ 5 true 3/ 9 3/ 5
mild 4/ 9 2/ 5 false 6/ 9 2/ 5
cool 3/ 9 1/ 5
15
Weather Dataset Example: Classifying X

• An unseen sample X = <rain, hot, high, false>

16
Avoiding the Zero-Probability Problem
• Descriptor posterior probability goes to 0 if any of probability is
0:
d
P( X | H i )   P( x j | H i )
j 1

• Ex. Suppose a dataset with 1000 tuples for a class C,

income=low (0), income= medium (990), and income = high
(10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low|H) = 1/1003
Prob(income = medium|H) = 991/1003
Prob(income = high|H) = 11/1003
17
Independence Hypothesis

• makes computation possible

• yields optimal classifiers when satisfied
• but is seldom satisfied in practice, as
attributes (variables) are often correlated
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes

18
Logistic Regression Classifier

19
20
Log ratio:
Positive—Class 0 Negative—Class 1 21
Logistic Function

P(Y  1 | X )
1

1  exp( wX  b)

Training set: X
Y=1—P(Y=1|X)=1 Y=0—P(Y=1|X)=0 22
Maximizing Conditional Likelihood

• Training Set:
• Find W that maximizes conditional likelihood:

• A concave function in W
• Gradient descent approach to solve it
23
Rule-Based Classifier

• Classify records by using a collection of

“if…then…” rules
• Rule: (Condition)  y
– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule condition
– RHS: rule consequent
– Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No

24
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds

R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
25
Application of Rule-Based Classifier

• A rule r covers an instance x if the attributes of the

instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal

26
Rule Coverage and Accuracy
• Coverage of a rule:
Tid Refund Marital Taxable
Status Income Class

– Fraction of records 1 Yes Single 125K No

2 No Married 100K No
that satisfy the 3 No Single 70K No
condition of a rule 4 Yes Married 120K No

• Accuracy of a rule: 5
6
No
No
Divorced 95K
Married 60K
Yes
No
– Fraction of records 7 Yes Divorced 220K No

that satisfy both the 8 No Single 85K Yes

9 No Married 75K No
LHS and RHS of a rule
10 No Single 90K Yes
10

(Status=Single)  No
Coverage = 40%, Accuracy = 50%
27
Characteristics of Rule-Based Classifier

• Mutually exclusive rules

– Classifier contains mutually exclusive rules if the
rules are independent of each other
– Every record is covered by at most one rule

• Exhaustive rules
– Classifier has exhaustive coverage if it accounts for
every possible combination of attribute values
– Each record is covered by at least one rule

28
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marita l
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K

NO YES
Each path in the tree forms a rule
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree

29
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marita l
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No

< 80K > 80K 7 Yes Divorced 220K No

8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No)  (Status=Married)  No

Simplified Rule: (Status=Married)  No
30
Effect of Rule Simplification
• Rules are no longer mutually exclusive
– A record may trigger more than one rule
– Solution?
• Ordered rule set
• Unordered rule set – use voting schemes
• Rules are no longer exhaustive
– A record may not trigger any rules
– Solution?
• Use a default class

31
Learn Rules from Data: Sequential Covering

1. Start from an empty rule

2. Grow a rule using the Learn-One-Rule
function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping
criterion is met

32
Example of Sequential Covering

(i) Original Data (ii) Step 1

33
Example of Sequential Covering…

R1 R1

(iii) Step 2 (iv) Step 3

34
How to Learn-One-Rule?
• Start with the most general rule possible: condition =
empty
• Adding new attributes by adopting a greedy depth-first
strategy
– Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and
accuracy
– Foil-gain: assesses info_gain by extending condition
pos' pos
FOIL _ Gain  pos'(log 2  log 2 )
pos'neg ' pos  neg
• favors rules that have high accuracy and cover many
positive tuples
35
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples

36
Associative Classification
• Associative classification: Major steps
– Mine data to find strong associations between
frequent patterns (conjunctions of attribute-value
pairs) and class labels
– Association rules are generated in the form of
P1 ^ p2 … ^ pl  “Aclass = C” (conf, sup)
– Organize the rules to form a rule-based classifier

37
Associative Classification

• Why effective?
– It explores highly confident associations among
multiple attributes and may overcome some
constraints introduced by decision-tree induction,
which considers only one attribute at a time
– Associative classification has been found to be often
more accurate than some traditional classification
methods, such as C4.5

38
Associative Classification

• Basic idea
– Mine possible association rules in the form of
• Cond-set (a set of attribute-value pairs)  class
label
– Pattern-based approach
• Mine frequent patterns as candidate condition sets
• Choose a subset of frequent patterns based on
discriminativeness and redundancy

39
Frequent Pattern vs. Single Feature
The discriminative power of some frequent patterns is
higher than that of single features.

(a) Austral (b) Cleve (c) Sonar

Information Gain vs. Pattern Length

40
Two Problems
• Mine step
– combinatorial explosion

1. exponential explosion 2. patterns not considered if

minsupport isn’t small
enough

Frequent Patterns
mine
DataSet 1----------------------
---------2----------3
----- 4 --- 5 --------
--- 6 ------- 7------
41
Two Problems
• Select step
– Issue of discriminative power
4. Correlation not
3. InfoGain against the complete directly evaluated on their
dataset, NOT on subset of examples joint predictability

Frequent Patterns Mined

Discriminative
1----------------------
select Patterns
---------2----------3
----- 4 --- 5 -------- 124
--- 6 ------- 7------

42
Direct Mining & Selection via Model-based Search Tree

• Basic Flow

Mine & dataset Compact set of

Select P: Most
20% discriminative F highly
1 based on IG
Y N discriminative
patterns
Mine & Mine &
Select P: Select
2 5 P:20%
1
20%
Y N Y N 2
3
Mine & Mine & 4
Select Select
P:20% 3 4 6 7 P:20% 5
Y N Y N Y N 6
…
Few
7
+ Data
… + .
.
.
Divide-and-Conquer Based Frequent Pattern Mined Discriminative Patterns
Mining 43
Advantages of Rule-Based Classifiers

• As highly expressive as decision trees

• Easy to interpret
• Easy to generate
• Can classify new instances rapidly
• Performance comparable to decision trees

44
Support Vector Machines—An Example

• Find a linear hyperplane (decision boundary) that will separate the data
45
Example
B1

• One Possible Solution

46
Example

• Another possible solution

47
Example

• Other possible solutions

48
Choosing Decision Boundary
B1

• Which one is better? B1 or B2?

• How do you define better?
49
Maximize Margin between Classes
B1

b21
b22

margin
b11

b12

• Find hyperplane maximizes the margin => B1 is better than B2

50
Formal Definition
B1

 
w x  b  0
 
  w  x  b  1
w  x  b  1

b11

  b12
1 if w  x  b  1 2
y   Margin   2
 1 if w  x  b  1 || w ||
51
Support Vector Machines
2
• We want to maximize: Margin   2
|| w ||
 2
|| w ||
– Which is equivalent to minimizing: L( w) 
2
– But subjected to the following constraints:
 
w  x i  b  1 if yi  1
 
w  x i  b  1 if yi  -1
• This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)

52
Noisy Data
• What if the problem is not linearly separable?

53
Slack Variables

• What if the problem is not linearly separable?

– Introduce slack variables
• Need to minimize:  2
|| w ||  N k
L( w)   C   i 
• Subject to: 2  i 1 
 
w  x i  b  1 - i if yi  1
 
w  x i  b  1  i if yi  -1

54
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be
mapped to some higher-dimensional feature space where the
training set is linearly separable:

Φ: x → φ(x)

55
Ensemble Learning

• Problem
– Given a data set D={x1,x2,…,xn} and their
corresponding labels L={l1,l2,…,ln}
– An ensemble approach computes:
• A set of classifiers {f1,f2,…,fk}, each of which maps data to a
class label: fj(x)=l
• A combination of classifiers f* which minimizes
generalization error: f*(x)= w1f1(x)+ w2f2(x)+…+ wkfk(x)

56
Generating Base Classifiers

• Sampling training examples

– Train k classifiers on k subsets drawn from the training set
• Using different learning models
– Use all the training examples, but apply different learning
algorithms
• Sampling features
– Train k classifiers on k subsets of features drawn from the
feature space
• Learning “randomly”
– Introduce randomness into learning procedures

57
Bagging (1)
• Bootstrap
– Sampling with replacement
– Contains around 63.2% original records in each
sample
• Bootstrap Aggregation
– Train a classifier on each bootstrap sample
– Use majority voting to determine the class label of
ensemble classifier

58
Bagging (2)

Bootstrap samples and classifiers:

Combine predictions by majority voting

59
Boosting (1)
• Principles
– Boost a set of weak learners to a strong learner
– Make records currently misclassified more important

• Example
– Record 4 is hard to classify
– Its weight is increased, therefore it is more likely to
be chosen again in subsequent rounds

60
Boosting (2)
• AdaBoost
– Initially, set uniform weights on all the records
– At each round
• Create a bootstrap sample based on the weights
• Train a classifier on the sample and apply it on the original training
set
• Records that are wrongly classified will have their weights increased
• Records that are classified correctly will have their weights
decreased
• If the error rate is higher than 50%, start over
– Final prediction is weighted average of all the classifiers
with weight representing the training accuracy

61
Boosting (3)

• Determine the weight

 w j (Ci ( x j )  y j )
N
j 1
– For classifier i, its error is i 

N
j 1
wj

– The classifier’s importance is

1 1 i 
represented as:  i  ln  
2  i 

– The weight of each record is w(ji ) exp   i y j Ci ( x j ) 

w(ji 1) 
updated as: Z (i )

– Final combination: C ( x)  arg max y i 1 i Ci ( x)  y 

* K

62
Classifications (colors) and
Weights (size) after 1 iteration
Of AdaBoost

20 iterations
3 iterations

63
Boosting (4)

• Explanation
– Among the classifiers of the form:

f ( x)  i 1 i Ci ( x)
K

– We seek to minimize the exponential loss function:

 exp  y j f ( x j ) 
N
j 1

– Not robust in noisy settings

64
Random Forests (1)
• Algorithm
– Choose T—number of trees to grow
– Choose m<M (M is the number of total features) —number of
features used to calculate the best split at each node
(typically 20%)
– For each tree
• Choose a training set by choosing N times (N is the number of training
examples) with replacement from the training set
• For each node, randomly choose m features and calculate the best
split
• Fully grown and not pruned
– Use majority voting among all the trees

65
Random Forests (2)
• Discussions
– Bagging+random features
– Improve accuracy
• Incorporate more diversity and reduce variances
– Improve efficiency
• Searching among subsets of features is much faster than
searching among the complete set

66
Random Decision Tree (1)
• Single-model learning algorithms
– Fix structure of the model, minimize some form of errors, or maximize
data likelihood (eg., Logistic regression, Naive Bayes, etc.)
– Use some “free-form” functions to match the data given some
“preference criteria” such as information gain, gini index and MDL. (eg.,
Decision Tree, Rule-based Classifiers, etc.)

• Such methods will make mistakes if

– Data is insufficient
– Structure of the model or the preference criteria is inappropriate for the
problem

• Learning as Encoding
– Make no assumption about the true model, neither parametric form nor
free form
– Do not prefer one base model over the other, just average them
67
Random Decision Tree (2)
• Algorithm
– At each node, an un-used feature is chosen randomly
• A discrete feature is un-used if it has never been chosen previously on
a given decision path starting from the root to the current node.
• A continuous feature can be chosen multiple times on the same
decision path, but each time a different threshold value is chosen
– We stop when one of the following happens:
• A node becomes too small (<= 3 examples).
• Or the total height of the tree exceeds some limits, such as the total
number of features.
– Prediction
• Simple averaging over multiple trees

68
Random Decision Tree (3)

B1: {0,1} B1 chosen randomly

B1 == 0
B2: {0,1}
B3: continuous Y N

Random threshold 0.3 B2: {0,1}

B2: {0,1} B2 == 0? B3 < 0.3?

B3: continuous B3: continuous

Y N B3 chosen randomly
B2 chosen randomly Random threshold 0.6
……… B3 < 0.6?

B3: continous

69
Random Decision Tree (4)

• Advantages
– Training can be very efficient. Particularly true for
very large datasets.
• No cross-validation based estimation of parameters for
some parametric methods.
– Natural multi-class probability.
– Imposes very little about the structures of the model.

70
RDT looks
like the optimal
boundary

71 71
Ensemble Learning--Stories of Success
• Million-dollar prize
– Improve the baseline movie
recommendation approach of Netflix
by 10% in accuracy
– The top submissions all combine
several teams and algorithms as an
ensemble

• Data mining competitions

– Classification problems
– Winning teams employ an
ensemble of classifiers

72
Netflix Prize
• Supervised learning task
– Training data is a set of users and ratings (1,2,3,4,5 stars)
those users have given to movies.
– Construct a classifier that given a user and an unrated
movie, correctly classifies that movie as either 1, 2, 3, 4,
or 5 stars
– $1 million prize for a 10% improvement over Netflix’s
current movie recommender
• Competition
– At first, single-model methods are developed, and
performance is improved
– However, improvement slowed down
– Later, individuals and teams merged their results, and
significant improvement is observed
73
Leaderboard

“Our final solution (RMSE=0.8712) consists of blending 107 individual results. “

“Predictive accuracy is substantially improved when blending multiple predictors. Our

experience is that most efforts should be concentrated in deriving substantially different
approaches, rather than refining a single technique. “

74
Take-away Message
• Various classification approaches
• how they work
• their strengths and weakness
• Algorithms
• Decision tree
• K nearest neighbors
• Naive Bayes
• Logistic regression
• Rule-based classifier
• SVM
• Ensemble method

A Comprehensive Guide To Ensemble Learning (With Python Codes) PDF
100% (1)
A Comprehensive Guide To Ensemble Learning (With Python Codes) PDF
49 pages
23-Naive Bayes
No ratings yet
23-Naive Bayes
22 pages
Unit-4 DWDM
No ratings yet
Unit-4 DWDM
10 pages
Bayes Classification
No ratings yet
Bayes Classification
9 pages
Naïve Bayesv1
No ratings yet
Naïve Bayesv1
31 pages
W8-Supervised Learning Methods
No ratings yet
W8-Supervised Learning Methods
30 pages
CSC 325 AI Lecture08 Supervised Learning Fall2024 Dr Raheel 20022025 034558pm
No ratings yet
CSC 325 AI Lecture08 Supervised Learning Fall2024 Dr Raheel 20022025 034558pm
29 pages
Statistical Inference INF312 - Is - Lecture 03 - Part 3
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 3
18 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Example_Classification
No ratings yet
Example_Classification
71 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
47 pages
P9-10 ClassBasic
No ratings yet
P9-10 ClassBasic
82 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
Decision Tree
No ratings yet
Decision Tree
22 pages
Decision Tree
100% (1)
Decision Tree
12 pages
TTDS Lecture 5
No ratings yet
TTDS Lecture 5
8 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
14 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Week 5
No ratings yet
Week 5
72 pages
05 KDD03 Classification - Update 1
No ratings yet
05 KDD03 Classification - Update 1
70 pages
dm 3
No ratings yet
dm 3
37 pages
Bayesian
No ratings yet
Bayesian
23 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
Unit IV CI PDF
No ratings yet
Unit IV CI PDF
24 pages
Lecture-Notes-MCP-Ungrouped-Percentiles
No ratings yet
Lecture-Notes-MCP-Ungrouped-Percentiles
3 pages
Decision Tree
No ratings yet
Decision Tree
25 pages
Lecture 8 - Naive Bayes
No ratings yet
Lecture 8 - Naive Bayes
27 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
16 pages
AIML Lec-10
No ratings yet
AIML Lec-10
19 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
Attribute Selection Measures: Decision Tree Based Classification
No ratings yet
Attribute Selection Measures: Decision Tree Based Classification
16 pages
Bayesian Learning
No ratings yet
Bayesian Learning
58 pages
19 Classification 1
No ratings yet
19 Classification 1
53 pages
Data Mining, Klasifikasi
No ratings yet
Data Mining, Klasifikasi
88 pages
Chapter 4: Classification & Prediction: 4.1 Basic Concepts of Classification and Prediction 4.2 Decision Tree Induction
No ratings yet
Chapter 4: Classification & Prediction: 4.1 Basic Concepts of Classification and Prediction 4.2 Decision Tree Induction
19 pages
ML_Lec_5
No ratings yet
ML_Lec_5
37 pages
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
No ratings yet
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
28 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
dm4
No ratings yet
dm4
68 pages
Measures of Locations or Fractiles
No ratings yet
Measures of Locations or Fractiles
18 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
Unit Iv L Earning
No ratings yet
Unit Iv L Earning
33 pages
Unit Iv L Earning
No ratings yet
Unit Iv L Earning
23 pages
2.3 Bayes classification
No ratings yet
2.3 Bayes classification
15 pages
l18 Irsw Pir
No ratings yet
l18 Irsw Pir
36 pages
Bayesian Classification- problem (1)
No ratings yet
Bayesian Classification- problem (1)
4 pages
07 Classification
No ratings yet
07 Classification
63 pages
Lecture2 Both Part Merged
No ratings yet
Lecture2 Both Part Merged
25 pages
Step 2: Calculate The Midpoints For Each Class: X X N F X
No ratings yet
Step 2: Calculate The Midpoints For Each Class: X X N F X
1 page
Lec 4 Problems On Central Tendency Dispersion
No ratings yet
Lec 4 Problems On Central Tendency Dispersion
18 pages
unit 2 notes (1)
No ratings yet
unit 2 notes (1)
83 pages
Math Grade10 Quarter4 Week4 Module4
No ratings yet
Math Grade10 Quarter4 Week4 Module4
4 pages
IME672 - Lecture 44
No ratings yet
IME672 - Lecture 44
16 pages
LP - CH18 - Averages (30 Mins) - Final
No ratings yet
LP - CH18 - Averages (30 Mins) - Final
9 pages
PPT
No ratings yet
PPT
12 pages
Classification
No ratings yet
Classification
81 pages
CHAPTER 14 Statistics PYQ's jkbose
No ratings yet
CHAPTER 14 Statistics PYQ's jkbose
15 pages
Math Phonics Addition & Subtraction: Quick Tips and Alternative Techniques for Math Mastery
From Everand
Math Phonics Addition & Subtraction: Quick Tips and Alternative Techniques for Math Mastery
Marilyn B Hein
No ratings yet
Course Fee All Course 2023
No ratings yet
Course Fee All Course 2023
1 page
Project 1
No ratings yet
Project 1
6 pages
Lecture 05 - Addendum - Chisquare, T and F Distributions
No ratings yet
Lecture 05 - Addendum - Chisquare, T and F Distributions
4 pages
Compulsory Subjects & Open Electives - 28032023
No ratings yet
Compulsory Subjects & Open Electives - 28032023
1 page
Eco CH4 Micro
100% (1)
Eco CH4 Micro
2 pages
How To Search A Student ID With User Input and Output With The Student Name and Student ID in Python 4570611745706160
No ratings yet
How To Search A Student ID With User Input and Output With The Student Name and Student ID in Python 4570611745706160
1 page
Minor Project 2022
No ratings yet
Minor Project 2022
2 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
water_potability_ppt
No ratings yet
water_potability_ppt
12 pages
VIVA-QUESTIONS-FOR-DS-LAB
No ratings yet
VIVA-QUESTIONS-FOR-DS-LAB
10 pages
Final Thesis With Table of Content
No ratings yet
Final Thesis With Table of Content
111 pages
Diabetes Disease Prediction Using Significant Attribute Selection and Classification Approach
No ratings yet
Diabetes Disease Prediction Using Significant Attribute Selection and Classification Approach
37 pages
ML LAB Viva Questions with Answers
No ratings yet
ML LAB Viva Questions with Answers
10 pages
Intraday Market Predictability - A Machine Learning Approach
No ratings yet
Intraday Market Predictability - A Machine Learning Approach
57 pages
Face Recognition in Image Processing
No ratings yet
Face Recognition in Image Processing
7 pages
MAHESH Project Document
No ratings yet
MAHESH Project Document
63 pages
Unit-5 Rel
No ratings yet
Unit-5 Rel
5 pages
ARTICULO (2023) - Assessing Real-Time Attention Levels of The Students During Online Classes
No ratings yet
ARTICULO (2023) - Assessing Real-Time Attention Levels of The Students During Online Classes
15 pages
A Survey of Machine Learning Based Approaches For Parkinson Disease Prediction
No ratings yet
A Survey of Machine Learning Based Approaches For Parkinson Disease Prediction
8 pages
Report of Comparing 5 Classification Algorithms of Machine Learning PDF
No ratings yet
Report of Comparing 5 Classification Algorithms of Machine Learning PDF
4 pages
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 5 Introduction To Data Mining: by Tan, Steinbach, Kumar
72 pages
An Efficient Detection of Fake Currency KNN Method
No ratings yet
An Efficient Detection of Fake Currency KNN Method
7 pages
Ensembles of Classifiers: Evgueni Smirnov
No ratings yet
Ensembles of Classifiers: Evgueni Smirnov
43 pages
Machine Learning May 2024
No ratings yet
Machine Learning May 2024
8 pages
A Guide To Face Detection in Python - Towards Data Science
No ratings yet
A Guide To Face Detection in Python - Towards Data Science
26 pages
XGBoost
No ratings yet
XGBoost
4 pages
How To Learn Machine Learning Algorithms For Interviews
No ratings yet
How To Learn Machine Learning Algorithms For Interviews
16 pages
Unit 4 Ensemble Techniques and Unsupervised Learning
100% (1)
Unit 4 Ensemble Techniques and Unsupervised Learning
25 pages
ML notes
No ratings yet
ML notes
16 pages
(Ebook) Combining pattern classifiers: Methods and algorithms by Kuncheva, Kuncheva Ludmila Ilieva ISBN 9781118315231, 9781118914557, 9788320140064, 1118315235, 1118914554, 8320140064 - The ebook with all chapters is available with just one click
100% (1)
(Ebook) Combining pattern classifiers: Methods and algorithms by Kuncheva, Kuncheva Ludmila Ilieva ISBN 9781118315231, 9781118914557, 9788320140064, 1118315235, 1118914554, 8320140064 - The ebook with all chapters is available with just one click
31 pages
Stock Market Prediction Using Reinforcement Learning With Sentiment Analysis
No ratings yet
Stock Market Prediction Using Reinforcement Learning With Sentiment Analysis
20 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Weekly Quiz 2 Machine Learning PDF
100% (1)
Weekly Quiz 2 Machine Learning PDF
4 pages
Synopsis - Diabetes Prediction
No ratings yet
Synopsis - Diabetes Prediction
28 pages
Deep Learning Based Fusion Approach For Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach For Hate Speech Detection
94 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages

Classification Methods

Uploaded by

Classification Methods

Uploaded by

Classification

• Bayesian classifier vs. decision tree

Class Prior Probability Descriptor Posterior Probability

Class Posterior Probability Descriptor Prior Probability

• P(Hi) is class prior probability that X belongs to a

• P(X) is prior probability of X

• X=(age:31…40, income: medium, student: yes, credit: fair)

• P(X|Hi) is posterior probability of X given Hi

• X= (age:31…40, income: medium, student: yes, credit: fair)

• X= (age:31…40, income: medium, student: yes, credit: fair)

Class Posterior Probability Descriptor Prior Probability

To classify means to determine the highest P(Hi|X) among all

P(xj|Hi) Outlook P N H umidity P N

• An unseen sample X = <rain, hot, high, false>

• Ex. Suppose a dataset with 1000 tuples for a class C,

• makes computation possible

• Classify records by using a collection of

R1: (Give Birth = no)  (Can Fly = yes)  Birds

• A rule r covers an instance x if the attributes of the

The rule R1 covers a hawk => Bird

– Fraction of records 1 Yes Single 125K No

that satisfy both the 8 No Single 85K Yes

• Mutually exclusive rules

< 80K > 80K 7 Yes Divorced 220K No

Initial Rule: (Refund=No)  (Status=Married)  No

1. Start from an empty rule

(i) Original Data (ii) Step 1

(iii) Step 2 (iv) Step 3

(a) Austral (b) Cleve (c) Sonar

Information Gain vs. Pattern Length

1. exponential explosion 2. patterns not considered if

Frequent Patterns Mined

Mine & dataset Compact set of

• As highly expressive as decision trees

• One Possible Solution

• Another possible solution

• Other possible solutions

• Which one is better? B1 or B2?

• Find hyperplane maximizes the margin => B1 is better than B2

• What if the problem is not linearly separable?

• Sampling training examples

Bootstrap samples and classifiers:

Combine predictions by majority voting

• Determine the weight

– The classifier’s importance is

– The weight of each record is w(ji ) exp   i y j Ci ( x j ) 

– Final combination: C ( x)  arg max y i 1 i Ci ( x)  y 

– We seek to minimize the exponential loss function:

– Not robust in noisy settings

• Such methods will make mistakes if

B1: {0,1} B1 chosen randomly

Random threshold 0.3 B2: {0,1}

B3: continuous B3: continuous

• Data mining competitions

“Our final solution (RMSE=0.8712) consists of blending 107 individual results. “

“Predictive accuracy is substantially improved when blending multiple predictors. Our

You might also like