0% found this document useful (0 votes)
37 views83 pages

ML Unit 3

Uploaded by

sanju.25qt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views83 pages

ML Unit 3

Uploaded by

sanju.25qt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

UNIT 3

MACHINE LEARNING
TREE MODELS
• Feature Tree:
A compact way of representing a number of
conjunctive concepts in the hypothesis space.

• Tree:
1. Internal nodes as Features
2. Edges labelled as Literals.
3. Split : set of literals at a node.
4. Leaf: Logical expression conjunction of literals
in the path from root to that edge
TREE MODELS
• Generic Algorithms
Three functions:
1. Homogenous(D)  all instances belong to
True/False (Single Class)

2. Label(D)  returns label of D

3. BestSplit(D,F) on which feature the dataset


is divided(two classes/more
classes)
TREE MODELS
• Divide-and-Conquer algorithm:
it divides the data into subsets,builds a tree for each
of those and then combines those subtrees into a
single tree.
• Greedy:
whenever there is a choice (such as choosing the
best split), the best alternative is selected on the basis
of the information then available, and this choice is
never reconsidered.
• Backtracking search algorithm:
which can return an optimal solution, at the expense
of increased computation time and memory
requirements
DECISION TREES
• Classification Task: D
Homogenous(D)
Single Label(D)

Non- Homogenous(D) Majority Class Label


D Di =Ø (zero)
D 1 D2
DECISION TREES
• D + 1 = D+ and D -1 = Ø

• D -1 = D - and D+1 = Ø

pure

• Impurity: n+ , n- (impurity depends only on magnitude)

• Impurity is measured in Proportional format  p˙ = n+ / (n++n-)  empirical


probability of positive class.

• Aim : We need a function that returns


0 if p=0 or p=1
½ if p reaches maximum value
FUNCTIONS
1. MINORITY CLASS(Error Rate)

2. GINI INDEX(Expected Error Rate).

3. ENTROPY(Expected Information)
MINORITY CLASS
• Min(p,1-p), it returns error rate.

• Minority class is Proportion to misclassified examples.

• Spam=40 majority class,


Ham=10misclassified(minority class)

• If set of instances are Pure set  fewer(no error)

• Minority class as impurity class then ½ -|p- ½ |


GINI INDEX
• It is an expected error rate.

• Randomly assigns a label to instances.

• P(positive instances), 1-p(negative instances)

• False positive  p (1-p)

• False Negative (1-p) p


ENTROPY
• It is an Expected Information.

• Formula: -p log 2p – (1-p) log 2(1-p)


Decision Trees
Entropy
Gini Index
Decision Tree
• K>2

• One vs rest

• K class Entropy =

• K class Gini Index=


RANKING AND PROBABILITY ESTIMATION
• Grouping classifiers divide instance space into segments.
• Instance space

• Segments

• Rankers by learning an ordering algorithm

• Decision trees (can access Local class distribution) directly used


to construct Leaf ordering in an optimal way.

• Using Empirical Probability easy to calculate leaf ordering.

• Highest priority for Positives.

• Convex ROC Curve.


the empirical probability of the parent is a weighted average of the
empirical probabilities of its children; but this only tells us that p˙1 ≤ p˙ ≤ p˙2 or p˙2 ≤ p˙ ≤
p˙1.
• Tree is a feature tree with unlabelled data.

• How many ways we can label the tree and the


performance.

• If we know the number of positives and


negatives.

• L-labels, C-classes then CL ways to arrange the


leaves.

• Ex:24= 16 ways.
• Graph follows symmetry property.

• +-+-, -+-+  they are locating at same


place(symmetric property).

• Path of coverage corner contains optimal

• ----, --+-, +-+-, +-++, ++++

• L labels then L! permutations are possible.


• Feature tree is turned into

-- Rankers (Order leaves in descending


order based on Empirical
probability.

-- Probability Estimator(Predict Empirical


probability in each leaf or calculate
Laplace or m-estimate)

-- Classifier(choose operating conditions , find


the operating point that fits the
conditions
• the optimal labelling under these operating conditions
is +−++.
• use the second leaf to filter out negatives.
• In other words, the right two leaves can be merged into
one – their parent.
• the operation of merging all leaves in a subtree is called
pruning the subtree.
• The advantage of pruning is that we can simplify the
tree without affecting the chosen operating point,
which is sometimes useful
• if we want to communicate the tree model to
somebody else
• The disadvantage is that we lose ranking performance,
Sensitivity to Skewed Class Distribution
• Parent p Gini index = 2(n + / n)(n - / n)

• Average Weight of Gini index children


n1 = n1+ + n1-

n1/n * 2(n + / n)(n - / n)

• Relative impurity= sqrt(n1+ * n1- )/ (n + * n - )


How you would train decision Trees for
a dataset
• Good Ranking Estimator.
• Distributive-Insensitive data
• Disable Pruning.
• Operating Condition, Operating point ROC.
• Prune all the leaves at the same level.
Tree Learning as Variance Reduction
• Gini Index 2p(1-p)  Expected error rate.

• Label instances randomly.

• Coin---Head, tail  probability of occurring of


head is p then variance is p(1-p)
P is occurring
1-p is non-occurring
REGRESSION TREE
Regression Tree
Model(A100,B3,E112,M102,T202)
• A100[1051,1770,1900]mean=1574
• B3[4513] mean=4513
• E112[77] mean=77
• M102[870] mean=870
• T202[99,270,625] mean= 331

Calculate variance:
• A100
1/9 sq(1574-1051)+sq(1574-1770)+sq(1574-1900)=
1/9(523)+(-196)+(-326)=
273529+38416+106,276=46469
• B3
1/9sq(4513-4513)=0
• E112
1/9sq(77-77)=0
• M102
1/9sq(870-870)=0
• T202
1/9sq(331-99)+sq(331-270)+sq(331-625)=1/9(232+61+(-
294))=15997
• Calculate weigthed average of Model:-
• 3/9(46469)+0+0+0+3/9(15597)=2,686.5978
• Similarly for condition(excellent, good, fair)
excellent[1770,4513]mean=3142
good[270,870,1051,1900] mean=1023
fair[77,99,625] mean=267
Variance:-
• Excellent
1/9 sq(3142-1770)+sq(3142-4513)=1372+1371=418002
• good
1/9sq(1023-270)+sq(1023-870)+sq(1023-1051)+sq(1023-1900)
=1/9*sq(753)+sq(153)+sq(28)+sq(877)=
=1/9*567009+ 23409+ 784+769,129
=1,51,147
• fair
1/9(267-77)+(267-99)+(267-625)=190+168+358=21331
• Weighted Average of condition:-
2/9(418002)+4/9(151147)+3/9(21331)=
167,176.1111
• Similarly for Leslie(yes,no)
yes[625,870,1900] mean=1132
no[77,99,270,1051,1770,4513] mean= 1297
Variance:-
• Yes
1/9 sq(1132-625)+(1132-870)+(1132-1900)
=1/9 sq(507)+262+(-768)=101,704.11
• No
1/9 sq(1297-77)+(1297-99)+(1297-270)+(1297-1051)+(1297-
1770)+(1297-4513)
=1/9 sq(1220)+1198+1027+246+(-473)+(-3216)
=16223803.77
• Calculate weighted average of Leslie:-
• 3/9* 101,704.11 + 6/9* 16223803.77
=33901.36+10815869.180
=10849770.54

Weighted averages :
1. Model= 2,686.5978
2. Condition= 167,176.1111
3. Leslie= 10849770.54
• For A100 the splits are
Condition[excellent,good,fair]
[1770] [1051,1900] []  ignored
Leslie[yes,no]  [1900][1051,1770]calculate variance
• For T202 the splits are
Condition[excellent,good,fair][] [270][99,625]ignored
Leslie[yes,no] - [625][99,270]  calculate variance
Regression Tree
Clustering Trees
• Regressions finds an instance space segment that
target values are tightly clustered around the mean.
• Variance of set of target value is average
squared Euclidean distance to mean.

• Learning a clustering tree using
1. Dissimilarity Matrix.
2. Euclidean distance
• For A100 the means of the three numerical
features(price, reserve,bids)
11,8,13
18,15,15
19,19,1
• vectors (means) are(16,14,9.7)
• Variance is:
1/3sq(16-11)+(16-18)+(16-19)=1/3sq(5)+(-2)+(-
3)=12.7
• 1/3sq(14-8)+(14-15)+(14-19)= 20.7
• 1/3sq(9.7-13)+(9.7-15)+(9.7-1)=38.2
RULE MODELS
• Logical Models:
1. Tree models.
2. Rule models.

• Rule models consist of a collection of implications


or if–then rules.

• if-part defines a segment, and the then-part


defines the behaviour of the model in this
segment
• Two Approaches:
1. find a combination of literals – the body of the
rule, which is called a concept – that covers a
sufficiently homogeneous set of examples, and
find a label(class) to put in the head of the rule.
Ordered sequence of Rules  Rule Lists

2. first select a class you want to learn, and then find


rule bodies that cover (large subsets of ) the
examples of that class.
 Unordered collection of Rules Rule Sets
Learning Ordered Rule Lists
• Growing Rule body that improves Homogeneity
• Decision Tree Rule Lists

C1 C2 True False

Impurity for 2 classes Only for 1 children

• Separate and Conquer


many
many

1-

1-

1-

[0+,0-] [0+,0-]

1-
many

0-

0-

1-

0-
Learning Unordered Rule Sets
• Alternative approach to rule learning.

• Rules are learned for one class at a time.

• minimizing min(p˙, 1 − p˙).

• maximize p˙, the empirical probability of the


class.
Descriptive Rule Learning
• Descriptive models can be learned in either a
supervised or an unsupervised way.

• Supervised:
how to adapt the given rule learning
algorithms ---- subgroup discovery.

• Unsupervised Learning:
---frequent item sets and association rule
discovery.
Learning from Sub group Discovery

• Equal Proportion of Positives to Overall


Population.
1. Precision
|Prec – Pos|

2. Average-Recall
|avgrec – 0.5|

3. Weighted Relative Accuracy


= Pos * Neg (TPR - FPR)
Association Rule Mining

You might also like