0% found this document useful (0 votes)

104 views

Classification and Prediction

The document discusses various algorithms used for supervised machine learning and decision tree algorithms. It provides an overview of distance-based algorithms like K-Nearest Neighbors, decision tree algorithms like ID3, C4.5, CART, and statistical algorithms like Bayesian Classification and neural networks algorithms like propagation. It then focuses on decision trees, explaining that they can represent classification or regression rules in a tree structure. It provides examples of how decision trees are constructed and evaluated.

Uploaded by

Krishnan Swami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views

Classification and Prediction

Uploaded by

Krishnan Swami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Outline

 Distance based algo

 K - Nearest Neighbor Algo
 Decision Tree - based Algorithm
 ID3
 C4.5
 CART
 Statistical algo
 Bayesian Classification
 Neural Networks based algo
 Propagation
 Linear and non linear regression
Tree-Structured rules

 Supervised learning
 The type of rule discussed can be represented by a tree.
 Trees that represent classification rules are called classification
trees or decision trees & trees that represents regression rules
are called regression trees.
 Tree-structured rules are very popular since they are easy to
interpret and are very accurate.
Example

Age

<=25 >25

Car Type NO

Sedan Sports, Truck

NO YES

 Above fig shows Insurance risk example Decision Tree

 Each path from root node to a leaf node represents one classification rule.
Decision Trees
 Also called classification tree
 Graphical representation of set of classification rules
 Each internal node represents predictor / splitting
attribute
 Each arc/ edge is labeled with predicate or splitting
criteria
 Each leaf node is labeled with a class Cj
Decision Trees
Basic step
 Build the tree
 Apply the tree to the database
Decision Trees
 A decision tree is usually constructed in two phases.
 The growth phase
 The pruning phase
 In growth phase, an overly large tree is constructed. This tree
represents the record in the input database very accurately.
 In pruning phase, the final size of the tree is determined.
 The rules represented by the tree constructed in phase one are
usually overspecialized.
 By reducing the size of the tree, we generate a smaller number of
more general rules that are better than a very large number of
very specialized rules.
Decision Trees
 The splitting criterion at a node is found through application of a split selection
method.
 A split selection method is an algorithm that takes as input a relation and outputs
the locally ‘best’ splitting criterion.

 Following is the decision tree induction schema:

Input: node n, partition D, split selection method S

Output: decision tree for D rooted at node n
BuildTree( Node n, Partition D, split selection method S)
Apply S to D to find the splitting criterion
If ( a good splitting criterion is found)
Create two children nodes n1 & n2 of n
Partition D into D1 & D2.
BuildTree(n1,D1,S)
BuildTree(n2,D2,S)
endif
ID3 Background
 “Iterative Dichotomizer 3”.
 Invented by Ross Quinlan in 1979.
 Generates Decision Trees using Entropy.
 Information Gain is used to select the most useful
attribute for classification.
 Builds the tree in top down fashion.
 Succeeded by Quinlan’s C4.5 and C5.0 algorithms.
Gain(A) = I(p,n) – E(A)
Entropy
TEMP p n I(p,n)
Hot 0 2
0
Mild 1 1
 Introduced by Claude Shannon in 1948 1
 Quantifies “randomness” cold 1 0
0
 Lower value implies less uncertainty
 Higher value implies more uncertainty
 A completely homogeneous sample has entropy of 0
 An equally divided sample has entropy of 1
 Formula:
Entropy of Attribute
E(A) = ∑ pi+ni [ Ii (p,n) ]
p+n
IG of the table I(p,n)= -p log 2 p – n log 2 n
Or
Entropy of the
starting set or parent table p+n p+n p+n p+n
Information Gain (IG)
 The information gain is based on the decrease in
entropy after a dataset is split on an attribute.
 Can decide upon which attribute creates the most
homogeneous branches?
 Formula: Gain(A) = I(p,n) – E(A)
Outlook p n I(p,n)
Sunny 2 3 0.970
Overcast 4 0 0
Rain 3 2 0.970
E(outlook)= [(2+3)/(9+5)](0.970) + 0 + [(3+2)/14](0.970)
= 0.692
Gain (outlook)= IG - E(outlook)
= 0.940 – 0.692
= 0.248
Outlook p n I(p,n)

ID3 Sunny
Overcast
2
4
3
0
0.970
0
Rain 3 2 0.970
 ID3Eis(outlook)=
used to[(2+3)/(9+5)](0.970)
build DT based on+information
0 + [(3+2)/14](0.970)
theory concept
= 0.692

 ID3Gain (outlook)=
chooses IG - E(outlook)
splitting attribute with highest IG.
= 0.940 – 0.692
 IG is the info needed
= 0.248to make correct
classification before split Vs info needed after
split.

IG = Entropy of original dataset - Entropies of split dataset

Entropies of split dataset = Weighted sum of entropies
after each of subdivided dataset
Weight of each dataset = fraction of dataset being placed
in that division
ID3
 A branch set with entropy of 0 is a leaf node.
 Otherwise, the branch needs further splitting to classify
its dataset.
 The ID3 algorithm is run recursively on the non-leaf
branches, until all data is classified.
Overfitting
 During the construction of a DT , the tree repeatedly splits the data
into node to get successively pure subsets of data
 If nodes are fitting to noise in training data, model will not
generalize well
 This occurs when model is too complex
 Complexity is determined by “no. of nodes” in the tree
 To avoid overfitting
 Post pruning
 Grow tree to max size , then prune based on validation set
 Computationally expensive method
 Replace sub tree with leaf node if generalization error improves or
dose not change
 Pre pruning
 Stop growing the tree before fully grown to fit the training data
 Stop splitting when not statistically significant
Advantages of using ID3
 Understandable prediction rules are created from the
training data.
 Builds the fastest tree.
 Only need to test enough attributes until all data is
classified.
 Finding leaf nodes enables test data to be pruned,
reducing number of tests.
 Whole dataset is searched to create tree.
Disadvantages of using ID3
 Data may be over-fitted or over-classified, if a small
sample is tested.
 Smaller decision trees should be preferred over larger
ones. This algorithm usually produces small trees, but it
does not always produce the smallest possible tree
 Only one attribute at a time is tested for making a
decision.
 Classifying continuous data may be computationally
expensive.
DT Advantages/Disadvantages
 Advantages:
 DTs are easy to use and efficient.
 Rules generated are easy to interpret and understand
 They scale well for large databases as tree size is
independent of database size.
 Trees can be constructed for data with many attributes.
 Disadvantages:
 Does not easily handle continuous data.
 May suffer from over fitting.
 Can be quite large – pruning is necessary.
 Correlations among attributes in the database are
ignored in DT process
C4.5
 A successor of ID3
 Builds DT using divide and conquer, top down,
recursive approach
 Builds DT based on information theory concept
 It chooses splitting attribute with highest Gain Ratio

Split entropy

 It is ratio of IG for a splitting attribute and entropy of

an attribute split (ignoring classes )

Gain
Formula:
Ratio
Gain ratio(A) = GAIN(A) /split entropy(A)
temp p n I(p,n)
Hot 2 2 1 I(2,2)=1
Mild 4 2 0.09
I(4,2)= -4/6 log2(4/6) -2/6 log2 (2/6)
cool 3 1 0.81

E(temp)= [(2+2)/14](1) +[(4+2)/14] (0.09) + [(3+1)/14](0.81)

= 0.9110
Gain (temp)= IG - E(outlook)
= 0.940- 0.9110
=0.0292
Split info(temp)= -4/14 log2(4/14) -6/14 log2 (6/14) -4/14 log2 (4/14)

Split info(temp)= 0.926 Gain Ratio(temp)= 0.0292/0.926

C4.5
1. Handles both continuous and discrete attributes
 The basic idea is to divide the data into ranges based
on the attribute values for that item that are found in
the training sample
2. Handling training data with missing attribute values
 Missing attribute values are simply not used in gain
ratio and entropy calculations
 To classify a record with a missing attribute value, the
value for that item can be predicted based on what is
known about the attribute values for the other records
C4.5
3. Pruning trees after creation
 C4.5 goes back through the tree once it's been created
and attempts to remove branches that do not help by
replacing them with leaf nodes
C4.5
EXAMPLE
To calculate the GainRatio for the gender split, we first find the
entropy associated with the gender split (ignoring classes )
H(9/15, 6/15)=9/15 log(15/9)+6/15 log(15/6) = 0.292

This gives the GainRatio value for the gender attribute as

0.09688 = 0.332
0.292
The entropy for the split on height (ignoring classes ) is :
H(4/15, 7/15, 2/15, 2/15)
CART
• Classification and regression tree
• If the target variable is nominal ( categorical) then the tree is
called Classification tree.
• If the target variable is numerical (continuous) then the tree is
called Regression tree.
• CART handles missing data by ignoring them in calculating the
goodness of split on the attribute.
• CART contains pruning strategy.
CART
• Classification and regression trees (CART) is a technique that
generates a binary decision tree
• Unlike ID3, however, where a child is created for each
subcategory, only two children are created
• The splitting is performed around what is determined to be
the best split point.
• At each step, an exhaustive search is used to determine the
best split, where "best" is defined by a measure φ(s/t)
CART

 Create Binary Tree

 Formula to choose split point, s, for node t:

 This formula is evaluated at the current node, t, and

for each possible splitting attribute and criterion, s

51
CART
• L = left subtree of the current node
• R = Right subtree of the current node .
• PL= probability that a tuple in the training set will be on the Left side
of the tree
• PR= probability that a tuple in the training set will be on the Right side of the
tree
This is defined as [tuples in subtree]/ [tuples in training set]

• P(Cj|tL) is the probability that a tuple is in class, Cj, and in the left subtree
• P(Cj|tR) is the probability that a tuple is in class, Cj, and in the right subtree
• This is defined as the [tuples of class j in subtree]/ [tuples at the target node ]

• At each step, only one criterion is chosen as the best over all possible criteria

52
Gender short medium Tall Total
F 3 6 0 9
M 1 2 3 6
Gender short medium Tall Total
F 3 6 0 9
M 1 2 3 6

3-1=2 6-2=4 3-0=3

gender

F M

Φ(gender) = 2 * 9/15 * 6/15 * (2/15 + 4/15 + 3/15)=0.224

S=3 T=0 S=1 T=3

M=6 M=2

9/15 6/15
height Less Greater Total
than than
1.6 0 15 15
1.7 2 13 11
1.8 5 10 5
1.9 9 6 3
2 12 3 9
height S M T Total
<1.6 0 0 0 0
>=1.6 4 8 3 15

4 8 3
2*0/15 * 15/15 *

height

<1.6 >=1.6

S=0 T=0 S=4 T=3

M=0 M=8

Φ(1.6) = 2 * 0/15 * 15/15 (4/15 + 8/15 + 3/15)=0

height S M T Total
<1.7 2 0 0 2
>=1.7 2 8 3 13

0 8 3
2*2/15 * 13/15 * (0+8/15+3/15) =0.169
2*0/15 * 15/15 *

height

<1.7 >=1.7

S=2 T=0 S=2 T=3

M=0 M=8

Φ(1.7) = 2 * 2/15 * 13/15 (0/15 + 8/15 + 3/15)=0.169

height S M T Total
<1.8 4 1 0 5
>=1.8 0 7 3 10

4 6 3

height Less G
height than t
1.6 0 1
<1.8 >=1.8
1.7 2 1
1.8 5 1
1.9 9 6
2 12 3
S=4 T=0 S=0 T=3
M=1 M=7

Φ(1.8) = 2 * 5/15 * 10/15 (4/15 + 6/15 + 3/15)=0.385

CART Example

• At the start, there are six choices for split

point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8

60
CART Example

 P( 1.6)=p(height<1.6)=0
 P( 1.7)=p(height<1.7)=2*(2/15)*(13/15)[|2/15-2/15)|+|0-
8/15|+|0-5/15)|]
 P( 1.8)= p(height<1.8)=2*(5/15)*(10/15)[|4/15-0|+|1/15-
7/15|+|0-3/15|]
 P( 1.9)= p(height<1.9)=2*(9/15)*(6/15)[|4/15-0|+|5/15-
3/15|+|0-3/15|]
 P( 2.0)= p(height<2.0)=2*(12/15)*(3/15)[|4/15-0|+|8/15-
0|+|0-3/15|]

61
Bayesian Classification

 Bayes Rule or Bayes Theorem is

 Suppose there are m different hypotheses then
P(xi) = Σ P(xi |hj)P(hj)

P(h1 | xi) = P(xi |h1)P(h1)

P(xi)
 Here P(h1|xi) is called the posterior probability, while P(h1) is
the prior probability associated with hypothesis h1
 P(xi) is the probability of the occurrence of data value xi and
P(xi|h1) is the conditional probability that, given a hypothesis,
the tuple satisfies it.
Bayesian
Classification
Car color Type Origin stolen P(y)=5/10
No.
P(n)=5/10
1 Red sports domestic Y
2 Red sports Domestic N Color
3 Red Sports Domestic Y P(red|y)= 3/5 P(red|N)=2/5
4 Yellow Sports Domestic N P(yellow|y)=2/5 P(yellow|N)=3/5
5 Yellow Sports Importer Y
6 Yellow SUV Importer N Type
7 Yellow SUV Importer Y P(SUV|y)=1/5 P(suv|N)=3/5
8 Yellow SUV Domestic N P(sports|y)=4/5 P(sports|n)=2/5
9 Red SUV Importer N
10 Red Sports Importer Y Origin
Sample X=(red&SUV&DOM) decision=? P(dom|y)=2/5 P(dom|N)=3/5
Unlabeled sample P(imp|y)=3/5 P(imp|n)=2/5

P(X|Y) = P(red|Y)P(suv|Y)P(dom|Y) P(X|Y)=3/5 * 1/5 * 2/5 = 0.048

P(X|N) = P(red|N)P(suv|N)P(dom|N) P(X|N)= 2/5 *3/5*3/5 = 0.144

P(X|N) > P(X|Y) therefore sample X is class “N”

Car A1 A2 A3 Class
No. A1
1 A C A C1 P(a|C1)= P(a|C2)=
2 C A C C1 P(b|c1) = P(b|c2) =
3 A A C C2 P(c|c1)= P(c|c2)=
4 B C A C2
A2
5 c c b C2
P(a|C1)= P(a|C2)=
P(c1)=2/5 P(b|c1) = P(b|c2) =
P(c2)=3/5 P(c|c1)= P(c|c2)=
Sample X= A1=c, A2=c and A3=a class ?
A3
Sample Y= A1=a, A2=c and A3=b class ? P(a|C1)= P(a|C2)=
P(b|c1) = P(b|c2) =
P(c|c1)= P(c|c2)=

P(X|c1) P(c1)= P(A1|c1)P(A2|c1)P(A3|c1) P(c1) P(X|c1)=1/2 * 1/2 * 1/2 = 0.125

P(X|c2) P(c2)= P(A1|c2)P(A2|c2)P(A3|c2) P(c2) P(X|c2)= 1/3 *2/3*1/3 = 0.074

P(X|c1) > P(X|c2) therefore sample X is class “c1”

Bayesian
Classification
Bayesian Classification
 Assuming that the contribution by all attributes are independent
and that each contributes equally to the classification problem,
a simple classification scheme called naive Bayes classification
has been proposed that is based
on Bayes rule of conditional probability
 By analyzing the contribution of each "independent" attribute, a
conditional probability is determined
 A classification is made by combining the impact that the
different attributes have on the prediction to be made
 The approach is called "naive" because it assumes the
independence between the various attribute values
Statistical-based algorithm (Bayesian
Classification)

• When classifying a target tuple, the conditional and prior

probabilities generated from the training set are used to make
the prediction
• This is done by combining the effects of the different attribute
values from the tuple
• Suppose that tuple ti has p independent attribute values
{xi1,xi2,...,xjp} .From the descriptive phase, we know P( xik|Cj), for
each class Cj and attribute xik
• We then estimate P(ti|Cj) by P(ti|Cj) = ∏P(xik|Cj)
• We then have the needed prior probabilities P(Cj) for each class
and the conditional probability P(ti|Cj)
• To calculate P(ti), we can estimate the likelihood that ti is in each
class. This can be done by finding the likelihood that this tuple is
in each class and then adding all these values
Statistical-based algorithm (Bayesian
Classification)
 The probability that ti is in a class is the product of the
conditional probabilities for each attribute value
 The posterior probability P(Cj|ti) is then found for each class
 The class with the highest probability is the one chosen for the
tuple.
Statistical-based algorithm (Bayesian
Classification)
Bayesian Classification
• There are 4 tuples classified as short, 8 as medium, and 3 as tall.
• The Output classification uses the simple divisions shown below:
2 m ≤Height Tall
1.7 m < Height < 2 m Medium
Height ≤ 1. 7m Short
• The Output2 results require a much more complicated set of divisions using, both height and gender attributes.
• To facilitate classification, we divide the height attribute into six ranges:
(0,1.6], (1.6, 1. 7], (1. 7, 1.8], (1.8, 1.9], (1.9,2.0], (2.0,∞)
Statistical-based algorithm (Bayesian
Classification)
 With these training data, we estimate the prior probabilities:
P(short) = 4/15 = 0.267, P(medium) = 8/15 = 0.533,
and P(tall) = 3/15 = 0.2
 We use these values to classify a new tuple. For example,
suppose we wish to classify t = (Adam, M, 1.95 m)
 By using these values and the associated probabilities of gender
and height, we obtain the following estimates:
 P(t|short) = 1/4 x a = a
 P(t| medium) = 2/8 x 1/8 = 0.031
p(t |tall) = 3/3 x 1/3 = 0.333

Prediction
 Dependent variable , y
 The variable whose values we want to explain or
forecast
 Independent variable , x
 Variable that explains the other
 Linear regression
 assumes a linear relationship between input
variable and output variable
 Logistic regression
 Used when the dependent variable is binary
 0/1, T/F, Y/N
Linear Regression
 Objective
 To establish if there is a relationship between two
variables.
 Income & spending
 Students’ weight & exam score
 Forecast new observation
 Sales in next quarter
 If the scatter diagram indicates some relationship
between two variable x and y then the dots of the
scatter diagram will be concentrated round a curve

 The curve is called the curve of regression and the

relationship is said to the be expressed by means of
curvilinear regression

 In the particular case, when the curve is a straight

line, it is called a line of regression and the
regression is said to be linear.
Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is
linear regression
 Fit data with the best hyper-plane which "goes
through" the points
y
dependent
variable
(output)

x – independent variable (input)

81
Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is
linear regression
 Fit data with the best hyper-plane which "goes
through" the points
y
dependent
variable
(output)

x – independent variable (input)

Regression
 For classification the output(s) is nominal
 In regression the output is continuous
 Function Approximation
 Many models could be used – Simplest is linear regression
 Fit data with the best hyper-plane which "goes through" the
points
 For each point the differences between the predicted point and
the actual observation is the residue

x
Simple Linear Regression
 For now, assume just one (input) independent variable x,
and one (output) dependent variable y
 Multiple linear regression assumes an input vector x
 Multivariate linear regression assumes an output
vector y
 We will "fit" the points with a line (i.e. hyper-plane)
 Which line should we use?
 Choose an objective function
 For simple linear regression we choose sum squared
error (SSE)
 S (predictedi – actuali)2 = S (residuei)2
 Thus, find the line which minimizes the sum of the
squared residues (e.g. least squares)
84
Y=β0 + β1x

 The equation of the line of regression of is y=a+bx

, where y is dependent variable and x is independent
variable.

 The line of regression always passes through point

( x, y) , a=y-bx
b is the slope of the line r σy
σx

line gives the best estimate

of y given a value of x.
Y=4+2x, for every increase in x, y
increase 2 times
Regression line

y=a+bx substitute the values of a and b

rσy
a=y +bx b =
σx

Consumption= 49.13 + (0.85) Income + error

Every increase in income the consumption will increase 0.85 times

Cov (X,Y)= [1/n∑ xy ] – x y

where r = cov(x, y) / σxσy

‘a’ is the intercept and ‘b’ is the slope of the line

Examples
 Calculate the regression line of y on x for the following
data. Also obtain prediction of y which should
corresponding on the average to x = 6.2

x 1 2 3 4 5 6 7 8 9
y 9 8 10 12 11 13 14 16 15

rxy= cov(x, y)/σx σy Cov (X,Y)= [1/n∑ xy ] – x y

σx 2 = (1/n Σ X2 )-X2
a=y- bx b = rσx y=a+bx substitute the values of a and b
σy
 Linear regression
 not applicable for most complex problems
 donot work with non numeric data
 Assume a linear relationship
 The straight line values can be greater than 1 and
less than 0
 Cannot be used as the probability of occurrence of
target class
Regression

Simple regression considers the relation between

a single explanatory variable and response variable
Multiple regression simultaneously considers the
influence of multiple explanatory variables on a
response variable Y

The intent is to look at

the independent effect
of each variable.

90
Regression Modeling
 A simple regression model
(one independent variable)
fits a regression line in 2-
dimensional space

 A multiple regression
model with two
explanatory variables fits a
regression plane in 3-
dimensional space
Simple Regression Model
Regression coefficients are estimated by minimizing
∑residuals2 (i.e., sum of the squared residuals) to
derive this model:

The standard error of the regression (sY|x) is

based on the squared residuals:
Multiple Regression Model
Again, estimates for the multiple slope coefficients
are derived by minimizing ∑residuals2 to derive this
multiple regression model:

Again, the standard error of the regression is

based on the ∑residuals2:
Multiple Regression Model
 Intercept α predicts
where the regression
plane crosses the Y
axis
 Slope for variable X1
(β1) predicts the
change in Y per unit
X1 holding X2
constant
 The slope for
variable X2 (β2)
predicts the change
in Y per unit X2
holding X1 constant
15: Multiple Linear Regression Basic Biostat 94
Multiple Regression Model
A multiple regression
model with k independent
variables fits a regression
“surface” in k + 1
dimensional space (cannot
be visualized)
Multiple regression

 Method for analysing a linear relationship

involving more than two varaibles
 x1,x2,x3…….xn
 Y=a+b1 x1 +b2x2 +b3x3+…….+bnxn
Height Height Height of
of 0f daughter
mother father
63 64 58.6
67 65 64.7
64 67 66.3
… … ….

Daughter Ht= 7.5+0.707mother +0.614 father

Non linear regression

 The difference between linear and

nonlinear regression models isn’t
as straightforward as it sounds.
 You’d think that linear equations produce
straight lines and nonlinear equations model
curvature.
 Unfortunately, that’s not correct
 Both types of models can fit curves to your
data—so that’s not the defining characteristic
Non linear regression
 Linear model  y=a+bx
 Multiple linear regression  y=a+bx1+cx2
 y=a+bx+cx2
 if you take derivative with respect to any parameter
the resultant is 1(constant).
 Y= a+bx  dy/da = 1 dy/db=1x

 y=a+bx+cx2  dy/da= 1 dy/db= 1x

A regression model is called non linear if the derivative
of the model depends on one or more parameters.
Y= a+b2x  dy/db=2bx i.e.the derivative is dependent on ’b’
Non linear by parameter and not ,non linear by independent variable
Logistic regression
 Used when the dependent variable is binary
 0/1, T/F, Y/N
 Y= a+ f1(x1)+ …….. +fn(xn)
 f1 is the function being used to transform the
predictor
Logistic regression
dp/db will dependent
 Uses a logistic curve on b, thus logistic
p= e (a+bx) regression is
nonlinear regression
1+ e (a+bx)
 Logistic curve gives a value between 0 and 1
so it can be interpreted as the probability of
class membership
 Log e (p/(1-p) = a+bx
 Dependent variable , Y= a+bx
 p is the probability of being in the class
 (1-p) is the probability that it is not
Thanks
Logistic Regression
 One commonly used algorithm is Logistic Regression
 Assumes that the dependent (output) variable is
binary which is often the case in medical and other
studies. (Does person have disease or not, survive or
not, accepted or not, etc.)
 Like Quadric, Logistic Regression does a particular
non-linear transform on the data after which it just
does linear regression on the transformed data
 Logistic regression fits the data with a
sigmoidal/logistic curve rather than a line and outputs
an approximation of the probability of the output
given the input

102
Logistic Regression Example
 Age (X axis, input variable) – Data is fictional
 Heart Failure (Y axis, 1 or 0, output variable)
 Could use value of regression line as a probability approximation
 Extrapolates outside 0-1 and not as good empirically
 Sigmoidal curve to the right gives empirically good probability
approximation and is bounded between 0 and 1

103
Logistic Regression Approach
Learning
1. Transform initial input probabilities into log odds
(logit)
2. Do a standard linear regression on the logit values
 This effectively fits a logistic curve to the data, while still
just doing a linear regression with the transformed input
(ala quadric machine, etc.)
Generalization
1. Find the value for the new input on the logit line
2. Transform that logit value back into a probability
104
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability:
Dosage Cured Patients # Cured/Total
Patients

20 1 5 .20
30 2 6 .33
40 4 6 .67
50 6 7 .86

Cure 1
d
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
CS 478 - Regression 105
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability:
Dosage Cured Patients # Cured/Total
Patients

20 1 5 .20
30 2 6 .33
40 4 6 .67
50 6 7 .86

Cured 1
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
106
Logistic Regression Approach
 Could use linear regression with the probability points, but
that would not extrapolate well
 Logistic version is better but how do we get it?
 Similar to Quadric we do a non-linear pre-process of the
input and then do linear regression on the transformed
values – do a linear regression on the log odds - Logit

1 1
prob. prob.
Cure Cure
d d
0 0
0 10 20 30 40 50 0 10 20 30 40 50
60 60

107
Non-Linear Pre-Process to
Logit (Log Odds)
Medication # Total Probability: Odds: Logit
Dosage Cured Patients # Cured/Total p/(1-p) = Log Odds:
Patients # cured/
# not ln(Odds)
cured
20 1 5 .20 .25 -1.39
30 2 6 .33 .50 -0.69
40 4 6 .67 2.0 0.69
50 6 7 .86 6.0 1.79

Cure 1
d
prob.
Cure
Not d
0
Cured
0 10 20 30 40 50 0 10 20 30 40 50
60 60
108
Regression of Log Odds
Medicatio # Total Probability: Odds: Log +
n Dosage Cured Patient # p/(1-p) = Odds: 2
s Cured/Total # cured/ 0
Patients # not cured ln(Odds)

20 1 5 .20 .25 -1.39 -2

30 2 6 .33 .50 -0.69 0 10 20 30 40 50
40 4 6 .67 2.0 0.69 1 60
prob.
50 6 7 .86 6.0 1.79 Cured

• y = .11x – 3.8 - Logit regression equation 0

• Now we have a regression line for log odds (logit)
• To generalize, we interpolate the log odds value for the new data point
• Then we transform that log odds point to a probability: p = elogit(x)/(1+elogit(x))
• For example assume we want p for dosage = 10
Logit(10) = .11(10) – 3.8 = -2.7
p(10) = e-2.7/(1+e-2.7) = .06 [note that we just work backwards from logit to p]
• These p values make up the sigmoidal regression curve (which we never have to
actually plot)
109
Heart 50 50 50 50 70 70 90 90 90 90 90
rate
Heart y n n n n y y y n y y
Attack

Econometrics Cheat Sheet Stock and Watson
No ratings yet
Econometrics Cheat Sheet Stock and Watson
2 pages
Decision Tree & Techniques
71% (7)
Decision Tree & Techniques
41 pages
Math1041 Study Notes For UNSW
No ratings yet
Math1041 Study Notes For UNSW
16 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Aiml M4 C1
No ratings yet
Aiml M4 C1
101 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Peer Reviewed Scientific Journals
No ratings yet
Peer Reviewed Scientific Journals
9 pages
10.1 Decision Tree
No ratings yet
10.1 Decision Tree
17 pages
Decision Tree
No ratings yet
Decision Tree
13 pages
Session 5b Classification by Decision Tree Induction (1)
No ratings yet
Session 5b Classification by Decision Tree Induction (1)
42 pages
U4 ML Updated
No ratings yet
U4 ML Updated
32 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
2 - Decision Tree
No ratings yet
2 - Decision Tree
23 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
Machine Learning: Prepared by
No ratings yet
Machine Learning: Prepared by
44 pages
DWDM UNIT-IV Classification and Prediction
100% (1)
DWDM UNIT-IV Classification and Prediction
70 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
Unit Iii DM
No ratings yet
Unit Iii DM
48 pages
UNIT III DM (2)
No ratings yet
UNIT III DM (2)
48 pages
Decision Tree Induction Algorithm
No ratings yet
Decision Tree Induction Algorithm
6 pages
Decision Trees Edited
No ratings yet
Decision Trees Edited
56 pages
Classification and Regression Tree Construction
No ratings yet
Classification and Regression Tree Construction
18 pages
Classification Using Decision Trees
No ratings yet
Classification Using Decision Trees
43 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
FALLSEM2024-25 BCSE209L TH VL2024250101598 2024-08-05 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101598 2024-08-05 Reference-Material-I
31 pages
Lecture - 3 Classification (Decision Tree)
No ratings yet
Lecture - 3 Classification (Decision Tree)
44 pages
decision tree
No ratings yet
decision tree
13 pages
Data Mining Notes Unit 4
No ratings yet
Data Mining Notes Unit 4
30 pages
Lesson 7 Supervised Method (Decision Trees) Algorithms
No ratings yet
Lesson 7 Supervised Method (Decision Trees) Algorithms
12 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
Unit 3 (A) NGP
No ratings yet
Unit 3 (A) NGP
78 pages
Decision Tree For Classification (ID3 Information Gain Entropy)
No ratings yet
Decision Tree For Classification (ID3 Information Gain Entropy)
3 pages
Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
Decision Trees and Decision Modeling
No ratings yet
Decision Trees and Decision Modeling
58 pages
Week 4 - Classification - Decision Tree 1
No ratings yet
Week 4 - Classification - Decision Tree 1
40 pages
Decision Tree
No ratings yet
Decision Tree
68 pages
CS467-M4-Machine Learning-Ktustudents - in
No ratings yet
CS467-M4-Machine Learning-Ktustudents - in
9 pages
07.2.Decision Trees_ML
No ratings yet
07.2.Decision Trees_ML
32 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
ML Unit 3
No ratings yet
ML Unit 3
30 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
Machine_Learning_Lecture_08_Decision Tree Learning (1)
No ratings yet
Machine_Learning_Lecture_08_Decision Tree Learning (1)
67 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
ML Unit-2 Material WORD
No ratings yet
ML Unit-2 Material WORD
25 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Chapter#03 Supervised Learning and Its Algorithms - III
No ratings yet
Chapter#03 Supervised Learning and Its Algorithms - III
29 pages
UNIT 2 Class Basic
No ratings yet
UNIT 2 Class Basic
69 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
MI_Unit 4
No ratings yet
MI_Unit 4
79 pages
Slide 3
No ratings yet
Slide 3
23 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
CG 3rd Ut Clipping Part PDF
No ratings yet
CG 3rd Ut Clipping Part PDF
20 pages
Clustering
No ratings yet
Clustering
110 pages
Class 10 Geography Assignment 18 May 2020
No ratings yet
Class 10 Geography Assignment 18 May 2020
2 pages
SYMCA Timetable
No ratings yet
SYMCA Timetable
1 page
SoftwareTestingPrinciples and Practices by Gopalaswamy Ramesh, Srinivasan Desikan PDF
100% (1)
SoftwareTestingPrinciples and Practices by Gopalaswamy Ramesh, Srinivasan Desikan PDF
388 pages
Practical 1 To 4
No ratings yet
Practical 1 To 4
17 pages
Record A Script Using Selenese For An Unsuccessful Login of Gmail
No ratings yet
Record A Script Using Selenese For An Unsuccessful Login of Gmail
5 pages
DBMS
No ratings yet
DBMS
6 pages
Assignment 1 Program 1: Show The Network Diagram, Critical Path For Each Activity. Also Describe Resources Allocated To Activities
No ratings yet
Assignment 1 Program 1: Show The Network Diagram, Critical Path For Each Activity. Also Describe Resources Allocated To Activities
40 pages
Apparel Shop Inventory Management
No ratings yet
Apparel Shop Inventory Management
16 pages
Government of Transport Department Maharashtra: Reference No:MH05 /0034961/2018 License Type:LL
No ratings yet
Government of Transport Department Maharashtra: Reference No:MH05 /0034961/2018 License Type:LL
1 page
The Effect of Products and Prices On Purchasing Decisions of Health Food
No ratings yet
The Effect of Products and Prices On Purchasing Decisions of Health Food
10 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
Question Bank For DM
No ratings yet
Question Bank For DM
4 pages
Introduction To Data Science - Lin and Li
No ratings yet
Introduction To Data Science - Lin and Li
403 pages
ML Lab Manual
No ratings yet
ML Lab Manual
66 pages
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-06 Reference-Material-I
No ratings yet
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-06 Reference-Material-I
21 pages
Chapter Five (Dummy) - For Evaluation
No ratings yet
Chapter Five (Dummy) - For Evaluation
64 pages
Quantitative Risk Analysis of Air Pollution Health Effects Louis Anthony Cox Jr. pdf download
100% (5)
Quantitative Risk Analysis of Air Pollution Health Effects Louis Anthony Cox Jr. pdf download
64 pages
Internal Auditor Characteristics, Internal Audit Effectiveness and Moderating Effect of Senior Management
No ratings yet
Internal Auditor Characteristics, Internal Audit Effectiveness and Moderating Effect of Senior Management
19 pages
Burns and Grove The Practice of Nursing Research 9th edition by Jennifer Gray, Susan Grove 0323779255 9780323779258 - Download the ebook now to start reading without waiting
100% (19)
Burns and Grove The Practice of Nursing Research 9th edition by Jennifer Gray, Susan Grove 0323779255 9780323779258 - Download the ebook now to start reading without waiting
90 pages
Correlation
No ratings yet
Correlation
33 pages
Fulfilling The Needs of Esports Consumers:: A Uses and Gratifications Perspective
No ratings yet
Fulfilling The Needs of Esports Consumers:: A Uses and Gratifications Perspective
9 pages
Analytix Labs Data Science Course
100% (1)
Analytix Labs Data Science Course
18 pages
Unit 18
No ratings yet
Unit 18
9 pages
Gretl Guide (251 300)
No ratings yet
Gretl Guide (251 300)
50 pages
Camm 4e Ch09 PPT
No ratings yet
Camm 4e Ch09 PPT
71 pages
Samrawit Tesfay's Ma Thesis in Admas University
No ratings yet
Samrawit Tesfay's Ma Thesis in Admas University
110 pages
Chapter+11+a+Hybrid+Science Guided+Machine+Learning+Approach+for+Modeling+Chemical+and+Polymer+Processes
No ratings yet
Chapter+11+a+Hybrid+Science Guided+Machine+Learning+Approach+for+Modeling+Chemical+and+Polymer+Processes
44 pages
Free Access to Econometric Analysis 8th Edition Greene Solutions Manual Chapter Answers
100% (14)
Free Access to Econometric Analysis 8th Edition Greene Solutions Manual Chapter Answers
28 pages
Impact Evaluation in Practice (Second Edition) Offers A Comprehensive and Accessible Introduction To
No ratings yet
Impact Evaluation in Practice (Second Edition) Offers A Comprehensive and Accessible Introduction To
39 pages
Children at Risk of Specific Learning Disorder
No ratings yet
Children at Risk of Specific Learning Disorder
14 pages
Econometrics Assignment 2 IIT Delhi
No ratings yet
Econometrics Assignment 2 IIT Delhi
9 pages
UTS Susulan
No ratings yet
UTS Susulan
4 pages
Econometrics For ECO 2022 Tutorial 5
No ratings yet
Econometrics For ECO 2022 Tutorial 5
2 pages
Linear and Non-Linear Regression Analysis For The Kinetics of (Aucl) ̄ Removal by P-Hydroxybenzoate Intercalated Mg/Al-Hydrotalcite
No ratings yet
Linear and Non-Linear Regression Analysis For The Kinetics of (Aucl) ̄ Removal by P-Hydroxybenzoate Intercalated Mg/Al-Hydrotalcite
6 pages
Cinelli and Hazlett Making Sense of Sensitivity 221027 155400
No ratings yet
Cinelli and Hazlett Making Sense of Sensitivity 221027 155400
42 pages
Midterm Reviewer 1
No ratings yet
Midterm Reviewer 1
2 pages
Final SBE-FINALS
No ratings yet
Final SBE-FINALS
11 pages

Classification and Prediction

Uploaded by

Classification and Prediction

Uploaded by

Outline

 Distance based algo

Sedan Sports, Truck

 Above fig shows Insurance risk example Decision Tree

 Following is the decision tree induction schema:

Input: node n, partition D, split selection method S

IG = Entropy of original dataset - Entropies of split dataset

 It is ratio of IG for a splitting attribute and entropy of

E(temp)= [(2+2)/14](1) +[(4+2)/14] (0.09) + [(3+1)/14](0.81)

Split info(temp)= 0.926 Gain Ratio(temp)= 0.0292/0.926

This gives the GainRatio value for the gender attribute as

 Create Binary Tree

 This formula is evaluated at the current node, t, and

3-1=2 6-2=4 3-0=3

Φ(gender) = 2 * 9/15 * 6/15 * (2/15 + 4/15 + 3/15)=0.224

S=3 T=0 S=1 T=3

S=0 T=0 S=4 T=3

Φ(1.6) = 2 * 0/15 * 15/15 (4/15 + 8/15 + 3/15)=0

S=2 T=0 S=2 T=3

Φ(1.7) = 2 * 2/15 * 13/15 (0/15 + 8/15 + 3/15)=0.169

Φ(1.8) = 2 * 5/15 * 10/15 (4/15 + 6/15 + 3/15)=0.385

• At the start, there are six choices for split

 Bayes Rule or Bayes Theorem is

P(h1 | xi) = P(xi |h1)P(h1)

P(X|Y) = P(red|Y)P(suv|Y)P(dom|Y) P(X|Y)=3/5 * 1/5 * 2/5 = 0.048

P(X|N) > P(X|Y) therefore sample X is class “N”

P(X|c1) P(c1)= P(A1|c1)P(A2|c1)P(A3|c1) P(c1) P(X|c1)=1/2 * 1/2 * 1/2 = 0.125

P(X|c1) > P(X|c2) therefore sample X is class “c1”

• When classifying a target tuple, the conditional and prior

 The curve is called the curve of regression and the

 In the particular case, when the curve is a straight

x – independent variable (input)

x – independent variable (input)

 The equation of the line of regression of is y=a+bx

 The line of regression always passes through point

line gives the best estimate

y=a+bx substitute the values of a and b

Consumption= 49.13 + (0.85) Income + error

Cov (X,Y)= [1/n∑ xy ] – x y

‘a’ is the intercept and ‘b’ is the slope of the line

rxy= cov(x, y)/σx σy Cov (X,Y)= [1/n∑ xy ] – x y

Simple regression considers the relation between

The intent is to look at

The standard error of the regression (sY|x) is

Again, the standard error of the regression is

 Method for analysing a linear relationship

Daughter Ht= 7.5+0.707mother +0.614 father

 The difference between linear and

 y=a+bx+cx2  dy/da= 1 dy/db= 1x

20 1 5 .20 .25 -1.39 -2

• y = .11x – 3.8 - Logit regression equation 0

You might also like