0% found this document useful (0 votes)
6 views25 pages

Module 3.1

Uploaded by

1dt20ai016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views25 pages

Module 3.1

Uploaded by

1dt20ai016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Module 3

Fitting a Model to Data


• An alternative method for learning a predictive model from a dataset is to start by
specifying the structure of the model with certain numeric parameters left
unspecified.
• Then the data mining calculates the best parameter values given a particular set of
training data.
• As examples we will present some common techniques used for predicting
(estimating) unknown numeric values, unknown binary values (such as whether a
document or web page is relevant to a query), as well as likelihoods of events, such
as default on credit, response to an offer, fraud on an account, and so on.
Classification via Mathematical Functions
• It shows the space broken up into regions by horizontal and vertical decision
boundaries that partition the instance space into similar regions. Examples in each
region should have similar values for the target variable.
• A main purpose of creating homogeneous regions is so that we can predict the
target variable of a new, unseen instance by determining which segment it falls
into.
• For example, in Figure 4-1, if a new customer falls into the lower-left segment, we
can conclude that the target value is very likely to be “•”. Similarly, if it falls into
the upper-right segment, we can predict its value as “+”.
• For example, we can separate the instances almost perfectly (by class) if we are
allowed to introduce a boundary that is still a straight line, but is not perpendicular
to the axes (Figure 4-3)

Figure 4-3. The dataset of Figure 4-2 with a single linear split.

• This is called a linear classifier and is essentially a weighted sum of the values for
the various attributes.
Linear Discriminate Functions
• Our goal is going to be to fit our model to the data, and to do so it is quite helpful to
represent the model mathematically. You may recall that the equation of a line in two
dimensions is y = mx + b, where m is the slope of the line and b is the y intercept (the y
value when x = 0). The line in Figure 4-3 can be expressed in this form (with Balance in
thousands) as:
• Age = ( - 1.5) × Balance + 60
• We would classify an instance x as a + if it is above the line, and as a • if it is below the
line. Rearranging this mathematically leads to the function that is the basis of all the
techniques discussed in this chapter. First, for this example form the classification
solution is shown in Equation 4-1.
Equation 4-1. Classification function
class() = {+if 1.0×Age-1.5×Balance+60>0
{. if 1.0 × Age - 1.5 × Balance + 60 ≤ 0
• This is called a linear discriminate because it discriminates between the classes,
and the function of the decision boundary is a linear combination—a weighted sum
of the attributes.
• A linear discriminate function is a numeric classification model. For example,
consider our feature vector x, with the individual component features being xi. A
linear model then can be written as follows in Equation 4-2.
• Equation 4-2. A general linear model

f (x) = w0 + w1x1 + w2x2 + ⋯

Figure 4-4. A basic instance space in two dimensions containing points of two classes.
• The concrete example from Equation 4-1 can be written in this form:
• f (x) = 60 + 1.0 × Age - 1.5 × Balance
• To use this model as a linear discriminate, for a given instance represented by a
feature vector x, we check whether f(x) is positive or negative. As discussed above,
in the two-dimensional case, this corresponds to seeing whether the instance x falls
above or below the line.
• The data mining is going to “fit” this parameterized model to a particular dataset
meaning specifically, to find a good set of weights on the features.

Figure 4-5. Many different possible linear boundaries can separate the two
groups of points of Figure 4-4.
Optimizing an Objective Function
• Our general procedure will be to define an objective function that represents our
goal, and can be calculated for a particular set of weights and a particular set of
data. We will then find the optimal value for the weights by maximizing or
minimizing the objective function.
• Logistic regression doesn’t really do what we call regression, which is the
estimation of a numeric target value. Logistic regression applies linear models to
class probability estimation, which is particularly useful for many applications.

An Example of Mining a Linear Discriminant from Data


• From the UCI Dataset Repository (Bache & Lichman, 2013). This is an old and
fairly simple dataset representing various types of iris, a genus of flowering plant.
The original dataset includes three species of irises represented with four attributes,
and the data mining problem is to classify each instance as belonging to one of the
three species based on the attributes.
Figure 4-6. Two parts of a flower. Width measurements of these are used in
the Iris dataset
Iris Setosa and Iris Versicolor. The dataset describes a collection of flowers
of these two species, each described with two measurements: the Petal width
and the Sepal width (Figure 4-6).
Figure 4-7, with these two attributes on the x and y axis, respectively. Each
instance is one flower and corresponds to one dot on the graph. The filled dots are of
the species Iris Setosa and the circles are instances of the species Iris Versicolor.
• Two different separation lines are shown in the figure, one generated by logistic
regression and the second by another linear method, a support vector machine
(which will be described shortly). Note that the data comprise two fairly distinct
clumps, with a few outliers. Logistic regression separates the two classes
completely: all the Iris Versicolor examples are to the left of its line and all the Iris
Setosa to the right.

Linear Discriminant Functions for Scoring and Ranking Instances


• Many people suspect that right near the decision boundary we would be
most uncertain about a class (and see the discussion below on the
“margin”).
• f(x) will be relatively small when x is near the boundary. And f(x) will be
large (and positive) when x is far from the boundary in the + direction.
Support Vector Machines, Briefly
• Support vector machines are linear discriminants. For many business users
interacting with data scientists, that will be sufficient. Nevertheless, let’s look at
SVMs a little more carefully, if we can get through some minor details, the
procedure for fitting the linear discriminant is intuitively satisfying.
• The distance between the dashed parallel lines is called the margin around the
linear discriminant, and thus the objective is to maximize the margin.

Figure 4-8. The points of Figure 4-2 and the maximal margin classifier.
• The idea of maximizing the margin is intuitively satisfying for the following
reason. The training dataset is just a sample from some population. In predictive
modeling, we are interested in predicting the target for instances that we have not
yet seen. These instances will be scattered about. Hopefully they will be distributed
similarly to the training data, but they will in fact be different points. In particular,
some of the positive examples will likely fall closer to the discriminant boundary
than any positive example we have yet seen.
• The penalty for a misclassified point is proportional to the distance from the
decision boundary, so if possible the SVM will make only “small” errors.
Technically, this error function is known as hinge loss.
Figure 4-9. Two loss functions illustrated. The x axis shows the distance from the
decision boundary. The y axis shows the loss incurred by a negative instance as a
function of its distance from the decision boundary. (The case of a positive instance is
symmetric.) If the negative instance is on the negative side of the boundary, there is
no loss. If it is on the positive (wrong) side of the boundary, the different loss
functions penalize it differently.
Regression via Mathematical Functions
• The linear regression model structure is exactly the same as for the linear
discriminant function.

f (x) = w0 + w1x1 + w2x2 + ⋯


• The linear function estimates this numeric target value using Equation 4-2, and of
course the training data have the actual target value.
• The model that fits the data best would be the model with the minimum sum of
errors on the training data. And that is exactly what regression procedures.
• Standard linear regression procedures instead minimize the sum or mean of the
squares of these errors which gives the procedure its common name “least squares”
regression.
Class Probability Estimation and Logistic
“Regression”
• A linear discriminant could be used to identify accounts or transactions as likely to
have been defrauded. The director of the fraud control operation may want the
analysts to focus not simply on the cases most likely to be fraud, but on the cases
where the most money is at stake—that is, accounts where the company’s monetary
loss is expected to be the highest.
• Table 4-1. Probabilities and the corresponding odds.
Probability Corresponding odds
0.5 50:50 or 1
0.9 90:10 or 9
0.999 999:1 or 999
0.01 1:99 or 0.0101
0.001 1:999 or 0.001001
• Table 4-2. Probabilities, odds, and the corresponding log-odds.
Probability Odds Log-odds
0.5 50:50 or 1 0
0.9 90:10 or 9 2.19
0.999 999:1 or 999 6.9
0.01 1:99 or 0.0101 –4.6
0.001 1:999 or 0.001001 –6.9
• For probability estimation, logistic regression uses the same linear model as do our
linear discriminants for classification and linear regression for estimating numeric
target values.
• The output of the logistic regression model is interpreted as the log-odds of class
membership.
Logistic Regression: Some Technical Details

• p+(x) to represent the model’s estimate of the probability of class membership of a data item
represented by feature vector x.
• The estimated probability of the event not occurring is therefore 1 - p+(x).
• Equation 4-3. Log-odds linear function
log ( p+(x) 1 - p+(x))= f (x) = w0 + w1x1 + w2x2 + ⋯
• Equation 4-3. Log-odds linear function
• log ( p+(x) 1 - p+(x))= f (x) = w0 + w1x1 + w2x2 + ⋯
• Thus, Equation 4-3 specifies that for a particular data item, described by feature-vector x, the
log-odds of the class is equal to our linear function, f(x). Since often we actually want the
estimated probability of class membership, not the log-odds, we can solve for p+(x) in
Equation 4-3. This yields the not-so-pretty quantity in Equation 4-4.
• Equation 4-4. The logistic function
p+(x) =1 /1 + e - f (x)
Figure 4-10. Logistic regression’s estimate of class probability as a function of
f(x), (i.e., the distance from the separating boundary). This curve is called a
“sigmoid” curve because of its “S” shape, which squeezes the probabilities into
their correct range (between zero and one).
• Figure 4-10 plots the estimated probability p+(x) (vertical axis) as a function of the
distance from the decision boundary (horizontal axis). The figure shows that at the
decision boundary (at distance x = 0), the probability is 0.5 (a coin toss).
• Consider the following function computing the “likelihood” that a particular
labeled example belongs to the correct class, given a set of parameters w that
produces
• class probability estimates p+(x): g(x, w) = { p+(x) if x is a+
• 1 - p+(x) if x is a•
Example: Logistic Regression versus Tree Induction
• A classification tree uses decision boundaries that are perpendicular to the instance
space axes (see Figure 4-1), whereas the linear classifier can use decision
boundaries of any direction or orientation (see Figure 4-3).
• A classification tree is a “piecewise” classifier that segments the instance space
recursively when it has to, using a divide-and-conquer approach.
Figure 4-11. One of the cell images from which the Wisconsin Breast Cancer
dataset was derived. (Image courtesy of Nick Street and Bill Wolberg.)
Each example describes characteristics of a cell nuclei image, which has
been labeled as either benign or malignant (cancerous), based on an expert’s
diagnosis of the cells. A sample cell image is shown in Figure 4-11.
Table 4-3. The attributes of the Wisconsin Breast Cancer dataset.
• Attribute name Description
• RADIUS Mean of distances from center to points on the perimeter
• TEXTURE Standard deviation of grayscale values
• PERIMETER Perimeter of the mass
• AREA Area of the mass
• SMOOTHNESS Local variation in radius lengths
• COMPACTNESS Computed as: perimeter2/area – 1.0
• CONCAVITY Severity of concave portions of the contour
• CONCAVE POINTS Number of concave portions of the contour
• SYMMETRY A measure of the nucleii’s symmetry
• FRACTAL DIMENSION 'Coastline approximation' – 1.0
• DIAGNOSIS (Target) Diagnosis of cell sample: malignant or benign
• Table 4-4. Linear equation learned by logistic regression on the Wisconsin Breast
Cancer dataset (see text and Table 4-3 for a description of the attributes).
• Attribute Weight (learned parameter)
• SMOOTHNESS_worst 22.3
• CONCAVE_mean 19.47
• CONCAVE_worst 11.68
• SYMMETRY_worst 4.99
• CONCAVITY_worst 2.86
• CONCAVITY_mean 2.34
• RADIUS_worst 0.25
• TEXTURE_worst 0.13
• AREA_SE 0.06
• TEXTURE_mean 0.03
• TEXTURE_SE –0.29
• COMPACTNESS_mean –7.1
• COMPACTNESS_SE –27.87
• w0 (intercept) –17.7
Nonlinear Functions, Support Vector Machines,
and
Neural Networks
• In Figure 4-12 we show that such linear functions can actually represent nonlinear
models, if we include more complex features in the functions.
• The resulting model is a curved line (a parabola) in the original feature space.
Sepal width2. We also added a single data point to the original dataset, an Iris
Versicolor example.
• The two most common families of techniques that are based on fitting the
parameters of complex, nonlinear functions are nonlinear support vector machines
and neural networks.
Figure 4-12. The Iris dataset with a nonlinear feature. In this figure, logistic
regression and support vector machine both linear models are provided an
additional feature, Sepal width2, which allows both the freedom to create more
complex, nonlinear models (boundaries), as shown.

You might also like