Module 3.1
Module 3.1
Figure 4-3. The dataset of Figure 4-2 with a single linear split.
• This is called a linear classifier and is essentially a weighted sum of the values for
the various attributes.
Linear Discriminate Functions
• Our goal is going to be to fit our model to the data, and to do so it is quite helpful to
represent the model mathematically. You may recall that the equation of a line in two
dimensions is y = mx + b, where m is the slope of the line and b is the y intercept (the y
value when x = 0). The line in Figure 4-3 can be expressed in this form (with Balance in
thousands) as:
• Age = ( - 1.5) × Balance + 60
• We would classify an instance x as a + if it is above the line, and as a • if it is below the
line. Rearranging this mathematically leads to the function that is the basis of all the
techniques discussed in this chapter. First, for this example form the classification
solution is shown in Equation 4-1.
Equation 4-1. Classification function
class() = {+if 1.0×Age-1.5×Balance+60>0
{. if 1.0 × Age - 1.5 × Balance + 60 ≤ 0
• This is called a linear discriminate because it discriminates between the classes,
and the function of the decision boundary is a linear combination—a weighted sum
of the attributes.
• A linear discriminate function is a numeric classification model. For example,
consider our feature vector x, with the individual component features being xi. A
linear model then can be written as follows in Equation 4-2.
• Equation 4-2. A general linear model
Figure 4-4. A basic instance space in two dimensions containing points of two classes.
• The concrete example from Equation 4-1 can be written in this form:
• f (x) = 60 + 1.0 × Age - 1.5 × Balance
• To use this model as a linear discriminate, for a given instance represented by a
feature vector x, we check whether f(x) is positive or negative. As discussed above,
in the two-dimensional case, this corresponds to seeing whether the instance x falls
above or below the line.
• The data mining is going to “fit” this parameterized model to a particular dataset
meaning specifically, to find a good set of weights on the features.
Figure 4-5. Many different possible linear boundaries can separate the two
groups of points of Figure 4-4.
Optimizing an Objective Function
• Our general procedure will be to define an objective function that represents our
goal, and can be calculated for a particular set of weights and a particular set of
data. We will then find the optimal value for the weights by maximizing or
minimizing the objective function.
• Logistic regression doesn’t really do what we call regression, which is the
estimation of a numeric target value. Logistic regression applies linear models to
class probability estimation, which is particularly useful for many applications.
Figure 4-8. The points of Figure 4-2 and the maximal margin classifier.
• The idea of maximizing the margin is intuitively satisfying for the following
reason. The training dataset is just a sample from some population. In predictive
modeling, we are interested in predicting the target for instances that we have not
yet seen. These instances will be scattered about. Hopefully they will be distributed
similarly to the training data, but they will in fact be different points. In particular,
some of the positive examples will likely fall closer to the discriminant boundary
than any positive example we have yet seen.
• The penalty for a misclassified point is proportional to the distance from the
decision boundary, so if possible the SVM will make only “small” errors.
Technically, this error function is known as hinge loss.
Figure 4-9. Two loss functions illustrated. The x axis shows the distance from the
decision boundary. The y axis shows the loss incurred by a negative instance as a
function of its distance from the decision boundary. (The case of a positive instance is
symmetric.) If the negative instance is on the negative side of the boundary, there is
no loss. If it is on the positive (wrong) side of the boundary, the different loss
functions penalize it differently.
Regression via Mathematical Functions
• The linear regression model structure is exactly the same as for the linear
discriminant function.
• p+(x) to represent the model’s estimate of the probability of class membership of a data item
represented by feature vector x.
• The estimated probability of the event not occurring is therefore 1 - p+(x).
• Equation 4-3. Log-odds linear function
log ( p+(x) 1 - p+(x))= f (x) = w0 + w1x1 + w2x2 + ⋯
• Equation 4-3. Log-odds linear function
• log ( p+(x) 1 - p+(x))= f (x) = w0 + w1x1 + w2x2 + ⋯
• Thus, Equation 4-3 specifies that for a particular data item, described by feature-vector x, the
log-odds of the class is equal to our linear function, f(x). Since often we actually want the
estimated probability of class membership, not the log-odds, we can solve for p+(x) in
Equation 4-3. This yields the not-so-pretty quantity in Equation 4-4.
• Equation 4-4. The logistic function
p+(x) =1 /1 + e - f (x)
Figure 4-10. Logistic regression’s estimate of class probability as a function of
f(x), (i.e., the distance from the separating boundary). This curve is called a
“sigmoid” curve because of its “S” shape, which squeezes the probabilities into
their correct range (between zero and one).
• Figure 4-10 plots the estimated probability p+(x) (vertical axis) as a function of the
distance from the decision boundary (horizontal axis). The figure shows that at the
decision boundary (at distance x = 0), the probability is 0.5 (a coin toss).
• Consider the following function computing the “likelihood” that a particular
labeled example belongs to the correct class, given a set of parameters w that
produces
• class probability estimates p+(x): g(x, w) = { p+(x) if x is a+
• 1 - p+(x) if x is a•
Example: Logistic Regression versus Tree Induction
• A classification tree uses decision boundaries that are perpendicular to the instance
space axes (see Figure 4-1), whereas the linear classifier can use decision
boundaries of any direction or orientation (see Figure 4-3).
• A classification tree is a “piecewise” classifier that segments the instance space
recursively when it has to, using a divide-and-conquer approach.
Figure 4-11. One of the cell images from which the Wisconsin Breast Cancer
dataset was derived. (Image courtesy of Nick Street and Bill Wolberg.)
Each example describes characteristics of a cell nuclei image, which has
been labeled as either benign or malignant (cancerous), based on an expert’s
diagnosis of the cells. A sample cell image is shown in Figure 4-11.
Table 4-3. The attributes of the Wisconsin Breast Cancer dataset.
• Attribute name Description
• RADIUS Mean of distances from center to points on the perimeter
• TEXTURE Standard deviation of grayscale values
• PERIMETER Perimeter of the mass
• AREA Area of the mass
• SMOOTHNESS Local variation in radius lengths
• COMPACTNESS Computed as: perimeter2/area – 1.0
• CONCAVITY Severity of concave portions of the contour
• CONCAVE POINTS Number of concave portions of the contour
• SYMMETRY A measure of the nucleii’s symmetry
• FRACTAL DIMENSION 'Coastline approximation' – 1.0
• DIAGNOSIS (Target) Diagnosis of cell sample: malignant or benign
• Table 4-4. Linear equation learned by logistic regression on the Wisconsin Breast
Cancer dataset (see text and Table 4-3 for a description of the attributes).
• Attribute Weight (learned parameter)
• SMOOTHNESS_worst 22.3
• CONCAVE_mean 19.47
• CONCAVE_worst 11.68
• SYMMETRY_worst 4.99
• CONCAVITY_worst 2.86
• CONCAVITY_mean 2.34
• RADIUS_worst 0.25
• TEXTURE_worst 0.13
• AREA_SE 0.06
• TEXTURE_mean 0.03
• TEXTURE_SE –0.29
• COMPACTNESS_mean –7.1
• COMPACTNESS_SE –27.87
• w0 (intercept) –17.7
Nonlinear Functions, Support Vector Machines,
and
Neural Networks
• In Figure 4-12 we show that such linear functions can actually represent nonlinear
models, if we include more complex features in the functions.
• The resulting model is a curved line (a parabola) in the original feature space.
Sepal width2. We also added a single data point to the original dataset, an Iris
Versicolor example.
• The two most common families of techniques that are based on fitting the
parameters of complex, nonlinear functions are nonlinear support vector machines
and neural networks.
Figure 4-12. The Iris dataset with a nonlinear feature. In this figure, logistic
regression and support vector machine both linear models are provided an
additional feature, Sepal width2, which allows both the freedom to create more
complex, nonlinear models (boundaries), as shown.