Module3-Fitting A Model To Data
Module3-Fitting A Model To Data
In certain fields of Statistics & Economics, the bare model with unspecified
parameters is called “The Model”.
The Model: the model is a convenient fiction that necessarily glosses over
some of the details of the actual thing being modeled. It is meant to capture
the structure of the data as simply as possible.
The line in Figure 4-3 can be expressed in this form (with Balance in
thousands) as:
•Linear discriminant:
For our purposes, the important thing is that we can express the model
as a weighted sum of the attribute values.
The concrete example from Equation 4-1 can be written in this form:
They have very different slopes and intercepts, and each represents a
different model of the data. In fact, there are infinitely many lines (models)
that classify this training set perfectly.
We will then find the optimal value for the weights by maximizing or minimizing
the objective function.
What can easily be overlooked is that these weights are “best” only if we
believe that the objective function truly represents what we want to achieve.
Unfortunately, creating an objective function that matches the true goal of the
data mining is usually impossible.
Y= X1 + X2 + X3
Dependent Variable Independent Variable
Outcome Variable Predictor Variable
Response Variable Explanatory Variable
Linear Regression
Linear regression is the type of regression that forms a relationship between
the target variable and one or more independent variables utilizing a straight
line. The given equation represents the equation of linear regression
Y = a + b*X + e.
This type of statistical model (also known as logit model) is often used for
classification and predictive analytics.
This is also commonly known as the log odds, or the natural logarithm of
odds, and this logistic function is represented by the following formulas:
Logit(pi) = 1/(1+ exp(-pi))
Logistic regression is a misnomer
• The distinction between classification and regression is whether the value for
the target variable is categorical or numeric
• This is an old and fairly simple dataset representing various types of iris, a
genus of flowering plant.
• For this illustration we’ll use just two species of irises, Iris Setosa and Iris
Versicolor.
An Example of Mining a Linear Discriminant from Data
• The dataset describes a collection of flowers of these two species, each
described with two measurements: the Petal width and the Sepal width
(Figure 4-6).
Classifying Flowers
• The flower dataset is plotted in Figure 4-7, with these two attributes on the x and y
axis, respectively.
• Each instance is one flower and corresponds to one dot on the graph.
• The filled dots are of the species Iris Setosa and the circles are instances of
the species Iris Versicolor.
Classifying Flowers
• Two different separation lines are shown in the figure, one generated by
logistic regression and the second by another linear method, a support
vector machine (which will be described shortly).
• Note that the data comprise two fairly distinct clumps, with a few outliers.
Logistic regression separates the two classes completely: all the Iris
Versicolor examples are to the left of its line and all the Iris Setosa to
the right.
• The Support vector machine line is almost midway between the clumps,
though it misclassifies the starred point at (3, 1).
• Ranking
• Tree induction
• Linear discriminant functions (e.g., linear regressions, logistic regressions,
SVMs)
• Ranking is free
• Class Probability Estimation
• Tree induction
• Logistic regression
The many faces of classification:
Classification / Probability Estimation / Ranking
Increasing difficulty
Ranking:
• Business context determines the number of actions (“how far down the
list”)
Probability:
• You can always rank / classify if you have probabilities!
Ranking: Examples
• Search engines
• Whether a document is relevant to a topic / query
Class Probability Estimation: Examples
• MegaTelCo
• Ranking vs. Class Probability Estimation.
Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
B1
B2
B2
B2
H1
Recall the distance from a point(x0,y0) to a H
line: H2
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d+
d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
The distance between H1 and H2 is: 2/||w||
w• x + b = 0
w • x + b = −1 w • x + b = +1
b11
b12
1 if w • x + b 1 2
f ( x) = Margin =
− 1 if w • x + b −1 || w ||
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12
2
We want to maximize: Margin =
|| w ||
|| w ||2
Which is equivalent to minimizing: L( w) =
2
But subjected to the following constraints:
𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 ∙ 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1
• The hinge loss is a special type of cost function that not only
penalizes misclassified samples but also correctly classified
ones that are within a defined margin from the decision
boundary.
Hinge Loss functions
• Hinge loss incurs no penalty for an example that is not on the
wrong side of the margin.
• SVM can handle irrelevant and redundant attributes better than many
other techniques.
• The user needs to provide the type of kernel function and cost function.
• Difficult to handle missing values.
Simple Neural Network
Non-linear Functions
• Linear functions can actually represent nonlinear models, if we include
more complex features in the functions