0% found this document useful (0 votes)
36 views

Module3-Fitting A Model To Data

The document discusses fitting models to data through linear regression, logistic regression, and support vector machines. It explains that these techniques find the linear function that best fits the data by optimizing an objective function. Linear regression finds the line that minimizes the squared errors to predict continuous numeric outcomes. Logistic regression estimates the probability of classification by applying a logit transformation to the odds. Support vector machines produce different decision boundaries than logistic regression by optimizing a different objective function. An example using iris flower data demonstrates classifying instances based on measurements and comparing linear boundaries from logistic regression and support vector machines.

Uploaded by

Green Mongor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Module3-Fitting A Model To Data

The document discusses fitting models to data through linear regression, logistic regression, and support vector machines. It explains that these techniques find the linear function that best fits the data by optimizing an objective function. Linear regression finds the line that minimizes the squared errors to predict continuous numeric outcomes. Logistic regression estimates the probability of classification by applying a logit transformation to the odds. Support vector machines produce different decision boundaries than logistic regression by optimizing a different objective function. An example using iris flower data demonstrates classifying instances based on measurements and comparing linear boundaries from logistic regression and support vector machines.

Uploaded by

Green Mongor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Module 3

Fitting a Model to Data


What is a model?

In certain fields of Statistics & Economics, the bare model with unspecified
parameters is called “The Model”.

The Model: the model is a convenient fiction that necessarily glosses over
some of the details of the actual thing being modeled. It is meant to capture
the structure of the data as simply as possible.

The basic structure of a statistical model is:


data=model + error

We generally describe a model in terms of its parameters, which are


values that we can change in order to modify the predictions of the
model.
Outline
What is Model Fitting?
Model Fitting is a measurement of how well a machine learning model adapts to data
that is similar to the data on which it was trained. The fitting process is generally
built-in to models and is automatic. A well-fit model will accurately approximate the
output when given new data, producing more precise results.
Classification via Mathematical Functions
Decision Boundaries
A main purpose of creating
homogeneous regions is that we can
predict the target variable of a new,
unseen instance by determining which
segment it falls into.

For example, in Figure 4-1,


if a new customer falls into the lower-left
segment, we can conclude that the target
value is very likely to be “•”.

Similarly, if it falls into the upper-right


segment, we can predict its value as “+”.
Classification via Mathematical Functions

Instance Space Linear Classifier


Linear Discriminant Functions
Our goal is going to be to fit our model to the data, and to do so it is
quite helpful to represent the model mathematically.

You may recall that the equation of a line in two dimensions is


y = mx + b, where m is the slope of the line and b is the y intercept
(the y value when x = 0).

The line in Figure 4-3 can be expressed in this form (with Balance in
thousands) as:

We would classify an instance x as a + if it is above the line,


and as a • if it is below the line.
For this example Decision Boundary the classification solution is,

•Linear discriminant:

𝑐𝑙𝑎𝑠𝑠 = {+ if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 > 0


● if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 ≤ 0

• We now have a parameterized model: the weights of the linear function


are the parameters.

For our purposes, the important thing is that we can express the model
as a weighted sum of the attribute values.

• The weights are often loosely interpreted as importance indicators of the


features.
Thus, this linear model is a different sort of multivariate supervised
segmentation.
For example, consider our feature vector x, with the individual component
features being xi.

A linear model then can be written as follows in Equation 4-2.

Equation 4-2. A general linear model

The concrete example from Equation 4-1 can be written in this form:

To use this model as a linear discriminant, for a given instance


represented by a feature vector x, we check whether f(x) is positive or
negative.
In the two-dimensional case, this corresponds to seeing whether the
instance x falls above or below the line.
In Figure 4-5, there actually are many different linear discriminants that
can separate the classes perfectly.

They have very different slopes and intercepts, and each represents a
different model of the data. In fact, there are infinitely many lines (models)
that classify this training set perfectly.

Which should we pick?


Optimizing an Objective Function
what should be our goal or objective in choosing the parameters?

Our general procedure will be to define an objective function that represents


our goal, and can be calculated for a particular set of weights and a particular
set of data.

We will then find the optimal value for the weights by maximizing or minimizing
the objective function.

What can easily be overlooked is that these weights are “best” only if we
believe that the objective function truly represents what we want to achieve.
Unfortunately, creating an objective function that matches the true goal of the
data mining is usually impossible.

Several choices have been shown to be remarkably effective. One of these


choices creates the so-called “support vector machine.”

Linear regression, logistic regression, and support vector machines


are all very similar instances of our basic fundamental technique:
fitting a (linear) model to data.

The key difference is that each uses a different objective function.


Regression

• Technique used for the modeling and analysis of numerical data.

• Exploits the relationship between two or more variables so that we can


gain information about one of them through knowing values of the other.

• Regression can be used for prediction, estimation,hypothesis testing,


and modeling causal relationships.

Y= X1 + X2 + X3
Dependent Variable Independent Variable
Outcome Variable Predictor Variable
Response Variable Explanatory Variable
Linear Regression
Linear regression is the type of regression that forms a relationship between
the target variable and one or more independent variables utilizing a straight
line. The given equation represents the equation of linear regression

Y = a + b*X + e.

Where, a represents the intercept, b represents the slope of the


regression line, e represents the error, X and Y represent the predictor
and target variables, respectively.

If X is made up of more than one variable, termed as multiple linear equations.


In linear regression, the best fit line is achieved utilizing the least squared
method, and it minimizes the total sum of the squares of the deviations from
each data point to the line of regression. Here, the positive and negative
deviations do not get canceled as all the deviations are squared.
Linear Regression Line
A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.
Logistic Regression

This type of statistical model (also known as logit model) is often used for
classification and predictive analytics.

Logistic regression estimates the probability of an event occurring, such as


voted or didn’t vote, based on a given dataset of independent variables.
Since the outcome is a probability, the dependent variable is bounded
between 0 and 1.

In logistic regression, a logit transformation is applied on the odds—that is, the


probability of success divided by the probability of failure.

This is also commonly known as the log odds, or the natural logarithm of
odds, and this logistic function is represented by the following formulas:
Logit(pi) = 1/(1+ exp(-pi))
Logistic regression is a misnomer
• The distinction between classification and regression is whether the value for
the target variable is categorical or numeric

• For logistic regression, the model produces a numeric estimate.

• However, the values of the target variable in the data are


categorical.

• Logistic regression is estimating the probability of class membership


(a numeric quantity) over a categorical class.

• Logistic regression is a class probability estimation model and not a


regression model.
An Example of Mining a Linear Discriminant from Data

• To illustrate linear discriminant functions, we use an adaptation of the Iris


dataset.

• This is an old and fairly simple dataset representing various types of iris, a
genus of flowering plant.

• The original dataset includes three species of irises represented with


four attributes, and the data mining problem is to classify each instance
as belonging to one of the three species based on the attributes.

• For this illustration we’ll use just two species of irises, Iris Setosa and Iris
Versicolor.
An Example of Mining a Linear Discriminant from Data
• The dataset describes a collection of flowers of these two species, each
described with two measurements: the Petal width and the Sepal width
(Figure 4-6).
Classifying Flowers
• The flower dataset is plotted in Figure 4-7, with these two attributes on the x and y
axis, respectively.
• Each instance is one flower and corresponds to one dot on the graph.
• The filled dots are of the species Iris Setosa and the circles are instances of
the species Iris Versicolor.
Classifying Flowers
• Two different separation lines are shown in the figure, one generated by
logistic regression and the second by another linear method, a support
vector machine (which will be described shortly).

• Note that the data comprise two fairly distinct clumps, with a few outliers.
Logistic regression separates the two classes completely: all the Iris
Versicolor examples are to the left of its line and all the Iris Setosa to
the right.

• The Support vector machine line is almost midway between the clumps,
though it misclassifies the starred point at (3, 1).

• Notice that the methods produce different boundaries because they’re


optimizing different functions.
Ranking Instances and Probability Class Estimation
• In many applications, we don’t simply want a yes or no prediction of whether an
instance belongs to the class, but we want some notion of which examples are more
or less likely to belong to the class. For ex,
• Which consumers are most likely to respond to this offer?
• Which customers are most likely to leave when their contracts expire?

• Ranking
• Tree induction
• Linear discriminant functions (e.g., linear regressions, logistic regressions,
SVMs)
• Ranking is free
• Class Probability Estimation
• Tree induction
• Logistic regression
The many faces of classification:
Classification / Probability Estimation / Ranking

Increasing difficulty

Classification Ranking Probability

Ranking:
• Business context determines the number of actions (“how far down the
list”)

Probability:
• You can always rank / classify if you have probabilities!
Ranking: Examples
• Search engines
• Whether a document is relevant to a topic / query
Class Probability Estimation: Examples
• MegaTelCo
• Ranking vs. Class Probability Estimation.

• Identify accounts or transactions as likely to have been defrauded


• The director of the fraud control operation may want the analysts to focus
not simply on the cases most likely to be fraud, but on accounts where the
expected monetary loss is higher.

• We need to estimate the actual probability of fraud.


Logistic regression (“sigmoid”) curve
Application of Logistic Regression
• The Wisconsin Breast Cancer Dataset
Wisconsin Breast Cancer dataset

• From each of these basic characteristics, three values were


computed: the mean (_mean), standard error (_SE), and “worst” or
largest
Wisconsin Breast Cancer dataset
Support Vector Machines
In machine learning, support vector machines (SVMs, also support vector networks) are
supervised learning models with associated learning algorithms that analyze data for
classification and regression analysis.

Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines

B1

One Possible Solution


Support Vector Machines

B2

Another possible solution


Support Vector Machines

B2

Other possible solutions


Support Vector Machines
B1

B2

Which one is better? B1 or B2?


How do you define better?
Definitions

Define the hyperplane H such that:


xi•w+b  +1 when yi =+1 H1
xi•w+b  -1 when yi =-1
H1 and H2 are the planes: H2 d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1
H
The points on the planes H1 and H2
are the
Support Vectors:

d+ = the shortest distance to the closest positive point


d- = the shortest distance to the closest negative point
The margin of a separating hyperplane is d+ + d-.
Maximizing the margin
We want a classifier with as big margin as possible.

H1
Recall the distance from a point(x0,y0) to a H
line: H2
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d+
d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the


condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Optimization Problem
Support Vector Machines
B1

 
w• x + b = 0
 
 
w • x + b = −1 w • x + b = +1

b11

  b12
 1 if w • x + b  1 2
f ( x) =    Margin = 
 − 1 if w • x + b  −1 || w ||
Support Vector Machines
B1

B2

b21
b22

margin
b11

b12

Find hyperplane maximizes the margin => B1 is better than B2


Support Vector Machines

2
We want to maximize: Margin = 
|| w ||

|| w ||2
Which is equivalent to minimizing: L( w) =
2
But subjected to the following constraints:

𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ 1 if 𝑦𝑖 = 1
𝑤 ∙ 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1

This is a constrained optimization problem


Numerical approaches to solve it (e.g., quadratic
programming)
Loss Functions
• Loss functions define what a good prediction is and isn’t. In
short, choosing the right loss function dictates how well your
estimator will be.

• Loss functions measure how far an estimated value is from its


true value.

• A loss function maps decisions to their associated costs.

• Loss functions are not fixed, they change depending on the


task in hand and the goal to be met.
Loss Functions
• Zero-one loss assigns a loss of zero for a correct decision
and one for an incorrect decision.

• Squared error specifies a loss proportional to the square of


the distance from the boundary.
• Squared error loss usually is used for numeric value
prediction (regression), rather than classification.

• The squaring of the error has the effect of greatly penalizing


predictions that are grossly wrong.
Hinge Loss functions
• In machine learning, the hinge loss is a loss function used for
training classifiers. The hinge loss is used for "maximum-
margin" classification.

• Support vector machines use hinge loss.

• The hinge loss is a special type of cost function that not only
penalizes misclassified samples but also correctly classified
ones that are within a defined margin from the decision
boundary.
Hinge Loss functions
• Hinge loss incurs no penalty for an example that is not on the
wrong side of the margin.

• The hinge loss only becomes positive when an example is on


the wrong side of the boundary and beyond the margin.
• Loss then increases linearly with the example’s distance from
the margin.
• Penalizes points more the farther they are from the separating
boundary.
Characteristics of SVM
The learning problem is formulated as a convex optimization problem
• Efficient algorithms are available to find the global minima.
• Many of the other methods use greedy approaches and find locally
optimal solutions.
• High computational complexity for building the model.
• Robust to noise.

• Overfitting is handled by maximizing the margin of the decision boundary.

• SVM can handle irrelevant and redundant attributes better than many
other techniques.

• The user needs to provide the type of kernel function and cost function.
• Difficult to handle missing values.
Simple Neural Network
Non-linear Functions
• Linear functions can actually represent nonlinear models, if we include
more complex features in the functions

You might also like