0% found this document useful (0 votes)
22 views61 pages

PDA Unit-3 (Full Unit)

The document discusses error-based learning, focusing on simple linear regression to predict office rental prices based on office size. It explains the relationship between rental price and size using a linear model, the importance of measuring error through the sum of squared errors, and introduces gradient descent as an optimization algorithm for minimizing the cost function in machine learning. Additionally, it highlights the significance of choosing appropriate learning rates and initial weights for effective convergence in gradient descent.

Uploaded by

nkota1843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views61 pages

PDA Unit-3 (Full Unit)

The document discusses error-based learning, focusing on simple linear regression to predict office rental prices based on office size. It explains the relationship between rental price and size using a linear model, the importance of measuring error through the sum of squared errors, and introduces gradient descent as an optimization algorithm for minimizing the cost function in machine learning. Additionally, it highlights the significance of choosing appropriate learning rates and initial weights for effective convergence in gradient descent.

Uploaded by

nkota1843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Unit-3

Error-Based Learning
 Simple Linear Regression
• Table below shows a simple dataset recording the rental price (in Euro per month) of
Dublin city-center offices (RENTAL PRICE), along with a number of descriptive
features that are likely to be related to rental price: the SIZE of the office (in square
feet), the FLOOR in the building in which the office space is located, the
BROADBAND rate available at the office (in Mb per second), and the ENERGY
RATING of the building in which the office space is located (ratings range from A to
C, where A is the most efficient). we look at the ways in which all these descriptive
features can be used to train an error-based model to predict office rental prices.
Initially, though, we will focus on a simplified version of this task in which just
SIZE is used to predict RENTAL PRICE.
• Figure 7.1(a)[314] shows a scatter plot of the office rentals dataset with RENTAL
PRICE on the vertical (or y) axis and SIZE on the horizontal (or x) axis. From this
plot, it is clear that there is a strong linear relationship between these two features: as
SIZE increases so does RENTAL PRICE by a similar amount. If we could capture
this relationship in a model, we would be able to do two important things. First, we
would be able to understand how office size affects office rental price. Second, we
would be able to fill in the gaps in the dataset to predict office rental prices for office
sizes that we have never actually seen in the historical data—for example, how
much would we expect a 730-square-foot office to rent for? Both of these things
would be of great use to real estate agents trying to make decisions about
the rental prices they should set for new rental properties.
Table:
• There is a simple, well-known mathematical model that can capture the
relationship between two continuous features like those in our dataset. Many
readers will remember from high school geometry that the equation of a line can
be written
y = mx + b (7.1)
• where m is the slope of the line, and b is known as the y-intercept of the line (i.e.,
the position at which the line meets the vertical axis when the value of x is set to
zero). The equation of a line predicts a y value for every x value given the slope
and the y-intercept, and we can use this simple model to capture the relationship
between two features such as SIZE and RENTAL PRICE. Figure 7.1(b) shows the
same scatter plot as shown in Figure 7.1(a) with a simple linear model added to
capture the relationship between office sizes and office rental prices.
• This model is RENTAL PRICE = 6:47 + 0:62* SIZE (7.2)
where the slope of the line is 0:62 and the y-intercept is 6:47.
• This model tells us that for every increase of a square foot in SIZE, RENTAL PRICE
increases by 0:62 Euro. We can also use this model to determine the expected
rental price of the 730-square-foot office mentioned previously by simply plugging
this value for SIZE into the model
RENTAL PRICE = 6:47 + 0:62* 730
= 459:07
 Measuring Error
• The model shown in Equation (7.2) is defined by the weights w[0]= 6:47 and w[1]= 0:62.
What tells us that these weights suitably capture the relationship within the training
dataset? Figure 7.2(a) shows a scatter plot of the SIZE and RENTAL PRICE descriptive
features from the office rentals dataset and a number of different simple linear regression
models that might be used to capture this relationship. In these models the value for wr0s
is kept constant at 6:47, and the values for wr1s are set to 0:4, 0:5, 0:62, 0:7, and 0:8 from
top to bottom. Out of the candidate models shown, the third model from the top (with
w[1] set to 0:62), passes most closely through the actual dataset and is the one that most
accurately fits the relationship between office sizes and office rental prices—but how do
we measure this formally?
• In order to formally measure the fit of a linear regression model with a set of training data,
we require an error function. An error function captures the error between the predictions
made by a model and the actual values in a training dataset.2 There are many different
kinds of error functions, but for measuring the fit of simple linear regression models, the
most commonly used is the sum of squared errors error function, or L2. To calculate
L2 we use our candidate model Mw to make a prediction for each member of the training
dataset, D, and then calculate the error (or residual) between these predictions and the
actual target feature values in the training set.
• Figure 7.2(b) shows the office rentals dataset and the candidate model
with w[0]=6:47 and w[1]= 0:62 and also includes error bars to highlight the
differences between the predictions made by the model and the actual
RENTAL PRICE values in the training data. Notice that the model sometimes
overestimates the office rental price, and sometimes underestimates the
office rental price. This means that some of the errors will be positive
and some will be negative. If we were to simply add these together, the
positive and negative errors would effectively cancel each other out. This is
why, rather than just using the sum of the errors, we use the sum of the
squared errors because this means all values will be positive.
 Error Surfaces
Standard Approach: Multivariable Linear Regression with Gradient Descent
 Gradient Descent
• Gradient descent is an optimization algorithm used in machine learning to minimize
the cost function by iteratively adjusting parameters in the direction of the negative
gradient, aiming to find the optimal set of parameters.
• Gradient descent aims to find the parameters that minimize this discrepancy and
improve the model’s performance.
• Cost Function:
• It is a function that measures the performance of a model for any given data. Cost
Function quantifies the error between predicted values and expected values and
presents it in the form of a single real number.
• After making a hypothesis with initial parameters, we calculate the Cost function.
And with a goal to reduce the cost function, we modify the parameters by using the
Gradient descent algorithm over the given data. Here’s the mathematical
representation for it:
• Gradient descent is an optimization algorithm used in machine learning to minimize
the cost function by iteratively adjusting parameters in the direction of the negative
gradient, aiming to find the optimal set of parameters.
• The cost function represents the discrepancy between the predicted output of the
model and the actual output. Gradient descent aims to find the parameters that
minimize this discrepancy and improve the model’s performance.
• The algorithm operates by calculating the gradient of the cost function, which
indicates the direction and magnitude of the steepest ascent. However, since the
objective is to minimize the cost function, gradient descent moves in the opposite
direction of the gradient, known as the negative gradient direction.
• By iteratively updating the model’s parameters in the negative gradient direction,
gradient descent gradually converges towards the optimal set of parameters that
yields the lowest cost. The learning rate, a hyperparameter, determines the step size
taken in each iteration, influencing the speed and stability of convergence.
• Gradient descent can be applied to various machine learning algorithms,
including linear regression, logistic regression, neural networks, and support vector
machines. It provides a general framework for optimizing models by iteratively
refining their parameters based on the cost function.
• Example:
• Let’s say you are playing a game in which the players are at the top of a mountain
and asked to reach the lowest point of the mountain. Additionally, they are
blindfolded. So, what approach do you think would make you reach the lake?
• Take a moment to think about this before you read on.
• The best way is to observe the ground and find where the land descends. From that
position, step in the descending direction and iterate this process until we reach the
lowest point.
• Finding the lowest point in a hilly landscape.

• Gradient descent is an iterative optimization algorithm for finding the local


minimum of a function.
• To find the local minimum of a function using gradient descent, we must take steps
proportional to the negative of the gradient (move away from the gradient) of the
function at the current point. If we take steps proportional to the positive of the
gradient (moving towards the gradient), we will approach a local maximum of the
function, and the procedure is called Gradient Ascent.
• Gradient descent was originally proposed by CAUCHY in 1847. It is also known as
the steepest descent.

• The goal of the gradient descent algorithm is to minimize the given function (say,
cost function). To achieve this goal, it performs two steps iteratively:
• Compute the gradient (slope), the first-order derivative of the function at that
point
• Make a step (move) in the direction opposite to the gradient. The opposite
direction of the slope increases from the current point by alpha times the gradient at
that point
• Alpha is called Learning rate – a tuning parameter in the optimization process. It
decides the length of the steps.
• Working of Gradient Descent
1. The algorithm optimizes to minimize the model’s cost function.
2. The cost function measures how well the model fits the training data and defines
the difference between the predicted and actual values.
3. The cost function’s gradient is the derivative with respect to the model’s parameters
and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in small
steps to minimize the cost function.
5. In each iteration of the algorithm, it computes the gradient of the cost function with
respect to each parameter.
6. The gradient tells us the direction of the steepest ascent, and by moving in the
opposite direction, we can find the direction of the steepest descent.
7. The learning rate controls the step size, which determines how quickly the
algorithm moves towards the minimum.
8. The process is repeated until the cost function converges to a minimum. Therefore
indicating that the model has reached the optimal set of parameters.
9. Different variations of gradient descent include batch gradient descent, stochastic
gradient descent, and mini-batch gradient descent, each with advantages and
limitations.
10. Efficient implementation of gradient descent is essential for performing well in
machine learning tasks. The choice of the learning rate and the number of iterations
can significantly impact the algorithm’s performance.
Plotting the Gradient Descent Algorithm
When we have a single parameter (theta), we can plot the dependent variable
cost on the y-axis and theta on the x-axis. If there are two parameters, we can
go with a 3-D plot, with cost on one axis and the two parameters (thetas) along
the other two axes.
• It can also be visualized by using Contours. This shows a 3-D plot in two
dimensions with parameters along axes and the response as a contour. The
value of the response increases away from the center and has the same value
as with the rings. The response is directly proportional to the distance of a
point from the center (along a direction).

• Alpha – The Learning Rate


• We have the direction we want to move in. Now, we must decide the size of
the step we must take.
• *It must be chosen carefully to end up with local minima.
• If the learning rate is too high, we might OVERSHOOT the minima and keep
bouncing without reaching the minima
• If the learning rate is too small, the training might turn out to be too long
• The learning rate is optimal, and the model converges to the minimum.
• The learning rate is too small. It takes more time but converges to the minimum.
• The learning rate is higher than the optimal value. It overshoots but converges
( 1/C < η <2/C).
• The learning rate is very large. It overshoots and diverges, moves away from the
minima, and performance decreases in learning.
• Note: As the gradient decreases while moving towards the local minima, the size of
the step decreases. So, the learning rate (alpha) can be constant over the
optimization and need not be varied iteratively.
• Local Minima
• The cost function may consist of many minimum points. Depending on the initial
point (i.e., initial parameters(theta)) and the learning rate, the gradient may settle on
any minima. Therefore, the optimization may converge to different starting points
and learning rates.
• Advantages and Disadvantages
• Advantages:
• Easy to use: It’s like rolling the marble yourself – no fancy tools needed, you just
push it in the right direction.
• Fast updates: Each push (iteration) is quick, you don’t have to spend a lot of time
figuring out how hard to push.
• Memory efficient: You don’t need a big backpack to carry around extra
information, just the marble and your knowledge of the hill.
• Usually finds a good spot: Most of the time, the marble will end up in a pretty flat
area, even if it’s not the absolute flattest (global minimum).
• Disadvantages:
• Slow for giant hills (large datasets): If the hill is enormous, pushing the marble all
the way down each time can be super slow. There are better ways to roll for these
giants.
• Can get stuck in shallow dips (local minima): The hill might have many dips, and
the marble could get stuck in one that isn’t the absolute lowest. It depends on where
you start pushing it from.
• Finding the perfect push (learning rate): You need to figure out how har to push
the marble (learning rate). If you push too weakly, it’ll take forever to get
anywhere. Push too hard, and it might roll right past the flat spot.
 Choosing Learning Rates and Initial Weights
• The values chosen for the learning rate and initial weights can have a significant impact
on how the gradient descent algorithm proceeds. Unfortunately, there are no theoretical
results that help in choosing the optimal values for these parameters. Instead, these
algorithm parameters must be chosen using rules of thumb gathered through experience.
• The learning rate, , in the gradient descent algorithm determines the size of the
adjustment made to each weight at each step in the process. We can illustrate this using
the simplified version of the RENTAL PRICE prediction problem based only on office
size (SIZE). A linear regression model for the problem uses only two weights, wr0s and
wr1s. Figure 7.7 shows how different learning rates—0.002, 0.08, and 0.18—result in
very different journeys across the error surface. The changing sum of squared errors that
result from these journeys are also shown.
• Figure 7.7(a) shows the impact of a very small learning rate. Although the gradient
descent algorithm will converge to the global minimum eventually, it takes a very long
time as tiny changes are made to the weights at each iteration of the algorithm. Figure
7.7(c) shows the impact of a large learning rate. The large adjustments made to the
weights during gradient descent cause it to jump completely from one side of the error
surface to the other.
• Although the algorithm can still converge toward an area of the error surface close to the
global minimum, there is a strong chance that the global minimum itself will be missed,
and the algorithm will simply jump back and forth across it. In fact, if inappropriately
large learning rates are used, the jumps from one side of the error surface to the other can
cause the sum of squared errors to repeatedly increase rather than decrease, leading to a
process that will never converge
• Figure 7.7(b) shows that a well-chosen learning rate strikes a good balance,
converging quickly but also ensuring that the global minimum is reached.
• Note that even though the shape of the curve in Figure 7.7(e) is similar to the shape
in Figure 7.7(d), it takes far fewer iterations to reach the global minimum.
Unfortunately, choosing learning rates is not a well-defined science. Although there
are some algorithmic approaches, most practitioners use rules of thumb and trial
and error. A typical range for learning rates is r0:00001; 10s, and practitioners will
usually try out higher values and observe the resulting learning graph. If the graph
looks too much like Figure 7.7(f), a smaller value will be tested until something
approaching Figure 7.7(e) is found.
• When the gradient descent algorithm is used to find optimal weights for linear
regression models, the initial weights are chosen randomly from a predefined range
that must be specified as an input to the algorithm. The choice of the range from
which these initial weights are selected affects how quickly the gradient descent
algorithm will converge to a solution. Unfortunately, as is the case with the learning
rate, there are no well-established, proven methods for choosing initial weights.
Normalization also has a part to play here. It is much easier to select initial weights
for normalized feature values than for raw feature values, as the range in which
weights for normalized feature values might reasonably fall(particularly for the
intercept weight, wr0s) is much better defined than the corresponding range when
raw feature values are used. The best advice we can give is that, based on empirical
evidence, choosing random initial weights uniformly from the range r0:2; 0:2s
tends to work well
 Extensions and Variations
 Interpreting Multivariable Linear Regression Models
• A particularly useful feature of linear regression models is that the weights used by
the model indicate the effect of each descriptive feature on the predictions returned
by the model. First, the signs of the weights indicate whether different descriptive
features have a positive or a negative impact on the prediction. Table 7.4 repeats the
final weights for the office rentals model trained in Section 7.3.4. We can see that
increasing office size leads to increasing rental prices; that lower building floors
lead to higher rental prices; and that rental prices increase with broadband rates.
Second, the magnitudes of the weights show how much the value of the target
feature changes for a unit change in the value of a particular descriptive feature. For
example, for every increase of a square foot in office size, we can expect the rental
price to go up by 0.6270 Euro per month. Similarly, for every floor we go up in an
office building, we can expect the rental price to decrease by 0.1781 Euro per
month.
• The statistical significance test we use to analyze the importance of a descriptive feature d r js
in a linear regression model is the t-test. The null hypothesis that we adopt for this test is that
the feature does not have a significant impact on the odel. The test statistic we calculate is
called the t-statistic. In order to calculate this test statistic, we first have to calculate the
standard error for the overall model and the standard error for the descriptive feature we are
investigating the importance of. The standard error for the overall model is calculated as
follows:

• where n is the number of instances in the training dataset. A standard error calculation is then
done for a descriptive feature as follows:

• where d[j] is some descriptive feature and d r js is the mean value of that descriptive feature in
the training set.
• The t-statistic for this test is calculated as follows:

• where w[j] is the weight associated with descriptive feature d[j].


• Using a standard t statistic look-up table, we can then determine the p-value
associated with this test (this is a two tailed t-test with degrees of freedom set to the
number of instances in the training set minus 2). If the p-value is less than the
required significance level, typically 0:05, we reject the null hypothesis and say that
the descriptive feature has a significant impact on the model; otherwise we say that
it does not. We can see from Table 7.4 that only the SIZE descriptive feature has a
significant impact on the model. If a descriptive feature is found to have a significant
impact on the model, this indicates that there is a significant linear relationship
between it and the target feature.
 Setting the Learning Rate Using Weight Decay
• Previously we illustrated the impact of a learning rate parameter on the gradient
descent algorithm. In that section we also explained that most practitioners use rules
of thumb and trial and error to set the learning rate. A more systematic approach is
to use learning rate decay, which allows the learning rate to start at a large value
and then decay over time according to a predefined schedule. Although there are
different approaches in the literature, a good approach is to use the following decay
schedule:

• Where a0 is an initial learning rate (this is typically quite large, e.g., 1.0), c is a
constant that controls how quickly the learning rate decays (the value of this
parameter depends on how quickly the algorithm converges, but it is often set to
quite a large value, e.g., 100), and is the current iteration of the gradient descent
algorithm. Figure 7.8 shows the journey across the error surface and related plot of
the sums of squared errors for the office rentals problem—using just the SIZE
descriptive feature—when error decay is used with a0 =0:18 and c = 10 (this is a
pretty simple problem, so smaller values for these parameters are suitable). This
example shows that the algorithm converges to the global minimum more quickly
than any of the approaches shown in Figure 7.7
• The differences between Figures 7.7(f) and 7.8(b) most clearly show the impact of
learning rate decay as the initial learning rates are the same in these two instances.
When learning rate decay is used, there is much less thrashing back and forth
across the error surface than when the large static learning rate is used. Using
learning rate decay can even address the problem of inappropriately large error
rates causing the sum of squared errors to increase rather than decrease. Figure 7.9
shows an example of this in which learning rate decay is used with a0= 0.25 and
c=100. The algorithm starts at the position marked 1 on the error surface, and
learning steps actually cause it to move farther and farther up the error surface.
This can be seen in the increasing sums of squared errors in Figure 7.9(b).
• As the learning rate decays, however, the direction of the journey across the error
surface moves back downward, and eventually the global minimum is reached.
Although learning rate decay almost always leads to better performance than a
fixed learning rate, it still does require that problem-dependent values are chosen
for a0 and c.
 Handling Categorical Target Features: Logistic Regression
 Predicting categorical targets using linear regression
• Table 7.6 shows a sample dataset with a categorical target feature. This dataset contains
measurements of the revolutions per minute (RPM) that power station generators are running
at, the amount of vibration in the generators (VIBRATION), and an indicator to show whether
the generators proved to be working or faulty the day after these measurements were taken.
The RPM and VIBRATION measurements come from the day before the generators proved to
be operational or faulty. If power station administrators could predict upcoming generator
failures before the generators actually fail, they could improve power station safety and
save money on maintenance.11 Using this dataset, we would like to train a model to
distinguish between properly operating power station generators and faulty generators using
the RPM and VIBRATION measurements.
Figure 7.10(a) shows a scatter plot of this dataset in which we can see that there is a
good separation between the two types of generator. In fact, as shown in Figure 7.10(b)[340],
we can draw a straight line across the scatter plot that perfectly separates the good generators
from the faulty ones. This line is known as a decision boundary, and because we can draw this
line, this dataset is said to be linearly separable in terms of the two descriptive features used.
As the decision boundary is a linear separator, it can be defined using the equation of the line
(remember Equation (7.2.1)). In Figure 7.10(b) the decision boundary is defined as
• We can solve both these problems by using a more sophisticated threshold function
that is continuous, and therefore differentiable, and that allows for the subtlety
desired: the logistic function
• The logistic function is given by

• where x is a numeric value and e is Euler’s number and is approximately equal to


2.7183. A plot of the logistic function for values of x in the range r10; 10s is shown
in Figure 7.12(a). We can see that the logistic function is a threshold function that
pushes values above zero to 1 and values below zero to 0. This is very similar to
the hard threshold function given in Equation (7.24), except that it has a soft
boundary.
 Logistic regression
To build a logistic regression model, we threshold the output of the basic linear
regression model using the logistic function. So, instead of the regression function
simply being the dot product of the weights and the descriptive features (as given in
Equation (7.9)), the dot product of weights and descriptive feature values is passed
through the logistic function

• To see the impact of this, we can build a multivariable logistic regression model for
the dataset in Table 7.6[339]. After the training process (which uses a slightly
modified version of the gradient descent algorithm, which we will explain shortly),
the resulting logistic regression model is
 Modeling NonLinear Relationships
• All the simple linear regression and logistic regression models that we have looked
at so far model a linear relationship between descriptive features and a target feature.
Linear models work very well when the underlying relationships in the data are
linear. Sometimes, however, the underlying data will exhibit non-linear relationships
that we would like to capture in a model. For example, the dataset in Table 7.9 is
based on an agricultural scenario and shows rainfall (in mm per day), RAIN, and
resulting grass growth (in kilograms per acre per day), GROWTH, measured on a
number of Irish farms during July 2012. A scatter plot of these two features is shown
in Figure 7.16(a), from which the strong non-linear
relationship between rainfall and grass growth is clearly apparent—grass does not
grow well when there is very little rain or too much rain, but hits a sweet spot at
rainfall of about 2:5mm per day. It would be useful for farmers to be able to predict
grass growth for different amounts of forecasted rainfall so that they could plan the
optimal times to harvest their grass for making hay.
• A simple linear regression model cannot handle this non-linear relationship. Figure
7.16(b) shows the best simple linear regression model that can be trained for this
prediction problem. This model is
To successfully model the relationship between grass growth and rainfall, we need to
introduce non-linear elements. A generalized way in which to do this is to introduce
basis functions that transform the raw inputs to the model into non-linear
representations but still keep the model itself linear in terms of the weights. The
advantage of this is that, except for introducing the mechanism of basis functions, we
do not need to make any other changes to the approach we have presented so far.
Furthermore, basis functions work for both simple multivariable linear regression
models that predict a continuous target feature and multivariable logistic regression
models that predict a categorical target feature
 Multinomial Logistic Regression
• The multinomial logistic regression18 model is an extension that handles categorical
target features with more than two levels. A good way to build multinomial logistic
regression models is to use a set of one-versus-all models.19 If we have r target levels,
we create r one-versus-all logistic regression models. A one-versus-all model
distinguishes between one level of the target feature and all the others. Figure 7.20
shows three one-versus-all prediction models for a prediction problem with three target
levels (these models are based on the dataset in Table 7.11.

whereM’ wk (d) is a revised, normalized prediction for the one-versus-all


model for the target level k.
• The denominator in this equation sums the predictions of each of the one-versus-all
models for the r levels of the target feature and acts as a normalization term. This
ensures that the output of all models sums to 1. The r one-versus-all logistic
regression models used are trained in parallel, and the revised model outputs,
M’wk (d), are used in calculating the sum of squared errors for each model during
the training process. This means that the sum of squared errors function is changed
slightly to
 Support Vector Machines
• Support vector machines (SVM) are another approach to predictive modeling that is
based on error-based learning. Figure 7.22(a) shows a scatter plot of a reduced
version of the generators dataset (shown in Table 7.6) with a decision boundary
drawn across it. The instance nearest the decision boundary, based on perpendicular
distance, is highlighted.
• This distance from the decision boundary to the nearest training instance is known
as the margin. The dashed lines on either side of the decision boundary show the
extent of the margin, and we refer to these as the margin extents
• For support vector machines, we first set the negative target feature level to -1 and
the positive target feature level to+1. We then build a support vector machine
prediction model so that instances with the negative target level result in the model
outputting <=1 and instances with the positive target level result in the model
outputting >=. The space between the outputs of -1 and +1 allows for the margin .
• A support vector machine model is defined as

• where q is the set of descriptive features for a query instance; (d1; t1)… (ds; ts) are
s support vectors (instances composed of descriptive features and a target feature);
w0 is the first weight of the decision boundary; and is a set of parameters
determined during the training process (there is a parameter for each support vector
When the output of this equation is greater than 1, we predict the
positive target level for the query, and when the output is less than 1, we predict the
negative target level. An important feature of this equation is that the support
vectors are a component of the equation. This reflects the fact that a support vector
machine uses the support vectors to define theseparating hyperplane and hence to
make the actual model predictions.
Figure 7.23 shows two different decision boundaries that satisfy these constraints. Note
that the decision boundaries in these examples are equally positioned between positive and
negative instances, which is a consequence of the fact that decision boundaries satisfy these
constraints. The support vectors are highlighted in Figure 7.23 for each of the decision
boundaries shown

You might also like