PDA Unit-3 (Full Unit)
PDA Unit-3 (Full Unit)
Error-Based Learning
Simple Linear Regression
• Table below shows a simple dataset recording the rental price (in Euro per month) of
Dublin city-center offices (RENTAL PRICE), along with a number of descriptive
features that are likely to be related to rental price: the SIZE of the office (in square
feet), the FLOOR in the building in which the office space is located, the
BROADBAND rate available at the office (in Mb per second), and the ENERGY
RATING of the building in which the office space is located (ratings range from A to
C, where A is the most efficient). we look at the ways in which all these descriptive
features can be used to train an error-based model to predict office rental prices.
Initially, though, we will focus on a simplified version of this task in which just
SIZE is used to predict RENTAL PRICE.
• Figure 7.1(a)[314] shows a scatter plot of the office rentals dataset with RENTAL
PRICE on the vertical (or y) axis and SIZE on the horizontal (or x) axis. From this
plot, it is clear that there is a strong linear relationship between these two features: as
SIZE increases so does RENTAL PRICE by a similar amount. If we could capture
this relationship in a model, we would be able to do two important things. First, we
would be able to understand how office size affects office rental price. Second, we
would be able to fill in the gaps in the dataset to predict office rental prices for office
sizes that we have never actually seen in the historical data—for example, how
much would we expect a 730-square-foot office to rent for? Both of these things
would be of great use to real estate agents trying to make decisions about
the rental prices they should set for new rental properties.
Table:
• There is a simple, well-known mathematical model that can capture the
relationship between two continuous features like those in our dataset. Many
readers will remember from high school geometry that the equation of a line can
be written
y = mx + b (7.1)
• where m is the slope of the line, and b is known as the y-intercept of the line (i.e.,
the position at which the line meets the vertical axis when the value of x is set to
zero). The equation of a line predicts a y value for every x value given the slope
and the y-intercept, and we can use this simple model to capture the relationship
between two features such as SIZE and RENTAL PRICE. Figure 7.1(b) shows the
same scatter plot as shown in Figure 7.1(a) with a simple linear model added to
capture the relationship between office sizes and office rental prices.
• This model is RENTAL PRICE = 6:47 + 0:62* SIZE (7.2)
where the slope of the line is 0:62 and the y-intercept is 6:47.
• This model tells us that for every increase of a square foot in SIZE, RENTAL PRICE
increases by 0:62 Euro. We can also use this model to determine the expected
rental price of the 730-square-foot office mentioned previously by simply plugging
this value for SIZE into the model
RENTAL PRICE = 6:47 + 0:62* 730
= 459:07
Measuring Error
• The model shown in Equation (7.2) is defined by the weights w[0]= 6:47 and w[1]= 0:62.
What tells us that these weights suitably capture the relationship within the training
dataset? Figure 7.2(a) shows a scatter plot of the SIZE and RENTAL PRICE descriptive
features from the office rentals dataset and a number of different simple linear regression
models that might be used to capture this relationship. In these models the value for wr0s
is kept constant at 6:47, and the values for wr1s are set to 0:4, 0:5, 0:62, 0:7, and 0:8 from
top to bottom. Out of the candidate models shown, the third model from the top (with
w[1] set to 0:62), passes most closely through the actual dataset and is the one that most
accurately fits the relationship between office sizes and office rental prices—but how do
we measure this formally?
• In order to formally measure the fit of a linear regression model with a set of training data,
we require an error function. An error function captures the error between the predictions
made by a model and the actual values in a training dataset.2 There are many different
kinds of error functions, but for measuring the fit of simple linear regression models, the
most commonly used is the sum of squared errors error function, or L2. To calculate
L2 we use our candidate model Mw to make a prediction for each member of the training
dataset, D, and then calculate the error (or residual) between these predictions and the
actual target feature values in the training set.
• Figure 7.2(b) shows the office rentals dataset and the candidate model
with w[0]=6:47 and w[1]= 0:62 and also includes error bars to highlight the
differences between the predictions made by the model and the actual
RENTAL PRICE values in the training data. Notice that the model sometimes
overestimates the office rental price, and sometimes underestimates the
office rental price. This means that some of the errors will be positive
and some will be negative. If we were to simply add these together, the
positive and negative errors would effectively cancel each other out. This is
why, rather than just using the sum of the errors, we use the sum of the
squared errors because this means all values will be positive.
Error Surfaces
Standard Approach: Multivariable Linear Regression with Gradient Descent
Gradient Descent
• Gradient descent is an optimization algorithm used in machine learning to minimize
the cost function by iteratively adjusting parameters in the direction of the negative
gradient, aiming to find the optimal set of parameters.
• Gradient descent aims to find the parameters that minimize this discrepancy and
improve the model’s performance.
• Cost Function:
• It is a function that measures the performance of a model for any given data. Cost
Function quantifies the error between predicted values and expected values and
presents it in the form of a single real number.
• After making a hypothesis with initial parameters, we calculate the Cost function.
And with a goal to reduce the cost function, we modify the parameters by using the
Gradient descent algorithm over the given data. Here’s the mathematical
representation for it:
• Gradient descent is an optimization algorithm used in machine learning to minimize
the cost function by iteratively adjusting parameters in the direction of the negative
gradient, aiming to find the optimal set of parameters.
• The cost function represents the discrepancy between the predicted output of the
model and the actual output. Gradient descent aims to find the parameters that
minimize this discrepancy and improve the model’s performance.
• The algorithm operates by calculating the gradient of the cost function, which
indicates the direction and magnitude of the steepest ascent. However, since the
objective is to minimize the cost function, gradient descent moves in the opposite
direction of the gradient, known as the negative gradient direction.
• By iteratively updating the model’s parameters in the negative gradient direction,
gradient descent gradually converges towards the optimal set of parameters that
yields the lowest cost. The learning rate, a hyperparameter, determines the step size
taken in each iteration, influencing the speed and stability of convergence.
• Gradient descent can be applied to various machine learning algorithms,
including linear regression, logistic regression, neural networks, and support vector
machines. It provides a general framework for optimizing models by iteratively
refining their parameters based on the cost function.
• Example:
• Let’s say you are playing a game in which the players are at the top of a mountain
and asked to reach the lowest point of the mountain. Additionally, they are
blindfolded. So, what approach do you think would make you reach the lake?
• Take a moment to think about this before you read on.
• The best way is to observe the ground and find where the land descends. From that
position, step in the descending direction and iterate this process until we reach the
lowest point.
• Finding the lowest point in a hilly landscape.
• The goal of the gradient descent algorithm is to minimize the given function (say,
cost function). To achieve this goal, it performs two steps iteratively:
• Compute the gradient (slope), the first-order derivative of the function at that
point
• Make a step (move) in the direction opposite to the gradient. The opposite
direction of the slope increases from the current point by alpha times the gradient at
that point
• Alpha is called Learning rate – a tuning parameter in the optimization process. It
decides the length of the steps.
• Working of Gradient Descent
1. The algorithm optimizes to minimize the model’s cost function.
2. The cost function measures how well the model fits the training data and defines
the difference between the predicted and actual values.
3. The cost function’s gradient is the derivative with respect to the model’s parameters
and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in small
steps to minimize the cost function.
5. In each iteration of the algorithm, it computes the gradient of the cost function with
respect to each parameter.
6. The gradient tells us the direction of the steepest ascent, and by moving in the
opposite direction, we can find the direction of the steepest descent.
7. The learning rate controls the step size, which determines how quickly the
algorithm moves towards the minimum.
8. The process is repeated until the cost function converges to a minimum. Therefore
indicating that the model has reached the optimal set of parameters.
9. Different variations of gradient descent include batch gradient descent, stochastic
gradient descent, and mini-batch gradient descent, each with advantages and
limitations.
10. Efficient implementation of gradient descent is essential for performing well in
machine learning tasks. The choice of the learning rate and the number of iterations
can significantly impact the algorithm’s performance.
Plotting the Gradient Descent Algorithm
When we have a single parameter (theta), we can plot the dependent variable
cost on the y-axis and theta on the x-axis. If there are two parameters, we can
go with a 3-D plot, with cost on one axis and the two parameters (thetas) along
the other two axes.
• It can also be visualized by using Contours. This shows a 3-D plot in two
dimensions with parameters along axes and the response as a contour. The
value of the response increases away from the center and has the same value
as with the rings. The response is directly proportional to the distance of a
point from the center (along a direction).
• where n is the number of instances in the training dataset. A standard error calculation is then
done for a descriptive feature as follows:
• where d[j] is some descriptive feature and d r js is the mean value of that descriptive feature in
the training set.
• The t-statistic for this test is calculated as follows:
• Where a0 is an initial learning rate (this is typically quite large, e.g., 1.0), c is a
constant that controls how quickly the learning rate decays (the value of this
parameter depends on how quickly the algorithm converges, but it is often set to
quite a large value, e.g., 100), and is the current iteration of the gradient descent
algorithm. Figure 7.8 shows the journey across the error surface and related plot of
the sums of squared errors for the office rentals problem—using just the SIZE
descriptive feature—when error decay is used with a0 =0:18 and c = 10 (this is a
pretty simple problem, so smaller values for these parameters are suitable). This
example shows that the algorithm converges to the global minimum more quickly
than any of the approaches shown in Figure 7.7
• The differences between Figures 7.7(f) and 7.8(b) most clearly show the impact of
learning rate decay as the initial learning rates are the same in these two instances.
When learning rate decay is used, there is much less thrashing back and forth
across the error surface than when the large static learning rate is used. Using
learning rate decay can even address the problem of inappropriately large error
rates causing the sum of squared errors to increase rather than decrease. Figure 7.9
shows an example of this in which learning rate decay is used with a0= 0.25 and
c=100. The algorithm starts at the position marked 1 on the error surface, and
learning steps actually cause it to move farther and farther up the error surface.
This can be seen in the increasing sums of squared errors in Figure 7.9(b).
• As the learning rate decays, however, the direction of the journey across the error
surface moves back downward, and eventually the global minimum is reached.
Although learning rate decay almost always leads to better performance than a
fixed learning rate, it still does require that problem-dependent values are chosen
for a0 and c.
Handling Categorical Target Features: Logistic Regression
Predicting categorical targets using linear regression
• Table 7.6 shows a sample dataset with a categorical target feature. This dataset contains
measurements of the revolutions per minute (RPM) that power station generators are running
at, the amount of vibration in the generators (VIBRATION), and an indicator to show whether
the generators proved to be working or faulty the day after these measurements were taken.
The RPM and VIBRATION measurements come from the day before the generators proved to
be operational or faulty. If power station administrators could predict upcoming generator
failures before the generators actually fail, they could improve power station safety and
save money on maintenance.11 Using this dataset, we would like to train a model to
distinguish between properly operating power station generators and faulty generators using
the RPM and VIBRATION measurements.
Figure 7.10(a) shows a scatter plot of this dataset in which we can see that there is a
good separation between the two types of generator. In fact, as shown in Figure 7.10(b)[340],
we can draw a straight line across the scatter plot that perfectly separates the good generators
from the faulty ones. This line is known as a decision boundary, and because we can draw this
line, this dataset is said to be linearly separable in terms of the two descriptive features used.
As the decision boundary is a linear separator, it can be defined using the equation of the line
(remember Equation (7.2.1)). In Figure 7.10(b) the decision boundary is defined as
• We can solve both these problems by using a more sophisticated threshold function
that is continuous, and therefore differentiable, and that allows for the subtlety
desired: the logistic function
• The logistic function is given by
• To see the impact of this, we can build a multivariable logistic regression model for
the dataset in Table 7.6[339]. After the training process (which uses a slightly
modified version of the gradient descent algorithm, which we will explain shortly),
the resulting logistic regression model is
Modeling NonLinear Relationships
• All the simple linear regression and logistic regression models that we have looked
at so far model a linear relationship between descriptive features and a target feature.
Linear models work very well when the underlying relationships in the data are
linear. Sometimes, however, the underlying data will exhibit non-linear relationships
that we would like to capture in a model. For example, the dataset in Table 7.9 is
based on an agricultural scenario and shows rainfall (in mm per day), RAIN, and
resulting grass growth (in kilograms per acre per day), GROWTH, measured on a
number of Irish farms during July 2012. A scatter plot of these two features is shown
in Figure 7.16(a), from which the strong non-linear
relationship between rainfall and grass growth is clearly apparent—grass does not
grow well when there is very little rain or too much rain, but hits a sweet spot at
rainfall of about 2:5mm per day. It would be useful for farmers to be able to predict
grass growth for different amounts of forecasted rainfall so that they could plan the
optimal times to harvest their grass for making hay.
• A simple linear regression model cannot handle this non-linear relationship. Figure
7.16(b) shows the best simple linear regression model that can be trained for this
prediction problem. This model is
To successfully model the relationship between grass growth and rainfall, we need to
introduce non-linear elements. A generalized way in which to do this is to introduce
basis functions that transform the raw inputs to the model into non-linear
representations but still keep the model itself linear in terms of the weights. The
advantage of this is that, except for introducing the mechanism of basis functions, we
do not need to make any other changes to the approach we have presented so far.
Furthermore, basis functions work for both simple multivariable linear regression
models that predict a continuous target feature and multivariable logistic regression
models that predict a categorical target feature
Multinomial Logistic Regression
• The multinomial logistic regression18 model is an extension that handles categorical
target features with more than two levels. A good way to build multinomial logistic
regression models is to use a set of one-versus-all models.19 If we have r target levels,
we create r one-versus-all logistic regression models. A one-versus-all model
distinguishes between one level of the target feature and all the others. Figure 7.20
shows three one-versus-all prediction models for a prediction problem with three target
levels (these models are based on the dataset in Table 7.11.
• where q is the set of descriptive features for a query instance; (d1; t1)… (ds; ts) are
s support vectors (instances composed of descriptive features and a target feature);
w0 is the first weight of the decision boundary; and is a set of parameters
determined during the training process (there is a parameter for each support vector
When the output of this equation is greater than 1, we predict the
positive target level for the query, and when the output is less than 1, we predict the
negative target level. An important feature of this equation is that the support
vectors are a component of the equation. This reflects the fact that a support vector
machine uses the support vectors to define theseparating hyperplane and hence to
make the actual model predictions.
Figure 7.23 shows two different decision boundaries that satisfy these constraints. Note
that the decision boundaries in these examples are equally positioned between positive and
negative instances, which is a consequence of the fact that decision boundaries satisfy these
constraints. The support vectors are highlighted in Figure 7.23 for each of the decision
boundaries shown