Linear Regression
Linear Regression
• When we have more than one input we can use Ordinary Least Squares to
estimate the values of the coefficients
• Given a regression line through the data we calculate the distance from each
data point to the regression line, square it, and sum all of the squared errors
together
• When there are one or more inputs you can use a process of optimizing the values
of the coefficients by iteratively minimizing the error of the model on training data
• This operation is called Gradient Descent and works by starting with zero values for
each coefficient
• The sum of the squared errors are calculated for each pair of input and output
values
• A learning rate is used as a scale factor and the coefficients are
updated in the direction towards minimizing the error
• Where mean() is the average value for the variable in our dataset
• We can calculate B0 using B1 and some statistics from our dataset, as
• Follows
• Find the coefficientsfor simple linear regression equation
• Plot these predictions as a line with the given data.
Estimating Error
• Calculate an error score for the predictions called the Root Mean
Squared Error or RMSE
• p is the predicted value and y is the actual value, i is the index for a
specific instance, because we must calculate the error across all
predicted values
Shortcut
• There is a shortcut that you can use to quickly estimate the values for
B1
Basics of ML
Parametric and Nonparametric
Machine Learning Algorithms
• Assumptions can greatly simplify the learning process, but can also
limit what can be learned.
• Algorithms that simplify the function to a known form are called
parametric machine learning algorithms.
• “A learning model that summarizes data with a set of parameters of
fixed size (independent of the number of training examples) is called a
parametric model. No matter how much data you throw at a
parametric model, it won't change its mind about how many
parameters it needs.”
Nonparametric Machine Learning Algorithms
• Nonparametric methods are good when you have a lot of data and no
prior knowledge, and when you don't want to worry too much about
choosing just the right features
• Nonparametric methods seek to best t the training data in constructing
the mapping function, whilst maintaining some ability to generalize to
unseen data
• An easy to understand nonparametric model is the k-nearest neighbors
algorithm that makes predictions based on the k most similar training
patterns for a new data instance
• The method does not assume anything about the form of the mapping
function other than patterns that are close are likely have a similar
output variable.
Bias-Variance Trade-Off
• In supervised machine learning an algorithm learns a model from
training data
• The goal of any supervised machine learning algorithm is to best
estimate the mapping function (f) for the output variable (Y ) given
the input data (X)
• Mapping function is often called the target function because it is the
function that a given supervised machine learning algorithm aims to
approximate
• The prediction error for any machine learning algorithm can be
broken down into three parts:
• Bias Error
• Variance Error
• Irreducible Error
• Irreducible error cannot be reduced regardless of what algorithm is
used
• It is the error introduced from the chosen framing of the problem and
may be caused by factors like unknown variables that influence the
mapping of the input variables to the output variable
Bias Error
• Parametric algorithms have a high bias making them fast to learn and
easier to understand but generally less flexible
• Variance is the amount that the estimate of the target function will change
if different training data was used
• Ideally, should not change too much from one training dataset to the next
• High Variance: Suggests large changes to the estimate of the target function with
changes to the training dataset
• In machine learning we describe the learning of the target function from training
data as inductive learning
• This allows us to make predictions in the future on data the model has
never seen
• Overfitting and underfitting are the two biggest causes for poor
performance of machine learning algorithms
Statistical Fit
• Overfitting refers to a model that models the training data too well
• Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance on the model on new data
• This means that the noise or random fluctuations in the training data is picked up
and learned as concepts by the model
• The problem is that these concepts do not apply to new data and negatively impact
the models ability to generalize
• Overfitting is more likely with nonparametric and nonlinear models
that have more flexibility when learning a target function
• Decision trees are a nonparametric machine learning algorithm that is
very flexible and is subject to overfitting training data
• This problem can be addressed by pruning a tree after it has learned
in order to remove some of the detail it has picked up
Underfitting in Machine Learning
• Should plot both the skill on the training data and the skill on a test
dataset we have held back from the training process
• Over time, as the algorithm learns, the error for the model on the
training data goes down and so does the error on the test dataset
• If we train for too long, the error on the training dataset may continue
to decrease because the model is overfitting and learning the
irrelevant detail and noise in the training dataset
• At the same time the error for the test set starts to rise again as the
model's ability to generalize decreases
How To Limit Overfitting
• Best used when the parameters cannot be calculated analytically (e.g. using
linear algebra) and must be searched for by an optimization algorithm
Gradient Descent Procedure
• Procedure starts of with initial values for the coefficient or coefficients
for the function
coefficient = 0:0
• The cost of the coefficients is evaluated by plugging them into the
function and calculating the cost
cost = f(coefficient)
cost = evaluate(f(coefficient))
• Derivative of the cost is calculated which refers to the slope of the
function at a given point
• We need to know the slope to find the direction (sign) to move the
coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
• We can now update the coefficient values
• A learning rate parameter (alpha) must be specified that controls how
much the coefficients can change on each update
coefficient = coefficient - (alpha x delta)
• Process is repeated until the cost of the coefficients (cost) is 0.0 or no
further improvements in cost can be achieved
Batch Gradient Descent
• Goal of all supervised machine learning algorithms is to best estimate a
target function (f) that maps input data (X) onto output variables (Y )
• Different algorithms have different representations and different
coefficients, but many of them require a process of optimization to find
the set of coefficients that result in the best estimate of the target
function
• Common examples of algorithms with coefficients that can be
optimized using gradient descent are Linear Regression and Logistic
Regression
• Evaluation of how close fit a machine learning model estimates the
target function can be calculated a number of different ways, often
specific to the machine learning algorithm
• Cost function involves evaluating the coefficients in the machine
learning model by calculating a prediction for each training instance in
the dataset and comparing the predictions to the actual output values
then calculating a sum or average error
• From the cost function a derivative can be calculated for each
coefficient so that it can be updated
• Cost is calculated for a machine learning algorithm over the entire
training dataset for each iteration of the gradient descent algorithm
• One iteration of the algorithm is called one batch and this form of
gradient descent is referred to as batch gradient descent
• Batch gradient descent is the most common form of gradient descent
described in machine learning
Stochastic Gradient Descent
• Gradient descent can be slow to run on very large datasets
• One iteration of the gradient descent algorithm requires a prediction
for each instance in the training dataset, hence takes a long time when
there is many millions of instances
• For large amounts of data, variation of gradient descent can be used
called stochastic gradient descent
• In this variation, the gradient descent procedure described above is
run but the update to the coefficients is performed for each training
instance, rather than at the end of the batch of instances
• First step of the procedure requires that the order of the training
dataset is randomized
• Because the coefficients are updated after every training instance, the
updates will be noisy, jumping all over the place, and so will the
corresponding cost function
• By mixing up the order for the updates to the coefficients, it
harnesses this random walk and avoids getting stuck
• Update procedure for the coefficients is the same as that above,
except the cost is not summed or averaged over all training patterns,
but instead calculated for one training pattern
• Learning can be much faster with stochastic gradient descent for very
large training datasets and often need a small number of passes
through the dataset to reach a good or good enough set of
coefficients, e.g. 1-to-10 passes through the dataset
Tips for Gradient Descent
• Learning Rate
• Value is a small real value such as 0.1, 0.001 or 0.0001
• Rescale Inputs
• Algorithm will reach the minimum cost faster if the shape of the cost function
is not skewed and distorted
• This can be achieved by rescaling all of the input variables (X) to the same
range, such as between 0 and 1
• Few Passes
• Stochastic gradient descent often does not need more than 1-to-10 passes
through the training dataset to converge on good or good enough
coefficients.
• Plot Mean Cost
• The updates for each training dataset instance can result in a noisy plot of
cost over time when using stochastic gradient descent
Performance Measures
Confusion matrix for
Multiple classes
Recall for Label A:
Precision for Label A: