0% found this document useful (0 votes)
73 views

Linear Regression

Linear regression is both a statistical algorithm and a machine learning algorithm. It is used to model the relationship between input variables (x) and an output variable (y) with a linear equation. The coefficients of the linear equation are estimated from the training data using techniques like ordinary least squares and gradient descent to minimize error. Linear regression makes simplifying assumptions about data but provides an easily interpretable model.

Uploaded by

Naresha a
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Linear Regression

Linear regression is both a statistical algorithm and a machine learning algorithm. It is used to model the relationship between input variables (x) and an output variable (y) with a linear equation. The coefficients of the linear equation are estimated from the training data using techniques like ordinary least squares and gradient descent to minimize error. Linear regression makes simplifying assumptions about data but provides an easily interpretable model.

Uploaded by

Naresha a
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Linear Regression for Machine Learning

• One of the most well known and well understood algorithms in


statistics and machine learning

• Do not need to know any statistics or linear algebra to understand


linear regression
• Why linear regression belongs to both statistics and machine learning.
• The many names by which linear regression is known.
• The representation and learning algorithms used to create a linear
regression model.
• How to best prepare your data when modeling using linear
regression.
Isn’t Linear Regression from Statistics?

• Machine learning, more specifically the field of predictive modeling is primarily


concerned with minimizing the error of a model or making the most accurate
predictions possible, at the expense of explainability

• Linear regression was developed in the field of statistics and is studied as a


model for understanding the relationship between input and output numerical
variables, but has been borrowed by machine learning

• It is both a statistical algorithm and a machine learning algorithm


Many Names of Linear Regression

• Linear regression is a linear model, e.g. a model that assumes a linear


relationship between the input variables (x) and the single output
variable (y)

• Simple linear regression

• Multiple linear regression

• Least Squares Regression


Linear Regression Model Representation

• Attractive model because the representation is so simple


• Combines a specific set of input values (x) the solution to which is the
predicted output for that set of input values (y)
• Both the input values (x) and the output value are numeric
• The formula for a regression line is
Y' = bX + A where Y' is the predicted score, b is the slope of the
line, and A is the Y intercept (both called coefficients)
• In higher dimensions when we have more than one input (x), the line
is called a plane or a hyperplane
Linear Regression Learning the Model

• Learning a linear regression model means estimating the values of the


coefficients used in the representation with the data

• Techniques to prepare a linear regression model


• Ordinary Least Squares

• Gradient Descent (most common technique taught from a machine learning


perspective)
Simple Linear Regression

• With simple linear regression when we have a single input, we can


use statistics to estimate the coefficients

• Requires that we calculate statistical properties from the data such as


means, standard deviations, correlations and covariance.

• All of the data must be available to traverse and calculate statistics


Ordinary Least Squares

• When we have more than one input we can use Ordinary Least Squares to
estimate the values of the coefficients

• This procedure seeks to minimize the sum of the squared residuals

• Given a regression line through the data we calculate the distance from each
data point to the regression line, square it, and sum all of the squared errors
together

• This is the quantity that ordinary least squares seeks to minimize


Gradient Descent

• When there are one or more inputs you can use a process of optimizing the values
of the coefficients by iteratively minimizing the error of the model on training data

• This operation is called Gradient Descent and works by starting with zero values for
each coefficient

• The sum of the squared errors are calculated for each pair of input and output
values
• A learning rate is used as a scale factor and the coefficients are
updated in the direction towards minimizing the error

• The process is repeated until a minimum sum squared error is


achieved or no further improvement is possible
Regularized Linear Regression
• There are extensions of the training of the linear model called
regularization methods
• Seek to both minimize the sum of the squared error of the model on
the training data (using Ordinary Least Squares) but also to reduce the
complexity of the model (like the number or absolute size of the sum
of all coefficients in the model)
• Two popular examples of regularization procedures for linear
regression are:
• Lasso Regression:
• Ridge Regression:
• Lasso Regression:
• Ordinary Least Squares is modied to also minimize the absolute sum of the
coecients (called L1 regularization)
• Ridge Regression:
• Ordinary Least Squares is modied to also minimize the squared absolute sum
of the coecients (called L2 regularization)
Preparing Data For Linear Regression
• Linear Assumption
• need to transform data to make the relationship linear
• Remove Noise
• remove outliers in the output variable (y) if possible
• Remove Collinearity
• removing the most correlated data
• Rescale Inputs:
• Rescale input variables using standardization or normalization if exists
Simple Linear Regression
Tutorial
Introduction
• How to calculate a simple linear regression step-by-step.

• How to make predictions on new data using your model.

• A shortcut that greatly simplifes the calculation.


Tutorial Data Set
•x y
•1 1
•2 3
•4 3
•3 2
•5 5
Draw a Scatter plot of x versus y.
Simple Linear Regression
• With simple linear regression we want to model our data as follows:
• y = B0 + B1 x
• This is a line where y is the output variable we want to predict, x is the input
variable we know and B0 and B1 are coefficients that we need to estimate that
move the line around
• Technically, B0 is called the intercept because it determines where the line
intercepts the y-axis
• B1 term is called the slope because it defines the slope of the line or how x
translates into a y value before we add our bias
• Goal is to find the best estimates for the coefficients to minimize the errors in
predicting y from x
• Simple regression is great, because rather than having to search for
values by trial and error or calculate them analytically using more
advanced linear algebra, we can estimate them directly from our data
• To estimate the value for B1

• Where mean() is the average value for the variable in our dataset
• We can calculate B0 using B1 and some statistics from our dataset, as
• Follows
• Find the coefficientsfor simple linear regression equation
• Plot these predictions as a line with the given data.
Estimating Error
• Calculate an error score for the predictions called the Root Mean
Squared Error or RMSE

• p is the predicted value and y is the actual value, i is the index for a
specific instance, because we must calculate the error across all
predicted values
Shortcut
• There is a shortcut that you can use to quickly estimate the values for
B1
Basics of ML
Parametric and Nonparametric
Machine Learning Algorithms

• Assumptions can greatly simplify the learning process, but can also
limit what can be learned.
• Algorithms that simplify the function to a known form are called
parametric machine learning algorithms.
• “A learning model that summarizes data with a set of parameters of
fixed size (independent of the number of training examples) is called a
parametric model. No matter how much data you throw at a
parametric model, it won't change its mind about how many
parameters it needs.”
Nonparametric Machine Learning Algorithms
• Nonparametric methods are good when you have a lot of data and no
prior knowledge, and when you don't want to worry too much about
choosing just the right features
• Nonparametric methods seek to best t the training data in constructing
the mapping function, whilst maintaining some ability to generalize to
unseen data
• An easy to understand nonparametric model is the k-nearest neighbors
algorithm that makes predictions based on the k most similar training
patterns for a new data instance
• The method does not assume anything about the form of the mapping
function other than patterns that are close are likely have a similar
output variable.
Bias-Variance Trade-Off
• In supervised machine learning an algorithm learns a model from
training data
• The goal of any supervised machine learning algorithm is to best
estimate the mapping function (f) for the output variable (Y ) given
the input data (X)
• Mapping function is often called the target function because it is the
function that a given supervised machine learning algorithm aims to
approximate
• The prediction error for any machine learning algorithm can be
broken down into three parts:
• Bias Error
• Variance Error
• Irreducible Error
• Irreducible error cannot be reduced regardless of what algorithm is
used
• It is the error introduced from the chosen framing of the problem and
may be caused by factors like unknown variables that influence the
mapping of the input variables to the output variable
Bias Error

• Bias are the simplifying assumptions made by a model to make the


target function easier to learn

• Parametric algorithms have a high bias making them fast to learn and
easier to understand but generally less flexible

• In turn they have lower predictive performance on complex problems


that fail to meet the simplifying assumptions of the algorithms bias
• Low Bias: Suggests less assumptions about the form of the target
function

• High-Bias: Suggests more assumptions about the form of the target


function

• Low-bias machine learning algorithms include: Decision Trees, k-


Nearest Neighbors and Support Vector Machines

• High-bias machine learning algorithms include: Linear Regression,


Linear Discriminant Analysis and Logistic Regression
Variance Error

• Variance is the amount that the estimate of the target function will change
if different training data was used

• Target function is estimated from the training data by a machine learning


algorithm, so we should expect the algorithm to have some variance

• Ideally, should not change too much from one training dataset to the next

• Machine learning algorithms that have a high variance are strongly


influenced by the specifics of the training data
• Low Variance: Suggests small changes to the estimate of the target function with
changes to the training dataset

• High Variance: Suggests large changes to the estimate of the target function with
changes to the training dataset

• For example decision trees have a high variance

• Examples of low-variance machine learning algorithms include: Linear Regression,


Linear Discriminant Analysis and Logistic Regression

• Examples of high-variance machine learning algorithms include: Decision Trees, k-


Nearest Neighbors and Support Vector Machines
Bias-Variance Trade-Off
• Goal of any supervised machine learning algorithm is to achieve low
bias and low variance
• In turn the algorithm should achieve good prediction performance
• linear machine learning algorithms often have a high bias but a low
variance
• Nonlinear machine learning algorithms often have a low bias but a
high variance
Overfitting and Underfitting

• Generalization in Machine Learning

• In machine learning we describe the learning of the target function from training
data as inductive learning

• Induction refers to learning general concepts from specific examples which is


exactly the problem that supervised machine learning problems aim to solve

• Generalization refers to how well the concepts learned by a machine learning


model apply to specific examples not seen by the model when it was learning
• Goal of a good machine learning model is to generalize well from the
training data to any data from the problem domain

• This allows us to make predictions in the future on data the model has
never seen

• Overfitting and underfitting are the two biggest causes for poor
performance of machine learning algorithms
Statistical Fit

• A fit refers to how well you approximate a target function


• Statistics often describe the goodness of fit to measures used to
estimate how well the approximation of the function matches the
target function
Overfitting in Machine Learning

• Overfitting refers to a model that models the training data too well

• Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance on the model on new data

• This means that the noise or random fluctuations in the training data is picked up
and learned as concepts by the model

• The problem is that these concepts do not apply to new data and negatively impact
the models ability to generalize
• Overfitting is more likely with nonparametric and nonlinear models
that have more flexibility when learning a target function
• Decision trees are a nonparametric machine learning algorithm that is
very flexible and is subject to overfitting training data
• This problem can be addressed by pruning a tree after it has learned
in order to remove some of the detail it has picked up
Underfitting in Machine Learning

• Underfitting refers to a model that can neither model the training


data nor generalize to new data
• An underfit machine learning model is not a suitable model and will
be obvious as it will have poor performance on the training data
• Underfitting is often not discussed as it is easy to detect given a good
performance metric
• Remedy is to move on and try alternate machine learning algorithms
A Good Fit in Machine Learning

• Should plot both the skill on the training data and the skill on a test
dataset we have held back from the training process
• Over time, as the algorithm learns, the error for the model on the
training data goes down and so does the error on the test dataset
• If we train for too long, the error on the training dataset may continue
to decrease because the model is overfitting and learning the
irrelevant detail and noise in the training dataset
• At the same time the error for the test set starts to rise again as the
model's ability to generalize decreases
How To Limit Overfitting

• Both overfitting and underfitting can lead to poor model performance


• Most common problem in applied machine learning is overfitting
• Overfitting is such a problem because the evaluation of machine
learning algorithms on training data is different from the evaluation
we actually care the most about, namely how well the algorithm
performs on unseen data
• Two important techniques that can be used when evaluating machine
learning algorithms to limit overfitting:
• Use a resampling technique to estimate model accuracy (k-fold cross-
validation)
• Hold back a validation dataset (validation dataset is simply a subset of your
training data that you hold back from your machine learning algorithms until
the very end of your project)
Gradient Descent For Machine
Learning
• Optimization is a big part of machine learning

• Gradient descent is an optimization algorithm used to find the values of


parameters (coefficients) of a function (f) that minimizes a cost function
(cost)

• Best used when the parameters cannot be calculated analytically (e.g. using
linear algebra) and must be searched for by an optimization algorithm
Gradient Descent Procedure
• Procedure starts of with initial values for the coefficient or coefficients
for the function
coefficient = 0:0
• The cost of the coefficients is evaluated by plugging them into the
function and calculating the cost
cost = f(coefficient)
cost = evaluate(f(coefficient))
• Derivative of the cost is calculated which refers to the slope of the
function at a given point
• We need to know the slope to find the direction (sign) to move the
coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
• We can now update the coefficient values
• A learning rate parameter (alpha) must be specified that controls how
much the coefficients can change on each update
coefficient = coefficient - (alpha x delta)
• Process is repeated until the cost of the coefficients (cost) is 0.0 or no
further improvements in cost can be achieved
Batch Gradient Descent
• Goal of all supervised machine learning algorithms is to best estimate a
target function (f) that maps input data (X) onto output variables (Y )
• Different algorithms have different representations and different
coefficients, but many of them require a process of optimization to find
the set of coefficients that result in the best estimate of the target
function
• Common examples of algorithms with coefficients that can be
optimized using gradient descent are Linear Regression and Logistic
Regression
• Evaluation of how close fit a machine learning model estimates the
target function can be calculated a number of different ways, often
specific to the machine learning algorithm
• Cost function involves evaluating the coefficients in the machine
learning model by calculating a prediction for each training instance in
the dataset and comparing the predictions to the actual output values
then calculating a sum or average error
• From the cost function a derivative can be calculated for each
coefficient so that it can be updated
• Cost is calculated for a machine learning algorithm over the entire
training dataset for each iteration of the gradient descent algorithm
• One iteration of the algorithm is called one batch and this form of
gradient descent is referred to as batch gradient descent
• Batch gradient descent is the most common form of gradient descent
described in machine learning
Stochastic Gradient Descent
• Gradient descent can be slow to run on very large datasets
• One iteration of the gradient descent algorithm requires a prediction
for each instance in the training dataset, hence takes a long time when
there is many millions of instances
• For large amounts of data, variation of gradient descent can be used
called stochastic gradient descent
• In this variation, the gradient descent procedure described above is
run but the update to the coefficients is performed for each training
instance, rather than at the end of the batch of instances
• First step of the procedure requires that the order of the training
dataset is randomized
• Because the coefficients are updated after every training instance, the
updates will be noisy, jumping all over the place, and so will the
corresponding cost function
• By mixing up the order for the updates to the coefficients, it
harnesses this random walk and avoids getting stuck
• Update procedure for the coefficients is the same as that above,
except the cost is not summed or averaged over all training patterns,
but instead calculated for one training pattern
• Learning can be much faster with stochastic gradient descent for very
large training datasets and often need a small number of passes
through the dataset to reach a good or good enough set of
coefficients, e.g. 1-to-10 passes through the dataset
Tips for Gradient Descent

• Plot Cost versus Time


• Expectation for a well performing gradient descent run is a
decrease in cost each iteration
• If it does not decrease, try reducing your learning rate

• Learning Rate
• Value is a small real value such as 0.1, 0.001 or 0.0001
• Rescale Inputs

• Algorithm will reach the minimum cost faster if the shape of the cost function
is not skewed and distorted
• This can be achieved by rescaling all of the input variables (X) to the same
range, such as between 0 and 1

• Few Passes

• Stochastic gradient descent often does not need more than 1-to-10 passes
through the training dataset to converge on good or good enough
coefficients.
• Plot Mean Cost
• The updates for each training dataset instance can result in a noisy plot of
cost over time when using stochastic gradient descent
Performance Measures
Confusion matrix for
Multiple classes
Recall for Label A:
Precision for Label A:

You might also like