0% found this document useful (0 votes)
20 views74 pages

ML-Unit I - Linear Regression

- The document discusses linear regression and data collection from product reviews. It shows rating, review text, votes, and product quality for different reviews. - It then talks about using the collected data to predict the number of votes for future reviews based on the word count in the review text. This is framed as a regression problem since the goal is to predict a continuous output value. - Examples are provided of simple linear regression using only the vote data, calculating the mean and residuals. The goal of linear regression is to minimize the sum of squared residuals to find the best fitting line.

Uploaded by

Pranav Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views74 pages

ML-Unit I - Linear Regression

- The document discusses linear regression and data collection from product reviews. It shows rating, review text, votes, and product quality for different reviews. - It then talks about using the collected data to predict the number of votes for future reviews based on the word count in the review text. This is framed as a regression problem since the goal is to predict a continuous output value. - Examples are provided of simple linear regression using only the vote data, calculating the mean and residuals. The goal of linear regression is to minimize the sum of squared residuals to find the best fitting line.

Uploaded by

Pranav Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Machine Learning

Dr. Sunil Saumya


IIIT Dharwad
Linear Regression
Data Collection
Rating Review Text Vote Product
Quality

1 It's stops working 52 Low


just after 1 month ..
It's a fault in it's
design or circuit
..maybe it's not sweat
proff.. but I would
suggest don't buy it..
Data Collection Continue..
Rating Review Text Vote Product
Quality
1 It's stops working just after 1 52 Low
month ..
It's a fault in it's design or
circuit ..maybe it's not sweat
proff.. but I would suggest
don't buy it..
1 Not good 6 Low

5 It's a killer one... I've been 42 High


using Sennheiser cx180 for a
year and it's damaged, so I got
this one for me. So…..
Data Collection Continue..
Rating Review Text Vote Product
Quality
1 It's stops working just after 1 month .. 52 Low (0)
It's a fault in it's design or circuit ..maybe it's
not sweat proff.. but I would suggest don't
buy it..
1 Not good 6 Low (0)
5 It's a killer one... I've been using Sennheiser 42 High (1)
cx180 for a year and it's damaged, so I got
this one for me. So…..
1 Nit a good product! Mine stopped working ?? Low (0)
with in 1st month of purchase! And now iam
struggling with warranty claim!
Data Collection Continue..

Rating Review Text Word Vote Product


Count Quality
1 It's stops working just after 1 month .. 27 52 Low (0)
It's a fault in it's design or circuit ..maybe it's
not sweat proff.. but I would suggest don't buy
it..
1 Not good 2 6 Low (0)
5 It's a killer one... I've been using Sennheiser 100 42 High (1)
cx180 for a year and it's damaged, so I got
this one for me. So…..
1 Nit a good product! Mine stopped working 20 ?? Low (0)
with in 1st month of purchase! And now iam
struggling with warranty claim!
Problem Statement

● Problem statement:
Word Vote
Count ○ For every review (given in terms of word count
in the dataset) that will be posted on the
27 52
e-commerce website, predict how many votes it
2 6
will receive?
100 42 ● Here, we have input and output both given in the
20 ?? dataset, so it is a supervised problem.
● Second, while looking the vote column (output) it looks
Dataset like continuous value and not categorical value.
○ Hence, it is a regression problem.
One-dimensional data
Sl. No. Vote
● Unfortunately, while storing the data we collected only
1 5
vote and not the review word count.
2 17 ● So, this is the best data we have now, and we have to
3 11 find what will be the next vote for review no. 4.
4 8 ● How will you predict the vote count for future review
5 14 only on this data?
6 5
Dataset
Data Visualization

Sl. No. Vote

1 5
2 17
3 11
4 8
5 14
6 5
Dataset
Best line for the given data

Sl. No. Vote

1 5
2 17
3 11
4 8
5 14
6 5
Dataset Ŷ = 10
“Mean”: Best line for the given data

Sl. No. Vote

1 5
2 17
3 11
4 8
5 14
6 5 ● With only one variable, and no other information, the best prediction for next
measurement is the mean itself.
Dataset ● The variability in the vote can be explained by vote itself.
Ŷ = 10
“Goodness of fit” for the Vote

Sl. No. Vote

1 5
2 17
3 11
4 8
5 14
6 5 ● Residuals are also known as Errors.
● Residual always add ups to Zero.
Dataset Ŷ = 10 ○ In this case Residual above line is +12 and below line is -12.
Squaring the residuals
Sl. R R2 Sum of squared errors (SSE) = 120
No.
1 -5 25
2 +7 49
3 +1 1
4 -2 4
5 +4 16
6 -5 25

● Why square the residuals?


○ Makes them positive
Dataset Ŷ = 10 ○ Emphasizes larger deviations from the mean
Squaring the residuals
Sl. R R2 Sum of squared errors (SSE) = 120
No.
1 -5 25
2 +7 49
3 +1 1
4 -2 4
5 +4 16
6 -5 25

Dataset Ŷ = 10
Squaring the residuals
Sum of squared errors (SSE) = 120
Sl. R R2
No.
1 -5 25
2 +7 49
3 +1 1
4 -2 4
5 +4 16
6 -5 25

Dataset Ŷ = 10
Important points
Sl. R R2 ● The goal of simple linear regression is to create a linear model
No. that minimizes the sum of squares of the residuals/ errors
(SSE).
1 -5 25
● When conducting simple linear regression with two variables,
2 +7 49 we will determine how good that line “fits” the data by
3 +1 1 comparing it to this type of line; where we pretend the second
4 -2 4 variable does not exist.
● If a two variable regression model looks like this example,
5 +4 16
other variable does nothing to explain dependent variable.
6 -5 25

Dataset Ŷ = 10
Important points
● Simple linear regression is really a comparison of two models.
○ One is where the independent variable does not even exist.
○ Other uses the best fit regression line.
● If there is only one variable in dataset, the best prediction is given by the mean of
the dependent variable.
● The difference between the best fit line and the observed value is called the
residuals (or errors).
● The residuals are squared and sum together to given sum of squared residuals/errors
(SSE).
● Simple linear regression is designed to find the best fitting line through the data that
minimizes the SSE.
Linear Regression with independent variable
● Linear regression is a statistical method of
finding the relationship between independent
and dependent variables.
● Why do we call them as Independent and
Dependent variables?
○ Our independent variables are
independent because we cannot
mathematically determine the years of
experience.
○ But, we can determine / predict salary
column values (Dependent Variables)
based on years of experience.
Linear Regression with independent variable
● If you look at the data, the dependent column
values (Salary in 1000$) are increasing /
decreasing based on years of experience.
● Total Sum of Squares (SST):
○ The SST is the sum of all squared
differences between the mean of a sample
and the individual values in that sample. It
is represented mathematically with the
formula.
Linear Regression with independent variable
● Total Sum of Squares (SST):

The total sum of squared errors SST output is


5226.19.
Ordinary Least Square (OLS) Linear Regression
● Linear regression model objective is to reduce the SSE value as minimum as
possible.
● OLS works on the slope-intercept form of line to determine the relationship between
independent variables and dependent variable.
● To identify a slope-intercept, we use the equation
y = mx + b,
‘m’ is the slope
‘x’ → independent variables
‘b’ is intercept
Ordinary Least Square (OLS) Linear Regression
● To identify a slope-intercept, we use the
equation
y = mx + b,
‘m’ is the slope
‘x’ → independent variables
‘b’ is intercept
● To use OLS method, we apply the below
formula to find the equation
Ordinary Least Square (OLS) Linear Regression

m = 1037.8 / 216.19
m = 4.80
b = 45.44 - 4.80 * 7.56 = 9.15
Hence,
y = mx + b → 4.80x + 9.15
y = 4.80x + 9.15
Ordinary Least Square (OLS) Linear Regression
● Let’s compare our OLS method result with MS-Excel.
● Yes, we can test our linear regression best line fit in Microsoft Excel.

Our OLS method output → y = 4.80x + 9.15


MS-Excel Linear Reg. Output → y = 4.79x + 9.18
Ordinary Least Square (OLS) Linear Regression
● Let us calculate SSE again by using our output equation.

y = 4.79x + 9.18
Ordinary Least Square (OLS) Linear Regression
● Let us calculate SSE again by using our output equation.

SSE before OLS: 5226.19


SSE with OLS: 245.38
Linear Regression: on review Dataset
Word Vote
● In our dataset, we have only one independent variable
Count
(or input or x) “Word count” therefore, we can use
27 52
linear regression.
2 6 ● As we know, the linear relationship is always presented
100 42 by a straight line.
40 38
14 30
20 ??

Dataset
Linear Regression
Word Vote
● Now, let's come back to our main dataset and find the
Count
relationship between Word count (X) and Vote (Y)
27 52
2 6
100 42
40 38
14 30
20 ??

Dataset
OLS Linear Regression
Word Vote ● We know that a linear relationship can be obtained by
Count drawing a straight line between Word count (X) and
27 52 Vote (Y) which is given as:
2 6 Y = mx + c
100 42 Where,
m= slope
40 38
c= intercept
14 30
20 ??

Dataset
OLS Linear Regression
Word Vote ● We know that a linear relationship can be obtained by
Count drawing a straight line between Word count (X) and
27 52 Vote (Y) which is given as:
2 6 Y = mx + c
Where, Avg (x) = (27+2+100+40+14)/5 =36.6
100 42
m= slope Avg (y) = (52+6+42+38+30)/5 = 33.6
40 38
c= intercept
14 30
20 ??

Dataset
OLS Linear Regression
Word Vote
Count
27 52
2 6
100 42
40 38
14 30
20 ??

Dataset
OLS Linear Regression
Word Vote But, is it the
Count best line?
27 52
2 6
100 42
40 38
14 30
20 ??

Dataset
Linear Regression: choosing best line
Word Vote ● But, is it the best line?
Count ○ We can get multiple lines if we change the value
of m and c in the equation Y=mx+c.
27 52
To get the best line we will
2 6 use the gradient descent
algorithm.
100 42
40 38 Idea: Choose m and c
such that f(x) is close to y
14 30 for our training example
(x, y).
20 ??
Therefore, we need to
minimize the difference
Dataset between f(x) and y.
Re-writing hypothesis for finding best line
● A best regression line is one for which we get the least
error.
● Objective: of all possible lines, find the one that minimizes
the distance between the predicted y values (on the line)
and the true y values
Hypothesis Function: h(x)= w0+w1x
Cost function: Mean Squared Error
● Objective: of all possible lines, find the one that
minimizes the distance between the predicted y
values (on the line) and the true y values
Hypothesis Function: h(x)= w0+w1x
Cost function: Mean Squared Error
● Objective: of all possible lines, find the one that
minimizes the distance between the predicted y
values (on the line) and the true y values
Hypothesis Function: h(x)= w0+w1x

In other words, find w0 and w1 that minimize the cost function


J(w) for our n training examples
Cost function (MSE): Intuition

● As a simplification for the moment, let’s set w0 to be zero


● This means that our line will pass through the origin
● Our hypothesis is then h(x) = 0 + w1x = w1x
● Our cost function is then

● Our goal is to find w1 that minimizes J(w1 )


Cost function (MSE): Intuition

Suppose we have the following


three training dataset: Let’s consider the cost associated
Consider W0=0 with different values of w1
Cost function (MSE): Intuition

Suppose we have the following


three training dataset:
Consider W0=0

For W1=0.5

Let’s consider the cost associated


with different values of w1
Cost function (MSE): Intuition

Suppose we have the following


three training dataset:
Consider W0=0

For W1=0

Let’s consider the cost associated


with different values of w1
Cost function (MSE): Intuition

Suppose we have the following


three training dataset:
Consider W0=0

For W1=2

Let’s consider the cost associated


with different values of w1
Cost function (MSE): Intuition

Suppose we have the following


three training dataset:
Consider W0=0

Let’s consider the cost associated


with different values of w1
Cost function (MSE): of two parameters (W0, W1)
Cost function (MSE): of two parameters (W0, W1)

Cost Function
Cost function (MSE): of two parameters (W0, W1)
Cost function (MSE): of two parameters (W0, W1)
Cost function (MSE): of two parameters (W0, W1)
Gradient Descent
● We want to find the line that best fits the
data, i.e., we want to find w0 and w1 that
minimize the cost, J(w0,w1).

Gradient descent Algorithm:

Step 1: Start with some w0 and w1


(e.g., w0= 0 and w1 = 0)

Step 2: Keep changing w0 and w1 to reduce


the cost J(w0,w1) until hopefully we end up at
a minimum.
Gradient Descent; W0=0
● We want to find the line (passing through
the origin) that best fits the data, i.e., we
want to find and w1 that minimize the
cost, J(w1).

Gradient descent Algorithm:

Step 1: Start with some w1


(e.g., w1 = 0)

Step 2: Keep changing w1 to reduce the cost


J(w1) until hopefully we end up at a
minimum.
Gradient Descent
Gradient Descent
Gradient Descent: choosing learning rate
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent

Initialize: w0= 0 and w1 = 0

Repeat until convergence:


Batch Gradient Descent
● With batch gradient descent, we
consider all data points each
time we update a weight
parameter
Initialize: w0= 0 and w1 = 0

Repeat until convergence:


Stochastic Gradient Descent
● With stochastic gradient descent,
we consider a single data point
each time we update a weight
parameter

Initialize: w0= 0 and w1 = 0


Repeat until convergence, iterating
over each data point (x, y):
Multivariate Linear Regression
Univariate Linear Regression
Multivariate Linear Regression
Multivariate Linear Regression
Multivariate Linear Regression
Multivariate Linear Regression: feature scaling

● Features may have very


different ranges!
● Don’t forget to perform
feature scaling, e.g.,
subtract each feature’s
mean and divide by each
feature’s standard
deviation.
● Then features will have
the same scale.
Polynomial Regression
Polynomial Regression
Polynomial Regression
Polynomial Regression
Polynomial Regression
Overfitting and Underfitting
● Overfitting and Underfitting are two crucial concepts in machine learning
and are the prevalent causes for the poor performance of a machine learning
model.
● We will see the overfitting and underfitting for both:
○ Regression problem
Overfitting and Underfitting
● Underfitting:
○ When a model has not learned the
patterns in the training data well and
is unable to generalize well on the
new data, it is known as underfitting.
○ An underfit model has poor
performance on the training data and
will result in unreliable predictions.
○ Underfitting occurs due to high bias
and low variance.
W1X+b
Overfitting and Underfitting
● Optimum fit:
○ let's look at a second variation of a
model, which is if you insert for a
quadratic function at the data with two
features, x1 and x2, then when you fit the
parameters W1 and W2, you can get a
curve that fits the data somewhat better.
○ Your learning algorithm do well, even on
examples that are not on the training set,
that's called generalization.
W1X+W2X2+b
Overfitting and Underfitting
● Overfit:
○ The problem is if the model learns
the data too well, it fails to capture
the true relationship between input
and output and thus gives poor
validation accuracy ( results on
unseen data ) although it exhibits
good accuracy on the training data.
○ This is called overfitting and is a very
common problem in Machine
Learning. W1X1+W2X22+W3X33+W4X44+b
Overfitting and Underfitting
Overfitting may occur when we have
too many features and the learned
hypothesis fits the training data very
well but fails to generalize to new
examples.

You might also like