Lec 6
Lec 6
1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 / 18
PART II: L INEAR R EGRESSION : T HE I NDUCTIVE B IAS
1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1 A List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Choice of Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 / 18
PART III: L INEAR R EGRESSION : A PPLICATIONS AND S HORTCOMINGS
3 / 18
Part I
4 / 18
R EGRESSION
▶ Regression is a statistical method that attempts to predict the strength and nature of the
relationship between a dependent variable, and one or a series of independent variables.
▶ It does so by finding a curve that minimizes the error in the actual and expected values of the
dependent variable of training data-set.
▶ For proper interpretation of regression, several assumptions (inductive bias) about the data and
the model must hold.
▶ Linear regression is one of the common forms of this method. It establishes a linear relationship
between the two variables.
5 / 18
L INEAR R EGRESSION
I NTRODUCTION
6 / 18
L INEAR R EGRESSION
R EGRESSION L INE
The line showing the linear relationship between the dependent and independent variable is called
a regression line. An example of a regression line is shown below:
The regression line may be positive, wherein the dependent variable increases with increase in
independent variable; or it may be negative, wherein the dependent variable decreases with
increase in independent variable (as in above figure).
7 / 18
L INEAR R EGRESSION : T HE P ROBLEM
T YPES
Linear regression may be classified further into the following two types:
▶ Simple Linear Regression: Assumes a linear relationship between a single independent
variable and a dependent variable.
▶ Multiple Linear Regression: Assumes a linear relationship between two or more independent
variables and a dependent variable.
8 / 18
L INEAR R EGRESSION : T HE P ROBLEM
M ATHEMATICAL R EPRESENTATION
Once a linear relationship has been determined by the algorithm, the general form of each model
may be represented as follows:
▶ Simple Linear Regression
y = ax + b + u
▶ Multiple Linear Regression
Y = a1 x + a2 x +a3 x + b + u
where:
y = Dependent variable
x = Independent variable
a = Slope(s) of the variable(s)
b = The y-intercept
u = The regression residual/error term
9 / 18
L INEAR R EGRESSION : T HE P ROBLEM
L OSS F UNCTION : M EAN S QUARED E RROR
▶ The regression line is achieved by minimizing the sum of mean squared error (loss function) for
all points in the domain. The loss function is gives as:
1
MSE = Σ(y − f (x))2
N
where f(x) = a1 x + a2 x .... +b
10 / 18
L INEAR R EGRESSION : T HE S OLUTION
S OLUTION
The best fit line may be found in the following two manners:
▶ Closed form (Exact form) Solution:
• It solves the problem in terms of simple functions and mathematical operators.
• The closed form solution for linear regression is as follows:
B = (X′ X)−1 X′ Y
11 / 18
L INEAR R EGRESSION : T HE S OLUTION
Q UALITY OF F IT
▶ The goodness of the fit achieved determines how linearly the variables are correlated.
▶ The goodness of fit may be calculated using the Pearson correlation coefficient, which is given
by:
Σ(x1 − x)(yi − y)
r= p
Σ(xi − x)2 Σ(yi − y)2
▶ The higher the value of r, the better is the fit.
12 / 18
Part II
13 / 18
I NDUCTIVE B IAS
A L IST
14 / 18
C HOICE OF L OSS F UNCTION
Let us analyse some loss functions to justify the choice of MSE as an appropriate loss function.
▶ L1 = (y-f(x)): This loss function gives out both positive and negative values, which cancel out to
give near zero error for large data.
▶ L2 = |(y-f(x))|: Although errors do not cancel out here, the outliers are penalised equally as
compared to standard data.
▶ L3 = (y-f(x))2 : In this case, the errors do not cancel out. Also, outliers are penalised more, giving
a more appropriate regression line.
Hence, MSE is an appropriate choice for loss function.
15 / 18
Part III
16 / 18
A PPLICATIONS AND S HORTCOMINGS
▶ Linear regression finds its applications in several fields, like market analysis, financial analysis,
environmental health, and medicine.
▶ However, it does leave somethings to desire for. A linear correlation does not indicate
causation, i.e. a connection between two variable does not imply that one causes the other.
▶ Linear regression is prone to noise and overfitting.
▶ It is prone to multicollinearity, i.e. occurence of correlation between two or more independent
variables. This reduces the statistical significance of an independent variable.
17 / 18
R EFERENCES
18 / 18