MATH6183 Introduction+Regression
MATH6183 Introduction+Regression
Fall 2023/2024
Introduction to Data Analytics
Introduction
▶ Motivation
▶ Types of variables
▶ Types of learning
▶ Data Analytics Methods
▶ Simple Linear Regression
Introduction
There are several examples of how analytics has changed our lives:
▶ IBM Watson
▶ eHarmony
▶ Netflix
▶ Blue Gene
▶ Kidney Exchange
▶ Drug discovery
Introduction
https://www.youtube.com/watch?v=wdEcVj5LTGg
Introduction
Types of data:
▶ Qualitative or Categorical
▶ Nominal: letters, symbols, words, gender, postcode, birthday,
etc.
▶ Ordinal: poor, average, good, etc.
▶ Quantitative or Numerical
▶ Discrete: 0,1,2,..
▶ Continuous: 4.7, 10K, 23C, etc.
Introduction
yi = β0 + β1 xi + ϵ̃i
Where:
1. yi is the observed outcome and β0 is the intercept.
2. β1 is the regression weight or coefficient associated with the
predictor variable xi .
3. ϵ̃i is the error term or residual error. (Typically assumed to be
i.i.d. noise with E (ϵ̃i ) = 0 and Var (ϵ̃i ) = σ 2 .)
n
X
minQ(β0 , β1 ) = (yi − (β0 + β1 xi ))2
i=1
n
∂Q X
0= = −2 (yi − (β̂0 + β̂1 xi )) (1)
∂β 0
i=1
n
∂Q X
0= = −2 xi (yi − (βˆ0 + β̂1 xi )) (2)
∂ β1
i=1
Ordinary Least Squares (OLS)
n
X n
X
yi − nβ̂0 − β̂1 xi = 0
i=1 i=1
n n
1X 1X
yi − β̂0 − β̂1 xi = 0
n n
i=1 i=1
y − β̂0 − β̂1 x = 0
β̂0 = y − β̂1 x
What does this tell us?
Ordinary Least Squares (OLS)
n
X n
X n
X
xi yi − β̂0 xi − β̂1 xi2 = 0
i=1 i=1 i=1
Using Definition of x
n n n
!2 n
X X 1 X X
xi yi − y xi + β̂1 xi − β̂1 xi2 = 0
n
i=1 i=1 i=1 i=1
Using Definition of y
Pn 1 Pn Pn
i=1 xi yi − n i=1 xi i=1 yi
β̂1 = Pn 2 1 Pn 2
i=1 xi − n ( i=1 xi )
Ordinary Least Squares (OLS)
Pn 2
1. Sum of Squares Total (SST) = i=1 (yi − y i )
Pn
2. Sum of Squares Regression (SSR) = i=1 (ŷi − y i )2
Error (SSE) = ni=1 (yi − ŷi )2
P
3. Sum of Squares
Demo 1.1: The salary data set consist of 30 rows and 2 columns
of year of experience and salary.
In Week 3, you will learn how to fit the regression line to training
data and measure the performance of the model on training data
using R.
Linear Regression in R
Linear Regression in R
Linear Regression in R
y = X β + ϵ̃
β0 y1 ϵ̃1
.. .. ..
with vectors β := . , y := . , and ϵ̃ := . .
βp−1 yn ϵ̃n
Multiple Linear Regression
Coefficient vector (β) is determined so that the error(ϵ̃i ) between
the predicted outcome(ŷi ) and the actual outcome(yi ) is minimized
by minimizing the Sum of Squared Errors(SSE):
n
X p−1
X
Q(β) = (yi − (β0 + βj xij ))2
i=1 j=1
β̂ = (X T X )−1 X T y
1
You should recall your linear algebra knowledge here. What happens, if this
fails? This condition is also required to have a unique β.
Multiple Linear Regression
∂ b T Xj = 0,
Q = −2(y − X β) 0≤j ≤p−1
∂βj βb
Equivalent to
b TX = 0
(y − X β)
y T X = βbT X T X
X T Y = X T X βb
−1
βb = X T X XTy
Multiple Linear Regression
▶ Overfitting:
Adding too many independent variables to get a good model
fit on training data will typically lead to low predictive power
on unseen data.
▶ Multicollinearity:
When the explanatory variables are tightly correlated, the
model is not able to disentangle their respective influence.
Multiple Linear Regression
▶ Multicollinear variables
▶ Two independent variables are highly correlated with each
other and therefore redundant; only one should remain in use
for the model.
▶ Non-contributing variables
▶ An independent variable has low/no correlation with depended
variable; can cause overfitting if not excluded from the model.
Multiple Linear Regression
Redundant multicollinear variables and Non-contributing variables
Multiple Linear Regression (Model 1)
▶ Two independent variables, Administration (x1 ) and RD
Spending (x2 ), are used to predict profit(y ).
▶ However, Administration has low correlation with profit and
high p-value. Do we need it in the model?
Multiple Linear Regression (Model 2)
Discuss in pairs:
Which of these curves make sense as a model?
Which of them have more ‘reasonable tails’ ?
Logistic Regression
Figure: Classification
Maximum Likelihood
▶ Maximum likelihood estimation (MLE) is a technique for
estimating the parameters of an assumed probability
distribution using observable data.
▶ Accomplished by optimising a probability function when the
observed data have the highest likelihood.
▶ The point in parameter space where the likelihood function is
maximised is known as the maximum likelihood estimate.
Logistic Regression (Maximum Likelihood Estimator)
To estimate the coefficients, we maximize the likelihood function:
n
Y
L(β) = Pr(Y = 1|x = xi ; β)yi Pr(Y = 0|x = xi ; β)1−yi .
i=1
Confusion matrix:
Suppose we use the following rule to classify or predict an output:
1. Choose a threshold t
2. For any observation i with predictor variables xi and estimated
coefficients β̂:
If Pr(Y = 1|xi ; β̂) ≥ t, then predict 1, else predict 0
Actual = 0 Actual = 1
Predict = 0 True Negative (TN) False Negative (FN)
You will learn a lot more about classification in later weeks and
also in other courses!