0% found this document useful (0 votes)
20 views70 pages

MATH6183 Introduction+Regression

The document provides an overview of a course on data mining and analytics, including introducing common data analytics methods and simple linear regression. It discusses motivation, types of variables and learning, typical analytics steps, and defines key linear regression concepts like ordinary least squares estimation and goodness of fit measures like R-squared.

Uploaded by

praveen sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views70 pages

MATH6183 Introduction+Regression

The document provides an overview of a course on data mining and analytics, including introducing common data analytics methods and simple linear regression. It discusses motivation, types of variables and learning, typical analytics steps, and defines key linear regression concepts like ordinary least squares estimation and goodness of fit measures like R-squared.

Uploaded by

praveen sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

MATH6138: Data Mining and Analytics

Prof. Selin Damla Ahipasaoglu

Fall 2023/2024
Introduction to Data Analytics
Introduction

▶ Motivation
▶ Types of variables
▶ Types of learning
▶ Data Analytics Methods
▶ Simple Linear Regression
Introduction

Data analytics is the science of using data to build models that


lead to better decisions that add value to individuals, to
companies, to institutions.
Introduction

Data analytics is the science of using data to build models that


lead to better decisions that add value to individuals, to
companies, to institutions.

Data Analytics can provide a competitive advantage to companies,


make governments use their budget more efficiently, and can even
help save lives or fight against climate change!
Introduction

Data analytics is the science of using data to build models that


lead to better decisions that add value to individuals, to
companies, to institutions.

Data Analytics can provide a competitive advantage to companies,


make governments use their budget more efficiently, and can even
help save lives or fight against climate change!

Can you think about a “Data Analytics Application” that has


recently changed the world or can change it in the near future?
(Peer-discussion, 5 minutes)
Introduction

There are several examples of how analytics has changed our lives:
▶ IBM Watson
▶ eHarmony
▶ Netflix
▶ Blue Gene
▶ Kidney Exchange
▶ Drug discovery
Introduction

and will continue to have impact in millions of lives:

https://www.youtube.com/watch?v=wdEcVj5LTGg
Introduction

Typical steps of Data Analytics are:

▶ Understanding the problem and identifying the questions


▶ Data mining
▶ Data preparation
▶ Data exploration
▶ Feature engineering (Selection of input variables)
▶ Predictive modelling (Training and testing)
▶ Data Visualization
▶ Presentation of the results

Discuss which step is the most crucial/hardest, and which


techniques/software we can use for each step.
(Brainstorm, 5 minutes)
Introduction

According to Google data is defined as:


▶ facts and statistics collected together for reference or analysis
▶ the quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and
transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.

Types of data:
▶ Qualitative or Categorical
▶ Nominal: letters, symbols, words, gender, postcode, birthday,
etc.
▶ Ordinal: poor, average, good, etc.
▶ Quantitative or Numerical
▶ Discrete: 0,1,2,..
▶ Continuous: 4.7, 10K, 23C, etc.
Introduction

Major Techniques of Data Analytics/Machine Learning:


▶ Regression / Estimation: Predict Continuous Values
▶ E.g., predicting carbon emission values based on engine volume
and speed
▶ Linear regression, generalised additive models, etc.
▶ Classification: Predicting the item class/category of a case.
▶ E.g., predicting if a customer will pay his credit debt or not
▶ Logistic regression, SVMs, decision trees, LDA, etc.
▶ Clustering: Finding the structure of data; summarization.
▶ E.g., segmenting customers of a retail to offer differentiated
products and offers
▶ k-means, DBSCAN, association analysis, etc.
▶ Dimension Reduction: Reducing the size of data
▶ E.g., image compression
▶ Principal Component Analysis
Discuss: which ones are supervised/unsupervised?
Simple Linear Regression
Introduction

A simple and popular technique for predicting a continuous


variable assuming a linear relationship between the outcome (yi )
and the predictor variable (xi ).

yi = β0 + β1 xi + ϵ̃i
Where:
1. yi is the observed outcome and β0 is the intercept.
2. β1 is the regression weight or coefficient associated with the
predictor variable xi .
3. ϵ̃i is the error term or residual error. (Typically assumed to be
i.i.d. noise with E (ϵ̃i ) = 0 and Var (ϵ̃i ) = σ 2 .)

How do we determine the regression line?


Ordinary Least Squares (OLS)
Coefficients (β0 , β1 ) are determined so that the error(ϵ̃i ) between
the predicted outcome(ŷi ) and the actual outcome(yi ) is minimized
by minimizing the Sum of Squared Errors(SSE):
n
X
minQ(β0 , β1 ) = (yi − (β0 + β1 xi ))2
i=1
Ordinary Least Squares (OLS)

n
X
minQ(β0 , β1 ) = (yi − (β0 + β1 xi ))2
i=1

Note that this is a convex function of β0 , β1 .

To minimize SSE, we set derivatives to 0 and solve for solutions


β̂0 ,and β̂1 :

n
∂Q X
0= = −2 (yi − (β̂0 + β̂1 xi )) (1)
∂β 0
i=1
n
∂Q X
0= = −2 xi (yi − (βˆ0 + β̂1 xi )) (2)
∂ β1
i=1
Ordinary Least Squares (OLS)

Rewrite equation (1) and solve for β̂0 :


n
X n
X n
X
yi − β̂0 − β̂1 xi = 0
i=1 i=1 i=1

n
X n
X
yi − nβ̂0 − β̂1 xi = 0
i=1 i=1
n n
1X 1X
yi − β̂0 − β̂1 xi = 0
n n
i=1 i=1

y − β̂0 − β̂1 x = 0

β̂0 = y − β̂1 x
What does this tell us?
Ordinary Least Squares (OLS)

Rewrite equation (2) for β̂1 :


n
X n
X n
X
xi yi − xi β̂0 − xi2 β̂1 = 0
i=1 i=1 i=1

n
X n
X n
X
xi yi − β̂0 xi − β̂1 xi2 = 0
i=1 i=1 i=1

Replace β̂0 from previous slide


n
X n
X n
X
xi yi − (y − β̂1 x) xi − β̂1 xi2 = 0
i=1 i=1 i=1
Ordinary Least Squares (OLS)

Using Definition of x

n n n
!2 n
X X 1 X X
xi yi − y xi + β̂1 xi − β̂1 xi2 = 0
n
i=1 i=1 i=1 i=1

Using Definition of y
Pn 1 Pn Pn
i=1 xi yi − n i=1 xi i=1 yi
β̂1 = Pn 2 1 Pn 2
i=1 xi − n ( i=1 xi )
Ordinary Least Squares (OLS)

Pn 2
1. Sum of Squares Total (SST) = i=1 (yi − y i )
Pn
2. Sum of Squares Regression (SSR) = i=1 (ŷi − y i )2
Error (SSE) = ni=1 (yi − ŷi )2
P
3. Sum of Squares

SST = SSR + SSE

(Can you prove this?)


n
X n
X n
X
(yi − y i )2 = (ŷi − y i )2 + (yi − ŷi )2
|i=1 {z } |i=1 {z } |i=1 {z }
total variability explained variability unexplained variability
Ordinary Least Squares

▶ SST, or Sum of Squares Total, measure the variation of y’s


around their mean (Green line) = ni=1 (yi − y i )2 =s̃yy
P
Ordinary Least Squares

▶ SSR, or Sum of Squares explained by Regression, is the


regression
P model’s variation around the sample mean (violet
line) = ni=1 (ŷi − y i )2
Ordinary Least Squares

▶ SSE, or Sum of Squares Error/residual(Red line) is the total


unexplained error = ni=1 (yi − ŷi )2
P
Ordinary Least Squares

The Goodness of Fit


▶ R-Squared or R2 express how good the regression line (grey
line) fit compared to fitting a flat line at the sample mean
(blue line) y
Ordinary Least Squares

The Goodness of Fit

▶ R2 = SSR SSE SST−SSE


SST = 1− SST = SST

▶ For example, if SST = 50 and SSE=15, then


R2 = 50−15 35
50 = 50 = 70%

This can be interpreted as


▶ “There is 70% less variation with the (regression) line
compared to the mean line”.
or
▶ “The regression explains 70% of the variation in the data.
Ordinary Least Squares

The Goodness of Fit: Adjusted R-Squared or R2adj


▶ R2 formula has a known drawback.
▶ R2 always increases with every predictor added to a model can
be misleading as it can appear to be a better fit as more
terms are added to the model.
▶ Adjusted R2 remedies this problem.
▶ Adjusted R2 also indicates how well terms fit a curve or line,
but adjusts for the number of terms in a model. If you add
useless variables to a model, adjusted R-squared will decrease.
Ordinary Least Squares

The Goodness of Fit


(1−R2 )(n−1)
▶ Adjusted R-Square = 1 − n−p−1
▶ p = number of independent regressors / variables
▶ n = number of points in the data sample
Methods to determine the best-fit regression line

Root Mean Square Error (RMSE): Measures the model


prediction error. It corresponds to the average difference between
the observed known values of the outcome and the predicted value
by the model. Lower RMSE value is desirable.
q Pn q
2
i=1 (yi −yˆi ) SSE
RMSE = n = n
Adjusted R-Square: Represent the proportion of variation in the
data explained by the model. Corresponds to the overall quality of
the model. Higher R-square is desirable.

How should you compute these metrics?


They should be computed on a new set of data which has not been
used to train the model. With large data set, splitting the data
into train and test data set with 80:20 ratio is a common practice.
Linear Regression in R

We will use linear regression to predict salary of an employee using


the number of years of experience that she has.

Demo 1.1: The salary data set consist of 30 rows and 2 columns
of year of experience and salary.

In Week 3, you will learn how to fit the regression line to training
data and measure the performance of the model on training data
using R.
Linear Regression in R
Linear Regression in R
Linear Regression in R

We can also calculate the confidence intervals of these estimators:

Or make a prediction with a desired confidence level:

We will learn more about these details in computer labs and


coursework!
Multiple Linear Regression
Multiple Linear Regression

An extension of simple linear regression for predicting a


continuous outcome variable (yi ) using multiple distinct
predictor variables (xi ).

yi = β0 + β1 xi1 + ... + βp−1 xi,p−1 + ϵ̃i

▶ The number of predictors in this model is p − 1.


▶ Error is assumed to follow N(0, σ 2 ) as before.
▶ Coefficients are called partial regression coefficients as they
measure the effect of a unit increase of a single regressor on
the average value of the output while holding constant all
other variables.
Multiple Linear Regression

Let’s first build some intuition about our model:

yi = β0 + β1 xi1 + ... + βp−1 xi,p−1 + ϵ̃i

▶ This is a linear model. What does this mean exactly when we


have more than two variables?
▶ What happens if we have more than one variable?
▶ Go to https://www.geogebra.org/ and draw a multiple
linear equation with 2 variables to answer these questions.
(Work in pairs, 5 minutes)
Multiple Linear Regression

▶ Illustration of a multiple linear regression model:

▶ Which geometric object is this?


▶ What do you recall about this object?
(Work in pairs, 5 minutes)
Multiple Linear Regression
 
1 x11 x12 . . . x1,p−1
. .. .. ..
 ..
▶ Let X :=   ∈ ℜn×p ,

. . .
1 xn1 xn2 . . . xn,p−1
   
1 x1j
 ..   .. 
with column vectors X0 :=  .  and Xj :=  . .
1 xnj
▶ Then the regression model then can be expressed as

y = X β + ϵ̃

     
β0 y1 ϵ̃1
 ..   ..   .. 
with vectors β :=  . , y :=  . , and ϵ̃ :=  . .
βp−1 yn ϵ̃n
Multiple Linear Regression
Coefficient vector (β) is determined so that the error(ϵ̃i ) between
the predicted outcome(ŷi ) and the actual outcome(yi ) is minimized
by minimizing the Sum of Squared Errors(SSE):
n
X p−1
X
Q(β) = (yi − (β0 + βj xij ))2
i=1 j=1

▶ This is an unconstrained convex optimisation problem and


easy to solve.
▶ When X is full rank1 , the solution is obtained by the so-called
normal equations

β̂ = (X T X )−1 X T y

1
You should recall your linear algebra knowledge here. What happens, if this
fails? This condition is also required to have a unique β.
Multiple Linear Regression

Normal equations are obtained from optimality conditions of


the convex problem:

∂ b T Xj = 0,
Q = −2(y − X β) 0≤j ≤p−1
∂βj βb

Equivalent to

b TX = 0
(y − X β)
 
y T X = βbT X T X
 
X T Y = X T X βb
 −1
βb = X T X XTy
Multiple Linear Regression

As there are more than one regressor, we have to be careful in


selecting the correct model.

▶ Overfitting:
Adding too many independent variables to get a good model
fit on training data will typically lead to low predictive power
on unseen data.

▶ Multicollinearity:
When the explanatory variables are tightly correlated, the
model is not able to disentangle their respective influence.
Multiple Linear Regression

How do we determine overfitting and multicollinearity?

What can we do if we observe it?

(Brainstorm in pairs, 5 minutes.)


Multiple Linear Regression

▶ Check the relationship between each independent variable and


the dependent variable using scatter plots and correlations.

▶ Check the relationship among the independent variables using


scatter plots and correlations.

▶ The plots analysis can give information to detect:


▶ Multicollinear variables (i.e., potentially redundant variables)
▶ Non-contributing variables
Multiple Linear Regression

Demo 2.1: Scatter plots for dataset USA startups.csv

The startup data set consist of 50 rows and 5 columns:


▶ R&D Spending (x1 ),
▶ Administration (x2 ),
▶ Marketing Spending (x3 ),
▶ State (x4 , nominal),
▶ Profit (y ).
Multiple Linear Regression in R
(Discuss your observations with a peer, 5 minutes)
Multiple Linear Regression

▶ Multicollinear variables
▶ Two independent variables are highly correlated with each
other and therefore redundant; only one should remain in use
for the model.

▶ Non-contributing variables
▶ An independent variable has low/no correlation with depended
variable; can cause overfitting if not excluded from the model.
Multiple Linear Regression
Redundant multicollinear variables and Non-contributing variables
Multiple Linear Regression (Model 1)
▶ Two independent variables, Administration (x1 ) and RD
Spending (x2 ), are used to predict profit(y ).
▶ However, Administration has low correlation with profit and
high p-value. Do we need it in the model?
Multiple Linear Regression (Model 2)

▶ Let’s drop Administration from the previous model.


Multiple Linear Regression

▶ Demo 2.2: Building regression models using Administration


and RD Spend variables on training data (%80) and
evaluating their performance on test data.

Figure: RMSE and R2 shows that Model 1 has an overfitting issue


Multiple Linear Regression (Variable Selection)

▶ We need to select the variables carefully to avoid overfitting


and multicollinearity.

▶ Statistical analysis can be done through examining:


▶ Coefficients
▶ P-value or Analysis of Variance (ANOVA)
▶ R2 , Adjusted R2 (to calculate VIF, Variance Inflation Factor)
Multiple Linear Regression (Variable Selection)

Variable Selection is a hard problem as we need to check all


possible subsets of independent variables.

▶ Stepwise regression consists of iteratively adding and


removing predictors with the goal of finding the subset of
variables in the data set resulting in the best performing
model or the lowest prediction error.

▶ 3 strategies of stepwise regression:


▶ Forward selection
▶ Backward selection / backward elimination
▶ Stepwise selection
Multiple Linear Regression (Variable Selection)

▶ Forward selection starts with no predictors in the model and


iteratively adds the most contributing predictors, and stops
when the improvement is no longer statistically significant.

▶ Backward selection includes all predictors in the model, and


then iteratively removes the least contributing predictors, and
stops when all predictors are statistically significant.

▶ Stepwise selection is a combination of forward and backward


selections. It starts with no predictors, then sequentially adds
the most contributing predictors (forward selection). After
adding each new variable, it removes variables that no longer
provide an improvement in the model fit (backward selection).
Multiple Linear Regression (Variable Selection)

Demo 2.3: Apply backward selection to USA Startup.csv


We assume a linear relationship between profit (y ) and the rest of
the predictors (x1 , x2 , x3 , x4 , x5 ). Note that last two variables both
correspond to the nominal (State) variable.

Model with all variables included:


5
X
yi = β0 + βi xi + ϵi
i=1
Backward Method Multiple Linear Regression in R
▶ Initial model includes all available variables (Model 3).
▶ Analyze the coefficient, p-values of predictors/independent
variables, and adjusted R2 .
Backward Method Multiple Linear Regression in R

▶ Let’s remove the independent variable (State) with largest


p-value (Model 4)
Backward Method Multiple Linear Regression in R
▶ Remove another variable (Administration) with large p-value.
(Model 5)
Backward Method Multiple Linear Regression in R

▶ Removing (Marketing. Spend) with large p-value, (back to


Model 2 again!)
Backward Method Multiple Linear Regression in R

▶ Additionally, ANOVA and VIF can be tested to see predictor


significance as well
▶ F-value is used to determine predictors’ significance.
Logistic Regression
Logistic Regression

▶ Logistic regression is used for classification, i.e., predicting a


categorical variable (nominal or ordinal).

▶ Determine whether a person is obese or not based on their


weight, height, and body mass.
▶ Determine whether a loan will default or not based on the
loanee’s number of loans, income, and amount of loan.
▶ Determine the species of a bird based on its features.
(multinomial/ more than to classes)
Logistic Regression

We would like to build a model that estimates the probability of


the output belonging to a category based on the values of the
input variables.

For example: P(y |x) = Prob(y = Obese|x = 78kg ).

▶ Why is linear regression (i.e., modeling p(y |x) = ax + b) not


a suitable approach?

▶ What would happen if we used linear regression for


determining if a person is obese or not based on their weight?
Logistic Regression

Figure: linear regression (left) and logistic regression (right)

Discuss in pairs:
Which of these curves make sense as a model?
Which of them have more ‘reasonable tails’ ?
Logistic Regression

Figure: Classification

Consider data set with i th observation:


yi ∈ {0, 1} and xi = (1, xi1 , . . . , xi(p−1) ).
Logistic Model

The logistic function (S-shaped) provides a nice formulation to


capture this:
T
e β0 +β1 x1 +β2 x2 +...+βp−1 xp−1 eβ x
p(x, β) := Pr(Y = 1|x, β) = = .
1 + e β0 +β1 x1 +β2 x2 +...+βp−1 xp−1 1 + e βT x
This number is always between 0 and 1, irrespective of the value of
the x variables.
Logistic Regression

Pr(Y = 1|x, β) p(x, β)


Odds = =
Pr(Y = 0|x, β) 1 − p(x, β)
Tx
= e β0 +β1 x1 +β2 x2 +...+βp−1 xp−1 = e β .

▶ The odds (ratio) can take on any value in (0, ∞).


▶ Odds > 1 if Y = 1 is more likely and Odds < 1 if Y = 0 is
more likely, given a particular x.
▶ The logit or log-odds is defined as:

Log(Odds) = β T x = β0 + β1 x1 + β2 x2 + . . . + βp−1 xp−1 ,

Note βj > 0, then Pr(Y = 1) increases, if xj increase. However


increasing xj by 1 unit (keeping all of the other xk values fixed),
changes the log-odds by βj or multiplies the odds by e βj .
Logistic Regression (Maximum Likelihood
Estimator)

Maximum Likelihood
▶ Maximum likelihood estimation (MLE) is a technique for
estimating the parameters of an assumed probability
distribution using observable data.
▶ Accomplished by optimising a probability function when the
observed data have the highest likelihood.
▶ The point in parameter space where the likelihood function is
maximised is known as the maximum likelihood estimate.
Logistic Regression (Maximum Likelihood Estimator)
To estimate the coefficients, we maximize the likelihood function:
n
Y
L(β) = Pr(Y = 1|x = xi ; β)yi Pr(Y = 0|x = xi ; β)1−yi .
i=1

The maximum (log)likelihood problem is defined as:

max log L(β)


β

where the log-likelihood is defined as:


n
X
log L(β) = LL(β) = [yi log p(xi , β) + (1 − yi ) log(1 − p(xi , β))]
i=1
n h i
Tx
X
= yi β T xi − log(1 + e β i
) .
i=1

The objective function is concave. The problem can be solved


efficiently.
Logistic Model (Goodness of Fit)

The Akaike information criterion (AIC) is a measure of fit that


penalises complicated models (similar to adjusted R-squared).
However the AIC does not have a benchmark range like R-squared
which lies in the inteval [0, 1].
The smaller the AIC, the better is the model. AIC is defined as:

AIC = −2LL(β̂) + 2(p),

where p − 1 coefficients and an intercept coefficient are estimated.


Logistic Model (Goodness of Fit)

Confusion matrix:
Suppose we use the following rule to classify or predict an output:
1. Choose a threshold t
2. For any observation i with predictor variables xi and estimated
coefficients β̂:
If Pr(Y = 1|xi ; β̂) ≥ t, then predict 1, else predict 0

Actual = 0 Actual = 1
Predict = 0 True Negative (TN) False Negative (FN)

Predict = 1 False Positive (FP) True Positive (TP)

Table: Confusion matrix


Logistic Model (Goodness of Fit)

We can define the following quantities:


TN
True negative rate: TNR = (Specificity),
FP + TN
FP
False positive rate: FPR = (Type I error),
FP + TN
TP
True positive rate: TPR = (Sensitivity),
TP + FN
FN
False negative rate: FNR = (Type II error),
TP + FN
TP + TN
Overall Accuracy: Accuracy = .
FP + TN + TP + FN
Note that by definition, FPR + TNR = 1 and TPR + FNR = 1.

Pair discussion: How do you determine t?


Logistic Regression

▶ Demo 3.1: Prediction of purchase from an advertisement


based on audience’s age

Figure: Logistic Regression Result

▶ For the Purchase data, estimated coefficients of the logistic


regression model that predicts the probability of default using
balance. A one-unit increase in Age is associated with an
increase in the log odds of default by 1.9913 units.

You will learn a lot more about classification in later weeks and
also in other courses!

You might also like