Week3 Modified

The document discusses linear regression models, including univariate and multivariate linear regression. Key concepts covered include least squares regression, R2, correlation, and adjusting R2 for multivariate models. Examples are provided to help explain these statistical topics.

Uploaded by

turbonstre

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Week3 Modified

Uploaded by

turbonstre

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

BDM 2053

Big Data Algorithms and Statistics

Weekly Course Objectives
● Inferential Statistics vs. Descriptive Statistics
● Linear regression?
○ What is it?
○ What are least squares?
● What is R2?
● What is correlation?
● What are multivariate linear models?
● What is adjusted R2?
● Do some examples in Python!
Descriptive Statistics
● We have been looking at mainly descriptive statistics which
are summaries of data either through central tendencies and
measures of variability (mean, median, etc) or graphs
(histograms, boxplots, etc).
● Descriptive statistics describe a sample of observations. We
simply take our data, then use summary statistics and graphs to
present characteristics of the data.
● There is no uncertainty with descriptive statistics because we
only look at describing what we have, and not inferring to
anything outside our data.
Inferential Statistics
● One example we looked at last week of inferential statistics was
confidence intervals.
● Inferential statistics takes information from a sample and
makes inferences about the larger population (confidence
intervals, regression, etc).
● To make inferences on the population, we must have a good,
representative, sample!
● There is uncertainty with inferential statistics because we will
make inferences to the greater population based on our sample.
We must sample appropriately to reduce uncertainty in our
inferences.
● Another method of inferring is to make predictions. Linear
regression is another method to do so!
Linear Regression
● Regression is the measure of relation between two or more
variables.
● Linear regression is therefore a linear measure of relation
between two or more variables.
○ This is done by finding the best line that reduces the
distance between the observed values and the actuals.
● The distance between our observations and points on the line
(which are the predictions in this case) are called our
residuals.
● This process of finding the line that reduces the distance
between our observations and the corresponding value on the
line is called least squares.
○ It’s called least squares because we take the squared
differences between our observed values and the predicted
values. In other words, we square our residuals.
Linear Regression cont.

To the left, we can see that the red

points are the actual data, the
dashed line is our linear
regression line, and the distances
between the red points to the
regression line are our residuals.

Depending on the line ﬁtted, the

squared residuals will change.
How do we quantitatively ﬁnd the
best line?
Least Squares Method
● For the Least Squares Method, we are basically ﬁnding the
minimum sum of squared residuals aka SSR (residual sum of
squares aka RSS). Therefore we want to minimize:

, where b̂0 is the “y-intercept” and b̂1 is the slope.

● As you might imagine, since this is a minimization problem, we
need to take derivatives. Proving this would take up a good
chunk of the lecture time, so I will leave the link here.
● The equation of the simple linear regression model is:
ŷ = b̂0 + b̂1x
Use R2 to assess your model!
● We talked a lot about dispersion so far. The most common
method of dispersion was variance; the sum of square distance
between our observations and the mean of the observations.
○ If the variance is 0, that means our observations are very
close to the average of the observations.
● R2 is simply the variation in our response variable (target
variable, dependent variable) that is explained by our
independent variable(s).
○ It can be expressed as:
R2=(SSM - SSR)/SSM
● The above equation means that we can reduce the variance in
our response variable, when we take our independent variables
into account!
Still confused about R2?
● Let’s look at an example… with weight!... But of mice!

Since we are interested in variance

Above, we can see that the red of our target variable, let’s
points are the actual data for calculate the average mouse size
mouse size and mouse weight. here.
R2 example cont.

Mean Size

We calculate the variance of the Similarly, we can capture the

mouse size by summing the variance of the fitted values by
square distances from the mean summing the squared residuals
(SSM), and dividing by n. (SSR), and dividing by n.
Therefore: Therefore:
Var(mean)=Sum((data-mean)2)/n Var(fit)=Sum((data-fit)2)/n
Var(mean)=SSM/n Var(fit)=Sum(residuals2)/n
Var(mean)=SSM/n
R2 example cont.
● Therefore, R2 would be the variation in the target variable, in
this case mouse size, explained by our independent variable
(explanatory variable) mouse weight.

R2=(Var(SSM)-Var(SSR))/Var(SSM)

● Since the variances have the same denominator, the R2 value

can be thought of as simply:

R2=(SSM-SSR)/SSM

● If we got 100 for SSM and 40 for SSR that means that R2 is 0.6
○ This means that 60% of the variation in mouse size can be
explained by mouse weight.
R2 Values cont.
● Your R2 can only be between 0 and 1.
○ If you have a perfect model (SSR = 0, you have a perfect line
through all the points), then R2=(SSM-0)/SSM = 1
■ In context to the mouse data, this would mean that
100% of your variation in mouse size can be explained
by mouse weight. In other words, when your mouse size
changes from observation to observation, we can
attribute the mouse’s weight to 100% explain this
variation.
R2 Values cont.
● Your R2 can only be between 0 and 1.
○ On the other extreme, you can have an R2 of 0
■ This would mean that knowing mouse weight does not
provide any information on mouse size. Therefore, we
get a linear model where the slope does not account for
the mouse weight (slope is 0). In such a case, we get
just a ﬂat line through the data which yields something
like this:

To the left, we can see that light

mice and large mice don’t indicate
mouse size. Here we get just a ﬂat
line which would be around the
mean of mouse size since the
average is the center point. Here
SSR = SSM therefore R2=0
What is correlation?
● If R2 is the measure of variation of your independent variable
explained by your dependent variable(s), then there must be a
more general statistic that captures simply the strength of
correlation between 2 variables…
● Correlation is the the strength of the relationship between an
independent and dependent variable and is given by:

● You might be panicking right now, but as always, Python has a

simple function to do this messy calculation for you :).
● Correlation is a value that falls between -1 and 1 where -1 means
that there is strong negative correlation, 0 means no
correlation, and 1 means strong positive correlation.
Correlation cont.

● We can see that correlation, r, is simply a

measure of strength between two
variables. \
● We can visually see that the more tightly
packed and linear two variables are, the
stronger their correlation.
● For every one unit increase in our
x-values, if the y-values go up this is
positively correlated. The reverse is also
true.

● A very useful chart to understand the

strength of correlation when asked.
● Diﬀerent textbooks have diﬀerent
suggestions, I generally use something
similar to this chart.
Correlation does NOT imply causation.
● When two variables are correlated, we automatically assume
that “variable x causes variable y to go up or down”.
● Say we observed ice cream sales being strongly correlated with
shark attacks.
○ Does this mean that the more people eat ice cream, the
more sharks will attack? NO!
○ There is an underlying confounding variable here, which
is the hot weather, which impacts both independent and
dependent variable.
Multivariate linear regression
● We looked at a small, cute case of linear regression where we
have 1 independent variable, but in reality we have many! So we
don’t have 1 estimate for an independent variable but many.
Therefore we get:

ŷ = b̂0 + b̂1x1 + b̂2x2 + … + b̂pxp ,

where p is the number of independent variables and the xi

values are the independent values, and b̂i are the the beta
coeﬃcients or simply the regression coeﬃcients such that we
can get the least squares of the residuals.

● This does not impact how we calculate R2!

Multivariate linear regression example
Multivariate linear regression example cont.
Multivariate linear regression example cont.
Multivariate linear regression example cont.
● So now, our estimates of mouse size represented by ŷ, are given
by the following equation.
ŷ = b̂0 + b̂1x1 + b̂2x2
, where b̂0 is the y-intercept, b̂1 is the least squares estimate of
the mouse weight, and b̂2 is the least squares estimate of the
mouse tail length.
● If say the tail length wasn’t useful, the least squares estimate
would approximate the corresponding beta coeﬃcient to 0.
● So an equation like the following:
ŷ = b̂0 + b̂1x1 + b̂2x2+ b̂3x3+ b̂4x4,
where b̂3 is the temperature outside and b̂4 is the month the
mouse was born would be likely approximated to 0 since they
will not do a good job at explaining the variation in mouse size.
Adjusted R2
● Statistics is weird… sometimes the models we use and how the
associations are picked up using least squares may yield
circumstances where independent variables that aren’t
correlated with our dependent variable are given non-zero
estimates.
● In such a case, we get models that might incorporate
realistically useless features and therefore reduce the SSR
leading to a very misleading R2.
○ In other words, the more parameters we add to our linear
regression model, the more opportunities we give for
random events to reduce the residuals and ultimately lead
to a better R2
● Therefore, R2 can never decrease when you add more variables!
Adjusted R2 cont.
● More formally, the equation for adjusted R2 can be given as
follows:
Adjusted R2 = 1 - ((1-R2)(n-1))/(n-k-1),
where k is the number of independent variables and n is the
number of observations.
● Unlike R2, Adjusted R2 can be negative when there is little
sample data and poor features to predict your response
variable.
● Realistically, the only thing that would change once you have a
linear model is the features or independent variables you add
in.
● If you add in more and more useless variables, there’s a chance
R2 may just keep going up.
○ With the adjusted R2 we can ensure we reduce this eﬀect.
Thank you
The notorious p-values
● Before we keep building up our knowledge of linear regression,
we need to learn something very important called p-values.
● First and foremost, p-values are not just probabilities, but
rather probabilities of observing extreme events.

MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Corr and Regress
No ratings yet
Corr and Regress
30 pages
Correlation and regression
No ratings yet
Correlation and regression
30 pages
Corr and Regress
No ratings yet
Corr and Regress
61 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Lecture+8+ +Linear+Regression
No ratings yet
Lecture+8+ +Linear+Regression
45 pages
Corr and Regress
No ratings yet
Corr and Regress
42 pages
Session_19&20
No ratings yet
Session_19&20
54 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Lecture 9 simple-linear-regression-correlation updated
No ratings yet
Lecture 9 simple-linear-regression-correlation updated
44 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
PROBLEMS ch05
No ratings yet
PROBLEMS ch05
117 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Regression and Correlation
No ratings yet
Regression and Correlation
66 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
No ratings yet
Summary of Topics For Midterm Exam #2: STA 371G, Fall 2017
6 pages
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
91 pages
DA-MODULE-3
No ratings yet
DA-MODULE-3
54 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
T-Tests, Anova and Regression: Lorelei Howard and Nick Wright MFD 2008
No ratings yet
T-Tests, Anova and Regression: Lorelei Howard and Nick Wright MFD 2008
37 pages
Session 15 Regression and Correlation
No ratings yet
Session 15 Regression and Correlation
66 pages
PARAMETRIC-TEST
No ratings yet
PARAMETRIC-TEST
49 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Regression
No ratings yet
Regression
24 pages
BCSE352E EDA CAT 2 Mod 1,2,5
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5
146 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
Looking at Data: Relationships: Least-Squares Regression
No ratings yet
Looking at Data: Relationships: Least-Squares Regression
23 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
ML unit-2
No ratings yet
ML unit-2
52 pages
Linear Regression (1)
No ratings yet
Linear Regression (1)
65 pages
Unit-III (Data Analytics)
100% (1)
Unit-III (Data Analytics)
15 pages
Bivariate Data Analysis
100% (1)
Bivariate Data Analysis
34 pages
Regression Models - Follow
No ratings yet
Regression Models - Follow
7 pages
Screenshot 2023-12-04 at 11.27.14
No ratings yet
Screenshot 2023-12-04 at 11.27.14
32 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Unit 02 - Relationships in Data - Handouts - 1 Per Page
No ratings yet
Unit 02 - Relationships in Data - Handouts - 1 Per Page
53 pages
Stats10_Chapter+4 2
No ratings yet
Stats10_Chapter+4 2
54 pages
T-Tests, Anova and Regression: Lorelei Howard and Nick Wright MFD 2008
No ratings yet
T-Tests, Anova and Regression: Lorelei Howard and Nick Wright MFD 2008
37 pages
Common Pitfalls in Statistical Analysis: Linear Regression Analysis
No ratings yet
Common Pitfalls in Statistical Analysis: Linear Regression Analysis
4 pages
A Tutorial On How To Run A Simple Linear Regression in Excel
No ratings yet
A Tutorial On How To Run A Simple Linear Regression in Excel
19 pages
CH 06
No ratings yet
CH 06
22 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Simple Regression
No ratings yet
Simple Regression
46 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Correlation
No ratings yet
Correlation
5 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
Corelation With Example
No ratings yet
Corelation With Example
112 pages
Theme 3 Multivariante Regression Model
No ratings yet
Theme 3 Multivariante Regression Model
8 pages
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
6 pages
REGRESSION
No ratings yet
REGRESSION
8 pages
Correlation Regression
No ratings yet
Correlation Regression
58 pages
Regression
No ratings yet
Regression
12 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Calculus by Muhammad Umer
From Everand
Calculus by Muhammad Umer
Muhammad Umer
No ratings yet
Drag Go Kart IJVD - Draft PDF
No ratings yet
Drag Go Kart IJVD - Draft PDF
13 pages
Introduction To Machine Learning - Unit 4 - Week 2
No ratings yet
Introduction To Machine Learning - Unit 4 - Week 2
4 pages
Regress 83 A
No ratings yet
Regress 83 A
5 pages
Assignment of Econometrics
No ratings yet
Assignment of Econometrics
10 pages
Week 5 Sol
No ratings yet
Week 5 Sol
5 pages
Statistics Review: EEE 305 Lecture 10: Regression
No ratings yet
Statistics Review: EEE 305 Lecture 10: Regression
12 pages
Problem Set 05 - Solutions (Odtuclass)
No ratings yet
Problem Set 05 - Solutions (Odtuclass)
10 pages
Econometrics Exercises
No ratings yet
Econometrics Exercises
8 pages
Business Analytics Calendar February 2023
No ratings yet
Business Analytics Calendar February 2023
1 page
Chapter - 2 - Week 4-11 Feb
No ratings yet
Chapter - 2 - Week 4-11 Feb
45 pages
Investigative Skills 3
No ratings yet
Investigative Skills 3
75 pages
Cross-Validation and Model Selection
No ratings yet
Cross-Validation and Model Selection
46 pages
Causaldidregress
No ratings yet
Causaldidregress
35 pages
Tutorial 6 Solution
No ratings yet
Tutorial 6 Solution
20 pages
OLS Assumptions
No ratings yet
OLS Assumptions
40 pages
Introductory Econometrics A Modern Approach 5th Edition Jeffrey M. Wooldridge download
100% (1)
Introductory Econometrics A Modern Approach 5th Edition Jeffrey M. Wooldridge download
60 pages
2024 Chapter 1
No ratings yet
2024 Chapter 1
8 pages
Reading 18 Linear Regression
No ratings yet
Reading 18 Linear Regression
22 pages
Paper 6 Final Revision: Food Tests
No ratings yet
Paper 6 Final Revision: Food Tests
6 pages
Principles of Curve Fitting For Multiplex Sandwich Immunoassays PDF
No ratings yet
Principles of Curve Fitting For Multiplex Sandwich Immunoassays PDF
4 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
67255359b887ca9fb5bf9c48_64069184308
No ratings yet
67255359b887ca9fb5bf9c48_64069184308
2 pages
Analysis of Variance
No ratings yet
Analysis of Variance
3 pages
2 Scad
No ratings yet
2 Scad
13 pages
Yield Loci For Shallow Foundations by Swipe Testing
No ratings yet
Yield Loci For Shallow Foundations by Swipe Testing
6 pages
00.0 EC 402 syllabus 2025
No ratings yet
00.0 EC 402 syllabus 2025
8 pages
Yarman Lawolo - SB5 - 7
No ratings yet
Yarman Lawolo - SB5 - 7
9 pages
Dissertation Ordinal Logistic Regression
100% (2)
Dissertation Ordinal Logistic Regression
5 pages
Handouts CH 3 (Gujarati)
No ratings yet
Handouts CH 3 (Gujarati)
5 pages
Get SAS for Linear Models Fourth Edition Ramon Littell PDF ebook with Full Chapters Now
No ratings yet
Get SAS for Linear Models Fourth Edition Ramon Littell PDF ebook with Full Chapters Now
52 pages