0% found this document useful (0 votes)
6 views

Week3 Modified

The document discusses linear regression models, including univariate and multivariate linear regression. Key concepts covered include least squares regression, R2, correlation, and adjusting R2 for multivariate models. Examples are provided to help explain these statistical topics.

Uploaded by

turbonstre
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week3 Modified

The document discusses linear regression models, including univariate and multivariate linear regression. Key concepts covered include least squares regression, R2, correlation, and adjusting R2 for multivariate models. Examples are provided to help explain these statistical topics.

Uploaded by

turbonstre
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

BDM 2053

Big Data Algorithms and Statistics


Weekly Course Objectives
● Inferential Statistics vs. Descriptive Statistics
● Linear regression?
○ What is it?
○ What are least squares?
● What is R2?
● What is correlation?
● What are multivariate linear models?
● What is adjusted R2?
● Do some examples in Python!
Descriptive Statistics
● We have been looking at mainly descriptive statistics which
are summaries of data either through central tendencies and
measures of variability (mean, median, etc) or graphs
(histograms, boxplots, etc).
● Descriptive statistics describe a sample of observations. We
simply take our data, then use summary statistics and graphs to
present characteristics of the data.
● There is no uncertainty with descriptive statistics because we
only look at describing what we have, and not inferring to
anything outside our data.
Inferential Statistics
● One example we looked at last week of inferential statistics was
confidence intervals.
● Inferential statistics takes information from a sample and
makes inferences about the larger population (confidence
intervals, regression, etc).
● To make inferences on the population, we must have a good,
representative, sample!
● There is uncertainty with inferential statistics because we will
make inferences to the greater population based on our sample.
We must sample appropriately to reduce uncertainty in our
inferences.
● Another method of inferring is to make predictions. Linear
regression is another method to do so!
Linear Regression
● Regression is the measure of relation between two or more
variables.
● Linear regression is therefore a linear measure of relation
between two or more variables.
○ This is done by finding the best line that reduces the
distance between the observed values and the actuals.
● The distance between our observations and points on the line
(which are the predictions in this case) are called our
residuals.
● This process of finding the line that reduces the distance
between our observations and the corresponding value on the
line is called least squares.
○ It’s called least squares because we take the squared
differences between our observed values and the predicted
values. In other words, we square our residuals.
Linear Regression cont.

To the left, we can see that the red


points are the actual data, the
dashed line is our linear
regression line, and the distances
between the red points to the
regression line are our residuals.

Depending on the line fitted, the


squared residuals will change.
How do we quantitatively find the
best line?
Least Squares Method
● For the Least Squares Method, we are basically finding the
minimum sum of squared residuals aka SSR (residual sum of
squares aka RSS). Therefore we want to minimize:

, where b̂0 is the “y-intercept” and b̂1 is the slope.


● As you might imagine, since this is a minimization problem, we
need to take derivatives. Proving this would take up a good
chunk of the lecture time, so I will leave the link here.
● The equation of the simple linear regression model is:
ŷ = b̂0 + b̂1x
Use R2 to assess your model!
● We talked a lot about dispersion so far. The most common
method of dispersion was variance; the sum of square distance
between our observations and the mean of the observations.
○ If the variance is 0, that means our observations are very
close to the average of the observations.
● R2 is simply the variation in our response variable (target
variable, dependent variable) that is explained by our
independent variable(s).
○ It can be expressed as:
R2=(SSM - SSR)/SSM
● The above equation means that we can reduce the variance in
our response variable, when we take our independent variables
into account!
Still confused about R2?
● Let’s look at an example… with weight!... But of mice!

Since we are interested in variance


Above, we can see that the red of our target variable, let’s
points are the actual data for calculate the average mouse size
mouse size and mouse weight. here.
R2 example cont.

Mean Size

We calculate the variance of the Similarly, we can capture the


mouse size by summing the variance of the fitted values by
square distances from the mean summing the squared residuals
(SSM), and dividing by n. (SSR), and dividing by n.
Therefore: Therefore:
Var(mean)=Sum((data-mean)2)/n Var(fit)=Sum((data-fit)2)/n
Var(mean)=SSM/n Var(fit)=Sum(residuals2)/n
Var(mean)=SSM/n
R2 example cont.
● Therefore, R2 would be the variation in the target variable, in
this case mouse size, explained by our independent variable
(explanatory variable) mouse weight.

R2=(Var(SSM)-Var(SSR))/Var(SSM)

● Since the variances have the same denominator, the R2 value


can be thought of as simply:

R2=(SSM-SSR)/SSM

● If we got 100 for SSM and 40 for SSR that means that R2 is 0.6
○ This means that 60% of the variation in mouse size can be
explained by mouse weight.
R2 Values cont.
● Your R2 can only be between 0 and 1.
○ If you have a perfect model (SSR = 0, you have a perfect line
through all the points), then R2=(SSM-0)/SSM = 1
■ In context to the mouse data, this would mean that
100% of your variation in mouse size can be explained
by mouse weight. In other words, when your mouse size
changes from observation to observation, we can
attribute the mouse’s weight to 100% explain this
variation.
R2 Values cont.
● Your R2 can only be between 0 and 1.
○ On the other extreme, you can have an R2 of 0
■ This would mean that knowing mouse weight does not
provide any information on mouse size. Therefore, we
get a linear model where the slope does not account for
the mouse weight (slope is 0). In such a case, we get
just a flat line through the data which yields something
like this:

To the left, we can see that light


mice and large mice don’t indicate
mouse size. Here we get just a flat
line which would be around the
mean of mouse size since the
average is the center point. Here
SSR = SSM therefore R2=0
What is correlation?
● If R2 is the measure of variation of your independent variable
explained by your dependent variable(s), then there must be a
more general statistic that captures simply the strength of
correlation between 2 variables…
● Correlation is the the strength of the relationship between an
independent and dependent variable and is given by:

● You might be panicking right now, but as always, Python has a


simple function to do this messy calculation for you :).
● Correlation is a value that falls between -1 and 1 where -1 means
that there is strong negative correlation, 0 means no
correlation, and 1 means strong positive correlation.
Correlation cont.

● We can see that correlation, r, is simply a


measure of strength between two
variables. \
● We can visually see that the more tightly
packed and linear two variables are, the
stronger their correlation.
● For every one unit increase in our
x-values, if the y-values go up this is
positively correlated. The reverse is also
true.

● A very useful chart to understand the


strength of correlation when asked.
● Different textbooks have different
suggestions, I generally use something
similar to this chart.
Correlation does NOT imply causation.
● When two variables are correlated, we automatically assume
that “variable x causes variable y to go up or down”.
● Say we observed ice cream sales being strongly correlated with
shark attacks.
○ Does this mean that the more people eat ice cream, the
more sharks will attack? NO!
○ There is an underlying confounding variable here, which
is the hot weather, which impacts both independent and
dependent variable.
Multivariate linear regression
● We looked at a small, cute case of linear regression where we
have 1 independent variable, but in reality we have many! So we
don’t have 1 estimate for an independent variable but many.
Therefore we get:

ŷ = b̂0 + b̂1x1 + b̂2x2 + … + b̂pxp ,

where p is the number of independent variables and the xi


values are the independent values, and b̂i are the the beta
coefficients or simply the regression coefficients such that we
can get the least squares of the residuals.

● This does not impact how we calculate R2!


Multivariate linear regression example
Multivariate linear regression example cont.
Multivariate linear regression example cont.
Multivariate linear regression example cont.
● So now, our estimates of mouse size represented by ŷ, are given
by the following equation.
ŷ = b̂0 + b̂1x1 + b̂2x2
, where b̂0 is the y-intercept, b̂1 is the least squares estimate of
the mouse weight, and b̂2 is the least squares estimate of the
mouse tail length.
● If say the tail length wasn’t useful, the least squares estimate
would approximate the corresponding beta coefficient to 0.
● So an equation like the following:
ŷ = b̂0 + b̂1x1 + b̂2x2+ b̂3x3+ b̂4x4,
where b̂3 is the temperature outside and b̂4 is the month the
mouse was born would be likely approximated to 0 since they
will not do a good job at explaining the variation in mouse size.
Adjusted R2
● Statistics is weird… sometimes the models we use and how the
associations are picked up using least squares may yield
circumstances where independent variables that aren’t
correlated with our dependent variable are given non-zero
estimates.
● In such a case, we get models that might incorporate
realistically useless features and therefore reduce the SSR
leading to a very misleading R2.
○ In other words, the more parameters we add to our linear
regression model, the more opportunities we give for
random events to reduce the residuals and ultimately lead
to a better R2
● Therefore, R2 can never decrease when you add more variables!
Adjusted R2 cont.
● More formally, the equation for adjusted R2 can be given as
follows:
Adjusted R2 = 1 - ((1-R2)(n-1))/(n-k-1),
where k is the number of independent variables and n is the
number of observations.
● Unlike R2, Adjusted R2 can be negative when there is little
sample data and poor features to predict your response
variable.
● Realistically, the only thing that would change once you have a
linear model is the features or independent variables you add
in.
● If you add in more and more useless variables, there’s a chance
R2 may just keep going up.
○ With the adjusted R2 we can ensure we reduce this effect.
Thank you
The notorious p-values
● Before we keep building up our knowledge of linear regression,
we need to learn something very important called p-values.
● First and foremost, p-values are not just probabilities, but
rather probabilities of observing extreme events.

You might also like