0% found this document useful (0 votes)
41 views110 pages

Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099

This document provides an overview of simple linear regression. It discusses how regression analysis was originally coined by Sir Francis Galton to describe relationships between variables. It defines key regression terminology like explanatory variables, populations, samples, and regression equations. It also outlines important assumptions of linear regression like linearity, independence, homoskedasticity, and normality of errors. The document is intended to explain the basic concepts and assumptions underlying simple linear regression analysis.

Uploaded by

Ahmed Elsayed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views110 pages

Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099

This document provides an overview of simple linear regression. It discusses how regression analysis was originally coined by Sir Francis Galton to describe relationships between variables. It defines key regression terminology like explanatory variables, populations, samples, and regression equations. It also outlines important assumptions of linear regression like linearity, independence, homoskedasticity, and normality of errors. The document is intended to explain the basic concepts and assumptions underlying simple linear regression analysis.

Uploaded by

Ahmed Elsayed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 110

Prof. Dr.

Moustapha Ibrahim Salem


[email protected]
01005857099
Simple Linear R
egression
?”Why “Regression
• The term was coined by Sir Frances Galton in his 1885 address to the
BAAS,
• in describing his findings about the relationship between height of c
hildren and of their parents.
• He found that the heights tended to be to less extreme than their pa
rents – closer to the mean, so tall parents had tall kids, but less extre
mely tall, while short couples had short kids, but less extremely shor
t. (The gradient of the Kids on parents line was 0.61) He called this:
• "Regression Towards Mediocrity In Hereditary Stature," (Galton, F. (1
886))
Galton’s data
• Inverting this logic might predict
that parents are more extreme th
an heir children.
• This is not so : the gradient of the
parents on kids line was 0.29, ev
en less than the 1.0 expected fro
m equality.
• Remember: the best fit line does
not just transpose when X and Y
swap!
Types of Regression Models
Regression
Models

Explanatory Variable 1 Explanatory Variables +2

Simple Multiple

Linear Non-Linear Linear Non-Linear


Equation for a line - getting notation
straight
• In order to use regression analysis effectively, it is im
portant that you understand the concepts of slopes a
nd intercepts and how to determine these from data
values.
Equation for a line - getting notation
straight
• In linear algebra, the equation of a straight lin
e was often written y = mx + b where m is the
slope and b is the intercept.
• In some popular spreadsheet programs, the a
uthors decide to write the equation of a line a
s y=a + bx. Now a is the intercept, and b is the
slope
Equation for a line - getting notation
straight
• Statisticians, for good reasons, have rationalized this
notation and usually write the equation of a line as
y = βo + β1x or as Y = b0 + b1X. (the distinction betwe
en βo and b0 will be made clearer in a few minutes).
• The use of the subscripts 0 to represent the interce
pt and the subscript 1 to represent the coefficient f
or the X variable then readily extends to more com
plex cases
Populations and samples
• All of statistics is about detecting signals in the face of noise and
in estimating population parameters from samples. Regression i
s no different.
• First consider the population. The correct definition of the popul
ation is important as part of any study.
• Conceptually, we can think of the large set of all units of interest.
On each unit, there is conceptually, both an X and Y variable pres
ent. We wish to summarize the relationship between Y and X, an
d furthermore wish to make predictions of the Y value for future
X values that may be observed from this population.
Populations and samples
• If this were physics, we may conceive of a physical law betwee
n X and Y , e.g. F = ma or 𝞺Density= w/v. (Deterministic model)
• However, in chemical engineering, the relationship between Y
and X is often much more tenuous. (Probabilistic model).
• If you could draw a scatterplot of Y against X for ALL elements
of the population, the points would NOT fall exactly on a strai
ght line. Rather the value of Y would fluctuate above or belo
w a straight line at any given X value.
Populations and samples
• We denote this relationship as Y = βo + β1X + ε where now
βo and β1 are the POPULATION intercept and slope respec
tively.
• The term ε represent random variation of individual unit
s in the population above and below the expected value. I
t is assumed to have constant standard deviation over th
e entire regression line (i.e. the spread of data points in th
e population is constant over the entire regression line).
Populations and samples
Populations and samples
Populations and samples
• Of course, we can never measure all units of t
he population. So a sample must be taken in o
rder to estimate the population slope, popula
tion intercept, and population standard deviati
on
Populations and samples
• It is NOT necessary to select a simple random
sample from the entire population.
• The bare minimum that must be achieved is th
at for any individual X value found in the samp
le, the units in the population that share this X
value, must have been selected at random.
Populations and samples
• This is quite a relaxed assumption!
• For example, it allows us to deliberately choose values of
X from the extremes and then only at those X value, rando
mly select from the relevant subset of the population, rath
er than having to select at random X values from the popu
lation as a whole.
• In other words, we can for example select X values to be 3
, 5, 9, and 12 .... then measure the Y values corresponding
to each X.
Populations and samples
Populations and samples
Population & Sample Regression Models
Obtaining Estimates
• To distinguish between population parameters and sample es
timates, we denote the sample intercept by b0 and the sampl
e slope as b1. The equation of a particular sample of points is
expressed :
Obtaining Estimates
• How is the best fitting line found when the points are scattere
d? Many methods have been proposed (and used) for curve fi
tting. Some of these methods are:
• least squares
• least absolute deviations
• least median-of-squares
• least maximum absolute deviation
• baeysian regression
• fuzzy regression
Obtaining Estimates
Obtaining Estimates

• we minimize the SQUARED deviation in the VERTICAL direction


Obtaining Estimates
• The estimated intercept (b0) is the estimated value of Y
when X = 0.
• In some cases, it is meaningless to talk about values of Y
when X = 0 because X = 0 is nonsensical. For example, in
a plot of income vs year, it seems kind of silly to investiga
te income in year 0.
• In these cases, there is no clear interpretation of the inte
rcept, and it merely serves as a placeholder for the line.
Obtaining Estimates
• The estimated slope (b1) is the estimated change in Y per unit change in X.
For every unit change in the horizontal direction, the fitted line increased
by b1 units.
• If b1 is negative, the fitted line points downwards, and the increase in the li
ne is negative, i.e., actually a decrease.
• As with all estimates, a measure of precision can be obtained. As before, t
his is the standard error of each of the estimates. Again, there are comput
ational formulae, but in this age of computers, these are not important.
• As before, approximate 95% confidence intervals for the corresponding p
opulation parameters are found as estimate  2 × se
Obtaining Estimates
• Formal tests of hypotheses can also be done. Usually, these a
re only done on the slope parameter as this is typically of mos
t interest.
• The null hypothesis is that population slope is 0, i.e. there is n
o relationship between Y and X (can you draw a scatterplot sh
owing such a relationship?). More formally the null hypothesis
is H0: β1 = 0
• notice that the null hypothesis is ALWAYS in terms of a popula
tion parameter and not in terms of a sample statistic.
Obtaining Estimates
Inference about the Slope: t Test
• t test for a population slope
Is there a linear relationship between x and y?
• Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)
• Test statistic
Inference about the Slope: t Test
Inference about the Slope: t Test
Regression Analysis for Descrip
tion
Regression Analysis for Descrip
tion
Obtaining Estimates
• Confidence and prediction i
ntervals
• The prediction interval predi
cts in what range a future in
dividual observation will fall.
• Thus there is a 95% probabili
ty that the true best-fit line f
or the population lies within
the confidence interval.
Assumptions
• Linearity:
• regression analysis assumes that the relationship between Y and X is l
inear. Make a scatterplot between Y and X to assess this assumption.
• Correct sampling scheme
• The Y must be a random sample from the population of Y values for e
very X value in the sample.
• Fortunately, it is not necessary to have a completely random sample f
rom the population as the regression line is valid even if the X value
s are deliberately chosen.
• However, for a given X, the values from the population must be a sim
ple random sample
Assumptions
• No outliers or influential points
• All the points must belong to the relationship
– there should be no unusual points.
• The scatterplot of Y vs X should be examined. I
f in doubt, fit the model with the points in an
d out of the fit and see if this makes a differen
ce in the fit.
Assumptions
Assumptions
• Equal variation along the line (homoskedacity)
• The variability about the regression line is similar for all val
ues of X, i.e. the scatter of the points above and below the
fitted line should be roughly constant over the entire line.
• Independence
• Each value of Y is independent of any other value of Y. Th
e most common cases where this fail are time series data w
here X is a time measurement. In these cases, time series a
nalysis should be used
Assumptions
• Normality of errors:
• The residuals (the difference between the vale of Y and the po
int on the line) must be normally distributed
Assumptions
• X measured without error: In regression, it can turn
out that that the X value may not be known exactly.
This general problem is called the “error in variables”
problem and has a long history in statistics.
• It turns out that there are two important cases. (1) If
the value reported for X is a nominal value and the
actual value of X varies randomly around this nomin
al value, then there is no bias in the estimates.
Assumptions
(2) However, if the value used for X is an actual measurement of t
he true underlying X then there is uncertainty in both the X and Y
direction.
• In this case, estimates of the slope are attenuated towards zer
o (i.e. positive slopes are biased downwards, negative slopes bia
sed upwards). More alarmingly, the estimate are no longer con
sistent, i.e. as the sample size increases, the estimates no longe
r tend to the true population values!
• This latter case of “error in variables” is very difficult to analyze
properly and there are not universally accepted solutions.
F-Test of Model Significan
ce
Residual plots
• After the curve is fit, it is important to examine if the fitted cur
ve is reasonable. This is done using residuals.
• The residual for a point is the difference between the observe
d value and the predicted value, i.e., the residual from fitting a
straight line is found as:
Residual Plots
Residual Plots
• Typical residual plots are illustrated below – b
ut note that with small datasets, the patterns
will not be as clear cut.
• Therefore, with small datasets, don’t over an
alyze the plots - only gross deviations from ide
al plots are of interest.
Residual Plots
Residual Plots
Residual Plots

• Lets have another look at the major warning si


gns in residual plots
Residual Plots
Quiz ... Comment on the p
lot
Quiz ... Comment on the p
lot
Quiz ... Comment on the p
lot
Quiz ... Comment on the p
lot
Probability Plots
Probability Plots
• The probability plot is a graphical technique for assessing wh
ether or not a data set follows a given distribution such as the
normal distribution.
• The data are plotted against a theoretical normal distribution i
n such a way that the points should form approximately a strai
ght line. Departures from this straight line indicate departures
from the specified distribution.
• If the residuals are not normally distributed, the t-tests on re
gression coefficients, the F-tests, and the interval estimates ar
e not valid. This is a critical assumption to check.
Probability Plots
• Stragglers at either end of the normal prob
ability plot indicate outliers.
• Curvature at both ends of the plot indicates
long or short distributional tails.
• Convex, or concave, curvature indicates a la
ck of symmetry.
• Gaps, plateaus, or segmentation indicate cl
ustering and may require a closer examinati
on of the data or model.
• Of course, use of this graphic tool with very
small sample sizes ( less than 100 points) is
unwise
Example - Yield and fertil
izer
• We wish to investigate the relationship betw
een yield(Liters) and fertilizer (kg/ha) for to
mato plants.
• Interest also lies in predicting the yield when
16 kg/ha are assigned. The level of fertilizer
were randomly assigned to each plot. At the
end of the experiment, the yields were meas
ured and the following data were obtained.
• In this study, it is quite clear that the fertilize
r is the predictor (X) variable, while the resp
onse variable (Y) is the yield.
Example - Yield and fertil
izer
Example - Yield and fertil
izer
Example - Yield and fertilizer
- KYPLOT Analysis
• The ordering of the rows is NO
T important; however, it is oft
en easier to find individual dat
a points if the data is sorted by
the X value and the rows for fu
ture predictions are placed at
the end of the dataset.
Example - Yield and fertilizer
- KYPLOT Analysis
• Use the Statistics-> Regression Analysis -> Simple Regression
platform to start the analysis. Specify the Y and X variable as n
eeded.
Example - Yield and fertilizer
- KYPLOT Analysis
• Use the Statistics-> Regres
sion Analysis -> Simple Re
gression platform to start t
he analysis. Specify the Y a
nd X variable as needed
Example - Yield and fertilizer
- KYPLOT Analysis
Example - Yield and fertilizer
- KYPLOT Analysis
• A new window will pop up, with the following regress
ion results:
Example - Yield and fertilizer
- KYPLOT Analysis
Example - Yield and fertilizer
- KYPLOT Analysis
Example - Yield and fertilizer
- KYPLOT Analysis
• A new window will pop up, with the following regression resul
ts
Example - Yield and fertilizer
- KYPLOT Analysis
• A new window will pop up, with the following regression resul
ts:
Example - Yield and fertilizer
- KYPLOT Analysis
• At this stage, it would be also useful to
draw a scatter plot of the data (refer t
o previous KYPLOT tutorials)
• The relationship look approximately line
ar
• there don’t appear to be any outlier or i
nfluential points
• The scatter appears to be roughly equal
across the entire regression line
• Residual plots will be used later to chec
k these assumptions in more detail
Example - Yield and fertilizer
- KYPLOT Analysis
• The Fit menu item allows you to fit the least-squa
res line.
• The actual fitted line is drawn on the scatter plot,
• and the straight line equation coefficients, (here c
alled A1 for the intercept and A2 for the slope) of t
he fitted line are printed below the fit spread she
et.
Example - Yield and fertilizer
- KYPLOT Analysis
Example - Yield and fertilizer
- KYPLOT Analysis
• The estimated slope is the
estimated change in yield
when the amount of fertili
zer is increased by 1 unit. I
n this case, the yield is exp
ected to increase (why?) b
y 1.10137 L when the fertil
izer amount is increased by
1 kg/ha.
Example - Yield and fertilizer - KY
PLOT Analysis
Example - Yield and fertilizer
- KYPLOT Analysis
Example - Yield and fertilizer - KYPLOT
Analysis
• The estimated standard error for b1 (the estimated slope) is 0.1
32 L/kg. This is an estimate of the standard deviation of b1 over
all possible experiments. A standard error can also be found for
it as shown in the above table.
• We interpret this interval as “being 95% confident that the true
increase in yield when the amount of fertilizer is increased by o
ne unit is somewhere between (.837 to 1.365) L/kg.‟
• Be sure to carefully distinguish between β1 and b1. Note that th
e confidence interval is computed using b1, but is a confidence i
nterval for β1 - the population parameter that is unknown .
Example - Yield and fertilizer - KYPLOT
Analysis
• In linear regression problems, one hypothesis of interest is if the true slo
pe is zero. This would correspond to no linear relationship between the re
sponse and predictor variable (why?) In many cases, a confidence interval
tells the entire story.
• KYPLOT produces a test of the hypothesis that each of the parameters (the
slope and the intercept in the population) is zero. The output is reproduce
d again below:
Example - Yield and fertilizer - KYPLOT
Analysis
• The hypothesis testing proceeds as follows.
1. Specify the null and alternate hypothesis:
• H0: β1 = 0 (no linear relationship)
• H1: β1 ≠ 0 (linear relationship does exist)
• Notice that the null hypothesis is in terms of the popul
ation parameter β1. This is a two- sided test as we are i
nterested in detecting differences from zero in either di
rection.
Example - Yield and fertilizer - KYPLOT
Analysis
Example - Yield and fertilizer - KYPLOT
Analysis
Example - Yield and fertilizer - KYPLOT
Analysis
Example - Yield and fertilizer - KYPLOT
Analysis
Example - Yield and fertilizer - KYPLOT
Analysis
Example - Yield and fertilizer - KYPLOT
Analysis

• The residuals are simply the difference between the actual da


ta point and the corresponding spot on the line measured in t
he vertical direction. The residual plot shows a no trend in the
scatter around the value of zero.
Example - Yield and fertilizer - KYPLOT
Analysis

• The above normal probability plot of residuals was cr


eated by NCSS software.
Transformations
• In some cases, the plot of Y vs X is obviously non-linear and a t
ransformation of X or Y may be used to establish linearity.
• Often a visual inspection of a plot may identify the appropriat
e transformation. There is no theoretical difficulty in fitting a li
near regression using transformed variables other than an un
derstanding of the implicit assumption of the error structure.
The model for a fit on transformed data is of the form
Transformations
• Note that the error is assumed to act additively on the tra
nsformed scale. All of the assumptions of linear regression
are assumed to act on the transformed scale – in particula
r that the population standard deviation around the regre
ssion line is constant on the transformed scale.
• The most common transformation is the logarithmic transf
orm. It doesn’t matter if the natural logarithm (often calle
d the ln function) or the common logarithm transformatio
n (often called the log10 transformation) is used.

Transformations
Some common transformations

After the regression model is fit, remember to interpret the estimates


.of slope and intercept on the transformed scale
Transformations - Example: Monitoring D
ioxins
• An unfortunate byproduct of pulp-and-paper production us
ed to be dioxins - a very hazardous material. This material
was discharged into waterways with the pulp-and-paper e
ffluent where it bio accumulated in living organisms such a
crabs. Newer processes have eliminated this by product, bu
t the dioxins in the organisms takes a long time to degrade.
• Government environmental protection agencies take sampl
es of crabs from affected areas each year and measure the
amount of dioxins in the tissue. The following example is ba
sed on a real study
Transformations - Example: Monitoring D
ioxins
• Each year, four crabs are captured from a
monitoring station. The liver is excised and
the livers from all four crabs are composite
d together into a single sample. The dioxin
s levels in this composite sample is measur
ed. As there are many different forms of di
oxins with different toxicities, a summary
measure, called the Total Equivalent Dose
(TEQ) is computed from the sample.
• Here is the raw data.
Transformations - Example: Monitoring D
ioxins
• As with all analyses, start with a preliminary plot of the data o
btained using the Graph -> Create Graph platform.
Transformations - Example: Monitoring D
ioxins
• The preliminary plot of the data shows a decli
ne in levels over time, but it is clearly non-line
ar. Why is this so? In many cases, a fixed fracti
on of dioxins degrades per year, e.g. a 10% dec
line per year. This can be expressed in a non-li
near relationship:
Transformations - Example: Monitoring D
ioxins

• where C is the initial conce


ntration, r is the rate reduc
tion per year, and t is the el
apsed time.
• If this is plotted over time,
this leads to the non-linear
pattern seen in the plot.
Transformations - Example: Monitoring D
ioxins
Transformations - Example: Monitoring D
ioxins
Transformations - Example: Monitoring D
ioxins
Transformations - Example: Monitoring D

ioxins
If you want a detailed regression analysis then you should perf
orm the regression separately from the Staistics - > Regressio
n Analysis -> Simple Regression menu
Transformations - Example: Monitoring D
ioxins
Transformations - Example: Monitoring D
ioxins
Transformations - Example: Monitoring D

ioxins
Have a look at the npp. It lo
oks even better than the on
e obtained in the fertilizer y
ield example, but it looks “t
ailed”. However, r2 for the
npp is >0.98, i.e. the residu
als are accepted to be nor
mal for this sample size
Transformations - Example: Monitoring D
ioxins
• Possible remedies for the failure of these tests include using a tran
sformation of Y such as the log or square root, correcting data-reco
rding errors found by looking into outliers, adding additional indep
endent variables, using robust regression, or using bootstrap meth
ods. We will discuss these methods briefly in class.
• The fitted line is: log (TEQ) = 80.5 – 0.0394 (year) The intercept (80.
5) would be the log(TEQ) in the year 0 which is clearly nonsensical.
• The slope (-.0394) is the estimated log(ratio) from one year to the
next. It would mean that the TEQ in one year is only 91.3% of the T
EQ in the previous year or roughly a 9% decline per year.
Explained and Unexplained Varia
tion
Explained and Unexplained Varia
tion
• SST = total sum of squares
 Measures the variation of the yi values around their mean y
• SSE = error sum of squares
 Variation attributable to factors other than the relationship between x
and y
• SSR = regression sum of squares
 Explained variation attributable to the relationship between x and y
Explained and Unexplained Varia
tion
Explained and Unexplained Varia
tion
• Coefficient of Determination, R2
• The coefficient of determination is the portion of the total vari
ation in the dependent variable that is explained by variation i
n the independent variable.
• The coefficient of determination is also called R-squared and i
s denoted as R2
Explained and Unexplained Varia
tion
Explained and Unexplained Varia
tion
Explained and Unexplained Varia
tion
Explained and Unexplained Varia
tion
Explained and Unexplained Varia
tion
End of Chapter
5

You might also like