0% found this document useful (0 votes)
19 views100 pages

Regression - Analysis ch1,2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views100 pages

Regression - Analysis ch1,2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Regression Analysis (Stat3041)

Yabebal Ayalew

College of Natural & Computational Science


Statistics Department
Chapter One

Introduction
Introduction
Outline

1 Overview of Regression Analysis


2 Applications of Regression Analysis
3 Steps in Regression Analysis

3 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

• A straight line is a simple geometric shape that has no


curves and extends infinitely in both directions. It is the
shortest distance between any two points.
• In a coordinate plane (which has an x-axis and a y-axis),
a straight line can be represented by a mathematical
equation. This equation tells us how the x and y
coordinates of any point on the line are related.
The most basic form of the equation of a straight line is

y = β0 + β1 x

where y is the value on the y-axis, x is the value on the x-axis, β1 is the slope of the line, which shows
how steep the line is and β0 is the y-intercept, which is the point where the line crosses the y-axis.

4 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

• The equation to find the slope β1 of a straight line when


you know two points on the line, say (x1 , y1 ) and (x2 , y2 ),
is given by
y1 − y2 y2 − y1
β1 = =
x1 − x2 x2 − x1
• Example: If you have two points on a line, say (1, 2) and
(4, 6):
y1 − y2 2−6 4
β1 = = =
x1 − x2 1−4 3
The intercept β0 is
2
β0 = y1 − β1 x1 = y2 − β1 x2 =
3
2
The line equation will be y = 3 + 43 x

5 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

Imagine you have a set of data points that show how


the number of hours you study affects your test scores.
You plot these points on a graph, but they don’t all
lie on a perfect straight line. If you were asked to
draw a line that best represents the relationship
between study hours and test scores, how would you
decide where to place this line? What factors would
you consider?
1 Do you think this line would be the same if we had more
or different data points?
2 How would you use this line to predict test scores for
someone who studied for a different number of hours?
3 What do you think it means if a point is far from the line
you’ve drawn?

6 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

Imagine you’re planning to buy a house and want to


know how much it might cost. You notice that house
prices depend on various factors like the size of the
house, the number of bedrooms, the neighborhood,
and even the distance from the city center.
You collect data from houses in the area, including
their prices, sizes, and other features. But when you
plot this data on a graph, the points are
scattered—they don’t form a perfect line. You can’t
just use a simple equation to predict the price based on
size alone because many factors influence the price.

7 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

• This is where regression analysis becomes powerful. It


helps you take all those scattered data points and find the
best possible line (or curve) that represents the overall
trend. This line won’t pass through every point, but it
will be close to most of them.
• With this line, you can:
— Predict the price of a house based on its features,
even if you haven’t seen the exact house before.
— Understand which factors influence the price most.
— Make decisions with more confidence, knowing that
data back your predictions.

8 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

Imagine you’re a farmer trying to maximize the yield of


your crops. You know that many factors influence your
harvest, such as the amount of fertilizer you use, the
amount of water your crops receive, the type of soil,
and the amount of sunlight. You’ve been keeping track
of these factors and the resulting yields over several
years.
You decide to analyze your data, but you quickly
notice that the relationship between each factor and
the crop yield isn’t straightforward. Sometimes, more
fertilizer improves the yield, but too much fertilizer
actually reduces it. Similarly, more water generally
helps, but only up to a certain point.

9 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

• This is where regression analysis becomes essential. It allows you


to:
1 Quantify the Relationship: Regression helps you determine
exactly how much each factor (like water, fertilizer, or
sunlight) affects your crop yield. For example, it might show
that for every additional liter of water per square meter, your
yield increases by a certain amount—up to a limit.
2 Identify Optimal Levels: Regression can help you find the
optimal combination of factors. For instance, it might reveal
that using a specific amount of fertilizer combined with a
certain amount of water produces the best results, helping
you avoid overuse or underuse.
3 Predict Future Yields: Regression allows you to predict
your crop yield under different conditions by analyzing past
data. This is crucial for planning, resource allocation, and What is regression
maximizing your profits. analysis?
10 Statistics Department Regression Analysis Yabebal A.
Introduction
Overview of Regression Analysis

• Regression Analysis is a statistical method used to examine the relationship between one dependent
variable (often referred to as the outcome or response variable) and one or more independent
variables (also called predictors or explanatory variables).
• Regression methods are concerned with two types of variables: the explanatory (or independent)
variables x and the dependent variables y. The collection of methods that are referred to as
regression methods have several objectives
1 Modeling of the response y given x such that the underlying structure of the influence of x on
y is found.
2 Quantification of the influence of x on y.
3 Prediction of y given an observation x.

• Regression analysis is one of the examples of a statistical model. A statistical model is a


mathematical representation of observed data. It encapsulates the underlying relationships between
variables in the form of equations, probabilities, or other statistical structures

11 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

Historic Development of Regression


• The concept of regression analysis dates back to the late 19th century and is most closely associated
with Sir Francis Galton, a British polymath.
— Galton was interested in understanding the relationship between parents’ heights and their
children’s heights
— In 1886, he observed that children’s heights tended to regress towards the mean height of the
population, meaning that tall parents tended to have slightly shorter children and short parents
had slightly taller children, on average. This phenomenon was referred to as "regression toward
the mean."
• Karl Pearson: A colleague of Galton, Karl Pearson, further developed the statistical methods
underlying regression. He introduced the concept of the correlation coefficient, which quantifies the
strength of the relationship between two variables.
• Yule and the Multiple Regression: In the early 20th century, G. Udny Yule extended the concept
of regression to multiple variables, leading to the development of multiple regression analysis. This
allowed for studying relationships between one dependent variable and several independent variables.

12 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

Sir Francis Galton Karl Pearson G. Udny Yule


(1822 - 1911) (1857 - 1936 ) (1871 - 1951)

13 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis

Fisher and the Formalization of Regression: Sir


Ronald A. Fisher, a prominent statistician, played a
critical role in formalizing regression analysis as part of
the broader field of statistical modeling. In the 1920s
and 1930s, Fisher introduced the least squares
method for estimating the parameters of a regression
line, a technique still fundamental in regression analysis
today.
He also developed the analysis of variance (ANOVA)
method, which is often used in conjunction with
regression. Sir Ronald A. Fisher
(1890 - 1962 )

14 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis—Structure

• A statistical model is a mathematical representation of the relationships between different variables in


the real world. It is designed to capture the essential features of these relationships while ignoring the
complexities and details that are not critical for the analysis at hand.
• The main purpose of using a model is to simplify a complex reality into a form that can be analyzed
and understood. By focusing on key variables and relationships, statistical models allow us to make
predictions, test hypotheses, and understand underlying patterns in the data.
• Regression modeling uses several structural components. In particular, it is useful to distinguish
between the random component, which usually is specified by some distributional assumption,
and the components, which specify the structuring of the covariates x. More specifically, in a
structured regression, the mean µ (or any other parameter) of the dependent variable y is modeled as
a function in x in the form
µ = h(η(x))
where h is a transformation and η(x) is a structured term. A very simple form is used in classical
linear regression, where one assumes

15 Statistics Department Regression Analysis Yabebal A.


Introduction
Overview of Regression Analysis—Structure

µ = β0 + β1 x1 + β2 x2 + · · · + βp xp
= β0 + xT β

with the parameter vector β T = (β1 , β2 , · · · , βp ) and the vector of covariates


xT = (x1 , x2 , · · · , xp ). Thus, classical linear regression assumes that the mean µ is directly
linked to a linear predictor η(x) = β0 + xT β. Covariates determine the mean response by a
linear term, and the link h is the identity function. The distributional part in classical linear
regression follows from assuming a normal distribution for y|x.
In logistic regression, when the response takes a value of 0 or 1, the mean corresponds to the
probability P r(y = 1|x). Then, the identity link h is a questionable choice since the
probabilities are between 0 and 1. A transformation h that maps η(x) into the interval [0, 1]
typically yields more appropriate models.

16 Statistics Department Regression Analysis Yabebal A.


Introduction
Applications of Regression Analysis

Regression analysis is a versatile tool that finds applications across numerous fields, allowing
for data-driven decision-making, forecasting, and understanding complex relationships
1 Economics and Finance
— Forecasting Economic Indicators: Economists use regression analysis to predict key indicators
like GDP, inflation rates, unemployment, and stock prices based on historical data and
influencing factors.
— Risk Management: In finance, regression models help assess the relationship between various
risk factors (e.g., interest rates, market indices) and the returns of financial assets, aiding in
portfolio management and risk assessment.
— Pricing Models: Regression is used in developing pricing models for options, bonds, and other
financial instruments, often involving multiple variables like interest rates and asset prices.

17 Statistics Department Regression Analysis Yabebal A.


Introduction
Applications of Regression Analysis
2 Marketing and Sales
— Customer Behavior Analysis: Companies use regression analysis to understand the
relationship between advertising spend, price changes, and sales performance. This helps in
optimizing marketing strategies.
— Market Research: Regression models are employed to analyze survey data, identifying key
factors that influence customer satisfaction, brand loyalty, and purchasing decisions.
— Sales Forecasting: By analyzing past sales data along with factors like seasonality, economic
conditions, and marketing efforts, regression helps predict future sales.
3 Healthcare and Medicine
— Epidemiology: Regression analysis is used to study the relationship between risk factors (e.g.,
age, lifestyle, genetic predisposition) and health outcomes, such as the likelihood of developing a
disease.
— Clinical Trials: In clinical research, regression models help evaluate the effectiveness of new
treatments by analyzing patient outcomes relative to various predictors.
— Public Health: Regression is used to model the spread of diseases and the impact of public
health interventions, such as vaccination programs or policy changes.
18 Statistics Department Regression Analysis Yabebal A.
Introduction
Applications of Regression Analysis

4 Social Sciences
— Sociology and Psychology: Researchers use regression to explore the relationship between
social factors (e.g., education, income, family background) and outcomes like crime rates,
educational attainment, or mental health.
— Education: Regression models help assess the impact of different teaching methods, class sizes,
and other factors on student performance, guiding educational policy and practice.
5 Engineering and Manufacturing
— Quality Control: In manufacturing, regression analysis is used to identify the factors that most
influence product quality, enabling companies to optimize production processes and reduce
defects.
— Reliability Engineering: Engineers use regression models to predict the lifespan of products or
components based on various stress factors, guiding maintenance and warranty decisions.
— Process Optimization: Regression helps in modeling and optimizing manufacturing processes
by analyzing the relationship between input variables (like temperature, pressure) and output
quality.

19 Statistics Department Regression Analysis Yabebal A.


Introduction
Applications of Regression Analysis

6 Environmental Science
— Climate Change Studies: Scientists use regression analysis to model the relationship between
greenhouse gas emissions and temperature changes, helping to understand and predict the
impacts of climate change.
— Pollution Control: Regression models are applied to assess the relationship between pollutant
levels and health outcomes, guiding environmental regulations and public health interventions.
— Resource Management: In agriculture and forestry, regression is used to model crop yields,
forest growth, and resource depletion, informing sustainable management practices.
7 Agriculture
— Crop Yield Prediction: Farmers and agricultural scientists use regression to predict crop yields
based on factors such as soil quality, weather conditions, and agricultural practices.
— Livestock Management: Regression models help optimize feeding strategies and improve
livestock health by analyzing the impact of diet, environment, and genetics on animal growth
and productivity.

20 Statistics Department Regression Analysis Yabebal A.


Introduction
Steps in Regression Analysis

1 Define the Research Question or Problem


— Objective: Clearly define what you want to achieve with the regression analysis. Are you trying
to predict a future outcome, understand the relationship between variables, or test a hypothesis?
— Example: Suppose you want to understand how study hours (independent variable) affect exam
scores (dependent variable) among students.
2 Select/Identify the Variables
— Dependent Variable (Y ): The outcome or response variable that you are trying to predict or
explain.
— Independent Variable(s) (x): The predictor(s) or explanatory variable(s) that you believe
influence the dependent variable.
— Example: In our case, the exam score is the dependent variable, and the study hours are the
independent variable.

21 Statistics Department Regression Analysis Yabebal A.


Introduction
Steps in Regression Analysis
3 Collect Data
— Data Collection: Gather the necessary data for the dependent and independent variables. The
data should be accurate, relevant, and representative of your study population.
— Data Sources: Data can be collected from surveys, experiments, historical records, databases,
etc.
— Data Quality: Ensure the data is free from errors, missing values, and outliers that could
distort the analysis.
— Example: Collect data on how many hours each student studied and their corresponding exam
scores.
4 Explore and Prepare the Data
— Data Cleaning: Check for and handle missing values, remove or correct outliers, and ensure
that all variables are in the correct format.
— Data Transformation: If necessary, transform the data to meet the assumptions of regression
(e.g., logarithmic transformation, standardization).
— Descriptive Statistics: Calculate basic statistics like mean, median, standard deviation, and
correlation coefficients to get an initial understanding of the data.
— Example: Plot the data to visualize the relationship between study hours and exam scores.
Check if the relationship appears linear or if there’s a need for data transformation.
22 Statistics Department Regression Analysis Yabebal A.
Introduction
Steps in Regression Analysis

5 Choose the Type of Regression Model


— Simple Linear Regression: Used when there is one independent variable and the relationship
between the variables is expected to be linear.
— Multiple Linear Regression: Used when there are two or more independent variables.
— Logistic Regression: Used when the dependent variable is categorical (e.g., pass/fail).
— Polynomial Regression: Used when the relationship between variables is non-linear and can be
modeled as a polynomial function.
— Example: If you believe that study hours linearly affect exam scores, you will choose simple
linear regression.
6 Split the Data (if applicable)
— Training and Testing Sets: If your goal is prediction, split your data into a training set (to
build the model) and a testing set (to evaluate the model’s performance).
— Example: Use 70% of your data to build the model (training set) and the remaining 30% to
test it (testing set).

23 Statistics Department Regression Analysis Yabebal A.


Introduction
Steps in Regression Analysis

7 Estimate the Regression Model


— Fit the Model: Use statistical software or methods (like the method of least squares) to fit the
regression model to your data. This involves estimating the coefficients (e.g., slope and
intercept in linear regression).
8 Evaluate the Model
— Goodness of Fit: Evaluate how well the model fits the data using metrics such as R-squared
(which indicates the proportion of variance in the dependent variable that the model explains).
— Significance Testing: Perform statistical tests (e.g., t-tests) to determine if the regression
coefficients significantly differ from zero, implying that the independent variables have a
meaningful effect on the dependent variable.
— Residual Analysis: Examine the residuals (the differences between observed and predicted
values) to check for patterns that might indicate problems like non-linearity or heteroscedasticity.

24 Statistics Department Regression Analysis Yabebal A.


Introduction
Steps in Regression Analysis

9 Model Diagnostics and Validation


— Check Assumptions: Ensure that the model meets the key assumptions of regression (e.g.,
linearity, independence of errors, homoscedasticity, normality of errors).
— Multicollinearity: If using multiple regression, check for multicollinearity (high correlation
between independent variables), which can distort the results.
— Model Validation: If you have a testing set, use it to validate the model’s predictive power.
Compare the predicted values with actual values to assess accuracy.
10 Interpret the Results
— Coefficient Interpretation: Understand what each coefficient in the model represents in the
context of your data. For example, a positive coefficient indicates a positive relationship
between the independent and dependent variables.
— Contextualization: Relate the statistical findings back to the original research question. What
do the results tell you about the relationship between the variables?

25 Statistics Department Regression Analysis Yabebal A.


Introduction
Steps in Regression Analysis

11 Make Predictions
— Use the regression equation to make predictions
about the dependent variable for new or unseen data.
— Calculate confidence intervals for the predictions to
understand the uncertainty around them.
12 Report and Communicate Findings
— Prepare a detailed report or presentation that
includes the methodology, regression model, results,
and interpretations.
— Use graphs (e.g., scatter plots with a regression line)
to represent the relationship and findings visually.
— Tailor the communication of your results to your
audience, whether they are statisticians,
stakeholders, or the general public.

26 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

1 Which of the following is the first step in conducting a regression analysis?


A. Collecting data
B. Evaluating the model
C. Defining the research question or problem
D. Choosing the type of regression model
E. Making predictions
2 In regression analysis, the dependent variable is also known as the:
A. Predictor variable
B. Response variable
C. Control variable
D. Explanatory variable
E. Covariate

27 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

3 Which of the following best describes an independent variable in regression analysis?


A. The variable you are trying to predict
B The variable that is constant
C. The variable that influences the dependent variable
D. The variable that is always positive
E. The variable that equals the dependent variable
4 Why is it important to check for and handle missing values in your data before performing regression
analysis?
A. To ensure your model has enough variables
B. To avoid skewing the results and ensure the analysis is accurate
C. To reduce the number of observations
D. To increase the complexity of the model
E. To increase the degrees of freedom

28 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

5 Which of the following is NOT typically a part of data preparation in regression analysis?
A. Data cleaning
B. Model validation
C. Data transformation
D. Descriptive statistics calculation
E. Outlier correction
6 Descriptive statistics in regression analysis help to:
A. Predict future outcomes
B. Understand the basic features of the data
C. Test the significance of coefficients
D. Split the data into training and testing sets
E. Determine the goodness of fit

29 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

7 When should you use multiple linear regression instead of simple linear regression?
A. When there is one independent variable
B. When the dependent variable is categorical
C. When there are two or more independent variables
D. When the relationship between variables is non-linear
E. When the model requires polynomial terms
8 Which type of regression would be most appropriate if your dependent variable is a binary outcome
(e.g., pass/fail)?
A. Simple linear regression
B. Multiple linear regression
C. Logistic regression
D. Polynomial regression
E. Ridge regression

30 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

9 Which type of regression analysis is suitable for modeling non-linear relationships between variables?
A. Simple linear regression
B. Multiple linear regression
C. Logistic regression
D. Polynomial regression
E. Stepwise regression
10 Which of the following is NOT a recommended component of a regression analysis report?
A. Visualization of data and model results
B. Discussion of model assumptions and their validation
C. Detailed methodology description
D. Instructions for future data collection
E. Interpretation of the coefficients and findings

31 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

11 What is the primary purpose of regression analysis?


A. To classify data into different categories
B. To identify relationships between variables and predict future outcomes
C. To calculate mean and median values
D. To organize data into tables
12 A company uses regression analysis to understand the impact of advertising spend, pricing, and
seasonality on sales. Which of the following would be a potential independent variable?
A. Total sales revenue
B. Advertising spend
C. Profit margins
D. Customer satisfaction

32 Statistics Department Regression Analysis Yabebal A.


Introduction
Review Exercise

13 In an agricultural study, a researcher uses regression analysis to predict crop yield based on factors
such as rainfall, soil quality, and fertilizer usage. What is the dependent variable in this scenario?
A. Rainfall
B. Soil quality
C. Fertilizer usage
D. Crop yield
14 In a marketing campaign, a company uses regression analysis to predict customer spending based on
demographic data and past purchase behavior. Which of the following could be a dependent
variable?
A. Age of the customer
B. Customer spending
C. Past purchase behavior
D. Gender of the customer

33 Statistics Department Regression Analysis Yabebal A.


Chapter Two

Simple Linear Regression


Simple Linear Regression
Outline
1 Introduction
2 Estimation of the regression parameters: Least–square method
3 Desirable properties of the least-squares estimator
4 Inference about the intercept and the slope parameters
✪ Test of the significance of regression parameters
✪ Interval estimation of the regression parameters
5 Prediction
✪ Prediction of new observations
✪ Interval estimation of a predicted new observation
✪ Hypothesis testing about a predicted new observation
6 Covariance and correlation coefficient
✪ Estimation of population covariance and correlation coefficient
✪ Interval estimation of a population correlation coefficient
✪ Hypothesis testing about a population correlation coefficient
35 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Introduction

• Simple linear regression assumes a linear relationship between the independent variable (X) and
the dependent variable (Y ). This means that as X changes, Y changes in a way that a straight line
can represent.
• The relationship between X and Y is described by the regression equation

Yi = β0 + β1 Xi + ϵi i = 1, 2, . . . , n (1)

where Yi is the value of the response variable in the ith trial, β0 and β1 are parameters, Xi is a
known constant, namely, the value of the predictor variable in the ith trial, ϵi is the random error
term with mean E(ϵi ) = 0 and variance σ 2 (ϵi ) = σ 2 ; ϵi and ϵj are uncorrelated so that their
covariance is zero (i.e., Cov(ϵi , ϵj ) = 0 for all i, j: i ̸= j)

36 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Introduction
• Regression model (1) is simple, linear in the parameters, and linear in the predictor variable.
• It is “simple” in that there is only one predictor variable, “linear in the parameters” because no
parameter appears as an exponent or is multiplied or divided by another parameter, and “linear in the
predictor variable” because this variable appears only in the first power.
• A model that is linear in the parameters and in the predictor variable is also called a first-order
model

Important Features of Model


1 The response Yi in the ith trial is the sum of two components: (1) the constant term β0 + β1 Xi and
(2) the random term ϵi . Hence, Yi is a random variable.
2 Since E(ϵi ) = 0, E(Yi ) = µ = β0 + β1 Xi

E(Yi ) = E(β0 + β1 Xi + ϵi )
= E(β0 + β1 Xi ) + E(ϵi )
= β0 + β1 Xi

37 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Introduction

3 The error terms ϵi are assumed to have constant variance σ 2 . It therefore follows that the responses
Yi have the same constant variance:

σ 2 (Yi ) = V ar(β0 + β1 Xi + ϵi )
= V ar(β0 + β1 Xi ) + V ar(ϵi )
| {z } | {z }
=0 σ2
2
= σ

Thus, regression model (1) assumes that the probability distributions of Y have the same variance
σ 2 , regardless of the level of the predictor variable X
4 The error terms are assumed to be uncorrelated. Since the error terms ϵi and ϵj are uncorrelated, so
are the responses Yi and Yj

38 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Introduction

Y X
• The parameters β0 and β1 in regression model (1) are
called regression coefficients. β1 is the slope of the Y1 X1
regression line. It indicates the change in the mean of the Y2 X2
probability distribution of Y per unit increase in X. .. ..
. .
• The parameter β0 is the Y intercept of the regression Yi Xi
line. When the scope of the model includes X = 0, β0 .. ..
gives the mean of the probability distribution of Y at . .
X = 0. When the scope of the model does not cover Yn Xn
X = 0, β0 does not have any particular meaning as a
separate term in the regression model Data structure for simple linear
regression

39 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method
• In real-world situations, we do not know the true values of β0 and β1 . We only have a sample of
observed data points (Xi , Yi ), where i = 1, 2, . . . , n. These observed data points are drawn from a
larger population, and the objective is to use this sample to make informed estimates about the true
underlying parameters.

If the model coefficients are


unknown, we have to
estimate them from the
sample data
The sample estimators of β0
and β1 are β̂0 and β̂1 ,
respectively. Thus, the best
fitted regression line is

Ŷi = β̂0 + β̂1 Xi

40 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

The least squares principle for the simple linear regression model is to find the estimates β̂0
and β̂1 such that the sum of the squared distance from actual response Yi and predicted
response Ŷi = β̂0 + β̂1 Xi reaches the minimum among all possible choices of regression
coefficients, β0 and β1 . i.e.,
n
X
(β̂0 , β̂1 ) = arg min (Yi − Ŷi )2
(β̂0 ,β̂1 ) i=1
n
X
= arg min (Yi − β̂0 − β̂1 Xi )2
(β̂0 ,β̂1 ) i=1

The motivation behind the least squares method is to find parameter estimates by choosing
the regression line that is the most "closest" line to all data points (Xi , Yi ).

41 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

Mathematically, the least squares estimates of


the simple linear regression are given by solving
the following system of equations:
n
∂ X
(Yi − β̂0 − β̂1 Xi )2 = 0 (2)
∂ β̂0 i=1
n
∂ X
(Yi − β̂0 − β̂1 Xi )2 = 0 (3)
∂ β̂1 i=1

42 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

43 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

n n
!
∂ X X
(Yi − β̂0 − β̂1 Xi )2 = −2 (Yi − β̂0 − β̂1 Xi ) =0
∂ β̂0 i=1 i=1
n
X
= (Yi − β̂0 − β̂1 Xi ) = 0
i=1
Xn n
X
= Yi − nβ̂0 − β̂1 Xi = 0
i=1 i=1
n n
!
∂ X X
(Yi − β̂0 − β̂1 Xi )2 = −2 (Yi − β̂0 − β̂1 Xi )Xi =0
∂ β̂1 i=1 i=1
n
X
= (Yi − β̂0 − β̂1 Xi )Xi = 0
i=1
Xn n
X n
X
= Yi Xi − β̂0 Xi − β̂1 Xi2 = 0
i=1 i=1 i=1
44 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

Then, the two equations are


n
X n
X
Yi = nβ̂0 + β̂1 Xi (4)
i=1 i=1
n
X n
X n
X
Yi Xi = β̂0 Xi + β̂1 Xi2 (5)
i=1 i=1 i=1

By solving these equations simultaneously, we can find that


n
(Yi − Ȳ )(Xi − X̄)
P
i=1
β̂1 = n (6)
(Xi − X̄)2
P
i=1
β̂0 = Ȳ − β̂1 X̄ (7)

45 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method
The alternative formula for β̂1
n
X n 
X 
(Xi − X̄)(Yi − Ȳ ) = Xi Yi − Xi Ȳ − X̄Yi + X̄ Ȳ
i=1 i=1
Xn n
X n
X n
X
= Xi Yi − Ȳ Xi − X̄ Yi − X̄ Ȳ
i=1 i=1 i=1 i=1
Xn n
X
= Xi Yi − nX̄ Ȳ − nX̄ Ȳ + nX̄ Ȳ = Xi Yi − nX̄ Ȳ
i=1 i=1
n
X n
X n
X
(Xi − X̄)Yi = (Xi Yi − X̄Yi ) = Xi Yi − nX̄ Ȳ
i=1 i=1 i=1
Thus,
n
(Xi − X̄)Yi X
P
n
(Xi − X̄)
β̂1 = i=1
n = ki Yi , where ki = P
n
(Xi − X̄)2 (Xi − X̄)2
P
i=1
i=1 i=1
46 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

Attention!
• The random disturbance term, often denoted by ϵ, represents the true error in the relationship
between the dependent variable Y and the independent variable X.

ϵi = Yi − E(Yi )

• It captures the effects of all the factors that influence Y other than X. This includes unobserved
variables, measurement errors, and any inherent randomness in the relationship.
• The residual, often denoted by e, is the difference between the observed value Yi and the value
predicted by the estimated regression model Ŷi

ei = Yi − Ŷi

• The residual is an observable quantity calculated from the data and the fitted regression model.
Unlike the random disturbance term, which is theoretical and unobservable, the residual is what we
use to assess the model’s fit.

47 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Properties of Fitted Regression Line
Pn
1 The sum of the residual is zero, i=1 ei =0
n
X n
X n
X
ei = (Yi − Ŷi ) = (Yi − (β̂0 + β̂1 Xi ))
i=1 i=1 i=1
Xn
= (Yi − (Ȳ − β̂1 X̄ + β̂1 Xi ))
i=1
Xn
= (Yi − Ȳ + β̂1 X̄ − β̂1 Xi )
i=1
Xn
= ((Yi − Ȳ ) − β̂1 (Xi − X̄))
i=1
Xn n
X
= (Yi − Ȳ ) − β̂1 (Xi − X̄)
i=1 i=1
= 0

48 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Properties of Fitted Regression Line

The sum of the squared residuals, i e2i , is a minimum. This was the requirement to be satisfied in
P
2
deriving the least P
squares estimators of the regression parameters since the sum square error to be
minimized equals i e2i when the least squares estimators β̂0 and β̂1 are used for estimating β0 and
β1 .
3 The sum of the observed values Yi equals the sum of the fitted values Ŷi :
n
X n
X
ei = 0 = (Yi − Ŷi )
i=1 i=1
Xn n
X
= Yi − Ŷi
i=1 i=1
n
X Xn
Yi = Ŷi
i=1 i=1

49 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Properties of Fitted Regression Line
4 The sum of the weighted residuals is zero Pn when the residual in the ith trial is weighted by the level
of the predictor variable in the ith trial: i=1 Xi ei = 0

Proof. We know that β̂0 = Ȳ − β̂1 X̄, ei = (Yi − Ŷi ) = (Yi − Ȳ ) − β̂1 (Xi − X̄) and
Pn Pn
i=1 (Xi − X̄)Yi (Yi − Ȳ )Xi
β̂1 = Pn 2
= Pni=1
i=1 (Xi − X̄) i=1 (Xi − X̄)Xi

n
X n
X n
X
Xi ei = Xi (Yi − Ŷi ) = Xi (Yi − (β̂0 + β̂1 Xi ))
i=1 i=1 i=1
Xn
 
= Xi (Yi − Ȳ ) − β̄1 (Xi − X̄)
i=1
n
X n
X
= Xi (Yi − Ȳ ) − β̄1 Xi (Xi − X̄)
i=1 i=1
Xn n
X
= Xi (Yi − Ȳ ) − Xi (Yi − Ȳ ) = 0
i=1 i=1
50 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Properties of Fitted Regression Line
P P
5 A consequence of properties i ei = 0 and i Xi ei = 0 is that the sum of the weighted residuals is
zero when theP residual in the ith trial is weighted by the fitted value of the response variable for the
ith trial. i.e., i Ŷi ei = 0
Proof.
n
X n
X
Ŷi ei = (β̂0 + β̂1 Xi )ei
i=1 i=1
Xn n
X
= β̂0 ei + β̂1 Xi ei
i=1 i=1
n
X n
X
= β̂0 ei + β̂1 Xi ei
i=1 i=1

= β̂0 × 0 + β̂1 × 0
= 0

51 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Properties of Fitted Regression Line
6 The regression line always goes through the point (X̄, Ȳ ). When Xi = X̄, then the fitted regression
line Ŷi = β̂0 + β̂1 Xi will be
Ŷi = β̂0 + β̂1 Xi = β̂0 + β̂1 X̄
= Ȳ − β̂1 X̄ + β̂1 X̄
= Ȳ − β̂1 (X̄ − X̄)
= Ȳ

Example: A company wants to understand the relationship between its advertising budget (in
thousands of dollars) and the resulting sales (in thousands of units). They have collected data
over six months. The data is as follows:
Month 1 2 3 4 5 6
Advertising Budget (X) 10 15 20 25 30 35
Sales (Y ) 25 30 45 50 55 60
Using the data provided, fit a simple linear regression line to predict sales based on the
advertising budget.
52 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method
Solution: The mean of X and Y are
n
1X 10 + 15 + 20 + 25 + 30 + 35 135
X̄ = Xi = = = 22.5
n i=1 6 6
n
1X 25 + 30 + 45 + 50 + 55 + 60 265
Ȳ = Yi = = ≈ 44.17
n i=1 6 6

The slope of the regression line is calculated using the following formula:
6
(Xi − X̄)Yi
P
i=1 (10 − 22.5)25 + (15 − 22.5)30 + · · · + (35 − 22.5)60
β̂1 = 6
=
(10 − 22.5)10 + (15 − 22.5)15 + · · · + (35 − 22.5)35
(Xi − X̄)Xi
P
i=1
637.5
= = 1.46
437.5

53 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

The intercept β̂0 is calculated using the following formula:

β̂0 = Ȳ − β̂1 X̄ = 44.17 − 1.46 × 22.5 = 11.32


The regression line equation is:
Ŷi = 11.32 + 1.46Xi
This equation indicates that for every additional thousand ETB spent on advertising, the sales
are expected to increase by approximately 1.46 thousand units.

Exercise: Based on the above example, show that the properties of the fitted line hold. i.e.,
P P P
i ei = i Xi ei = i Ŷi ei = 0 and the regression line passes through a point (X̄, Ȳ )

54 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

55 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

Important Results
1
n
X n
X n
X n
X
(Xi − X̄)(Yi − Ȳ ) = (Xi − X̄)Yi = (Yi − Ȳ )Xi = Xi Yi − nX̄ Ȳ
i=1 i=1 i=1 i=1

2 If the covariance of X and Y are defined as


" n #
1 X
Cov(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n − 1 i=1

then,
n
X
(Xi − X̄)(Yi − Ȳ ) = (n − 1)Cov(X, Y )
i=1

56 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method

4
n
X n
X n
X
(Xi − X̄)2 = (Xi − X̄)Xi = Xi2 − nX̄ 2
=1 i=1 i=1

5 If the variance of X is defined as


n
1 X
V ar(X) = (Xi − X̄)2
n − 1 =1
Pn
then, =1 (Xi − X̄)2 = (n − 1)V ar(X)
6 As a result, Pn
i=1 (Xi − X̄)(Yi − Ȳ ) (n − 1)Cov(X, Y ) Cov(X, Y )
β̂1 = Pn = =
=1 (Xi − X̄)
2 (n − 1)V ar(X) V ar(X)
Pn 1
Pn
7 i=1 ki2 = Pn (Xi −X̄)2
and i=1 Xi ki = 1
i=1

57 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
Gauss-Markov Theorem: Under certain conditions, the Ordinary Least Squares (OLS)
estimator for the regression coefficients has the Best Linear Unbiased Estimator (BLUE)
properties. In other words, it is:
1 Linear: A linear function of the dependent variable Y
2 Unbiased: E(β̂) = β, meaning on average, the OLS estimator gives the true value of β
3 Best: It has the minimum variance among all linear unbiased estimators.
Assumptions of the Gauss-Markov Theorem
• Linearity of the Model: The relationship between the dependent variable Y and the independent
variables X is linear in the coefficients β.
• Random Sampling: The observations in the dataset are independently and identically distributed
(i.i.d.), ensuring no patterns that might bias the estimates.
• The error term (ϵ) must have zero mean and constant variance.

E(ϵ|X) = 0, V ar(ϵ|X) = σ 2

58 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
Theorem: The least squares estimator β̂1 is an unbiased estimator of β1
Proof.

n
P 
(Xi − X̄)Yi " n #
 i=1  1 X
E(β̂1 ) = E
 P n
=
 n (Xi − X̄)E(Yi )
(Xi − X̄) 2 (Xi − X̄)2
P
i=1
i=1 i=1
n n n
(Xi − X̄)(β0 + β1 Xi ) (Xi − X̄)β0 + (Xi − X̄)β1 Xi
P P P
i=1 i=1 i=1
= n = n
(Xi − X̄)2 (Xi − X̄)2
P P
i=1 i=1
P
=0 = i
(Xi −X̄)2
z }| { z }| {
X n X n
β0 (Xi − X̄) +β1 (Xi − X̄)Xi
i=1 i=1
= n = β1
(Xi − X̄)2
P
59 Statistics Department Regression Analysis Yabebal A.
i=1
Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
Theorem: The least squares estimator β̂0 is an unbiased estimator of β0
Proof.

n
!
1X
E(β̂0 ) = E(Ȳ − β̂1 X̄) = E(Ȳ ) − E(β̂1 X̄) = E Yi − X̄E(β̂1 )
n i=1
n
1X
= E(Yi ) − β1 X̄
n i=1
n
1X
= (β0 + β1 Xi ) − β1 X̄
n i=1
n n
" #
1 X X
= β0 + β1 Xi − β1 X̄
n i=1 i=1
= β0 + β1 X̄ − β1 X̄ = β0

60 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
Theorem: The variance of the OLS estimator β̂1 is
σ2
V ar(β̂1 ) = Pn 2
i=1 (Xi − X̄)

Proof.
n
X n
X
V ar(β̂1 ) = V ar( ki Yi ) = V ar(ki Yi ))
i=1 i=1
n
X
= ki2 V ar(Yi )
i=1
n n
" #2
2
X X X (Xi − X̄) 1
= σ ki2 , ki2 = 2
=P 2
i (Xi − X̄) i (Xi − X̄)
P
i=1 i=1 i
" #
1
= σ 2 Pn 2
i=1 (Xi − X̄)

61 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator

Theorem: The least squares estimator β̂1 and Ȳ are uncorrelated.


Proof.
n n
!
1X X
Cov(β̂1 , Ȳ ) = Cov Yi , ki Yi
n i=1 i=1
n n
!
1 X X
= Cov Yi , ki Yi
n i=1 i=1
n
1X
= ki V ar(Yi )
n i=1
n
!
σ2 X
= ki
n i=1
= 0

62 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator

Theorem: The variance of the OLS estimator β̂0 is


!
2 1 X̄ 2
V ar(β̂0 ) = σ + Pn 2
n i=1 (Xi − X̄)

Proof.

V ar(β̂0 ) = V ar(Ȳ − β̂1 X̄) = V ar(Ȳ ) + V ar(β̂1 X̄) − 2Cov(Ȳ , β̂1 X̄)
σ2 σ 2 X̄ 2
= + Pn 2
− 2X̄ Cov(Ȳ , β̂1 )
n i=1 (Xi − X̄)
| {z }
=0
!
1 X̄ 2
= σ2 + Pn 2
n i=1 (Xi − X̄)

63 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator

• The σ 2 is the population variance is not known. As a result, we have to estimate it from the data
that is being used to estimate regression coefficients
• The unbiased estimator of σ 2 is
n
2 1 X SSE
σ̂ = (Yi − Ŷi )2 = = M SE
n − 2 i=1 n−2

where σ̂ 2 is called error mean square to residual mean square


• The denominator n − 2 is called the degrees of freedom. Two degrees of freedom are lost due to
the fact that β0 and β1 have to be estimated in obtaining the estimated means Ŷi
• Thus, the estimated variance of β̂0 and β̂1 are

X̄ 2
   
\ 1 \ 1
V ar(β̂0 ) = M SE + Pn , and V ar(β̂1 ) = M SE Pn
n i=1 (Xi − X̄)
2
i=1 (Xi − X̄)
2

64 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
Example: A researcher conducts an experiment to investigate the effect of temperature on the
rate of a chemical reaction. The table below shows the temperature (in degrees Celsius) and
the time it takes for the reaction to complete (in seconds).
Temperature (X) 15 20 25 30 35
Reaction Time (Y) 52.3 47.8 43.2 39.6 36.1
Estimate the simple linear regression model parameters and also compute their variances
Solution: The summary statistics are:
5
X 5
X
X̄ = 25, Ȳ = 43.8, Xi2 = 3375, Xi Yi = 5272
i=1 i=1

The estimator of β1 is

i Xi Yi − nX̄ Ȳ 5272 − 5(25)(43.8)


P
β̂1 = = = −0.812
2
i Xi − nX̄
2 3375 − 5(25)2
P

65 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
The estimator of β0 is

β̂0 = Ȳ − β̂1 X̄ = 43.8 + 0.812(25) = 64.1

The fitted regression line will be Ŷi = 64.1 − 0.812Xi . Thus, Ŷi = {51.9, 47.9, 43.8, 39.7, 35.7}
The mean square error is
5
1 X 1h i
M SE = σ̂ 2 = (Yi − Ŷi )2 = (52.3 − 51.9)2 + · · · + (36.1 − 35.7)2 = 0.235
n − 2 i=1 3

The variance of β̂1 and β̂0 are:


M SE 0.235
V\
ar(β̂1 ) = P = = 0.000939
2
i Xi − nX̄
2 3375 − 5(25)2
! !
1 X̄ 2 1 252
V\
ar(β̂0 ) = M SE +P 2 = 0.235 + = 0.634
n i Xi − nX̄
2 5 3375 − 5(25)2
66 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Inference about the Intercept and the Slope Parameters

• The simple linear regression

Yi = β0 + β1 Xi + ϵi , i = 1, 2, . . . , n
Review—Theorem
where ϵi is assumed to follow a normal If X1 , X2 , . . . , Xn are independent random
probability distribution with mean zero and a variables that are normally distributed,
constant variance σ 2 . i.e., ei ∼ N (0, σ 2 ) with Xi ∼ N (µi , σi2 ), then any linear
• The normal error term greatly simplifies the combination of these variables,
theory of regression analysis. As the error is
assumed to follow a normal distribution, the Y = a1 X1 + a2 X2 + · · · + an Xn
outcome variable (Yi ) is also expected to follow a
is also normally distributed. Specifically,
normal probability distribution. This is because Pn Pn 2 2

Y ∼N a µ
i=1 i i , i=1 i σi .
a
of the property of a normal probability
distribution. i.e.,

Yi ∼ N (β0 + β1 Xi , σ 2 )

67 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Inference about the Intercept and the Slope Parameters

Central Limit Theorem in Regression


Under the assumptions of the regression model (linear relationship, independence, constant variance,
and normally distributed errors or large sample size), the CLT tells us that as the sample size n grows,
the sampling distributions of β̂0 and β̂1 become approximately normal. Specifically,
 
β̂1 ∼ N β1 , Var(β̂1 )
 
β̂0 ∼ N β0 , Var(β̂0 )

where Var(β̂1 ) and Var(β̂0 ) depend on the variance of the errors σ 2 and the sample data.
The CLT enables us to use normal approximation methods for inference on regression coefficients,
allowing us to:
• Construct confidence intervals for β0 and β1 .
• Perform hypothesis tests to assess if coefficients are significantly different from zero.

68 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Inference about the Intercept and the Slope Parameters
• In simple linear regression, we are often interested in testing whether the predictor X has a
statistically significant effect on the outcome Y . This is done by examining the slope coefficient β1 in
the model. To determine if X has a statistically significant effect on Y , we set up the following
hypotheses:
— Null Hypothesis:
H0 : β 1 = 0
Indicates no linear relationship between X and Y .
— Alternative Hypothesis:
H1 : β1 ̸= 0
Indicates a linear relationship between X and Y .
• Test Statistic: Under the null hypothesis, the estimated slope βˆ1 follows an approximately normal
distribution (for large samples or if errors are normally distributed). The test statistic for β1 is given
by:

βˆ1
t= ∼ t(n − 2)
s.e(βˆ1 )
where s.e(βˆ1 ) is the standard error of βˆ1 . It is the positive square root of the variance of β̂1
69 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Inference about the Intercept and the Slope Parameters

• Decision Rule:
1 Significance Level: Set a significance level α (commonly 0.05).
2 Critical Value: Use the t-distribution with n − 2 degrees of freedom to find the critical value
tα/2 (n − 2).
3 Compare Test Statistic: If |t| > tα/2 (n − 2), reject the null hypothesis H0 ; otherwise, do not
reject H0 .
• Conclusion and Interpretation
— Reject H0 : Conclude that there is a statistically significant linear relationship between X and Y .
— Do Not Reject H0 : Conclude that there is insufficient evidence to claim a linear relationship
between X and Y .
• Hypothesis testing for β1 is essential for determining the relevance of predictors in regression models.
It helps establish whether observed relationships are likely to be due to chance or represent real
associations.

70 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Inference about the Intercept and the Slope Parameters

• In many cases, we test whether β1 is equal to zero (no effect), but we may also be interested in
testing if β1 equals a specific non-zero value, such as β10 .
• To determine if the effect of X on Y differs significantly from a specified value β10 , we set up the
following hypotheses:
H0 : β1 = β10
H1 : β1 ̸= β10

• To test the hypothesis, we calculate a t-statistic that compares the observed estimate βˆ1 to β10 :

βˆ1 − β10
t= ∼ t(n − 2)
s.e(βˆ1 )

where β10 is the hypothesized slope value in H0


• Note that if s.e(β̂1 ) is estimated based on the population variance σ 2 instead of σ̂ 2 = M SE, then
the test statistic t ∼ N (0, 1)

71 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Inference about the Intercept and the Slope Parameters

• To determine whether β0 significantly differs from a hypothesized value β00 , we set up the following
hypotheses:
H0 : β0 = β00
H1 : β0 ̸= β00
The null hypothesis suggests that the expected value of Y when X = 0. In many applications, we
test if β0 is equal to zero, meaning there is no baseline effect when X = 0.
• Under H0 : β0 = β00 , we use a t-statistic to test the intercept

βˆ0 − β00
t= ∼ t(n − 2)
s.e(βˆ0 )

• Decision rule: If |t| > tn−2,α/2 , reject H0 ; otherwise, do not reject H0 .

72 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Inference about the Intercept and the Slope Parameters
Example: Consider the data about the relationship between temperature (in degrees Celsius)
and the reaction time of a chemical reaction (in seconds). Test whether the temperature has a
significant impact on the reaction time of a chemical reaction.

Solution: The standard error of β̂1 is the positive square root of V \


ar(β̂1 )
r

s.e(β̂1 ) = V\
ar(β̂1 ) = 0.000939 = 0.0306

The hypotheses to test whether the temperature has a significant impact on the reaction time
of a chemical reaction are:
H0 : β1 = 0
H1 : β1 ̸= 0
The test statistic is
β̂1 −0.812
t= = = −26.5
s.e(β̂1 ) 0.0306
The tabulated value is t0.05/2 (5 − 2) = t0,025 (3) = 3.18
73 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Inference about the Intercept and the Slope Parameters

Since |t| = 26.5 is greater than


the tabulated value (3.18), we
have enough evidence to reject
the null hypothesis and
conclude that temperature has
significantly affected the
reaction time at a 5% level of
significance.

74 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Inference about the Intercept and the Slope Parameters

• A 100(1 − α)% confidence interval for β1 is

β̂1 − tα/2 (n − 2)s.e(β̂1 ) ≤ β1 ≤ β̂1 + tα/2 (n − 2)s.e(β̂1 )

• A 100(1 − α)% confidence interval for β0 is

β̂0 − tα/2 (n − 2)s.e(β̂0 ) ≤ β0 ≤ β̂0 + tα/2 (n − 2)s.e(β̂0 )

Relationship between H1 : β1 ̸= 0 and Confidence Interval


If the confidence interval for β1 does not contain zero, it provides evidence against H0 : β1 = 0. In this
case, you would reject the null hypothesis, suggesting a statistically significant relationship between X
and Y . Conversely, if the confidence interval contains zero, it indicates that we do not have sufficient
evidence to conclude that β1 is significantly different from zero, which aligns with failing to reject H0 .

75 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing
• A common objective in regression analysis is to estimate the mean for one or more probability
distributions of Y
• Let Xh denote the level of X for which we wish to estimate the mean response. Xh may be a value
that occurred in the sample, or it may be some other value in the range of the predictor variable
within the scope of the model
• The mean response when X = Xh is denoted by E(Yh ). The point estimator of E)Yh ) is

Ŷh = β̂0 + β̂1 Xh


• The sampling distribution of Ŷh refers to the different values of Ŷh that would be obtained if
repeated samples were selected, each holding the level of the predictor variable X constant, and
calculating Ŷh for each sample
Theorem: For a simple linear regression model with a normally distributed error term, the
sampling distribution of Ŷh is normal, with mean and variance
" #
1 (Xh − X̄)2
E(Ŷh ) = E(Ŷh ) = β0 + β1 Xh , and V ar(Ŷh ) = σ 2 + Pn 2
n i=1 (Xi − X̄)
76 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing
Proof.
The normality of the sampling distribution of Ŷh follows directly from the fact that Ŷh is a
linear combination of the observations Yi . To prove that Ŷh is unbiased estimator of E(Yh ), we
proceed as follows
E(Ŷh ) = E(β̂0 + β1 Xh ) = E(β̂0 ) + E(β̂1 Xh ) = β0 + β1 Xh
The variance of Ŷh is
V ar(Ŷh ) = V ar(β̂0 + β̂1 Xh ) = V ar(Ȳ − β̂1 X̄ + β̂1 Xh ))
= V ar(Ȳ − β̂1 (Xh − X̄))
= V ar(Ȳ ) + V ar(β̂1 (Xh − X̄)) − 2(Xh − X̄) Cov(Ȳ , β̂1 )
| {z }
=0
σ2
= + (Xh − X̄)2 V ar(β̂1 )
n !
σ2 σ 2 (Xh − X̄)2 1 (Xh − X̄)2
= + Pn 2
= σ2 + Pn 2
n i=1 (X i − X̄) n i=1 (Xi − X̄)
77 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing

The unbiased estimator of σ 2 is MSE, then


" #
1 (Xh − X̄)2
V\
ar(Ŷh ) = M SE + Pn 2
n i=1 (Xi − X̄)

Observations
— The variability of the sampling distribution of Ŷh is affected by how far Xh is from the X̄. The
further from X̄ is Xh , the greater is the quantity (Xh − X̄)2 and the larger is the variance of Ŷh
— When Xh = 0, the variance of Ŷh reduced to V ar(β̂0 )
A 100(1 − α)% confidence interval for E(Yh ) is

Ŷh − tα/2 (n − 2)s.e(Ŷh ) ≤ E(Yh ) ≤ Ŷh + tα/2 (n − 2)s.e(Ŷh )

78 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing
The hypotheses for mean response is

H0 : E(Yh ) = Yh0
H1 : E(Yh ) ̸= Yh0

The test statistic is


Ŷh − Yh0
t= ∼ t(n − 2)
s.e(Ŷh )
Reject the null hypothesis if |t| is greater than tα/2 (n − 2)
Example: Consider the temperature and reaction time data. Construct a 95% confidence
interval for E(Yh ) when the temperature Xh = 30.
Solution: The variance of Ŷh is
" # " #
1 (Xh − X̄)2 1 (30 − 25)2
V\
ar(Ŷh ) = M SE + Pn = 0.235 + = 0.0704
n i=1 (Xi − X̄)
2 5 3375 − 5(25)2

79 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing

The standard error is then √


s.e(Ŷh ) = 0.0704 = 0.265
The tabulated value is t0.025 (3) = 3.18 and Yh = 39.7 Thus, the 95% confidence interval for
E(Yh ) is
Ŷh − t0.025 (3)s.e(Ŷh ) ≤ E(Yh ) ≤ Ŷh − t0.025 (3)s.e(Ŷh )
39.7 − 3.18(0.265) ≤ E(Yh ) ≤ 39.7 + 3.18(0.265)
38.9 ≤ E(Yh ) ≤ 40.5
Exercise: What will be your conclusion if you are asked to test the following hypothesis based
on the temperature Vs reaction time?

H0 : E(Yh ) = 40
H1 : E(Yh ) ̸= 40

80 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing

• We now consider the prediction of a new observation Y corresponding to a given level X of the
independent variable.
• When we estimate Y based on values of X that were not used in fitting the model parameters and
fall outside the observed range of X, this process is known as extrapolation.
• Let X0 be the new value for the independent variable, and the corresponding predicted value is Y0 .
However, Y0 remains unknown and independent of Ŷ0 = β̂0 + β̂1 X0 . So, the confidence interval for
E(Yh ) is inappropriate as it is an interval estimate on the mean of Y , not a probability statement
about future observations from a normal probability distribution
• Now, let’s develop the prediction interval for the future observation Y0 . Let ϕ = Yo − Ŷ0 be a
normal random variable with mean zero and variance
(X0 − X̄)2
 
2 1
V ar(ϕ) = V ar(Y0 − Ŷ0 ) = σ 1 + + Pn
n i=1 (Xi − X̄)
2

81 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing
Proof.

V ar(Ŷ0 ) = V ar(Y0 − Ŷ0 ) = V ar(Y0 ) + V ar(Ŷ0 ) − 2 Cov(Y0 , Ŷ0 )


| {z }
=0
= V ar(Y0 ) + V ar(β̂0 + β̂1 X0 )
= V ar(Y0 ) + V ar(Ȳ − β̂1 (X0 − X̄))
!
2 2 1 (X0 − X̄)2
= σ +σ +P 2
n i=1 (Xi − X̄)
!
2 1 (X0 − X̄)2
= σ 1 + + Pn 2
n i=1 (Xi − X̄)

The estimator of V ar(Ŷ0 ) is


!
1 (X0 − X̄)2
V\
ar(Ŷ0 ) = M SE 1 + + Pn 2
n i=1 (Xi − X̄)

82 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing
A 100(1 − α)% prediction interval for Y0 is

Ŷ0 − tα/2 (n − 2)s.e(Ŷ0 ) ≤ Y0 ≤ Ŷ0 + tα/2 (n − 2)s.e(Ŷ0 )

Example: Consider the temperature and reaction time data. Construct a 95% prediction
interval for Y0 when the temperature X0 = 40.
Solution: The new value of the independent variable X0 = 40 is outside of the rage of X.
Thus, we are predicting new reaction time (Y0 )

Ŷ0 = 64.1 − 0.812(40) = 31.6

The variance estimator of Ŷ0


! !
1 (X0 − X̄)2 1 (40 − 25)2
V\
ar(Ŷ0 ) = M SE 1 + + Pn = 0.235 1 + + = 0.305
n i=1 (Xi − X̄)
2 5 3375 − 5(25)2

83 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Prediction—Interval Estimation and Hypothesis Testing

The standard error of Ŷ0 is 0.552. Then the 95% prediction interval is

Ŷ0 − tα/2 (n − 2)s.e(Ŷ0 ) ≤ Y0 ≤ Ŷ0 + tα/2 (n − 2)s.e(Ŷ0 )

31.6 − 3.18(0.552) ≤ Y0 ≤ 31.6 + 3.18(0.552)


29.8 ≤ Y0 ≤ 33.4

Caution! Extrapolation Outside the Data Range


Predictions made for values of the independent variable (X) that lie outside the range of the observed
data (extrapolation) can be highly unreliable. The regression model is built on the assumption that the
relationships observed within the data range will hold, but this may not be true beyond that range,
leading to inaccurate predictions.

84 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient

• Covariance measures the degree to which two variables change together. Specifically, it quantifies
whether an increase in one variable generally corresponds to an increase (or decrease) in the other
variable.
n
1 X
Cove(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n − 1 i=1

• Covariance is in units obtained by multiplying the units of X and Y , which makes it scale-dependent.
This lack of standardization can make it hard to compare covariances between different variable pairs.
• Limitations
— Covariance alone doesn’t provide a measure of the strength of the relationship between variables.
— It is affected by the scale of the variables, so it’s not useful for comparing relationships across
datasets with different units or ranges.

85 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient

• The correlation coefficient (often denoted r for sample data or ρ for population data) standardizes
the covariance by dividing by the product of the standard deviations of the variables. This gives a
measure of both the direction and strength of a linear relationship, with values standardized between
−1 and 1. Pn
Cov(X, Y ) i=1 (Xi − X̄)(Yi − Ȳ )
r= p = qP qP
V ar(X)V ar(Y ) n
− 2 n 2
i=1 (X i X̄) i=1 (Yi − Ȳ )

• Interpretation
— r = 1: Perfect positive correlation, meaning X and Y move together in a perfectly linear way.
— r = 1: Perfect negative correlation, meaning X and Y move in exactly opposite directions in a
perfectly linear way.
— r = 0: No linear correlation between X and Y . (Again, this doesn’t rule out a non-linear
relationship.)
• Correlation is unitless because it standardizes the covariance by dividing by the standard deviations.
This makes it possible to compare correlations across different datasets or variable pairs.

86 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient

Limitations
• Correlation does not imply causation; it only indicates an association.
• Outliers can significantly affect the correlation coefficient, especially with Pearson’s correlation.
• It’s sensitive to linear relationships but may miss strong non-linear associations between variables.

87 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient

• To test the hypothesis that there is no linear relationship between two variables (denoted as X and
Y ) in a population, we can conduct a hypothesis test on the population correlation coefficient, ρ

H0 : ρ = 0
H1 : ρ ̸= 0

The test statistic is √


r n−2
t= √ ∼ t(n − 2)
1 − r2
Reject the null hypothesis, which says there is no linear relationship between X and Y in the
population if |t| > tα/2 (n − 2). Note that if H0 is not rejected, it suggests that any observed
correlation may be due to random sampling variability, and there is no strong evidence of a linear
relationship in the population
• To build a confidence interval for the population correlation ρ, we commonly use the Fisher
z-transformation because it helps normalize the sampling distribution of r, making it closer to a
normal distribution.

88 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient
The Fisher z-transformation
1 1+r
 
z= ln = arctanh(r)
2 1−r
This transformed value z approximately follows a normal distribution with mean
1 1+ρ
 
E(z) = ln
2 1−ρ
and standard error of z
1
s.e(z) = √
n−3
A 100(1 − α)% confidence interval for Z is

z ± Zα/2 s.e(z)

89 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient

• Convert the confidence interval limits for z back to the correlation coefficient scale using the inverse
Fisher transformation. i.e.,
tanh z ± Zα/2 s.e(z)
• Thus, a 100(1 − α)% confidence interval for ρ is
   
Zα/2 Zα/2
tanh arctanh(r) − √ ≤ ρ ≤ tanh arctanh(r) + √
n−3 n−3

• If we want to test H0 : ρ = ρ0 , then the appropriate test statistic will be



Z0 = (arctanh(r) − arctanh(ρ0 )) n − 3

Reject the null hypothesis if |Z0 | > Zα/2

90 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Covariance and Correlation Coefficient
For a simple linear regression model, the slope β1 is calculated as
Pn
i=1 (Xi − X̄)(Yi − Ȳ )
β̂1 = Pn 2
i=1 (Xi − X̄)

This formula can also be rewritten in terms of the sample standard deviations of X and Y and
the sample correlation coefficient r
qP
n
i=1 (Yi − Ȳ )2 Sy
β̄1 = r qP =r
n
− X̄)2 Sx
i=1 (Xi

where Sx and Sy are the standard deviation of X and Y , respectively


Note!
The sign of β̂1 is determined by r. If r > 0, then β̂1 > 0 and if r < 0, then β̂1 < 0. This sign indicates
the direction of the relationship between X and Y : a positive r (and hence a positive β̂1 ) suggests that
as X increases, Y tends to increase as well, and vice versa for a negative r.
91 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Analysis of Variance
• Our aim in regression analysis is to explain much of the variations of Y by a set of explanatory
variables. The total variation is denoted by SST
n
X
SST = (Yi − Ȳ )2
i=1

Decomposing the SST into variations explained by the regression model and unexplained by the
regression model gives birth to the analysis of variance (ANOVA)
n
X n
X
2
SST = (Yi − Ȳ ) = ((Yi − Ŷi ) + (Ŷi − Ȳ ))2
i=1 i=1
n
X
= ((Yi − Ŷi )2 + 2(Yi − Ŷi )(Ŷi − Ȳ ) + (Ŷi − Ȳ )2 )
i=1
n
X n
X n
X
= (Yi − Ŷi )2 + 2 (Yi − Ŷi )(Ŷi − Ȳ ) + (Ŷi − Ȳ )2
i=1 i=1 i=1
| {z }
=0

92 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Analysis of Variance

n
X n
X
SST = (Yi − Ŷi )2 + (Ŷi − Ȳ )2
i=1 i=1
Pn
From our experience of OLS estimation, i=1 (Yi − Ŷi )2 is called sum square error or residual
sum squares (SSE). This is the variation of Y that cannot be explained by the regression
model. Whereas, ni=1 (Ŷi − Ȳ )2 is the variation of Y that is explained by the regression line
P

and denoted as SSR (sum square regression)

SST = SSE + SSR

The proportion of variation that is explained by the regression model is defined as


SSR SSE
 
R2 = =1−
SST SST

where R2 is called coefficient of multiple determination


93 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Analysis of Variance
• R2 ranges from 0 to 1, where higher R2 means better fit. If R2 is close to 1, the model explains
most of the variability in Y . If R2 is close to 0, the model explains little of the variability in Y
• For a simple linear regression, R2 = r2

Goodness of Fit test


• The hypothesis test for the overall model significance in simple linear regression evaluates whether
the predictor variable X has a statistically significant relationship with the response variable Y

H0 : β 1 = 0
H1 : β1 ̸= 0

• The test statistic is


M SR
F = ∼ F (1, n − 2)
M SE
where
SSR SSE
M SR = and M SE =
1 n−2
94 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Analysis of Variance
The ANOVA table will help us test the overall significance of the model
Source of variation Degrees of freedom Sum Squares Mean Squares F statistic
M SR
Model 1 SSR MSR F = M SE
Residual n−2 SSE MSE
Total n−1 SST
Reject the null hypothesis if F > Fα (1, n − 2)
Example: A public health researcher is investigating the relationship between Body Mass
Index (BMI) and Systolic Blood Pressure (SBP) in adults. You are provided with a dataset of
100 adults with their BMI and SBP values. Let X be BMI and Y be SBP. The summary
statistics are presented below:
X X
X̄ = 24.8, Ȳ = 152, (Xi − X̄)Xi = 1290, (Yi − Ȳ )2 = 21377
i i
X X
(Xi − X̄)Yi = 3746, (Yi − Ŷi )2 = 10498
i i
95 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Analysis of Variance

1 Test whether there is a significant linear relationship between BMI and SBP.
2 Construct ANOVA and test the overall goodness of fit of the model
3 Compute the coefficient of multiple determination and interpret the result
Solution:
1 The appropriate hypothesis to test is
H0 : ρ = 0
H1 : ρ ̸= 0
The test statistic is √ √
r n−2 0.713 100 − 2
t= √ = √ = 13.2
1 − r2 1 − 0.713
where P
i (Xi − X̄)Yi 3746
r = qP qP =p = 0.713
2 2 (1290)(21377)
i (Xi − X̄) i (Yi − Ȳ )

96 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Analysis of Variance

The tabulated value is t0.025 (98) = 1.98. The test statistic is greater than this tabulated value.
Therefore, we have enough evidence to reject the null hypothesis and conclude that there is a
significant linear relationship between BMI and SBP. This gives us the green light that SBP can be
predicted using simple linear regression by taking BMI as the independent variable
2 The total variation which is captured by SST is 21377, and the sum square error (SSE) is given to
be 10498. Thus,
SSR = SST − SSE = 21377 − 10498 = 10879
Dividing the sum squares by their corresponding degrees of freedom gives the mean square. i.e.,
SSR SSE
M SR = , M SE = = 107
1 n−2
The F test statistic is then
M SR 10879
F = = = 102
M SE 107

97 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Analysis of Variance

The ANOVA table is


Source of variation Degrees of freedom Sum Squares Mean Squares F statistic
Model 1 10879 10879 F = 102
Residual 98 10498 107
Total 99 21377

The tabulated value is F0.05 (1, 98) = 3.94. This value is less than the test statistic. So, we
have sufficient evidence to reject H0 : β1 = 0, and the model is good enough to capture
variation in the response variable (SBP)
The coefficient of multiple determination is
SSR 10879
R2 = r2 = 0.7132 = = = 0.51
SST 21377
About 51% of the variation in SBP is explained by BMI alone.

98 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Analysis of Variance

• We have witnessed that the goodness-of-fit test of the simple linear regression has the same null and
alternative hypothesis as the slope.
• This is because we have only one predictor. However, the goodness-of-fit test for multiple linear
regression is different from testing the significance of individual βs

Caution!
It is true that " #2
2 β̂1 M SR
t = = =F
s.e(β̂1 ) M SE

and (tα/2 (n − 2))2 = Fα (1, n − 2). So, for simple linear regression, we can conclude that testing the
significance of the slope means testing the overall goodness of the model.
Keep in mind! this coincidence happens only in simple linear regression

99 Statistics Department Regression Analysis Yabebal A.


Simple Linear Regression
Exercise

1 Consider the simple linear regression model

Yi = β0 + β1 Xi + ϵi , i = 1, 2, . . . , n

where the intercept β0 is known.


• Find the least-squares estimator of β1 for this model.
• What is the variance of the slope β̂1 for the least-squares estimator found above
• Construct a 100(1 − α)% confidence interval for β̂1
2 In a test of alternatives H0 : β1 = 0 Vs H1 : β1 > 0, an analyst concluded H0 . Does this conclusion
imply that there is no linear association between X and Y ? Explain

100 Statistics Department Regression Analysis Yabebal A.

You might also like