0% found this document useful (0 votes)
23 views17 pages

Reg 01

Regression

Uploaded by

biniase669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

Reg 01

Regression

Uploaded by

biniase669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Regression Analysis(Math4319)

Introduction

Instructor: Tatek Getachew(PhD)

Tatek () Math4319 1 / 17
Outline

1 Regression and Model Building

2 Data Collection

3 Uses of Regression

4 Role of the Computer

Tatek () Math4319 2 / 17
Regression and Model Building

Regression and Model Building


Regression analysis is a statistical technique for investigating and
modeling the relationship between variables.
Applications of regression are numerous and occur in almost every
field, including engineering, the physical and chemical sciences,
economics, management, life and biological sciences, and the social
sciences.
Regression analysis is used extensively in data mining and is a basic
tool of data science and analytics.
regression analysis may be the most widely used statistical technique.
it is used to answer questions such as
Does yield in quintal depend on amount of rainfall, temperature,
fertilizer use and number of times the cultivation is made?
Does change in cholesterol level depend on diet change, age, sex and
amount of exercise?
Does changing class size affect success of students?
Does students GPA affected by amount of time for studied, age, sex,
economic status of the parent, field of study, . . .
Tatek () Math4319 3 / 17
Eg. Suppose that an industrial engineer employed by a soft drink
beverage, The engineer visits 25 randomly chosen retail outlets having
vending machines, and the in - outlet delivery time (in minutes) and the
volume of product delivered (in cases)
If we let y represent delivery time and x represent delivery volume,
then the equation of a straight line relating these two variables is
y = β0 + β1 x
where β0 is the intercept and β1 is the slope. Now the data points do
not fall exactly on a straight line, so
We use Scatter Plot to display the relationship between two variables

Tatek () Math4319 4 / 17
the difference between the observed value of y and the straight line
(β0 + β1 x) be an error .
It is convenient to think of  as a statistical error; that is, it is a
random variable that accounts for the failure of the model to fit the
data exactly.
The error may be made up of the effects of other variables on delivery
time, measurement errors, and so forth.
Thus, a more plausible model for the delivery time data is
y = β0 + β1 x + 
The above Equation is called a linear regression model.
x is called the independent variable and y is called the dependent
variable.
we refer to x as the predictor or regressor variable and y as the
response variable.
The Equation involves only one regressor variable, it is called a
simple linear regression model.
Tatek () Math4319 5 / 17
We assume x is fixed, the random component  on the right-hand side
of Eq. determines the properties of y.
Suppose that the mean and variance of  are 0 and σ 2 , respectively.
Then the mean response at any value of the regressor variable is

E (y /x) = µy /x = E [β0 + β1 x + ] = β0 + β1 x

The variance of y given any value of x is

Var (y /x) = σy2/x = Var [β0 + β1 x + ] = σ 2

Thus, the true regression model µy /x = β0 + β1 x is a line of mean


values, that is, the height of the regression line at any value of x is
just the expected value of y for that x.
The slope, β1 can be interpreted as the change in the mean of y for a
unit change in x.
Furthermore, the variability of y at a particular value of x is
determined by the variance of the error component of the model, σ 2 .
Tatek () Math4319 6 / 17
This implies that there is a distribution of y values at each x and that
the variance of this distribution is the same at each x.
The variance σ 2 determines the amount of variability or noise in the
observations y on delivery time.
When σ 2 is small, the observed values of delivery time will fall close
to the line, and when σ 2 is large, the observed values of delivery time
may deviate considerably from the line.

Tatek () Math4319 7 / 17
These functional relationships are often based on physical, chemical,
or other engineering or scientific theory, that is, knowledge of the
underlying mechanism.
these types of models are often called mechanistic models.
Regression models, are thought of as empirical models.
Figure 1.3 illustrates a situation where the true relationship between y
and x is relatively complex, yet it may be approximated quite well by
a linear regression equation.
Sometimes the underlying mechanism is more complex, resulting in
the need for a more complex approximating function,
in Figure 1.4, where a ”piecewise linear” regression function is used to
approximate the true relationship between y and x.
Generally regression equations are valid only over the region of the
regressor variables contained in the observed data.

Tatek () Math4319 8 / 17
For example, consider Figure 1.5. Suppose that data on y and x were
collected in the interval x1 ≤ x ≤ x2 .
Over this interval the linear regression equation shown in Figure 1.5 is
a good approximation of the true relationship.
However, suppose this equation were used to predict values of y for
values of the regressor variable in the region x2 ≤ x ≤ x3 .
Clearly the linear regression model is not going to perform well over
this range of x because of model error or equation error.

Tatek () Math4319 9 / 17
In general, the response variable y may be related to k regressors,
x1 , x2 , . . . , xk , so that

y = β0 + β1 x1 + β2 x2 + · + βk xk + 

This is called a multiple linear regression model because more than


one regressor is involved.

Tatek () Math4319 10 / 17
The adjective linear is employed to indicate that the model is linear in
the parameters β0 , β1 , . . . , βk , not because y is a linear function of
the x’s.
An important objective of regression analysis is to estimate the
unknown parameters in the regression model.
This process is also called fitting the model to the data.
We study several parameter estimation techniques in this book. One
of these techniques is the method of least squares (introduced in
Chapter 2 ).
The next phase of a regression analysis is called model adequacy
checking, in which the appropriateness of the model is studied and
the quality of the fit ascertained.
Through such analyses the usefulness of the regression model may be
determined.
The outcome of adequacy checking may indicate either that the
model is reasonable or that the original fit must be modified.
Thus, regression analysis is an iterative procedure, in which data lead
to a model and a fit of the model to the data is produced.
Tatek () Math4319 11 / 17
The quality of the fit is then investigated, leading either to
modification of the model or the fit or to adoption of the model.
A regression model does not imply a cause - and - effect relationship
between the variables.
Finally it is important to remember that regression analysis is part of
a broader data - analytic approach to problem solving.
That is, the regression equation itself may not be the primary
objective of the study.
It is usually more important to gain insight and understanding
concerning the system generating the data.

Tatek () Math4319 12 / 17
Data Collection

Data Collection

An essential aspect of regression analysis is data collection. Any


regression analysis is only as good as the data on which it is based.
Three basic methods for collecting data are as follows:
A retrospective study based on historical data
An observational study
A designed experiment
A good data collection scheme can ensure a simplified and a generally
more applicable model.
A poor data collection scheme can result in serious problems for the
analysis and its interpretation.

Tatek () Math4319 13 / 17
Data Collection

Retrospective Study:- use either all or a sample of the historical process


data over some period of time to determine the relationships among the
two variables
In general, their primary disadvantages are as follows:
Some of the relevant data often are missing.
The reliability and quality of the data are often highly questionable.
The nature of the data often may not allow us to address the problem
at hand.
The analyst often tries to use the data in ways they were never
intended to be used.
Logs, notebooks, and memories may not explain interesting phenomena
identified by the data analysis
Using historical data always involves the risk that, for whatever
reason, some of the data were not recorded or were lost.
These errors make historical data prone to outliers, or observations
that are very different from the bulk of the data.

Tatek () Math4319 14 / 17
Data Collection

Observational Study:- an observational study simply observes the


process or population. We interact or disturb the process only as
much as is required to obtain relevant data.
With proper planning, these studies can ensure accurate, complete,
and reliable data.
On the other hand, these studies often provide very limited
information about specific relationships among the data.
Designed Experiment:- The best data collection strategy for this
problem uses a designed experiment where we would manipulate the
response and which we would call the factors, according to a well -
defined strategy, called the experimental design.
The experimental design or plan consists of a series of runs.

Tatek () Math4319 15 / 17
Uses of Regression

Uses of Regression
Regression models are used for several purposes, including the
following:
1. Data description
2. Parameter estimation
3. Prediction and estimation
4. Control
Regression analysis is helpful in developing use equations to
summarize or describe a set of data.
regression model would probably be a much more convenient and
useful summary of those data than a table or even a graph.
Sometimes parameter estimation problems can be solved by regression
methods
Many applications of regression involve prediction of the response
variable.
For example, we may wish to predict delivery time for a specified
number of cases of soft drinks to be delivered.
Regression models may be used for control purposes.
Tatek () Math4319 16 / 17
Role of the Computer

Role of the Computer


Building a regression model is an iterative process.
The model - building process is illustrated in Figure below. It begins
by using any theoretical knowledge of the process that is being
studied and available data to specify an initial regression model.
A good regression computer program is a necessary tool in the model
- building process.
We must learn how to interpret what the computer is telling us and
how to incorporate that information in subsequent models.
Generally, regression computer programs are part of more general
statistics software packages, such as Minitab, SAS, JMP, and R.

Tatek () Math4319 17 / 17

You might also like