Group 2 - Chapter 3 - Multiple Regression Analysis Estimation
Group 2 - Chapter 3 - Multiple Regression Analysis Estimation
GROUPS 2 :
LECTURERS
I Gusti Agung Ayu Apsari Anandari,M.S.E
Multiple regression analysis is more amenable to ceteris paribus analysis because it allows
us to explicitly control for many other factors that simultaneously affect the dependent variable.
This is important both for testing economic theories and for evaluating policy effects when we
must rely on nonexperimental data. Because multiple regression models can accommodate many
explanatory variables that may be correlated, we can hope to infer causality in cases where simple
regression analysis would be misleading.
3.1 Example
The 3-1 example is about a regression equation used to predict students' average grade point
average (GPA) in college based on their average high school GPA and ACT score. There are 141
students in this sample, and both GPA scores are measured on a four-point scale.
The regression equation used was:
If we only consider a simple regression between colGPA and ACT, the equation is:
However, this equation does not allow us to compare two people with the same hsGPA. This
equation represents a different experiment. In other words, the regression equation is used to
understand how one variable (e.g., ACT or hsGPA) affects another variable (colGPA) in the
context of this data. In this case, hsGPA appears to have a greater influence than ACT on colGPA.
3.2 Example
An example given is a multiple regression analysis that uses data from 526 workers in WAGE1 to
explain the relationship between dependent variables (Y) : log (Wage) and three independent
variables (X): education (years of education), experience (years of labor market experience),
tenure (years with the current employer). The estimated equation is:
This equation is used to estimate the impact of education, experience, and tenure (years with the
current employer).
The coefficients of the equation are as follows:
⮚ The equation shows that the intercept is .284 (which means that even if the value of the
independent variable (X) is equal to zero, the value of Y will still exist as much as the value of
the constant .284).
And the coefficients of education, experience, and tenure are .092, .0041, and .022, respectively.
The value of n is 526 which represents the number of observations used in the analysis. The
coefficients in the equation have a percentage interpretation
⮚ The coefficient on education is .092, which means that for every one-year increase in education,
holding experience and tenure is constant, log wages are expected to increase by .092 or 9.2%
⮚ The coefficient on experience is .0041, which means that for every one-year increase in
experience, holding education and tenure constant, log wages are expected to increase by .0041
or 0,41%
⮚ The tenure coefficient is 0.022, which means that for every one-year increase in tenure, holding
a constant education and experience, the log wage is expected to increase by .022 or 2,2%
In summary, this example shows how multiple linear regression models can be used to estimate
the relationship between a dependent variable and multiple independent variables. The coefficients
on the independent variable indicate how much the dependent variable is expected to change for
an increase of one unit in each independent variable, holding the other independent variable
constant.
3.3 Example
1. Participation Rate (prate): This is the dependent variable or the target to be predicted.
2. Plan Match Rate (mrate): This is the first independent variable. It measures the extent to
which the 401(k) plan match rate (employer contribution) affects the participation rate in
the retirement program.
3. 401(k) Plan Age: This is the second independent variable. It measures the extent to which
the age of the 401(k) plan affects the participation rate.
In the regression equation, there is an estimated regression coefficient for each independent
variable:
- Intercept (80.12): This is the expected value of the participation rate when both independent
variables, i.e. plan match rate and 401(k) plan age, are zero. In this context, it may not have any
real interpretation as both variables are unlikely to be zero in real situations.
- Coefficient for Plan Match Level (5.52): This is a regression coefficient that indicates how much
influence the level of plan suitability has on the participation rate. In this case, every 1 unit increase
in the plan match rate would be expected to increase the participation rate by 5.52, provided that
other factors remain constant.
- Coefficient for 401(k) Plan Age (0.243): This is the regression coefficient that indicates how
much influence 401(k) plan age has on the participation rate. In this case, every 1 unit increase in
401(k) plan age would be expected to increase the participation rate by 0.243, holding other factors
constant.
N = 1,534 indicates the number of observations or data used in this regression analysis.
Thus, this regression equation makes it possible to predict the participation rate in a 401(k)
retirement plan based on the plan match rate and the age of the 401(k) plan. You can state that the
plan match rate has a significant positive effect on the participation rate, while the 401(k) plan age
has a smaller but still significant positive effect on the participation rate. This statement illustrates
the importance of controlling for other variables that might affect the results in a regression
analysis. Let us further analyze what happens in this case:
1. Estimation of Age Effect: In the multiple regression, the age of the 401(k) plan (Age) is
the independent variable. This indicates that you have controlled for the age variable when
estimating the effect of the plan match rate (mrate) on the participation rate (prate). In other
words, when you look at the effect of mrate on prate in the context of multiple regression,
you are already considering that the age of the worker also plays a role.
2. Estimating the Effect Without Age Control: However, in the case of a simple regression of
prate on mrate, you no longer control for the age variable. Therefore, the estimation of the
effect of mrate on prate becomes more "coarse" as it does not consider any possible
influence of age. This is an important example of how control variables can affect
regression results.
3. Correlation Between mrate and Age: You also mentioned that the sample correlation
between mrate and age is only about 0.12. This indicates that, while there is a slight
correlation between the two variables, the correlation is not strong. In this context, this
imbalance between the mrate and age variables may account for the relatively small
difference in estimates between the simple regression and the multiple regression.
However, it is important to remember that when there are other factors that might affect the
dependent variable (in this case, prate), it is important to control for those factors as best as possible
to make the regression results more accurate. This is a basic principle in regression analysis: to
consider and control for variables that might affect the dependent variable so that the effect of the
variable being tested can be assessed more precisely.
In this context, although the simple regression estimation without controlling for age gives
different results from the multiple regression, the difference is not very large due to the weak
correlation between mrate and Age. But in other situations with more correlated variables or larger
effects, omitting the control variable may result in very different estimates.
3.4 Example
In the following example, information has been shared about the results of a regression that has
been done to predict college grade point average (GPA) based on the variables hsGPA (high school
GPA) and ACT (ACT test score).
- "n = 121" indicates that there are 121 samples or observations in this analysis.
- "R^2 = 0.176" is the coefficient of determination, which indicates how much variation in colGPA
can be explained by the regression model. In this case, R^2 is 0.176, which means that about 17.6%
of the variation in college GPA (colGPA) can be explained by the hsGPA and ACT variables that
have been included in the model. The remaining 82.4% cannot be explained by this model.
Your comment about the seemingly low R^2 level is correct. While hsGPA and ACT can provide
some insight into a student's performance in college, there are still many other factors that
influence college GPA, such as family background, personality, quality of high school education,
and interest in attending college. Therefore, the low R^2 indicates that this regression model is not
strong enough to explain the entire variation in college GPA.
It is important to understand that in the real world, phenomena such as performance in college are
often complex and influenced by many factors that are difficult to measure or include in regression
models. Therefore, interpretation of regression analysis results should always be done with
caution, and regression results are not always able to explain all aspects of a phenomenon.
3.5 Example
Data on arrests in 1986 can be found in CRIME1, along with other details on 2,725 men who were
born in California in either 1960 or 1961. Prior to 1986, each guy in the sample had at least one
arrest.
narr86: the number of times the man was arrested during 1986
avgsen: average sentence length served for prior convictions (zero for most people)
qemp86: the number of quarters during which the man was employed in 1986 (from zero to four).
n = 2,725, R2 = .0413.
If we increase pcnv by .50 (a large increase in the probability of conviction), then, holding the
other factors fixed, Δnarr86 = - .150 (.50) = -.075. This may seem unusual because an arrest cannot
change by a fraction. But we can use this value to a large group of men. For example, among 100
men, the predicted fall in arrests when pcnv increases by .50 is - 7.5.
Longer prison terms result in lower predicted arrests, with a ptime86 increase from 0 to 12 reducing
arrests by.0341122 5.408. Legal employment also lowers predicted arrests by.104, resulting in
10.4 arrests among 100 men.
narr86 = .707 - .151 pcnv + .0074 avgsen - .037 ptime86 - .103 qemp86
n = 2,725, R2 = .0422.
The average sentence variable increases R2 from.0413 to.0422, with a small effect. The coefficient
on avgsen indicates that longer average sentences increase criminal activity.
The second regression's four explanatory variables, which only explain 4.2% of the variation in
arrests, may still be reliable estimates of the ceteris paribus effects of each independent variable
on narr86. However, the accuracy of these estimates depends on the size of the R2 (Relative Error
Square) and the difficulty in predicting individual outcomes with high accuracy.
3.6 Example
The above example explains that log(wage)= β0+ β1educ+ β2abil+u satisfies Assumptions MLR.1
through MLR.4. With this model we try to understand the effect of education on wages. The
assumption is that the higher one's education level, the higher the wage they will receive.
Furthermore, we also want to understand the effect of ability on wages.The goal is to understand
how much education and ability affect wages.
The data set in WAGE1 does not contain data on ability, which means it does not provide data on
the ability of individuals. However, due to our ignorance or data unavailability, we estimate the
We use the symbol "~" rather than "^" to emphasize that this model comes from an underspecified
model. Since data on ability is not available, the coefficient (β1) measuring the effect of education
on wages (log(wage)) is estimated using a simple regression of the logarithm of wages (log(wage))
on education (educ).
● The value of .584 is the estimate of β₀, the intercept of the regression model. It indicates
the value of the logarithm of wage when education (educ) and ability (abil) are zero.
● The value of .083 is the estimate or coefficient of β₁, which measures the effect of education
on the logarithm of wages.
● n is the number of observations in the sample or dataset. In this case, there are 526 data
used in the analysis.
● The coefficient of determination or R-squared (R²) is a measure that explains how much
variation in the dependent variable can be explained by the regression model. In the context
of this model, the R-squared is .186, which means about 18.6% of the variation in the
logarithm of wages can be explained by the independent variables (education and ability)
in the model.
So we can see that the result of this model is only on a single sample, so we cannot say that .083
is greater than β1; the actual education result could be lower or higher than 8.3% (and we will
never know for sure). However, we do know that the average of the estimates from all random
samples will be too large.
3.7 Example
The variable we want to explain, y=earn98 is labor market income in 1998, the year after the job
training program (which took place in 1997).
It is known that:
1. y = earn 98
2. w= train
3. variables we use earnings in 1996 (the year prior to the program), years of schooling (educ),
age, and marital status (married).
4. on train, -22.05
5. The average earnings for those who did not participate is gotten from the intercept, so
$10,610.
This explains that pre-program income, education level, age, and marital status produce
significantly different estimates. or as explained in the formula, workers with more education also
earn more: about $363 for each additional year. also earn more: about $363 for each additional
year. The marriage effect is roughly as large as the employment effect training effect: ceteris
paribus, married men earn, on average, about $2,480 more than their single counterparts. But the
predictivity of the other control variables is also hard to explain but has a good working level.
CONCLUSION
Multiple regression analysis is a statistical technique that uses several explanatory variables to predict the
outcome of a response variable. In multiple regression analysis, the parameters are estimated using the
method of ordinary least squares. The R-squared value is used to determine how well the model fits the
data, and it represents the proportion of the variance for a dependent variable that's explained by an
independent variable. However, R-squared only works as intended in a simple linear regression model with
one explanatory variable. With a multiple regression made up of several independent variables, the R-
squared must be adjusted. Multiple regression analysis can be used to assess effect modification by
estimating a multiple regression equation relating the outcome of interest to independent variables. The
multiple regression model is based on the assumption that there is a linear relationship between the
dependent variables and the independent variables, the independent variables are not too highly correlated
with each other, yi observations are selected independently and randomly from the population, and residuals
should be normally distributed with a mean of 0 and variance σ.