1. Introduction
A dummy variable is a binary variable that has either 1 or zero. It is commonly used to examine group and time effects in regression. Panel data analysis estimates the fixed effect and/or random effect models using dummy variables. The fixed effect model examines difference in intercept among groups, assuming the same slopes. By contrast, the random effect model estimates error variances of groups, assuming the same intercept and slopes. An example of the random effect model is the groupwise heteroscesasticity model that assumes each group has different variances (Greene 2000: 511-513). The data used here are of the top 50 information technology firms from the 308 page of OECD Information Technology Outlook 2004 (http://thesius.sourceoecd.org/). The data set contains revenue, R&D budget, and net income in current USD millions.
Let us first think about a linear regression model, ordinary least squares (OLS), without the dummy variable. Note that 0 is the intercept; 1 is the slope of net income in 2000; and i is the error term of the regression equation. Model 1: researchi = 0 + 1incomei + i
The estimated model has the intercept 1,482.697 and slope .223. For $ one million increase in net income, a firm is likely to increase R&D budget in 2002 by $ .223 million, holding all othe r things constant. Table 2. Regression without Dummy Variables (Model 1)
Source | SS df MS -------------+-----------------------------Model | 15902406.5 1 15902406.5 Residual | 83261299.1 37 2250305.38 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 1, 37) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 7.07 0.0115 0.1604 0.1377 1500.1
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2230523 .0839066 2.66 0.012 .0530414 .3930632 _cons | 1482.697 314.7957 4.71 0.000 844.8599 2120.533 ------------------------------------------------------------------------------
Equipment and Software : Research = 2140.205 + .218*income 1 Telecom. and Electronics : Research = 1133.579 + .218*income Table 3. Regression with a Dummy Variable (Model 2)
Source | SS df MS -------------+-----------------------------Model | 24987948.9 2 12493974.4 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.06 0.0054 0.2520 0.2104 1435.4
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d | 1006.626 479.3717 2.10 0.043 34.41498 1978.837 _cons | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------
3.2 Comparison between Model 1 and Model 2 Let us draw a plot to highlight the difference between the Model 1 and 2 more clearly. Look at the middle red line first. It is the regression line of the Model 1 without the dummy variable. The top green is regression line for equipment and software companies, while the bottom yellow line is one for telecommunication and electronics firms in Model 2. Of course, green and yellow lines are parallel with a difference of 1,006.626, the coefficient of the dummy variable in Table 3. Figure 1. Comparison between Model 1 and Model 2 (Fixed Group Effect)
The intercept of equipment and software firm is computed as 2140.205 = 1006.626 +1133.579.
This plot shows that the Model 1 is canceling out the group difference, and thus report misleading intercept. The difference between two groups of firms looks substantial. The t-test for the dummy parameter reject the null hypothesis of no difference in intercepts at the .05 level (p<.043). Consequently, we conclude that the Model 2 considering fixed group effects is better than the simple Model 1. You may compare goodness of fit statistics (e.g., F, t, R-squared, and SSE) of the two models. 2 3.3 Common Misunderstandings Some people, especially those who do not know exactly how dummies work, may ask, What if we code the dummy variable reversely? The simplest answer is It gives equivalent results. Let us give 1 to d0 if d is 0 (telecommunications and electronics firm) and zero if d is 1 (equipment and software). And then replace d with d0 in Model 2. The model becomes, Model 2-1: researchi = 0' + 1'incomei + 'd 0 i + i Model 2-1 is equivalent to Model 2 in that both produce the identical regression equations. ANOVA table of two models are identical. The slope of the regressor remains unchanged: 1' = 1 ; The sign of dummy parameter was switched: ' = ; the intercept of Model 2-1 is the actual intercept of equipment and software companies whose dummy variable is excluded in Model 2-1: 0' = 0 + . That is, one implies the other. It is because two models use different baseline categories, reference points. Model 2 uses telecommunications and electronics firms as a baseline, while Model 2-1 switches to equipment and software companies. They see the same thing from different views. Table 4. Regression with a Reversely Coded Dummy (Model 2-1)
Source | SS df MS -------------+-----------------------------Model | 24987948.9 2 12493974.4 Residual | 74175756.7 36 2060437.69 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.06 0.0054 0.2520 0.2104 1435.4
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d0 | -1006.626 479.3717 -2.10 0.043 -1978.837 -34.41498 _cons | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 ------------------------------------------------------------------------------
Some may also ask, Then, why dont we run regression on a group by group basis? Yes, we may get similar regression equations by running regression only on equipment and software firms and another regression on telecommunications and electronics companies.
If the coefficient of the dummy variable d turns out statistically insignificant, we can conclude that there is no group effect, or that all firms have the same intercept, in favor of Model 1. http://mypage.iu.edu/~kucc625
Model 1-1: researchi = 0 + 1incomei + i for equipment and software firms Model 1-2: research j = 0 + 1income j + j for telecom and electronics firms What is the difference between this group by group regression, Model 1-1 and 1-2, and the Model 2 with a dummy? The former assumes that two groups are different species like monkey versus lemon. The parameters and are not comparable in a strict statistical sense. Thus, we may not be able to examine the group differences by comparing (eyeballing) goodness-of- fits of two separate regressions (Model 1-1 and 1-2). Another difference lines in the efficiency of the slope, which is improved by pooling data; thus, Model 2 produces more efficient estimates than Model 1-1 and 1-2. What if you present Model 1 (pooled regression), Model 1-1, and Model 1-2 at the same time? What if you report Model 1 as well as Model 2. These attempts will end up with logical fallacy because these models have contradictory assumptions. If Model 2 is true, for example, Model 1 must be false. Model 1-1 is not comparable to Model 1 and 1-2.
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2180066 .0803248 2.71 0.010 .0551004 .3809128 d | 2140.205 434.4846 4.93 0.000 1259.029 3021.38 d0 | 1133.579 344.0583 3.29 0.002 435.7962 1831.361 ------------------------------------------------------------------------------
You may observe several differences in statistics between Table 3 (Model 2) and Table 5 (Model 2-2). In particular, coefficients and t statistics of dummy variables are different, although two models are equivalent. 3 How do we explain these differences?
The R2 and adjusted R2 are not well defined (incorrect) in the Model 2-2 that suppresses the intercept.
The coefficients of dummy variables in Model 2 and 2-2 have different meanings. In Model 2-2, the coefficient estimates of dummies, 0 and 1 , are actual intercepts of two groups (2,040.205 and 1,133.579). Accordingly, the null hypothesis of t-test is that parameters 0 and 1 are zero. By contrast, the coefficient of d in Model 2 estimates the difference of 0 from 1 , where 0 is the intercept of the baseline category, telecom and electronics firms. Accordingly, the null hypothesis is that the difference, not the actual intercepts, is zero: = 1 0 = 0 . Consider the following two plots of regression lines. The left plot depicts a situation where both 0 and 1 are close to zero in Model 2-2 and their difference = 1 0 is not substantial in Model 2. T-tests in both models may not be rejected; No group effect. Thus, Model 1, a pooled model, may be better than Model 2. In the right plot, 1 may turn out statistically different (far away) from the zero (t-test may be rejected), while 0 is close to zero (not rejected). Accordingly, the difference = 1 0 is also substantial in Model 2 (rejected). It indicates that there is some fixed effect between two groups; so, the Model 2 is superior to Model 1. Figure 2. Meanings of Dummy Variable Coefficients
Let us run the three regression models mentioned so far using SAS and STATA. In SAS, use the REG procedure as follows. Note that the /**/ is used for comments.
PROC REG; MODEL rd2002 = net2000; /* Model 1*/ MODEL rd2002 = net2000 d; /* Model 2 */ MODEL rd2002 = net2000 d d0 /NOINT; /* Model 2-2 */ RUN;
In STATA, run the .regress command as follows. Note that the // is used for comments.
. regress rd2002 net2000 // Model 1
. regress rd2002 net2000 d // Model 2 . regress rd2002 net2000 d d0, noconstant // Model 2-2
5.2 Three Approaches to Running LSDV Regression Now, we are ready for regression analysis, called the least squares dummy variable (LSDV) regression. However, here is the problem. When including all the three dummy variables and an intercept, we will be caught in a so called dummy variable trap. This problem is a perfect multicollinearity; the regression equation is not solvable since X matrix is not fully ranked. There are three approaches to running regression analyses with multiple dummy variables. First look at the functional forms below. The first approach--let us call it LSDV1--run OLS with all dummy variables, ignoring intercept. The second LSDV2 omits one of dummy variables and includes the intercept. The final approach LSDV3 includes all
dummy variables and the intercept, but it imposes a restriction that the sum of parameters of all dummies is zero. Table 6 summarizes the features of the three LSDVs. researchi = 1incomei + 1d1i + 2d 2 i + 3d 3i + i (without intercept) researchi = 0 + 1incomei + 1d 1i + 2d 2i + i (without one of three dummy variables) researchi = 0 + 1incomei + 1d1i + 2d 2i + 3d 3i + i with restriction of 1 + 2 + 3 = 0 The biggest difference is the meanings of dummy variable parameters and their hypothesis tests. The first approach reports the coefficients that are easy to interpret substantively. They are actual intercepts of three groups as in the following regression equations (see Table 7). Telecom firm : Research = 153.624 + .215*income Electronics : Research = 1695.486 + .215*income Equipment & S/W : Research = 2147.559 + .215*income In the second approach, LSDV2, the intercept is the coefficient of the dropped dummy, playing a role of baseline or reference point. Other coefficients are differences of the baseline from corresponding actual coefficients (see Table 8). For example, the intercept 2,147.559 in LSDV2 is the actual coefficient of d3 that is dropped. The coefficient 452.073 of d2 is computed as 1695.486- 2147.559. Likewise, 153.624 in LSDV1 is computed as -1993.935 + 2147.559. What if we omit d2 instead of d3 ? We may have different parameter estimates and standard errors of the dummy variables. Note that the coefficient of net income is quite similar to those of Model 1 and Model 2 (LSDV2) in section 2 and 3 (.223 versus .218 versus .215). Table 7. LSDV1 without the Intercept
Source | SS df MS -------------+-----------------------------Model | 198376404 4 49594101.1 Residual | 60484956.6 35 1728141.62 -------------+-----------------------------Total | 258861361 39 6637470.79 Number of obs F( 4, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 28.70 0.0000 0.7663 0.7396 1314.6
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2151104 .0735702 2.92 0.006 .065755 .3644659 d1 | 153.6238 469.5762 0.33 0.745 -799.6665 1106.914 d2 | 1695.486 373.0145 4.55 0.000 938.2267 2452.746 d3 | 2147.559 397.9181 5.40 0.000 1339.742 2955.375 ------------------------------------------------------------------------------
The third approach, LSDV3, produces coefficients that indicate how far the averaged group effect, the intercept of LSDV3, is away from the actual parameters (see Table 9). For example, the intercept 1,332.223 is computed as (153.624+1695.486+2147.559)/3. The coefficient of d3 815.33581 is 2,147.559 1,332.223. Note that the 6.14175E-13 in the last part of SAS output is virtually zero; this is the restriction.
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2151104 .0735702 2.92 0.006 .065755 .3644659 d1 | -1993.935 561.9429 -3.55 0.001 -3134.74 -853.1303 d2 | -452.0725 481.2018 -0.94 0.354 -1428.964 524.8192 _cons | 2147.559 397.9181 5.40 0.000 1339.742 2955.375 ------------------------------------------------------------------------------
DF 3 35 38
F Value 7.46
Pr > F 0.0005
0.3900 0.3378
Parameter Estimates Parameter Estimate 1332.22301 0.21511 -1178.59917 363.26336 815.33581 6.14175E-13 Standard Error 280.18308 0.07357 333.36182 288.19307 297.13197 .
DF 1 1 1 1 1 -1
Table 10. Three Approaches to Running Dummy Variable Models (LSDVs) LSDV No intercept Dropping one dummy Imposing restriction a a b b Dummy included d1 d d b ,d2 dd c , d1c ddc
Intercept All dummy? Restriction? Meaning of coefficient Coefficients No Yes (d) No Fixed group effect
a d1a , d 2 , d da
Yes No (d-1) No How far away from the reference point (dropped)?
c i
= 0*
d ia = b + d ib ,
a d dropped =b
H0 of T-test
d ia = 0
a d ia d dropped =0
d ia = c + d ic , where 1 c = dia d 1 d ia d ia = 0 ** d
5.3 Comparing Statistics of the Three LSDVs The t-test for dummy variable parameters should be interpreted with cautions since three approaches have different meanings of the dummy coefficients (see Table 10). LSDV1 is easy to interpret these coefficients because they are actual intercepts. Keep in mind that LSDV2 examines the difference of the baseline intercept from an actual intercept, while LSDV3 checks how far the averaged intercept is away from an actual intercept. The null 1 a hypotheses of LSDV1 through LSDV3 are d ia = 0 , d ia d dropped = 0 , d ia d ia = 0 , d respectively. Therefore, you may not conclude, for example, that intercept of the first group (telecommunications) is statistically significant, or the parameter of d1 is not zero, by referring the t-test of LSDV2 (t=-3.55 and p<.001). The t-test just tells that the intercept of telecommunication firms is substantially different from that of equipment and software companies; it does not tell if the intercept is close to zero or not because the reference point is not zero. Instead, you need to look at the t-test in LSDV1. The small t statistics .33 and large p- value .745 in Table 7 allows us not to reject the null hypothesis that the actual intercept of the telecommunication firm is zero: d1a = 0 . Although the LSDV1 without intercept is easy to interpret, it has serious problems in reporting goodness of fit measures (see Table 11). This approach reports wrong SSM and MSM, thus R2 and F test for d1 = ...d n1 = 0 . However, LSDV1 reports correct SSE, MSE, DFerror , and standard error of parameter estimates. By contrast, LSDV2 and LSDV3 report correct information at the cost of interpreting dummy coefficients in a complicated manner.
This restriction reduces the number of parameters to be estimated, making model identified. In SAS, the H0 needs to be rearranged as
( d 1)d ia d a j = 0 , where i j
Correct Correct Correct Correct Correct N-K
5.4 Software Issues All data analysis software supports the LSDV1 and LSDV2. Only SAS and LIMDEP support linear regression with restriction. However, LIMDEP reports a little bit different parameter estimates across approaches. Although providing various econometric models, LIMDEP is not good for working with data sets. SAS and STATA respectively have TSCSREG procedure and .xtreg command to run fixed/random effect models without dummies. The TSCSREG procedure works only on panel data. Table 12. Comparing Estimation of Three LSDVs LSDV1 LSDV2
SAS 9.1 STATA 8.2 LIMDEP 8.0 R 2.xx SPSS 12.0
REG w/ NOINT .regress w/ nocon Regress w/o ONE > lm() w/ -1 Regression w/ Origin REG .regress Regress w/ ONE > lm() Regression
Regress w/ CLS
The following script runs LSDV3 using the RESTRICT statement of the REG procedures.
PROC REG; MODEL rd2002 = net2000 d1-d3; RESTRICT d1 + d2 + d3 = 0; RUN;
The following STATA .xtreg command runs fixed within effect panel data model. 5 Note that the i(type2) option specifies the independent unit, and that the type2 is recoded from the type so that it has 1, 2, 3 for three firm types (d1 through d3 ).
.xtreg rd2002 net2000, fe i(type2)
The K denotes the sum of the number of dummy variables, regressors, and the intercept included in the model. The N is the total number of observations used in the regression model. 5 Individual dummy coefficients need to be computed and their standard errors should be corrected (adjusted). http://mypage.iu.edu/~kucc625
6.1 Data Structure and Estimation A new group variable is the area of firms ownership. Here is another set of three dummy variables g1 , g2 , g3 . The g1 is set 1 if firms are owned by Asian countries and 0 otherwise. Similarly, the g2 and g3 are coded for European and American companies, respectively. Look at the data structure. Table 13. Data Structure of the Two-Way LSDV
+----------------------------------------------------------------------------+ | firm type d1 d2 d3 area g1 g2 g3 | |----------------------------------------------------------------------------| | Samsung Electronics 0 1 0 Asia 1 0 0 | | AT&T Telecom 1 0 0 America 0 0 1 | | IBM IT Equipment 0 0 1 America 0 0 1 | | Siemens Electronics 0 1 0 Europe 0 1 0 | | Verizon Telecom 1 0 0 America 0 0 1 | | Microsoft Service & S/W 0 0 1 America 0 0 1 | | EDS Service & S/W 0 0 1 America 0 0 1 |
Now, our model becomes a little bit messy since it has six dummy variables. In order to avoid the perfect multicollinearity, we have to 1) omit two dummy variables, one from each set of dummy variables, 2) omit one dummy variable for ownership areas and impose restriction for firm type, 3) omit one dummy variable for firm type and impose restriction for ownership areas, or 4) impose two restriction: one for firm type and the other for ownership areas. Note that you must not omit intercept in the two-way fixed effect model. The following is the simplest approach that omits two dummy variables. researchi = 0 + 1incomei + 1d1i + 2d 2i + 1g1i + 2 g 2i + i Table 14. Two-Way Fixed Effect Model (LSDV2)
Source | SS df MS -------------+-----------------------------Model | 47996204.2 5 9599240.84 Residual | 51167501.4 33 1550530.35 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 5, 33) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 6.19 0.0004 0.4840 0.4058 1245.2
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .3008584 .0830277 3.62 0.001 .1319374 .4697795 d1 | -2446.278 579.9832 -4.22 0.000 -3626.262 -1266.293 d2 | -923.931 503.678 -1.83 0.076 -1948.672 100.8097 g1 | 1375.542 579.5446 2.37 0.024 196.4499 2554.635 g2 | 907.2314 570.3879 1.59 0.121 -253.2315 2067.694 _cons | 1440.654 474.693 3.03 0.005 474.8843 2406.424 ------------------------------------------------------------------------------
Note that this model has many parameters to be estimated, compared to the number of observations available. We can draw nine regression equations depending on combinations of three firm types and three areas of ownership: 9 = 3 X 3.
= 0 + 1incomei + 0 + 0 + 0 + 0 + i (American equipment & S/W firms) = 0 + 1incomei + 1 + 0 + 0 + 0 + i (American telecom. firms) = 0 + 1incomei + 0 + 2 + 1 + 0 + i (Asian electronics firms) = 0 + 1incomei + 0 + 2 + 0 + 2 + i (European electronics firms)
For example, the regression equation for Asian telecommunication companies is, Research = 369.918 + .301* Income = (1,440.654-2,446.278+1,375.542) + .301* Income 6.2 Full-Model versus Restricted Model Let us call this two-way fixed effect model as a full- model or unrestricted model. We have four restricted or nested models that have different subsets of independent variables. Note that the second and third models should be estimated by one of LSDV approaches. (1) no fixed effect at all: (2) type effect only : (3) type effect only: (4) area effect only: researchi = 0 + 1incomei + i (Model 1) researchi = 0 + 1incomei + d i + i (Model 2) researchi = 0 + 1incomei + 1d1i + 2d 2i + 3d 3i + i researchi = 0 + 1incomei + 1 g1i + 2 g 2 i + 3 g 3i + i
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .2930783 .097524 3.01 0.005 .0950941 .4910626 g1 | 788.3469 633.0243 1.25 0.221 -496.7608 2073.455 g2 | -29.29548 631.1481 -0.05 0.963 -1310.594 1252.003 _cons | 996.9815 550.7665 1.81 0.079 -121.134 2115.097 ------------------------------------------------------------------------------
Which one is the best model? It is not good idea to compare F statistics and t-tests for individual parameter estimates. We may use so called incremental F-test to examine changes in goodness-of- fits of the full- model (or unrestricted model) and restricted models (Greene 2000; Fox 1997). This F-test requires the sum of squared of error (SSE), ee, of the unrestricted and restricted models. The null hypothesis is that the parameters of added regressors (dummies here) are all zero (e.g., H 0 : 1 = 2 = 3 = 0 ). The formula of the F-test is F ( J , N K ) =
' where e* e* and R*2 are respectively SSE and R2 of the restricted model. J is the number of dummy variables that were actually taken out of the full model (e.g., 2 for the second and third restricted model). Keep in mind that R2 in LSDV1 without the intercept is not well defined; so DO NOT plug R2 of LSDV1 in the second formula!
Let us compare the full- model (Table 14) and fixed area effect model (Table 15). The F statistic of 8.9005 is large enough to reject the null hypothesis (p<.0008), signaling superiority of the full- model. Adding two dummy variables may reduce SSE (ee) substantially.
F( J , N K ) =
Now consider the full- model versus fixed type effect model (Table 8). The small F statistic indicates that the full- model does not improve goodness of fit significantly by including two more variables (p<.0633). Thus, we do not reject the null hypothesis in favor of the restricted model.
' (e* e* e' e) / J (60,484,956.6 51,167,501.4 ) / 2 F( J , N K ) = = = 3.0046 (2,33) e' e /( N K ) (51,167,501.4) /( 39 6)
How do we compare the fixed type effect models in Table 3 with one dummy (Model 2) and Table 8 with two dummy variables? In this case, the model with two dummies becomes the full- model. A large F statistic allows us reject the null hypothesis in favor of the full- model with two dummies (p<.0080).
F( J , N K ) =
academic degree. For masters degree, for instance, only d3 is set to 1, while t 1 through t 3 are all coded as 1. Table 16. Data Structure for Threshold Effect Model
+---------------------------------------------------------------------------+ | income effort degree d1 d2 d3 d4 t1 t2 t3 t4 | |---------------------------------------------------------------------------| | 13.242 1.44977 Diploma 1 0 0 0 1 0 0 0 | | 32.983 1.01713 B.A 0 1 0 0 1 1 0 0 | | 47.962 .67178 Masters 0 0 1 0 1 1 1 0 | | 52.048 2.11554 Ph.D. 0 0 0 1 1 1 1 1 | | 50.528 2.55896 B.A. 0 1 0 0 1 1 0 0 | | 17.179 .68774 Ph.D. 0 0 0 1 1 1 1 1 |
There are four regression equations depending on degrees. They share the same slope, of course. Note that the intercepts are cumulative in a sense that they are 0 actually 1 ; 1 + 2 ; 1 + 2 + 3 ; and 1 + 2 + 3 + 4 , respectively. (1) (2) (3) (4) incomei incomei incomei incomei = 0 + 1efforti + 2 + 3 + 4 for the Ph.D. degree holders = 0 + 1efforti + 2 + 3 + 0 for the Masters degree holders = 0 + 1efforti + 2 + 0 + 0 for the B.A. degree holders = 0 + 1efforti + 0 + 0 + 0 for the diploma holders
It is notable that i is used to capture the marginal value of the academic degree. For example, 3 is the marginal value of the B.A. degree. We may say that Masters degree holders on average earn 3 more income than B.A. degree holders, holding all others constant.
This model has two regression equations with different slopes and intercepts. You may compare them with those in section 3. Equipment and Software : Research = 2047.062 + .255*income Telecom. and Electronics : Research = 1181.956 + .198*income Table 17. Data Structure for Interaction Effect Model
+-------------------------------------------------------------------+ | firm type rd2002 net2000 inc_d d | |-------------------------------------------------------------------| | Samsung Electronics 2,500 4,768 0 0 | | AT&T Telecom 254 4,669 0 0 | | IBM IT Equipment 4,750 8,093 8093 1 | | Siemens Electronics 5,490 6,528 0 0 | | Verizon Telecom . 11,797 0 0 | | Microsoft Service & S/W 4,307 9,421 9421 1 | | EDS Service & S/W 0 1,143 1143 1 |
The interaction effect turns out statistically insignificant at the .5 level (p<.738). Thus, we conclude that the slope of equipment and software companies is not substantially different from that of telecommunications and electronics firms. However, you may not conclude that the intercept of equipment and software firms is not statistically significant (close to zero) because of small t statistics (p<.186). Remember that the parameter indicates the difference of actual intercepts of the two types of firms. Table 18. Regression Model with Interaction Effect 1
Source | SS df MS -------------+-----------------------------Model | 25227993.1 3 8409331.02 Residual | 73935712.5 35 2112448.93 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 3, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 3.98 0.0153 0.2544 0.1905 1453.4
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .1975142 .1015407 1.95 0.060 -.0086244 .4036527 net_d | .0571731 .1696052 0.34 0.738 -.2871437 .4014899 d | 865.1056 641.755 1.35 0.186 -437.7264 2167.938 _cons | 1181.956 376.7763 3.14 0.003 417.0599 1946.853 ------------------------------------------------------------------------------
8.2 Regression with Different Slope and the Same Intercept Now, exclude the dummy variable so that only the regressor and the interaction term remain in the model. This model produces two regression equations with different slopes and the same intercept, which are less likely in the real world.
researchi = ( 0 + ) + ( 1 + 2 )incomei + i .
researchi = 0 + 1incomei + 2inc _ d i + i Equipment and Software : Research = 1480.15 + .353*income Telecom. and Electronics : Research = 1480.15 + .146*income
The t statistic of 1.59 for interaction term indicates that there is no statistically significant interaction effect (p<.120). Note that the SEE, square root of MSE, becomes larger than that of any other models discussed so far. Table 19. Regression Model with Interaction Effect 2
Source | SS df MS -------------+-----------------------------Model | 21389278.1 2 10694639.1 Residual | 77774427.5 36 2160400.76 -------------+-----------------------------Total | 99163705.6 38 2609571.2 Number of obs F( 2, 36) Prob > F R-squared Adj R-squared Root MSE = = = = = = 39 4.95 0.0126 0.2157 0.1721 1469.8
-----------------------------------------------------------------------------rd2002 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------net2000 | .1463857 .0952541 1.54 0.133 -.0467987 .33957 net_d | .2067402 .1297268 1.59 0.120 -.0563579 .4698383 _cons | 1480.15 308.4474 4.80 0.000 854.5894 2105.71 ------------------------------------------------------------------------------
Figure 3 compares the two regression models with interaction effects. The left plot depicts regression equations with different slopes and intercepts. The regression equations on the right have different slopes, but have the same intercept. Figure 3. Regression Model with Interaction Effect
8.3 Limitation and Further Direction The regression model with interaction effect is likely to sufferer from multicollinearity. It is because the interaction term tends to be highly correlated to the dummy variable. 8 As
many interaction terms are included in the model, accordingly, it is more likely that multicollinearity problem becomes severe. If two groups show different disturbance variances, the pooled regression may result in one biased estimate of disturbance variances and the incorrect estimate of the covariance matrix (Greene 2000: 323). It is case for the model of groupwise heteroscadasticity, an example of random group effect model for panel data.
* (2) ( 1 + 1 ) + ( 1 + 1 ) t 2 = ( 1 + 1 + 2 ) + ( 1 + 1 + 2 )t * 2 at the age of 27 * * Note that t 1 and t 2 respectively represent the threshold values, often called knots (19 and 27 in this case).
Interaction is different from correlation in a sense that regressors may jointly affect dependent variable no matter whether they are correlated or not (Fox 1997). http://mypage.iu.edu/~kucc625
As shown in the last equation, we have to create two new variables: one for * d 1 ( agei t1* ) and the other for d 2 ( agei t 2 ) . Finally, run the OLS to estimate the spline regression model. We may test the hypotheses on the knots; 1 = 0 , 2 = 0 , or 1 = 2 = 0 . The SAS script for this spline regression will be:
PROC REG; MODEL income = age age19 age27; TEST age19=1, age27=0; RUN;
10. Conclusion
Using dummy variables in regression analysis is useful to capture fixed/random effects. This technique is able to explain how group/time differences affect models. However, it must be used with cautions. First, keep in mind that each LSDV has different interpretations of dummy parameters, and that the t-tests have different null hypotheses. Otherwise, you may be totally misleading, ending up with wrong conclusion. Second, be parsimonious by minimizing the number of dummies especially when you do not have many observations. Avoid the problem of many parameters, small sample size. Try to hit the highlights, focusing on your main arguments. Third is related to the second. Be careful not to be caught in the dummy variable trap, perfect multicollinearity. As you include many dummies, the likelihood of being in trouble will increase sharply. Finally, do not try to compare monkey and lemon. Categories should have something in common with each others so that comparison is meaningful from analytic and theoretic perspective. Comparing apple and pear is better than contrasting apple and onion. By the same token, telecommunications versus electronics firms makes much more sense than telecommunications firms versus universities.
