Econometrics Computer Exercise Week 1: Introduction Stata + Simple Regression Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Amsterdam School of Economics

Faculty of Economics and Business

Econometrics

Computer exercise week 1: Introduction Stata +


simple regression model
This computer exercise covers some aspects of basic statistics and the simple regression model (Chap-
ters 4 and 5 of Stock & Watson). The exercise has three parts. In the first part you will learn some
basic elements of Stata like importing data, creating graphs and tables and calculate some descriptive
statistics. In the second and third part you will perform regression on artificial data (a so called Monte
Carlo analysis) and do some additional analyses.

In the course you will do your statistical and econometrical analyses using Stata. Stata is avail-
able on the university computers but can also be downloaded to your own computer (see Canvas for
instructions). Make sure you have access to Stata either way. Start Stata by clicking on the Stata-icon.
If it is not on your screen, go to Start, Programs and click Stata XX (XX=version installed). After
having opened Stata, on top of your screen you will see a menu. Furthermore, several windows can be
seen. The History window displays the most recent commands. You can give commands either via
the menu or by directly typing them in the Command window. The latter requires some knowledge
of Stata language, which is inevitable as not everything can be performed via the menu. But giving in
commands via the menu the History window will give the corresponding Stata language, so you learn
on the job. The Results window is the primary display of all your results. You cannot edit in this
window, but you can transport your results easily into other programs like Word by selecting them
(edit, copy) and copying them (copy, paste) into your Word document. Be sure to use the Courier font
in Word and decrease the size of the font to get Stata output in a decent way into Word. Alternatively
you can create an output file using "log using filename, text" (cf. the short guide to Stata, avail-
able on Canvas).

Be aware that Stata contains an extensive help-file (see top of your screen). Furthermore on Canvas
a brief manual of Stata is available that contains all command necessary in the course Econometrics.
Note that in this file also importing data from e.g. Excel is discussed. Stata works with special Stata-
data files (which have the extension .dta). It is possible to read in data from other sources (see File -
Import for the complete list).

1. We are going to read the California Test Score Data Set from Stock & Watson (S&W, see
Appendix 4.1 for a short description) and reproduce some results from Chapters 4 and 5. On the
Canvas-course of this course you can download the appropriate Stata data file (extension .dta).
Choose the file caschool.dta, click on the right button of your mouse, choose Save Target As
and choose a location to store the data.

2. In Stata choose from the menu File, Open (alternatively, type in the Command window line
use "c:\...\filename " , where you replace the second part by the location of your file and
your file name) and choose the appropriate file to open.

3. You have now access to the California Test Score data. We are going to analyze the variables
testscr (test score) and str (student-teacher ratio) and calculate some descriptive statistics. Click
Data, Describe data, List data in the menu and indicate that we want to see the variables
"testscr str" (write the names of the variables leaving space between them). After OK you see the
values for 420 school districts in California in 1998, one by one.

4. You will not detect much from only inspecting the values of the series. Continue with: Statistics,
Summaries, Tables, & tests, Summary and descriptive statistics, Summary statistics,
and type under Variables “testscr str”. This gives you for both series sample mean, standard
deviation, minimum and maximum. Inform yourself (read the description of the data provided
in the book) about the unit of measurement of both variables. Compare your results with Table
4.1. An alternative (and perhaps more convenient method) is to type sum testscr str in the
Command window. Try this.

5. Choose the same command as above, but now under Options choose Display additional
statistics. You will get some additional features of the marginal distributions of both vari-
ables including percentiles. Compare these with the numbers from Table 4.1 and verify that you
understand all concepts listed in this table. Alternatively, you can type sum testsc str, d in
the Command window. sum is the abbreviation of summarize. You can abbreviate all commands
to the first three or four letters in Stata.

6. We continue with Statistics, Summaries, tables, & tests, Summary and descriptive statistics,
Correlations & covariances to view sample correlations. To view sample covariances repeat
the steps, but choose under Options the option Display covariances. The correlation between
both variables is negative (-0.23).

7. Choose Graphics, Twoway graph, Create. In the new window choose Basic plots and Scatter,
“testscr” as Y variable and “str” as X variable. You now have obtained a scatter diagram with-
out regression line. It is the same picture as Figure 4.2 of S&W. This picture does not show a
strong relation between these two variables as suggested already by the only moderate sample
correlation coefficient (-0.23).

8. To create a scatter plot with regression line (Figure 4.3 of S&W) choose Graphics, Twoway
graph, Create. However, choose now Fit plots and Linear prediction, “testscr” as Y vari-
able and “str” as X variable.

9. Choose Statistics, Linear models and related, Linear regression, select testscr and str
as dependent and independent variables and click OK. You now run the regression of testscr on a
constant (automatically included) and str. The output table contains a lot of information, please
check the document Stataoutput.pdf on Canvas for a precise description of all numbers.

10. Finally, choose the same command as above, but choose under SE/Robust the option Robust to
obtain heteroskedasticity robust standard errors (see Section 5.4 of S&W). The default option in
Stata for calculating standard errors is homoskedasticity-only.

11. Close this session by typing clear in the Command window and give Enter. This removes all
current data from memory.

Above we mainly used the menu at the top of the screen to give commands. Instead, in the rest of this
exercise we are going to use the Command window as much as possible thereby ignoring the menu.
This requires knowledge of Stata language, which can be acquired via the Help menu (accessible via
the main menu or by typing help in the Command window).

In the second part of this exercise you will generate your own data, hence the population regres-
sion model (see formula 4.3, S&W) is known to us and we will make sure it satisfies the assumptions
ASS #1 – ASS #3 (section 4.4, S&W). In this case the accuracy of estimation methods like OLS
can be analyzed as we can compare estimates with true values. Here you will get some sense of the
unbiasedness property of OLS. The use of artificial data to investigate the properties of estimators is
known as a Monte Carlo analysis.
12. We are going to generate our own data. To be sure that you will have unique data (different
from other students) you have to set the seed. The seed is the number with which Stata starts
its algorithm to generate pseudo-random numbers. Use your student number to set the seed.
For example, if your student number is 10600564 you type in the Command window set seed
10600564. Another advantage of setting the seed is that your results will be reproducible. Next,
press Enter to execute the command.

13. Type set obs 1000 and give Enter to set the number of observations equal to 1000 in the current
data set.

14. Type generate x = 10 + 2* rnormal() and give Enter. The resulting one thousand observa-
tions on x are based on a random sample from the standard normal distribution multiplied by
2 and added with 10. Next, generate u = rnormal(). The resulting 1000 random variables are
independent of each other and of the values for x. In addition, error terms are homoskedastic (=
having a constant variance) and normally distributed (see Sections 5.4 and 5.6 of S&W). Finally,
generate y = 10 + x + u. Analyse the resulting sample of observations on x and y via e.g. table
and graph.

15. Perform the least squares regression of y on a constant and x by regress y x (a constant term
is automatically added). What are the real values for the population parameters β0 and β1
in the regression model for y? What do you find for the estimation errors of β0 and β1 (i.e.
the difference between the population and sample values)? Make also the scatter diagram with
regression line by twoway (lfit y x) (scatter y x).

16. Generate xx=10+2*rnormal(), uu=rnormal() and yy=10+xx+uu. Regress yy on a constant and


xx. We have exactly the same model, but a new sample of 1000 observations. Compare these
regression results with the earlier results. What are for this sample the estimation errors in
estimating β0 and β1 ? Why are these estimation errors different from the ones calculated in 15?

17. Try to understand the meaning of unbiasedness of OLS coefficient estimators. Suppose you repeat
this experiment many times. Each replication you will have a new sample and also estimates.
What is the average estimation error? And what will be the average of all estimates?

18. Contemplate on the differences of a regression on real data and on Monte-Carlo-data. Is it


possible to calculate the estimation errors for a regression on real data?

19. Redo the regression of y on x (and a constant). Test the hypothesis β1 = 0.95 against a two-sided
alternative at 5% significance level. Do you reject or not? In this case you know that the null
hypothesis is not true (why?). Do you make an error? If so, is this a Type 1 or Type 2 error?

20. Test the hypothesis β1 = 1 against a two-sided alternative at 5% significance level. Do you
reject or not? Compare the outcome of the test (reject or not reject) with your knowledge of the
population regression model. Do you make an error? If so, is this a Type 1 or Type 2 error? And
finally, given the significance level of 5% how many students present today will make an error?

21. Regress again y on x and a constant term, but use now only the first 10 observations of the sample.
To do so use: regress y x in 1/10. Calculate again the estimation errors in the regression
coefficients. Are estimation errors always larger in case of 10 (instead of 1000) observations?

22. Explain why the 95% confidence intervals for the parameters β0 and β1 are much wider for n=10
as compared to the case n=1000.

In the third part of this exercise you will use the current population regression model to evaluate some
algebraic properties of OLS. The focus is on scaling and several sample covariances and correlations.
Regarding the latter see Stock & Watson, Section 3.7, for some definitions. You are encouraged to
derive analytically all results below to enhance your understanding of the OLS method.
23. Generate w=2*x. Perform the regression of y on a constant and w. What are the differences and
similarities with the first regression of y on a constant and x? Derive this analytically. Create
residuals by predict resid, residuals.

24. Calculate with correlate y x w resid, covariance the sample covariances of the variables y,
w, x and resid. Explain why Var(w)=4Var(x). Verify that Var(resid) is equal to the residual sum
of squares of the last regression (the entry Residual in the column SS) divided by 999 (=n-1).
Verify that Var(y) is equal to the total sum of squares of the last regression (the entry Total in
the column SS) divided by 999 (=n-1).

25. Explain why Cov(w,resid)=0 and Cov(x,resid)=0. Also show that Cov(y,resid)=Var(resid).
Compute also the sample correlation coefficients by correlate y x w resid and explain why
Cor(w,x)=1, Cor(w,resid)=0 and Cor(x,resid)=0.

A final remark.
Stata do-files allow you to run a large number of Stata-commands consecutively. For instance, for
this (and all next) computer exercise(s) a do-file is available on Canvas. You can run it by typing do
CE1.do in Stata. There is also a specific do-file editor available in Stata that allows you to create and
change a do-file and to run it.

You might also like