Lec11-Stata Regression
Lec11-Stata Regression
Logical Operators:
–Less than: <
–Greater than: >
–Less than or equal to: <=
–Greater than or equal to: >=
–Equals: ==
–Does not equal: !=
describe
If you want to learn more about the data file, you could list all or some of the observations. For
example, below we list the first five observations.
If we want the name of the car whose weight is between 1000 and 2000 pounds...
–list make if weight > 1000 & weight < 2000 –What if we also wanted weight listed with their
name?•If we want a list of cars and their mileage per gallon (mpg) whose mpg is less than
20 or over 30..
.–list make if mpg < 20 | mpg > 30
list in 1/5
list make price mpg in 1/10
summarize acs_k3, detail
tabulate acs_k3
list snum dnum acs_k3 if acs_k3 < 0
This option tells Stata the range of the observations over which we want to apply the command.
list [variable name] in 5 lists the 5th observation of the variable
list [variable name] in 5/10 lists from 5th to 10th observation
list [variable name] in –3 lists the third-from-the-last
list [variable name] in 5/l lists from 5th to last observation
Let’s make a table of rep78 by foreign to look at the repair histories of the foreign and domestic
cars.
Let’s make the above table using the column and nofreq options. The command column requests column
percentages while the command nofreq suppresses cell frequencies. Note that column and nofreq come after the
comma. These are options on the tabulate command and options need to be placed after a comma.
The use of if is not limited to the tabulate command. Here, we use it with the list command/
If we wanted to include just the valid (non-missing) observations that are greater than or equal to 4, we
can do the following to tell Stata we want only observations where rep78 >= 4 and rep78 is not missing.
summarize
summarize, detail
summarize price if inrange(rep78,3,5
summarize price if rep78 >= 3 & !missing(rep78)
Scatter Plot:
predict e, residual
To draw a histogram
histogram acs_k3
graph [varname], histogram bin(#)
# is the number if intervals we want to specify (#=5 is the default)
Note that we would have obtained the same by typing
graph [varname], bin(#)
histogram enroll
histogram enroll, normal bin(20)
histogram enroll, normal bin(20) xlabel(0(100)1600
ladder enroll
gladder enroll
generate lenroll = log(enroll)
hist lenroll, normal
REGRESSION ANALYSIS
regress dep_var x1 x2 x3
* Before mpg 21
regress price mpg if mpg < 21
generate lninc=log(inc)
generate d=1 if price>3000
replace d=0 if d==.
The same result could be obtained if we use the following command in Stata:
generate d=(inc>1000)
generate mpg2 = mpg*mpg
generate lprice = log(price)
The multiple linear regression model is estimated by OLS with the regress command For
example,
webuse auto or sysuse auto.dta
regress mpg weight displacement
where only cars heavier than 3000 lb are considered. The Eicker-Huber-White covariance
is reported with the option robust
As an alternative, you could use the set level command before regress:
. set level 99
F-tests for one or more restrictions are calculated with the post-estimation command test.
For example
test weight displacement
tests H0:β1= 0 andβ2= 0 againstHA:β16= 0 orβ26= 0.New variables with residuals and
fitted values are generated by
predict uhat if e(sample), resid
predict pricehat if e(sample)
VIF & Tolerances. Use the vif command to get the variance inflation factors (VIFs) and the tolerances (1/VIF).
vif is one of many post-estimation commands. You run it AFTER running a regression. It uses information Stata has stored
internally.
. vif
–You need to clear the data first if you are moving between different data sets
»Either use clear and then sysuse census...» Or type sysuse census, clear to do the same thing
generate –command allows us to create new variables–We want to know the percentage of
adult (> 18 years) population for each state
•Generate adultpop= pop18p / pop –What we did was to create the percentage by dividing the population of adults by the total population
–sum pop18p
Can you create a variable named child that shows the total population of children (0-17 years old)
Let’s create a new variable named above that shows states that are above the average adult population
share, how?
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
acs_k3 | -2.681508 1.393991 -1.92 0.055 -5.424424 .0614073
meals | -3.702419 .1540256 -24.04 0.000 -4.005491 -3.399348
full | .1086104 .090719 1.20 0.232 -.0698947 .2871154
_cons | 906.7392 28.26505 32.08 0.000 851.1228 962.3555
Let’s focus on the three predictors, whether they are statistically significant and, if so, the
direction of the relationship. The average class size (acs_k3, b=-2.68), is not statistically
significant at the 0.05 level (p=0.055), but only just so. The coefficient is negative which would
indicate that larger class size is related to lower academic performance — which is what we
would expect. Next, the effect of meals (b=-3.70, p=.000) is significant and its coefficient is
negative indicating that the greater the proportion students receiving free meals, the lower the
academic performance. Please note, that we are not saying that free meals are causing lower
academic performance. The meals variable is highly related to income level and functions more
as a proxy for poverty. Thus, higher levels of poverty are associated with lower academic
performance. This result also makes sense. Finally, the percentage of teachers with full
credentials (full, b=0.11, p=.232) seems to be unrelated to academic performance. This would
seem to indicate that the percentage of teachers with full credentials is not an important factor in
predicting academic performance — this result was somewhat unexpected.
Should we take these results and write them up for publication? From these results, we would
conclude that lower class sizes are related to higher performance, that fewer students receiving
free meals is associated with higher performance, and that the percentage of teachers with full
credentials was not related to academic performance in the schools. Before we write this up for
publication, we should do a number of checks to make sure we can firmly stand behind these
results. We start by getting more familiar with the data file, doing preliminary data checking,
looking for errors in the data.
First, let’s use the describe command to learn more about this data file. We can verify how many
observations it has and see the names of the variables it contains. To do this, we simply type
describe
We will not go into all of the details of this output. Note that there are 400 observations and 21
variables. We have variables about academic performance in 2000 and 1999 and the change in
performance, api00, api99 and growth respectively. We also have various characteristics of the
schools, e.g., class size, parents education, percent of teachers with full and emergency
credentials, and number of students. Note that when we did our original regression analysis it
said that there were 313 observations, but the describe command indicates that we have 400
observations in the data file.
If you want to learn more about the data file, you could list all or some of the observations. For
example, below we list the first five observations.
Another dataset:
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
clear
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
regress api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, beta
correlate api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
pwcorr api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, obs sig