0% found this document useful (0 votes)
26 views9 pages

Lec11-Stata Regression

The document discusses multiple linear regression analysis using Stata. It uses a dataset on school performance to predict the variable api00 based on acs_k3, meals, and full. Running the regression in Stata, it finds the model to be statistically significant with an R-squared of 0.67, indicating the variables explain a large portion of the variation in api00.

Uploaded by

acegi3476
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

Lec11-Stata Regression

The document discusses multiple linear regression analysis using Stata. It uses a dataset on school performance to predict the variable api00 based on acs_k3, meals, and full. Running the regression in Stata, it finds the model to be statistically significant with an R-squared of 0.67, indicating the variables explain a large portion of the variation in api00.

Uploaded by

acegi3476
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Multiple linear regression:

Steps for Running Regression-


•1. Examine descriptive statistics
•2. Look at relationship graphically and test correlation(s)
•3. Run and interpret regression
•4. Test regression assumptions

If you need help with any command name, just type:

–help (command name)

Logical Operators:
–Less than: <
–Greater than: >
–Less than or equal to: <=
–Greater than or equal to: >=
–Equals: ==
–Does not equal: !=

webuse auto or sysuse auto.dta

describe

If you want to learn more about the data file, you could list all or some of the observations. For
example, below we list the first five observations.

If we want the name of the car whose weight is between 1000 and 2000 pounds...
–list make if weight > 1000 & weight < 2000 –What if we also wanted weight listed with their
name?•If we want a list of cars and their mileage per gallon (mpg) whose mpg is less than
20 or over 30..
.–list make if mpg < 20 | mpg > 30
list in 1/5
list make price mpg in 1/10
summarize acs_k3, detail
tabulate acs_k3
list snum dnum acs_k3 if acs_k3 < 0
This option tells Stata the range of the observations over which we want to apply the command.
list [variable name] in 5 lists the 5th observation of the variable
list [variable name] in 5/10 lists from 5th to 10th observation
list [variable name] in –3 lists the third-from-the-last
list [variable name] in 5/l lists from 5th to last observation

To label variables label variable [variable name] “comment”


This command allows you to document your data set so that you can make comments
on the variables and give a short description (at most 31 characters long) of the
variable.

Using keep/drop to eliminate variables

keep make rep78 foreign mpg price

keep make price mpg

drop displ gear_ratio

Using keep if/drop if to eliminate observations


drop if missing(rep78)

keep if (rep78 <= 3)

Eliminating variables and/or observations with use


use make mpg price rep78 using auto

use auto if (rep78 <= 3)

use make mpg price rep78 using auto if (rep78 <= 3)

Let’s make a table of rep78 by foreign to look at the repair histories of the foreign and domestic
cars.

tabulate rep78 foreign


tabulate rep78 foreign if rep78 >=4

Let’s make the above table using the column and nofreq options. The command column requests column
percentages while the command nofreq suppresses cell frequencies. Note that column and nofreq come after the
comma. These are options on the tabulate command and options need to be placed after a comma.

tabulate rep78 foreign if rep78 >=4, column nofreq

The use of if is not limited to the tabulate command. Here, we use it with the list command/

list if rep78 >= 4

If we wanted to include just the valid (non-missing) observations that are greater than or equal to 4, we
can do the following to tell Stata we want only observations where rep78 >= 4 and rep78 is not missing.

list if rep78 >= 4 & !missing(rep78)


list if rep78 >= 4 & rep78 !=
Additionally, we can use this code to designate a range of values. Here is a summary of price for the
values 3 through 5 in rep78.

summarize
summarize, detail
summarize price if inrange(rep78,3,5
summarize price if rep78 >= 3 & !missing(rep78)

Correlate: Correlation and covariance between two variables::


correlate (this command computes the correlation coefficient between all the possible pairs of variables
in memory)
correlate mpg price weight, means
or correlate mpg price weight
correlate var1 var2 (this command computes the correlation coefficient between the two variables
specified)
correlate var1 var2, covariance (computes the covariance between the two variables instead of the
correlation coefficient)

Scatter Plot:

Scatter plot of two variables


plot var1 var2 , where var1 is the y-axis variable and var2 is the x-axis variable. Otherwise, you can use
the following command will produce a better quality graph:
graph var1 var2

scatter api00 enroll


twoway (scatter price mpg) (lfit price mpg)
twoway (scatter price mpg) (lfit price mpg)(qfit price mpg)
twoway (scatter api00 enroll, mlabel(snum)) (lfit api00 enroll)

predict e, residual

To draw a histogram
histogram acs_k3
graph [varname], histogram bin(#)
# is the number if intervals we want to specify (#=5 is the default)
Note that we would have obtained the same by typing
graph [varname], bin(#)

If you include normal at the end of the command


graph [varname], histogram bin(#) normal
a line for the normal distribution appears so that you can compare whether your distribution looks like a
normal or it is very different

graph box acs_k3


stem acs_k3
stem full
tabulate full
tabulate dnum if full <= 1
count if dnum==401
graph matrix api00 acs_k3 meals full, half

histogram enroll
histogram enroll, normal bin(20)
histogram enroll, normal bin(20) xlabel(0(100)1600

kdensity enroll, normal


graph box enroll
symplot enroll
qnorm api00
pnorm enroll

ladder enroll
gladder enroll
generate lenroll = log(enroll)
hist lenroll, normal

scatter api00 enroll


twoway (scatter api00 enroll) (lfit api00 enroll)
twoway (scatter api00 enroll, mlabel(snum)) (lfit api00 enroll)

REGRESSION ANALYSIS

You can do the regression analysis by

regress dep_var x1 x2 x3

reg price mpg

Regress with if:

* Before mpg 21
regress price mpg if mpg < 21

* At mpg 21 and after


regress price mpg if mpg >= 21

Generate and regression:

generate lninc=log(inc)
generate d=1 if price>3000
replace d=0 if d==.
The same result could be obtained if we use the following command in Stata:
generate d=(inc>1000)
generate mpg2 = mpg*mpg
generate lprice = log(price)
The multiple linear regression model is estimated by OLS with the regress command For
example,
webuse auto or sysuse auto.dta
regress mpg weight displacement

regresses the mileage(mpg) of a car on weight and displacement. A constant is


automatically added if not suppressed by the option noconst regress mpg weight
displacement, noconst

Estimation based on a subsample is performed as

regress mpg weight displacement if weight > 3000

where only cars heavier than 3000 lb are considered. The Eicker-Huber-White covariance
is reported with the option robust

regress mpg weight displacement, vce(robust)

Change Confidence interval:


Confidence Interval. If you want to change the confidence interval, use the level parameter: .

regress mpg weight displacement, level(99)

As an alternative, you could use the set level command before regress:

. set level 99

. regress mpg weight displacement

F-tests for one or more restrictions are calculated with the post-estimation command test.
For example
test weight displacement

ttest mpg , by(foreign)

tests H0:β1= 0 andβ2= 0 againstHA:β16= 0 orβ26= 0.New variables with residuals and
fitted values are generated by
predict uhat if e(sample), resid
predict pricehat if e(sample)

VIF & Tolerances. Use the vif command to get the variance inflation factors (VIFs) and the tolerances (1/VIF).

vif is one of many post-estimation commands. You run it AFTER running a regression. It uses information Stata has stored
internally.

. vif

Let’s use a different data set this time–


sysuse census, clear
What happened when you tried this?

–You need to clear the data first if you are moving between different data sets

»Either use clear and then sysuse census...» Or type sysuse census, clear to do the same thing

generate –command allows us to create new variables–We want to know the percentage of
adult (> 18 years) population for each state

•Generate adultpop= pop18p / pop –What we did was to create the percentage by dividing the population of adults by the total population

Quick review: what is the average adult population in our sample?

–sum pop18p

Can you create a variable named child that shows the total population of children (0-17 years old)

–generate child = poplt5 + pop5_17

Let’s create a new variable named above that shows states that are above the average adult population
share, how?

generate above = 1 if adultpop> 0.71

How do you think we would do this?


–generate above = 0 if above == . (“if above equals blank”)

•replace–command that changes existing variables


–you can only use generate once on a variable

•So in this example, you would need to do


replace above = 0 if above == .

•replace works very similarly to generate in terms of calculations


–We want to change adultpop from a decimal to a percent (so 0.71 would read 71.0 instead),
how would we do that?
•replace adultpop= adultpop* 100

Regression with Stata


– Simple and Multiple Regression
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi

save elemapi elemapi.dta will save.


use elemapi elemapi.dta will open

1.1 A First Regression Analysis

regress api00 acs_k3 meals full


(Let’s dive right in and perform a regression analysis using the variables api00, acs_k3, meals and full.
These measure the academic performance of the school (api00), the average class size in kindergarten
through 3rd grade (acs_k3), the percentage of students receiving free meals (meals) – which is an
indicator of poverty, and the percentage of teachers who have full teaching credentials (full). We expect
that better academic performance would be associated with lower class size, fewer students receiving free
meals, and a higher percentage of teachers having full teaching credentials. Below, we show the Stata
command for testing this regression model followed by the Stata output)

regress api00 acs_k3 meals full

Source | SS df MS Number of obs = 313


-------------+------------------------------ F( 3, 309) = 213.41
Model | 2634884.26 3 878294.754 Prob > F = 0.0000
Residual | 1271713.21 309 4115.57673 R-squared = 0.6745
-------------+------------------------------ Adj R-squared = 0.6713
Total | 3906597.47 312 12521.1457 Root MSE = 64.153

------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
acs_k3 | -2.681508 1.393991 -1.92 0.055 -5.424424 .0614073
meals | -3.702419 .1540256 -24.04 0.000 -4.005491 -3.399348
full | .1086104 .090719 1.20 0.232 -.0698947 .2871154
_cons | 906.7392 28.26505 32.08 0.000 851.1228 962.3555

Let’s focus on the three predictors, whether they are statistically significant and, if so, the
direction of the relationship. The average class size (acs_k3, b=-2.68), is not statistically
significant at the 0.05 level (p=0.055), but only just so. The coefficient is negative which would
indicate that larger class size is related to lower academic performance — which is what we
would expect. Next, the effect of meals (b=-3.70, p=.000) is significant and its coefficient is
negative indicating that the greater the proportion students receiving free meals, the lower the
academic performance. Please note, that we are not saying that free meals are causing lower
academic performance. The meals variable is highly related to income level and functions more
as a proxy for poverty. Thus, higher levels of poverty are associated with lower academic
performance. This result also makes sense. Finally, the percentage of teachers with full
credentials (full, b=0.11, p=.232) seems to be unrelated to academic performance. This would
seem to indicate that the percentage of teachers with full credentials is not an important factor in
predicting academic performance — this result was somewhat unexpected.

Should we take these results and write them up for publication? From these results, we would
conclude that lower class sizes are related to higher performance, that fewer students receiving
free meals is associated with higher performance, and that the percentage of teachers with full
credentials was not related to academic performance in the schools. Before we write this up for
publication, we should do a number of checks to make sure we can firmly stand behind these
results. We start by getting more familiar with the data file, doing preliminary data checking,
looking for errors in the data.

1.2 Examining data

First, let’s use the describe command to learn more about this data file. We can verify how many
observations it has and see the names of the variables it contains. To do this, we simply type

describe

We will not go into all of the details of this output. Note that there are 400 observations and 21
variables. We have variables about academic performance in 2000 and 1999 and the change in
performance, api00, api99 and growth respectively. We also have various characteristics of the
schools, e.g., class size, parents education, percent of teachers with full and emergency
credentials, and number of students. Note that when we did our original regression analysis it
said that there were 313 observations, but the describe command indicates that we have 400
observations in the data file.

If you want to learn more about the data file, you could list all or some of the observations. For
example, below we list the first five observations.

codebook api00 acs_k3 meals full yr_rnd


(Another useful tool for learning about your variables is the codebook command. Let’s do codebook for the
variables we included in the regression analysis, as well as the variable yr_rnd. We have interspersed some
comments on this output in [square brackets and in bold])

summarize api00 acs_k3 meals full


summarize acs_k3, detail
tabulate acs_k3
list snum dnum acs_k3 if acs_k3 < 0
list dnum snum api00 acs_k3 meals full if dnum == 140

regress api00 acs_k3 meals full

Another dataset:

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2

regress api00 acs_k3 meals full


save elemapi2

regress api00 enroll -simple linear reg


predict e, residual

clear
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
regress api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, beta

correlate api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
pwcorr api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, obs sig

You might also like