0% found this document useful (0 votes)
92 views9 pages

Computing New Variables Using Generate and Replace

This document provides instructions for using Stata commands to summarize, generate, recode, label, and subset variables from datasets. It demonstrates how to create new variables through generation and recoding, apply labels to variables and values, and subset datasets by keeping or dropping variables and observations. Collapsing data across observations to aggregate values is also illustrated.

Uploaded by

Red
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views9 pages

Computing New Variables Using Generate and Replace

This document provides instructions for using Stata commands to summarize, generate, recode, label, and subset variables from datasets. It demonstrates how to create new variables through generation and recoding, apply labels to variables and values, and subset datasets by keeping or dropping variables and observations. Collapsing data across observations to aggregate values is also illustrated.

Uploaded by

Red
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

5/30/18

Computing new variables using generate and replace

● sysuse auto, clear → upon startup, loads data set


● summarize length → to summarize statistics
● generate len_ft = length/12 → new variable that has length in feet instead of inches
● replace len_ft = length/12 → replace all the values of an existing variable to length
● summarize length len_ft
● generate → works when the variable does not yet exist; error if variable exists
● replace → works when the variable already exists, error if variable does not exist

● Make other variables: length2, loglen, and zlength


- generate length2 = length^2
- summarize length2
- generate loglen = log(length)
- summarize loglen

● Get the mean and standard deviation of length and we can make z-scores of length.
- **From ECOSTAT: z = (x - mean)/standard deviation
- summarize length
- generate zlength = (length - 187.93)/22.27
- summarize zlength

● To break mpg down into three categories, first make a table of mpg:
- tabulate mpg
● Convert mpg into three categories to make it more readable.
- generate mpg3 = . → make a variable with missing values
- replace mpg3 = 1 if (mpg<=18)
- replace mpg3 = 2 if (mpg>=19) & (mpg<=23)
- replace mpg3 = 3 if (mpg>=24) & (mpg<.)

● Use tabulate to check that this worked correctly


- tabulate mpg mpg3
● Use data editor to check raw data
● Use mpg3 to show a crosstab of mpg3 by foreign to contrast the mileage of the foreign
and domestic cars
- tabulate mpg3 foreign, column
● HOW TO READ TABLE:
- Per row (top shows number of cars; bottom shows percentage)

● Make a copy of mpg, calling it mpg3a.


- generate mpg3a = mpg
● Use recode to convert mpg3a into three categories: min-18 into 1, 19-23 into 2, and 24-
max into 3.
- recode mpg3a (min/18=1) (19/23=2) (24/max=3)
● To double check: (should get the same as mpg)
- tabulate mpg mpg3a

Recodes with if

● Create a variable called mpgfd that assesses the mileage of the cars wrt their origin.
- sort foreign
- by foreign: summarize mpg, detail
● The generate and recode commands in the next slide recode mpg (INCOMPLETE)
● generate mpgfd = mpg
● recode mpgfd (min/18=0)(19/max=1) if foreign==0
● recode mpgfd(min/24=0)(25/max=1) if foreign==1
● check using:
- by foreign: tabulate mpg mpgfd

To save files:
Select all commands
Send to do-file editor
File name: last name_first name_LBYMET V24_DATE

Create log file


File>LOG>BEGIN> type file name

When done, LOG CLOSE

Save DO file so you can open again

LECTURE

sysuse auto, clear


summarize length
generate len_ft = length/12
replace len_ft = length / 12
summarize length len_ft
length^2
length^2
generate length^2 = length^2
summarize length2
generate length2 = length^2
summarize length2
generate loglen = log(length)
summarize loglen
summarize length
generate zlength = (length - 187.93)/22.27
summarize zlength
tabulate mpg
generate mpg3 = .
generate mpg3 =.
replace mpg3 = 1
generate mpg3=.
replace mpg3=1 if (mpg<=18)
replace mpg3 = 2 if (mpg>=19) & (mpg<=23)
eplace mpg3 = 3 if (mpg>=24) & (mpg<.)
replace mpg3 = 3 if (mpg>=24) & (mpg<.)
tabulate mpg mpg3
mpg3
tabulate mpg3 foreign, column
mpg3a
generate mpg3a=mpg
recode mpg3a (min/18=1) (19/23=2) (24/max=3)
tabulate mpg mpg3a
sort foreign
by foreign: summarize mpg, detail
generate mpgfd = mpg
recode mpgfd (min/18=0)(19/max=1) if foreign==0
recode mpgfd(min/24=0)(25/max=1) if foreign==1
by foreign: tabulate mpg mpgfd
save "C:\Users\Student\Documents\LBYEMET.dta"

EXERCISE
sysuse auto, clear
summarize weight
generate wei_kg=weight/2.2
summarize weight wei_kg
tabulate trunk
generate trunk4 = .
replace trunk4=1 if (trunk<=10)
replace trunk4= 2 if (trunk>=11)&(trunk<=15)
replace trunk4=3 if (trunk>=16)
replace trunk4=4 if (trunk>=21)&(trunk<.)
tabulate trunk trunk4
tabulate trunk4 foreign, column
generate trunk4a=trunk
recode trunk4a (min/10=1)(11/15=2)(16/20=3)(21/max=4)
tabulate trunk trunk4a
sort foreign

by foreign: summarize trunk, detail

generate trunkfd = trunk

recode trunkfd (min/16=0)(17/max=1) if foreign==0

recode trunkfd (min/11=0)(12/max=1) if foreign==1

6/5/18

LABELING DATA
Use a file called autolab that does not have any labels
● Download data from https://stats.idre.ucla.edu/stat/stata/modules/autolab.dta
● Open in Stata

Use the describe command to verify that indeed this file does not have any labels
- describe

Use the label data command to add a label describing the data file. Can be up to 80 characters
long.
- label data “This file contains auto data for the year 1978”

The describe command shows that this label has been applied to the version that is currently in
memory.
- describe

Use the label variable command to assign labels to the variables rep78 price, mpg and foreign
- label variable rep78 “the repair record from 1978”
- label variable price “the price of the car in 1978”
- label variable mpg “the miles per gallon for the car”
- label variable foreign “the origin of the car, foreign or domestic”

The describe command shows these labels have been applied


- Describe
Make a value label called foreignl to label the values of the variable foreign.
Label define command below creates the value label called foreignl that associates 0 with
domestic car and 1 with foreign car
- label define foreignl 0 “domestic car” 1 “foreign car”
The label values command below associates the variable foreign with the label foreignl.
- label values foreign foreignl
If we use the describe command, we can see that the variable foreign has a value label called
foreignl assigned to it.
- describe
Now when we use the table foreign command, it shows the labels domestic car and foreign car
instead of just 0 and 1.
- table foreign
Value labels are used in other commands as well. For example, below we issue the ttest,
by(foreign) command, and the output labels the groups as domestic and foreign (instead of 0
and 1).
- ttest mpg, by(foreign)
To save new data set
- save auto2

Part 2
Use sysuse auto, clear to clear previous labels

describe to check

You can use the keep and drop commands to subset variables
Suppose we want to just have make mpg and price, we can keep just those variables
- keep make mpg price
If we issue the describe command again, we see that those are the only variables left.
- describe

Using the drop command. Clear out the data in memory and use the auto data file.
- sysuse auto, clear
To get rid of variables displ and gear_ratio:
- drop displ gear_ratio
Use describe to check
- describe
Make change permanent
- save auto2
To replace existing file with same name
- Save auto2, replace

Keeping and dropping observations


Use the auto file and clear out the data currently in memory.
- sysuse auto, clear
The variable rep78 has values 1 to 5, and also has some missing values.
- tabulate rep78, missing
To eliminate the observations which have missing values, use drop if. The portion after the drop
if specifies which observations should be eliminated.
- drop if missing(rep78)
Use the tabulate command to show that they have been eliminated.
- tabulate rep78, missing

- drop if rep78==1
- tabulate rep78, missing

Use keep if to eliminate observations. First clear:


- sysuse auto, clear
The keep if command can be used to eliminate observations, except that the part after the keep
if specifies which observations should be kept. Suppose we want to keep just the cars which
had a repair rating of 3 or less.
- keep if (rep78<=3)
Check using tabulate
- tabulate rep78, missing

June 20, 2018


COLLAPSING DATA ACROSS OBSERVATIONS
You might have student data but you really want classroom data
You might have weekly data but you want monthly data.
- use https://stats.idre.ucla.edu/stat/stata/modules/kids.dta, clear (download then
open)
- list
Consider the collapse command below. It collapses across all of the observations to make a
single record with the average age of the kids.
- collapse age
- list
The above collapse command was not very useful but you can combine it with the by(famid)
option, and then it creates one record for each family that contains the average age of the kids
in the family
- use "C:\Users\Student\Downloads\kids.dta", clear
- Collapse age, by(famid)
- list
The following collapse command does the exact same thing as above except that the average
of age is named avgage and we have explicitly told the collapse command that we want it to
compute the mean.
- use "C:\Users\Student\Downloads\kids.dta", clear
- collapse(mean)avgage=age, by(famid)
- list
For more than one variable
- use "C:\Users\Student\Downloads\kids.dta", clear
- collapse(mean)avgage=age avgwt=wt, by(famid)
- list
This command also computes numkids which is the count of the number of kids in each family
(obtained by counting the number of observations with valid values of birth).
- use "C:\Users\Student\Downloads\kids.dta", clear
- collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)
- List
(Entire lecture in https://stats.idre.ucla.edu/stata/modules/collapsing-data-across-
observations/)

PART 2
(Entire lecture in https://stats.idre.ucla.edu/stata/modules/working-across-variables-
using-foreach/)

June 27, 2018


(Entire lecture in https://stats.idre.ucla.edu/stata/modules/combining-data/)

*non-numeric: string variable

July 4, 2018
https://stats.idre.ucla.edu/stata/modules/graph8/intro/introduction-to-graphs-in-stata/

July 11, 2018


https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-
simple-and-multiple-regression/
Transforming Variables
- We will focus on the issue of normality. The residuals need to be normally distributed for
the t-tests to be valid. *normally distributed - bell shaped curve.
\
A quadratic model

Open the real estate data set


- use br,clear
Describe and summarize the data
- desc
- sum
Create a new variable sqft2 that is the square of the variable sqft
- generate sqft2=sqft^2
Regress house price on the square of house size and obtain fitted values, priceq.
- regress price sqft2
- predict priceq, xb
Plot the fitted line using new options.
- twoway (scatter price sqft) (line priceq sqft, sort lwidth(medthick))

The slope of the fitted quadratic regression function ___ = ____. Compute teh slope at different
values of x= sqft. We will access the regression coefficient using _b[varname]. Calculating the
slope at sqft = 2000, 4000, and 6000 we have,
- di “slope at 2000 = “ 2*_b[sqft2]*2000
- di “slope at 4000 = “ 2*_b[sqft2]*4000
- di “slope at 6000 = “ 2*_b[sqft2]*6000

Using the same approach we can see the predicted values from the estimated regression
- Di “predicted price at 2000 = “
_b[_cons]+_b[sqft2]*2000^2

.
.
.
A more stylish and efficient approach is to use factor variables. We can estimate the quad fcn.
Directly without creating a new variable.
- Regress price c.sqft#c.sqft
Obtain fitted values
- Predict price2

Not only are slopes computed correctly, but we are provided a standard error and interval
estimate as well. Elasticities use the eyex(*) option.
- margins, eyex(*) at(sqft=(2000 4000 6000))
The slopes and elasticities computed above are Conditional because they are computed at
specific values. To compute the Average marginal effects or average elasticities use the
margins command without the at option.
- margins, eyex(*)

A log-linear model
Using the same data, we will estimate a log linear model. To obtain the fitted value of y the most
natural thing to do is compute the antilog.

Use the same data set. Get the detailed summary statistics and histogram of price.
- Summarize price, detail
- Histogram price, percent
Now generate the logarithm of price and plot its histogram.
- Generate lprice = ln(price)
- Histogram lprice, percent
The log-linear regression model is
- Reg lprice sqft
The predicted values are obtained using
- Predict lpricef, xb
- Generate pricef = exp(lpricef)
The variable pricef is the predicted (or forecast) price. Plot the fitted curve.
- Twoway (scatter price sqft) (line pricef sqft, sort lwidth(medthick))
We must calculate the slope and elasticity
- di "slope at 100000 = " _b[sqft]*100000
- di "slope at 500000 = " _b[sqft]*500000
- di "elasticity at 2000 = " _b[sqft]*2000
- di "elasticity at 4000 = " _b[sqft]*4000
We can also compute average marginal effects at each fitted house price in the sample
- Generate me = _b[sqft]*pricef
- Summarize me
Similarly the avg elasticity is
- Generate elas =_b[sqft]*sqft
- Summarize elas
LINK: https://www.stata.com/data/s4poe4/chap02.do

AUGUST 8 2018
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter2/stata-webbooksregressionwith-
statachapter-2-regression-diagnostics/

Codes not in the link


list state _dfbeta_1 _dfbeta_2 _dfbeta_3 in ⅕
scatter _dfbeta_1 _dfbeta_2 _dfbeta_3 sid, ylabel(-1(.5)3) yline(.28 -.28)

You might also like