6.1 Stata
6.1 Stata
Nicholas Ndiwa
DIRECTORY MANAGEMENT 5
STATA SYNTAX 7
USER-WRITTEN EXTENSIONS 10
DUPLICATES 17
OUTPUT MANAGEMENT 24
LABELLING VARIABLES 26
2
MANAGING DATETIME VARIABLES 29
THE REPLACE 30
COMBINING DATASETS 31
MERGE 31
APPEND 32
BY COMMAND 35
STRING CLEANING 35
SUB-SETTING DATA 36
MACROS 37
LOOPS 37
FOREACH 37
DESCRIPTIVE STATISTICS 38
COMBINING GRAPHS 46
3
ADDING TITLES AND LEGENDS TO GRAPHS 46
STATISTICAL INFERENCE 46
NORMALITY TESTING 48
TEST OF ASSOCIATION 48
TESTS OF DIFFERENCE 49
WAY FORWARD 62
4
Introduction to Stata
What is Stata
STATA is versatile statistical packages used for data management, analysis and production of
graphical outputs. It was initially popular among economists and it is nowadays used by
researchers from all fields.
Why Stata
STATA is popular because it is user-friendly, it has perpetual licensing options, has very powerful
and specialized user-written add-ons and allows reproducible reporting. It encompasses menu
and command driven interaction options.
The Stata Environment
Tool bar
Directory management
A directory is a folder where the files you are working on are stored.
Why create a directory?
5
How to create a working directory
By default, STATA saves or retrieves files from a defaulter folder specified during installation.
Users can change the default working folder to their preferred ones using the Change Directory
(cd) Command.
The cd (change directory) command can be used on its own to identify the directory you are
currently working in.
However, the command, followed by a directory name, changes the directory you work in. For
instance, let’s change our directory to “/Volumes/Transcend/Career/ILRI/Training”
cd “/Volumes/Transcend/Career/ILRI/Training”
Note that if your directory path contains embedded spaces, you will need to put the path in
double quotes.
To change the directory from the menu go to: File > Change Working Directory…
The set command is used to control the STATA operating environment. There are dozens of set
commands, but many of them are rarely used.
If you get the error message: “No room to add more observations”, this means the data file is too
big for the memory allocated to Stata . You can use this command increases the memory
allocated to Stata .
set mem XXm
This sets memory to XX megabytes. You cannot set memory greater than the total RAM memory
in the computer – physical and virtual memory. If you set too much memory, it lowers the
processing speed of the computer, so be cautious
If the problem is in variable allocation (default is 5,000 variables), you increase it by typing, for
example:
set maxvar 10000
Another SET command commonly used is the set more command. This command is used to turn
on and off the continuous scrolling of output.
set more off, [option]
6
Use if you are not interested in the intermediate output, only the final result. The option
permanent keeps this instruction to STATA perpetually. Use “set more on” if you need to be
able to read the early output.
Another SET command is the set trace command which traces the execution of programs for
debugging. set trace off turns off tracing after it has been set on.
To get help in Stata , type help followed by the command, this will provide help for inbuilt STATA
commands. To get help on the USER-written commands, type findit followed by the command.
There are several online materials that cover almost all STATA component.
Alternatively, Click on “help” from the Menu bar, then select “search”. Type the search string in
the dialog box that appears and click “OK”
findit summarize
Stata syntax
STATA is case sensitive whereby all STATA commands should be typed in lower case while File
and Variables names should be typed in the exact case (lower or upper)
STATA command has the following structure
keyword argument, options
The keyword is the STATA reserved command; the argument is the parameter input by that user
that the keyword will act on while options are additional commands that gives additional
commands on how the command will be executed. SPACE is used to separate keyword and the
argument while a COMMA is used to separate the argument and the options. For example to
import the data we use the command below:
use "example data1.dta", clear
7
use is the key word, "example data1.dta " is the argument and clear is the option
[if]
[in]
[using filename]
[=exp]
[weight]
, options
Where;
if - used to apply the command cmd to those observations for which the value of the expression
is true
in - used a specific observation range
weight indicates the weight to be attached to each observation e.g pweight for sampling weights
=exp value to be assigned to a variable and is most often used with generate and replace
----------------------------------------------------------------
help for summarize (manual: [R] summarize)
------------------------------------------------------------------
Summary statistics
------------------
[by varlist:] summarize [varlist] [weight] [if exp] [in range] [, { detail |
meanonly } format ]
STATA contains several example datasets that have been pre-installed within it. To load a
specified Stata -format dataset use the command:
syuse filename
8
If you are not sure of which example dataset to load, use this command to list all the names of
the datasets shipped with STATA . From the Menu go to File > Example Datasets...
sysuse dir, all
Now let’s load an example dataset called auto and run some basic STATA functions.
sysuse auto.dta
To obtain the summary statistics for all of the variables in the data file use the command
summarize or just su
list [varlist]
codebook varlist
describe
tabulate
9
Re-using commands:
If you want to reuse a command all that you need to do is double click on the command in the
Review Window.
User-written extensions
There are a tremendous number of user-written programs for Stata available which, once
installed, act just like official Stata commands.
There are organizations house user-written programs, including the Boston College Statistical
Software Components (SSC) archive and the Stata Journal.
So, how do you start to use user-written commands? There are several approaches based on
whether the command is found on SSC, elsewhere online, or currently unavailable online.
SSC programs
SSC is the largest repository of Stata user-written programs. To install an SSC package named
package, type in Stata :
ssc install package
For example, a popular SSC program is estout. To install it, type on the Stata command line:
ssc install estout
10
A list of packages that match the search are displayed. Click on the package you want, and a
window will appear. Follow the instructions to install the package.
Alternatively, one can search and install a package by searching it using the findit function
findit catplot
Distinguishing between Categorical data, Continuous data; Numerical, string and date/time
variables
Common terms used in STATA
Records (or cases or observations): Individual observations (e.g farm plots, households, villages,
or provinces). Usually considered to be the “rows” of the data file
Variables: Characteristics, location, or dimensions of each observation. Considered the
“columns” of the data file.
Levels: The level of a dataset describes what each record represents.
Discrete variables (or categorical variables): Variables that have only a limited number of
different values (e.g region, sex, type of roof, and occupation).
Binary variables (or dummy variables), Dichotomous : Discrete variable that only takes two
values. Examples: yes/no, male/female, have/don‟t have, or other variables with only two
values.
Continuous variables: Variables whose values are not limited. Examples: per capita expenditure,
farm size, number of trees. Usually expressed in some units such as shillings, kilometers, hectares,
or kilograms. Also, may take fractional values.
Variable labels: These are longer names associated with each variable to explain them in tables
and graphs. E.g: variable REGION could have a label “Region of Kenya”.
Value labels : These are longer names attached to each value of a categorical variable. For
example, if the variable REG has four values, each value is associated with a name. The value
labels for REG=1 could be “Northern Region”, REG=2 could be the “Central Region”, and so on.
11
Data storage types
Numeric types:
Stata provides five numeric types for storing variables, three of them integer types and two of
them floating point.
The floating-point types are float and double.
The integer types are byte, int, and long.
String storage type:
Text variables (such as names of respondents) are best stored as string characters
Note: Variables that are stored as string characters do not allow any statistical analysis on them
other than frequency counts
Stored Data Colours
When you browse the data, the spreadsheet is displayed and you can distinguish some of the
storage types by looking at the colors, for example:
12
Operators
STATA uses specific mathematical and logical operators
Special symbols
Missing values: The default STATA symbol for missing values is the “ . ” (dot), which is interpreted
as a large value by STATA . There is an option of defining your own missing value indicator.
For example, change all zero entries for hired labor to missing using the command below then
browse to see the change
mvdecode Fem_hired_work Tot_hired_lab , mv(0)
_n: This is used to refer to the record number (we will see examples of its applications)
_all: This is used to apply a command globally (we will see examples of its applications)
Variable types
Two main data types used in STATA are Numeric and String. Numeric data can be stored in
different formats depending on the required precision (table below).
13
long -2,147,483,647 2,147,483,620
Inbuilt datasets
STATA software comes with inbuilt datasets for users to practice with, called example datasets.
To access these datasets, go to:
>File>Example datasets>Datasets installed with Stata >use – to open the
dataset.
Example datasets within Stata can also be accessed using the command: sysuse “file.name” – if
you know the exact name of the example dataset, e.g.
sysuse auto
ii)INSHEET – for data created using a spreadsheet or database program, separated by commas
or tab delimited, e.g.
insheet using “farmers.csv”, clear
The firstrow option - tells STATA to treat first row of Excel data as variable names. There are
several other commands that can be used to import data into STATA . Run the command “help
import” to learn more commands.
To import external datasets in other formats apart from Stata datasets from the menu, go to
File > Import >.. Then select the type of file you want to import
14
v) Other sources
Stata can import data from several other sources: such as SPSS/PASW, SAS, etc. Data from SPSS
and SAS require user written programs before data importation. The commands below show how
to install the user written programs using the command ssc install package_name and how to
use the packages to import data.
Dataset from SPSS/PASW:
ssc install usespss // installs the required package
There are special user written commands for exporting STATA files to other statistical packages
such as SPSS and SAS or saving STATA data to older versions of STATA. Install and try any of the
following user-written commands
savespss – save STATA data in SPSS format
saveold – allows use to save STATA data in the preceding old version of STATA, e.g. STATA 14
to be opened by STATA 13, 12, 11, 10
use13 – allows users of the STATA 10-12 to open STATA 13 files,
Before working with any dataset it is important to have a glimpse of the data contents in terms
of the types of variables and variable contents. STATA provides several commands to getting
descriptions and contents of variables in the data.
15
If you had closed the example data set, open it again.
use “farmers.dta”,clear
Below are the common commands used to display dataset summaries or view the data.
describe: produces a summary of active dataset. Additional options can be included in the
command to get specific or additional information (type: help describe to get available
options). Test the following options and discuss the difference:
describe
describe, short
describe, number
codebook: Similar to describe but provides additional summary statistics (range, frequencies) for
each variables
codebook
browse: Displays the data in a spreadsheet format (columns/rows) but one is not able to edit or
modify the data contents
browse
edit: Displays the data in a spreadsheet format (columns/rows) and allows the user to edit or
modify the data contents
edit
For both browse and edit commands, STATA displays data in different colors depending on the
data type, for example, string variables in RED, value labels in BLUE and numeric in BLACK. One
can change the default data codes by selecting Edit->Preferences menu options.
Labelling datasets
STATA has an option of saving a dataset with a brief description of the file contents
(documentation) using the label data command.
Get description of the example data using describe command
16
Notice that there is no description of the data! Label the data. Label the data using the command
below then run describe again and notice the difference
label data "This data contains the list of farmers receiving vines"
Duplicates
Datasets have a unique ID, a variable that uniquely identifies observations. Unique IDs can
distinguish respondents from each other, so that John Doe is identified by the value 1 of variable
uniqueid, Jane Smith is identified by the value 2, and so on. An observation can also be uniquely
identified by a set of variables, for example, the combination of a household ID and a respondent
ID.
One of the first things that you will need to do when cleaning your data is cleaning the ID. More
specifically, you need to check that the ID is unique. If it isn't, you’ll need to find how many and
which IDs are duplicates, and resolve them.
17
Matching respondents across rounds (for example, matching baseline and endline
observations) and across datasets (for example, survey data and administrative data)
Data entry reconciliation
1 isid
Probably the easiest way to figure out whether a variable (or a combination of several variables)
uniquely identify observations is the isid command. isid will have no output if the variables
are a unique id together (or one variable by itself), and will give an error if they are non-unique.
To test this command, test whether he hhid is a unique identifier in our dataset:
isid hhid
18
We find that there is a “surplus” of 13 observations.
Try browsing the dataset to figure out we have duplicates on the hhid variable.
Before we move on to that, however, let’s briefly look at the duplicates tag and drop commands.
Duplicates tag essentially does the same thing as duplicates report, but also generates a new
variable (that people usually call “tag”) that shows you how many duplicates each observation of
that variable has. So, if there are no duplicates (observation is unique), the tag will be equal to 0.
If there is 1 duplicate observation—i.e. there are 2 observations that look the same, one of them
an “extra”, —the tag will be equal to 1 for both observations – the “1st” and the
“extra/duplicate”. If there are 2 duplicates (3 observations that look the same), tag will be equal
to 2… and so on.
Let’s try to generate a tag (called tag1) for childid duplicates, just for practice. Try to look up how
to do it using a help file.
duplicates tag hhid, gen(tag1)
Now tab the tag variable and browse just the observations are duplicated.
Finally, duplicates drop removes the duplicates. Please use it sparingly, and NEVER use it without
first checking out what the duplicates are and why they are there very carefully using the other
duplicates commands.
duplicates drop
To manage duplicate observations using the menu, go to: Data > Data utilities > Manage
duplicate observations
19
Sorting and ordering
sort – arranges the observations of the current data into ascending order based on the values
of the variables you list after the command. Suppose we wanted Stata to look at the data in the
order of the hhid. To accomplish this, say:
sort hhid
Now browse again. You can see the data is sorted by the student number. What if you wanted
Stata to sort by hhid within each county? You simply list the largest category first, and then add
extra variables to the command to sort within the bigger categories. So, to sort by student
number within schools, you’d type:
sort County hhid
Browse to see what that did. There is actually no limit to the number of variables that you can
sort by at one time, and Stata just reads them left to right and sorts in that order.
Certain commands will require you to sort the data beforehand, and sometimes it’s a great trick
to use with commands that don’t require it. However, you have to be very careful when using
sort.
To sort the data in descending or ascending order, use the gsort command with a + or – before
the variable you are using to sort. For example to sort the data in descending order, use the
following command.
gsort –hhid
Suppose you change your mind and want the County and Subcounty variables to appear after the
hhid. This can be accomplished as follows:
order County Subcounty, after(hhid)
To order via the menu go to: Data > Data utilities > Change order of variables
20
Executing commands using do-files
Stata comes with an integrated text editor called the Do-file Editor, which can be used for many
tasks. To access the do file editor either type in the command line
doedit
You can write the commands, to run them select the line(s), and click on the last icon in the do-
file window
Notice the different colors of the syntax elements. You can change them under Edit>Preferences
Click on the Do button, to execute the commands. Stata executes the commands in sequence.
Stata saves the commands to a temporary file and uses do command to execute them.
Save the do file with a .do extension. After you have saved your do-file, you can execute the
commands it contains by typing do filename, where the filename is the name of your do-file
do analysiscode.do
21
You can continually update your do-file with additional commands. You can try your commands
interactively, and if they seem to work, cut and paste them into the do file.
If you can put a * before a line in the do-file, Stata will not execute that line. This serves two
different purposes.
1. you can rerun your do-file while leaving out certain commands.
2. you can annotate your dofile
You can have Stata skip over several lines by using /* and */.
clear
use filename
gen lprice1=log(price1)
If there is a syntax error in your do-file, Stata will stop execution at the point of the error. You
can go back to the do-file editor, correct the syntax error, and rerun your program.
You may want to create two do-files for any project. The first manipulates the data and creates
new variables. At the end of this do-file, be sure to save the resulting data set in a new data file.
The second file uses the data set you created in the first file to perform all of your analyses.
Alternatively, you might prefer to work interactively. In this case, start analyzing your data
interactively, normally. Right click in the Review window. An intuitive menu will appear. You can
use the menu to:
Delete commands that you do not want to keep (be sure to highlight these commands
before deleting)
Highlight the entire contents of the review window (“select all”)
Send the highlighted commands to the do-file editor (“send to do-file editor”)
Edit your do-file Use the Stata do-file editor or even MS Word to edit your file; correct mistakes,
add comments (*)
Execute your do-file Type do filename in the Stata Command window or click on the execute
do-file button from the Stata do-file editor.
22
Calling other do-files
Say that you wrote makedata.do, which infiles your data, generates a few variables, and saves
step1.dta.
Say that you wrote anlstep1.do, which performed a little analysis on step1.dta. You could then
create a third do-file, begin master.do
do makedata
do anlstep1
and so in effect combine the two do-files. Do-files may call other do-files, which, in turn, call other
do-files, and so on. Stata allows do-files to be nested 64 deep.
Writing long commands
You can change the end-of-line delimiter to ‘;’ by using #delimit,
you can comment out the line break by using /* */ comment delimiters,
or you can use the /// line-join indicator.
In the following fragment of a do-file, we temporarily change the end-of-line delimiter:
use mydata
#delimit ;
delimit cr
Once we change the line delimiter to semicolon, all lines, even short ones, must end in
semicolons.
Stata treats carriage returns as no different from blanks. We can change the delimiter back to
carriage return by typing #delimit cr. The #delimit command is allowed only in do-files—it is
not allowed interactively.
The other way around long lines is to comment out the carriage return by using /* */ comment
brackets or to use the /// line-join indicator.
use mydata
if substr(company,1,4)=="Ford" | /* */
substr(company,1,2)=="GM", detail
23
gen byte ford = substr(company,1,4)=="Ford"
OR
use mydata
summarize weight price displ headroom rep78 length turn gear_ratio ///
if substr(company,1,4)=="Ford" | ///
substr(company,1,2)=="GM", detail
Output management
STATA Results window does not automatically keep all the output you generate, unlike SPSS. It
only stores about 300-600 lines, and when it is full, it begins to delete the old results as you add
new results.
set scrollbufsize XX
This is used to change the amount of output that STATA will store in the Results window. XX is
expressed in bytes. The default is 32000 and the maximum is 500000. Type “help set” for a list of
other settings in Stata .
STATA output files are called log files (saved using .log or .smcl extenstion). You can save all your
STATA outputs in log file, sort of Stata ’s built-in tape recorder and where you can retrieve the
output of your work and keep a record of your work.
Creating log files
In the command line type:
log using mylog.log OR log using mylog, text
This will create the file ‘mylog.log’ in your working directory. You can read it using any word
processor (notepad, word, etc.). To close a log file type:
log close
To add more output to an existing log file add the option append, type:
log using mylog.log, append
Note that the option replace will delete the contents of the previous version of the log.
Alternatively you can log using SMCL extension, a language Stata understands. Created as
before,
24
log using mylog.smcl
With the .smcl file, one can print to a text file or PDF file using command translate.
translate mylog.smcl mylog.log
If the mylog.log file already exists and you wish to overwrite the existing copy, you can specify
the replace option:
translate mylog.smcl mylog.log, replace
To start a log file from the menu, go to: File > Log >… .There will appear a drop down with
several options such as Begin, Append, Translate a log file.
STATA has two commands for creating new variables, generate and egenerate. Although the
two commands do similar functions, egenerate is applicable when using mathematical or
statistical functions such as sum, mean, etc while generate is limited to straight mathematical
operation. In summary use egenerate when performing more complex operations.
Example: Calculate the total number of hired labor for each record then browse to see results?
gen total_hired_labor= Male_hired_work+Fem_hired_work
brows Male_hired_work Fem_hired_work total_hired_labor
To generate variables using the gen command from the menu, Data > Create or
change data >
To create new variable with the egen command use the menu Data > Create or
change data > Create new variable (extended).
Alternatively, one can use the menu to generate variables. From the Data Menu go to Create or
change data > Create new variable and use the dialog box that appears to accomplish this.
25
More on Egen
Egen allows you to use STATA inbuilt functions to generate new variables. For example, to
create a variable that is the maximum value of number of vines allocated to household we can
use:
egen maxVines = max(Numberofvinecuttingskabode)
egen is especially useful for by-groups, for example, calculating the minimum value of number
of vines by county, we use:
bysort County: egen maxVines = min(Numberofvinecuttingskabode)
Above, min() and max() are statistical functions. This means that you can't use them outside
egen. Remember, one of egen's chief roles is to create variables that depend on multiple
observations. There are other egen functions apart from the ones we have used. Type “help
egen” to check the list of functions.
Suppose we want to dichotomize the continuous variable. This code will assign a value of 1 if car
repair record> 3 and 0 otherwise.
gen rep_cat = rep78>3
In some cases gen can be used together with the tab command to generate dummy variables
from categorical variables. For example:
tab price_cat3, generate(price_dum)
Labelling variables
It is important to document your dataset and one way of doing this is attaching meaningful labels
to the variables. This is done using the label variable command in STATA
label variable total_hired_labor2 "total number of hired workers in the farm"
Using the menu go to: Data > Variables Manager, click on the variable of interest then label the
variable by typing in the text box under “Label”
26
Defining value labels
In addition to labelling variables, another important dataset documentation is defining the what
the values in categorical variables mean, for example, form of ownership (Ownership) in the
example dataset. Defining value labels is a two-stage process – first defining the values labels
then attach the value label to the specific variable.
label define ownershipval 1 "Sole ownership" 2 "Partnership with another
person"
After defining the value labels attach them to a specific variable. Common value labels such
yes/no can be attached to several variables without having to redefine them each time.
label values Ownership ownershipval
Using the menu go to: Data > Variables Manager, click on the “Manage” button on the right of
the Value Label then follow the preceding dialog boxes to create a new value label.
In addition to deriving new variables, data management usually involves changing contents of
existing variables, changing the formats or reclassifying variables. Although there are several
commands or different ways of doing these changes, we will look at the common ones in this
workshop.
Recoding variables
When working with coded variables (variables with value labels), it is more efficient to use recode
command than replace. For example education of the manager has 6 levels (0=Illiterate,
1=Literate, 2= Primary school, 3=Intermediate, 4= Secondary school, 5= College, 6= University)
but looking at the distribution using the TABULATE command, it is more informative to combine
literate with primary, intermediate, secondary, college and university so that there will be three
levels of education. Meaning we recode 1 and 2 into 2 then 3, 4, 5 and 6 into 3. Also 0 is not a
very common code so change it to 1. Use the recode command. It is good practice NOT to change
the original variable and therefore we shall store the changes in a new variable called
edu_recoded.
We can directly add value labels to the recoded values within the recode command
recode Educ_mgr (0=1 “Illiterate”) (1/2=2 “Medium level”) (3 4 5 6=3 “Higher
level”), gen(educ_recode)
Using the menu go to: Data > Create or change data > Other variable-transformation commands
> Recode categorical variable
27
Encode
Some data analysis routines in STATA (e.g. ANOVA) work with numeric variables only and
therefore it becomes necessary to code string variables. The encode command is can be used to
code string variables.
encode owner_gender, generate(owner_gendercode)
The string variable being recoded is automatically sorted in ascending then codes are sequentially
assigned starting with 1 and are stored in a new variables as value labels.
From the menu, go to: Data > Create or change data > Other variable-transformation commands
> Encode value labels from string variable
Decode
decode is the reverse command for encode for converting variables with value labels to string
variables.
decode owner_gendercode, generate(owner_gender_encoded)
From the menu, go to: Data > Create or change data > Other variable-transformation commands
> Decode strings from labeled numeric variable
mvdecode
When working with variables in STATA, we can define our own missing values, e.g. -999 or -888.
We need to explicitly tell stata that -999 or -888 are missing values by running several replace
commands or using mvdecode. Note the prefix “mv” stands for “missing value”.
You can a suffix on a missing value to describe the type of missing for example:
-999 = No answer
-888= Not applicable
And we would therefore want to retain the same information when defining missing values as
follows:
One would have to do that for all variables in the dataset which is time consuming. With
mvdecode we can change all variables with a single command with the _all key word as follows:
28
Destring/Tostring
To convert variables defined string to numeric variables use destring and use tostring to
convert variables defined as numeric to string variable. When using the two commands either
generate or replace must be specified as one of the options.
tostring Tot_hired_lab, gen(Tot_hired_lab_string)
In the example dataset, the variable for income from hides and skin is defined as string variable
with entries such as “no income” if the firm didn’t get income from the business. To use this
variable for any quantitative analysis it has to be converted to numeric variable. NOTE that, in so
doing, the text entries will be converted to missing as follows:
destring Firm_Income_hideskins, gen(Firm_Income_hideskins2)
the command fails because the variable contains both numbers and text entries, to convert the
text entries into missing use the force option
destring Firm_Income_hideskins, gen(Firm_Income_hideskins2) force
To destring from the menu, go to: Data > Create or change data > Other variable-transformation
commands >Convert variables from string to numeric
To tostring from the menu, go to: Data > Create or change data > Other variable-transformation
commands > Convert variables from numeric to string
If you have datetimes stored in the string variable mystr, an example being "2010.07.12 14:32".
To convert to SIF datetime/c, you type
gen double eventtime = clock(mystr, "YMDhm")
The mask "YMDhm" specifies the order of the datetime components. In this case, they are year,
month, day, hour, and minute.
Run the command “help datetime” to learn more commands used to manage date variables.
Missing data common in research. Stata depicts missing data with a “.” for numeric and a black
for strings. Stata commands that perform computations of any type handle missing data by
omitting the missing values.
However, the way that missing values are omitted is not always consistent across commands, so
let's take a look at some examples (auto dataset).
29
summarize
For each variable, the number of non-missing values are used.
summarize rep78
tabulation
By default, missing values are excluded and percentages are based on the number of non-missing
values. If you use the missing option on the tab command,
tabulate foreign
tabulate foreign, m
As a general rule, computations involving missing values yield missing values. For example,
gen sum1 = length + headroom
egen sum2 = rowtotal(length headroom)
The rowtotal function treats missing values as 0. The rowtotal function with the missing option
will return a missing value if an observation is missing on all variables.
egen sum3 = rowtotal(length headroom) , missing
Finally, you can use the rowmiss and rownomiss functions to determine the number of missing
and the number of non-missing values, respectively, in a list of variables. This is illustrated below.
egen miss = rowmiss(length make)
egen nomiss = rownonmiss(length make)
Using the replace command, you can fix the missing values as follows, e.g.
replace rep78=0 if rep78==.
replace rep78=0 if missing(rep78)
The replace
tab Subcounty
Sub county
Uriri
Bungoma North
Matayos
Matoyos
Nambale
Looking at the output table, you notice that there are two sets of Subcounties that look like they
are same but with slight variation in spelling of order. For example Matayos is same company as
Matoyos and so one should be changed.
Replace Matoyos with Matayos using the replace command
30
replace Subcounty=”Matayos” if Subcounty==”Matoyos”
From the menu, go to: Data > Create or change data > Change contents of variable
Combining datasets
Combining dataset is a process of joining data from two different datasets. Broadly, there are
two way of joining datasets. One involves adding records to existing variables while the second
one involves adding fields/variables. Each of them have specific requirements and modes of
joining, for example, to append records, the two datasets must have same variables names while
merging will require a common variable that is used to link to the two datasets. STATA uses merge
and append commands to join datasets.
append adds more observations to the dataset, while merge adds more variables to existing
observations. To illustrate this with a simple graphic, append makes the data longer, while merge
makes it wider. The most common way in which append is used with cleaning is to put separate
batches of entered data into one dataset.
MERGE
Merge command is used to add variables to an open data set (master file) from an external
dataset (using) based on the common variable to match the records. Within STATA the matching
can be done based on one or more common variables. The match merging can be one-to-one,
one-to-many, many-to-one or many-to-many. STATA by default creates a new variable _merge
that contains codes indicating the source of the merged observations. The codes meaning are
displayed as part of the output from the merging process. The merge command can also be used
to update missing observations using the “update” option in the merge command, that is, if the
existing variables in the master file has missing values which have values in the using file. If the
‘using file’ is generally more up to date than the master file, then use the replace update option.
The “replace update” option affects only the observations with missing values.
We will merge two separate files – The first dataset contain the farmers data and the second
dataset contains the crops that were sold by the farmers. We have to join the two datasets in
order to carry out data some analysis. Use the merge 1:m command to merge them since the
dataset in memory (master file) has unique observations for the hhid variable while the using
dataset contains a list of crops sold by each household (here the hhid is repeated several times).
Open the owner dataset
31
use "/Volumes/Transcend/Career/ILRI/Training/crops_sold.dta”, clear
The output shows the codes of the _merge variable. Here is an explanation of each of the
codes.
1 master - observation appeared in master only
2 using - observation appeared in using only
3 match - observation appeared in both
4 match_update - observation appeared in both, missing values updated
5 match_conflict - observation appeared in both, conflicting nonmissing value
Codes 4 and 5 can arise only if the update option is specified. If codes of both 4 and 5 could
pertain to an observation, then 5 is used.
Discussion: What changes do you see in the newly dataset
From the menu, go to: Data > Combine datasets > Merge two datasets.
APPEND
Appending data occurs when records in the dataset to be analyzed were entered into separate
files but with the same variables. In the example below, two datasets (farmers1 and farmers2)
with the same variables were entered separately by two different data entry clerks. We use
append to join the two datasets into one
append using "/Volumes/Transcend/Career/ILRI/Training/farmers2.dta"
From the menu go to: Data > Combine datasets > Append datasets
32
Changing the shape of the data
RESHAPE
Often data come from either cross-sectional or panel studies. Cross-sectional data is data
collected from subjects at one specific time while panel data is data collected from same subjects
repeatedly over a certain period. Some cross-sectional data have hierarchy e.g. data on
household members, whereby the main subject is the household head while data on members
in the household are a second level data on the same household and therefore the household
identification is repeated. Panel data and cross-sectional data are entered either in WIDE or
LONG formats. Here is a pictorial presentation of the different data formats we are likely to
encounter.
In the file data below, you notice that each household produced and sold more than one crop
and this are spread over several columns such that the hhid’s are unique. This is a wide format.
It is possible to change data from long to wide in STATA using the reshape command.
33
To reshape data from wide to long STATA expects that the variables start with same stubname,
that the prefixes are the same, e.g. crop1, crop2, crop 3 or price2015, price2016, price2017, the
stubnames are crop and price. By default STATA assumes that the suffix is numeric and therefore
if the suffix text, it must be stated with the string option; this includes suffixes that are preceded
with zeros e.g. 01, 02.
Since the variables do not have a stubname, we will add one by renaming the variables such that
each is preceded by the prefix “yield” as shown below:
rename * yield*
This adds the prefix “yield” to all the variables in the dataset. Since we do not want the hhid and
county variables to contain the “yield” prefix, we revert to their original names using the
following code:
ren (yieldhhid yieldcounty) (hhid county)
View the reshaped dataset using the browse command. The dataset looks like this. Note that
the hhid variable is repeated.
34
If you need to use the reshape the data using the menu, go to: Data > Create or change data >
Other variable-transformation commands > Convert data between wide and long
By command
In Stata , you often find yourself needing to repeat a command, perhaps making minor
adjustments from one call to another. For example, suppose you want to examine summary
statistics for variable sum Numberofvinecuttingskabode for all the different counties. You could
try:
summarize Numberofvinecuttingskabode if County == “Busia”
summarize Numberofvinecuttingskabode if County == “Migori”
summarize Numberofvinecuttingskabode if County == “Bungoma”
This is a very inefficient approach, especially when you several counties are considered. What's
the solution? We use the by command:
sort County
by County: summarize Numberofvinecuttingskabode
by repeats Stata commands on subsets of the data, called "by-groups." All observations of each
by-group share the same value of the "by-variable" (for example, County above). The by
command above first ran summarize for all observations with County == “Busia”, then all those
with County == “Migori”, and so on. Each of these was its own by-group. by is followed by a list
of by-variables. This list can include more than one variable.
You might be wondering why we needed to sort by County before using by. by requires the
dataset to be sorted by the by-variables. You can do this before by, as we did above, or at the
same time by combining by and sort into one command, named bysort:
bysort County: summarize Numberofvinecuttingskabode
String cleaning
35
Very often, whether due to the enumerator, the data entry operator, or just the fact that certain
places, people, and things have different names, string variables will have multiple spellings for
the same response. For example a single county variable may have “Homa Bay” and “Homabay”.
We will discuss some of the approaches that can be used to clean string variables:
For string variables that are supposed to be only one word, you may want to remove all spaces
entirely using the following commands
replace County = itrim(trim(County)) OR
replace County = subinstr(County, " ", "", .)
The trim command removes leading and trailing spaces while the itrim command removes extra
spaces in between words. The subinstr command searches and replaces all instances of “ “ With
“”. There are several other string functions that can be used to clean and create new variables
such as strupper, strlower, substr etc
Suppose we want to remove typos from the County variable but cannot use any of the defined
functions. First you need to decide on the standardized responses, then make replacements. For
example, suppose these are all the same County: Homa Bay, Homa bay, Homabay, Homa bey and
if I wanted to standardize County by changing all these values to "Homabay", I could code.
replace County = "Homabay" if County == "Homa Bay" | County == "Homa bay" |
County == "Homa bey"
Sub-setting data
STATA allows you keeping and dropping variables that you need in your final dataset. This can
be achieved using the drop command.
Suppose we want to just have the variables hhid and County, we can keep just those variables,
as shown below.
keep hhid County
Perhaps we are not interested in the variables Wardcode Subcountycode and Countycode. We
can get rid of them using the drop command shown below.
drop Wardcode Subcountycode Countycode
STATA also allows you to keep and drop observations based on a specific condition. For
instance, we can eliminate the observations which have missing values using drop if as shown
below.
drop if missing(Countycode)
36
Let's illustrate using keep if to eliminate observations. Suppose we want to keep just the
households from Migori County.
keep if County == “Migori”
To keep or drop variables/observations using the menu, go to: Data > Create or change data >
Drop or keep observations
Macros
Loops
In Stata, loops are used to repeat a command multiple times or to execute a series of commands
that are related. There are several commands that work as loops:
foreach,
while,
forvalues,
if (different if from the if qualifier you learned before),
For example: suppose you need to transform a set of variables. Instead of doing it one by one,
you can loop through the variables.
foreach
This is the most basic foreach syntax runs the following way:
foreach item in list_item1 list_item2 list_item3{
command `item'
Where item is the macro that is used to signify each item of the list. So, as Stata loops through
the command for each item on the list, it will substitute each list_item into the macro, performing
the following:
command list_item1
command list_item2
command list_item3
For example:
foreach letter in a b c d {
display "`letter'"
37
}
In this case foreach … in is the command, a b c d is the list of items that you want to loop over,
and letter is the local you use to first declare and then call the list.
Now let’s try something serious
foreach var in price mpg weight length displacement {
summarize `var’
This will loop through each variable in the list, summarizing one at a time.
For starters, please check out the rep78 variable – tab it, codebook it, etc. You will see that it has
5 values (1-5). Now suppose we wanted to create a separate variable out of each of the five types.
Before we would’ve done it the slow way. We would’ve said
gen rep78_1 = (rep78 == 1)
That’s 3 lines instead of 5 we would’ve done. And one of them is just a bracket! Now imagine
that had been 20 values, or a hundred!
Descriptive statistics
Analysis of data that helps describe, show or summarize data in a meaningful way such that, for
example, patterns might emerge from the data. Descriptive statistics do not, however, allow us
to make conclusions beyond the data we have analyzed or reach conclusions regarding any
hypotheses we might have made. They are simply a way to describe our data.
Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. For example, the mode, median, and mean. Use the summarize
command to accomplish this:
38
Measures of spread: these are ways of summarizing a group of data by describing how spread
out the scores are. A number of statistics are available to us, including the range, quartiles,
absolute deviation, variance and standard deviation. Use the summarize command with the
detail option to accomplish this:
The are three related commands that produce frequency tables for discrete(categorical)
variables, that is, tab, tab1 and tab2.
tab:
i. Produces a oneway frequency table with percentages
tab County
39
ii. Produce a two-way frequency table (without percentages) but no more than two
variables is allowed by Stata . The row and col options tell Stata to output row and cell
percentages.
Using the key, we can tell that the first row shows the frequencies, the second row shows
percentages of the total frequency for each county and the third row shows percentages of the
total frequency for each level of Gender of the household head. For example, we can say that
75% of households in Bungoma County are headed by both males and females.
tab1
Produces a one-way frequency table for each variable in the variable list, unlike tab which allows
only one variable at a time when producing one-way frequency tables.
tab1 Genderedhouseholdtype1Femalh County Subcounty Ward
tab2
Produces all possible two-way tables from all the variables in the list (unlike the tab command)
tab2 Genderedhouseholdtype1Femalh County Subcounty Ward
40
From the menu, go to: Statistics > Summaries, tables, and tests > Frequency tables > One-way
table
OR
Statistics > Summaries, tables, and tests > Frequency tables > Multiple one-way tables
Instead of using the summarize command, one can also use the tabstat command which allows
you to specify the summary statistics of interest. Here are several examples
tabstat Numberofchildren023months Numberofpregnantwomen, by(County)
This command will output the means of the 2 variables for each county.
tabstat Numberofchildren023months, s(mean median sd var count range min max)
by(County)
This command generates mean, sd and freq of Numberofchildren023months for the categorical
two-way table.
tab County Genderedhouseholdtype1Femalh, sum(Numberofchildren023months)
nofreq
Often data is collected at more details than required during analysis, for example data for income
from livestock sales is often collected at animal level but analysis is usually done at household
level. It is therefore necessary to aggregate income from livestock sales to household level.
Similarly rainfall and temperature data is usually collected on daily basis but the reporting is
either monthly or annually and thus it has to be aggregated to the unit of analysis level. STATA
provides a tool called “COLLAPSE” to make data aggregation possible.
Collapse creates a summarized dataset from a master dataset. It is important to save the master
file before running the collapse command because it replaces the open the dataset.
use "farmers.dta", clear
41
collapse (sum) Number_children, by(County)
From the Data Menu, go to: Create or change data > Other variable-transformation commands >
Make dataset of means, medians, etc.
There are times when it is necessary to perform an operation on a set of records or fields at the
same time. STATA has functions that can be used in this circumstances. For example, in
longitudinal studies, we may want to calculate change over time for each such subject in a dataset
that is in a long format.
Example: We want to assess the trend in price change between grades in the example dataset,
by calculating the difference in price between grade 1 and 2, 2 and 3 and then 3 and 4 within
household and species.
Sort by species and qno and then get the price difference between grades.
sort hhid species level
duplicates report hhid species level
bysort hhid (level ) : gen price_change = gradepr - gradepr[_n-1]
collapse price_change, statistics(mean ) by(level)
Graphs provide an great way to explore your data visually – perform graphical summaries. Stata
has excellent graphic facilities, accessible through the graph command, see help graph for an
overview. This gives a list of commands i.e,
graph twoway scatterplots, line plots, etc.
graph matrix scatterplot matrices
graph bar bar charts
graph box box-and-whisker plots
graph pie pie charts
Once graphs have been generated, one can use the following commands to save a previously
drawn graph, redisplay previously saved graphs, and combine graphs
graph save save graph to disk
graph use redisplay graph stored on disk
graph display redisplay graph stored in memory
graph combine combine multiple graphs
graph export export .gph file to PostScript, pdf,png etc
We will use these commands extensively later in the course. First, we need to know the
difference between qualitative and quantitative data and how each data type can be plotted.
42
Qualitative data is one of several categories e.g blood group, region etc. One can generate the
count of the number of subjects in each group is commonly refered to as the frequency. Stata
command to produce a tabulation is tabulate varname.
This kind of data can be summarized graphically using any of the following graphs.
• Bar Chart: Data represented as a series of bars, height of bar proportional to frequency
(similar to histogram).
• Pie Chart: Data represented as a circle divided into segments, area of segment
proportional to frequency.
Quantitative can take any numerical value e.g weight, height,price. A histogram can be used to
summarize the continuous data.
• Area of bars proportional to probability of observation being in that bar
• Axis can be;
Frequency (heights add up to n) - frequency;
Percentage (heights add up to 100%) - percent
Density (Areas add up to 1) - density
A histogram can also be used to plot categorical data using the hist command and the discrete
option. Stata assumes by default that the variable supplied is continuous unless one adds the
discrete option. Note that string variables have to be converted to numeric first before being
used.
encode County, gen(County1)
hist County1, discrete percent
43
With a categorical variable, one can always generate graphs for each level of the variable, by use
of the by option. For example, using the farmers dataset, one can create a histogram for the
distribution of the number of children for each County.
hist Numberofchildren023months, by(County1) percent
You can also create an independent kernel density plot with the kdensity command
kdensity Numberofchildren023months
You can overlay a kernel density on your histogram just by adding the kdensity option (there's
also a normal option to add a normal density).
hist Numberofchildren023months, frequency kdensity
44
Graphing bivariate relationships
Use the graph bar command to plot for instance the mean number of children by county. To
accomplish this, go to the Graphics menu them select Bar chart.
Under the Main tab, Check the radio button, Graph by calculating Summary statistics. Then under
Statistics to plot, select the Mean from the drop down and the variable to which to apply the
Statistics. Under the Categories tab, select County as the category variable. See the screenshot
below.
One can also run this command directly from the Command line
graph bar (mean) Numberofchildrenunder5, over(County) bar(1, fcolor(ltblue))
twoway is a family of plots, all of which fit on numeric y and x scales. Two-way graphs show the
relationship between numeric data. Suppose we want to shows the relationship between price
and mileage in the auto dataset, we could graph these data as a twoway scatterplot;
45
sysuse auto, clear
or we could graph these data as a scatterplot and put on top of that the prediction from a linear
regression of le on year,
twoway (scatter price mpg) (lfit price mpg)
Combining graphs
To include multiple plots in a graph, they must be separated either by putting them in
parentheses or by putting two pipe characters between them (||).
Thus to create a graph containing two scatter plots of price and mpg, one for if foreign==1 and
another for which foreign==0, you can type either:
scatter price mpg if foreign==1 || scatter price mpg if foreign==0
Statistical inference
Oftentimes we want to study a phenomenon about a population. Since we cannot collect data
from the whole population we pick a representative sample through sampling.
Descriptive statistics provide information about the immediate group of data. Inferential
statistics on the other hand, allow us to make generalizations about the population using the
sample.
This introduces some degree of uncertainty since you are using a sample to infer what would be
measured in a population.
In statistical inference, we test hypothesis. Hypothesis testing is used to establish whether a
research hypothesis extends beyond those individuals examined in a single study. To conduct a
study, a researcher has to go through the following steps,
i. Define the research hypothesis for the study.
ii. Set out the variables to be studied and how to measure them.
46
iii. Set out the null and alternative hypothesis
iv. Set the significance level – mostly 5%
v. Make a one- or two-tailed prediction.
vi. Determine whether the distribution that you are studying is normal.
vii. Select an appropriate statistical test based on the variables you have defined and
whether the distribution is normal or not.
viii. Run the statistical tests on your data and interpret the output.
ix. Reject or fail to reject the null hypothesis based on the p-value of the test statistic.
The figure below shows a flow chart of the most commonly used statistical tests.
The table below shows a more detailed description of the statistical tests depending on the
type of response variable.
Outcome Are the observations correlated? Alternatives if the normality
Variable assumption is violated (and
small n):
Independent Correlated/Panel
Continuous Ttest: compares means between two Paired ttest: compares means Non-parametric statistics
(e.g. blood independent groups between two related groups Wilcoxon sign-rank test: non-
pressure, age, ANOVA: compares means between more (e.g., the same subjects before parametric alternative to paired
pain score) than two independent groups and after) ttest
Pearson’s correlation coefficient (linear Repeated-measures ANOVA: Wilcoxon sum-rank test
correlation): shows linear correlation compares changes over time in (=Mann-Whitney U test): non-
between two continuous variables the means of two or more parametric alternative to the
Linear regression: multivariate regression groups (repeated ttest
technique when the outcome is measurements) Kruskal-Wallis test: non-
continuous; gives slopes Mixed models: multivariate parametric alternative to
regression techniques to ANOVA
compare changes over time Spearman rank correlation
between two or more groups coefficient: non-parametric
alternative to Pearson’s
correlation coefficient
47
Categorical Logistic regression: Binary outcome GEE Models: to model changes
response (yes/no response) over
Ordinal logistic regression: Ordered
categories (e.g Mild/Moderate/Severe Fixed/random/mixed models:
pain) (Unlikely, Somewhat likely, very to take into account
likely to adopt a new farming method) dependence of observations.
Multinomial logistic regression: Nominal
response (Type of farming method
adopted: Zero-grazing, range farming,
Extensive)
Normality testing
It is important to test whether the continuous response variable follows a normal distribution or
not so as to be able to use the appropriate statistical method.
Stata provides ways of checking for normality of continuous variables using graphical ways. For
instance, suppose we would like to explore the distribution of a variable called score. We can
accomplish this using a variety of plots such as .
• stem-and-leaf plot : use the Stata command - stem score
• Dot plot : use the Stata command - dotplot score
• Box plots - use the Stata command - graph box score
• Histograms - use the Stata command - hist score
• Distributional diagnostic plots such as P-P plot and Q-Q plot. From the Statistics Menu, go
to Summaries, tables, and tests > Distributional plots and tests > Normal probability plot,
standardized for a P-P plot. Furthermore, for a Q-Q plot, Go to - Statistics > Summaries,
tables, and tests > Distributional plots and tests > Quantile-quantile plot
The methods described above are just explanatory. They do not help us provide conclusions on
whether the variable score is normally distributed or not. To test for normality, we need to state
the hypothesis first as follows,
• Null: Data is not normally distributed
• Alternative: Data is normally distributed
In Stata, this test can be implemented using the following tests.
• Shapiro-Wilk test - swilk score
• Shapiro-Francia test - sfrancia score
• Skewness/Kurtosis test - sktest score
Test of Association
Tests whether categorical variables are related. Here we will use the chi-square test of
association. The Null and Alternative hypothesis is as follows:
Null: The two variables are not related
48
Alt: The two variables are related
When you choose to analyze your data using a chi-square test for association, you need to make
sure that the data you want to analyze conforms to two assumptions. These two assumptions
are:
Tests of difference
Independent T-Test
The independent t-test is used to determine whether the mean of a dependent variable e.g maize
yield is the same in two unrelated, independent groups (e.g., males vs females, employed vs
unemployed. Specifically, you use an independent t-test to determine whether the mean
difference between two groups is statistically significantly different to zero.
Assumptions
Dependent variable should be continuous.
49
Independent variable should consist of two categorical, independent (unrelated) groups.
There should be no significant outliers
Dependent variable should be approximately normally distributed for each category of
the independent variable.
There needs to be homogeneity of variances. Tested using Levene's test for homogeneity
of variances.
Example: Test whether there is a difference in calf’s birth weight based on the gender of the calf.
The hypothesis for this test is:
Null: ℎ = ℎ
Alternative: ℎ != ℎ
pnorm calf_bweight
The Wilcoxon-Mann-Whitney test : used when the dependent variable is assumed not to be
normally distributed.
ranksum calf_bweight, by(sex)
One-way ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of two or more independent (unrelated)
groups.
When you choose to analyze your data using a one-way ANOVA, part of the process involves
checking to make sure that the data you want to analyze can actually be analyzed using a one-
way ANOVA. This involves checking whether the following assumptions are met by your data.
50
Dependent variable should be continuous.
Independent variable should consist of two categorical, independent (unrelated) groups.
There should be no significant outliers
Dependent variable should be approximately normally distributed for each category of
the independent variable.
There needs to be homogeneity of variances. Tested using bartlet’s test for
homogeneity of variances.
You should have independence of observations
Example: An agricultural research company is comparing the weights of cows from 4 farms
where each has 500 cows. The company wants to know whether the average weight of the
cows differed based on the farm they came from.
The hypothesis for this study is as shown below.
Null: av_ ℎ = _ ℎ = _ ℎ = av_ ℎ
Alternative: At least one of the means is different
use farmer_cattle_weight.dta, clear
One of the assumptions of ANOVA is normality of dependent variable. The Kruskal Wallis test:
used when this normality assumption is violated.
kwallis weight, by(farm)
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly
different value of chi-squared.
One way ANOVA assumes homogeneity (equality) of variances. To test this STATA uses the
Bartlett’s test. Stata provides results for the Bartlett’s test for equality of variance together with
the anova results.
The null hypothesis of the test states that the variances are equal. If the assumption of
homogeneity of variance is violated, recast the ANOVA into a regression with dummy variables
then use the robust option will output robust standard errors:
regress weight i.farm, robust
PostHoc Tests
If the ANOVA test is significant, pairwise differences using either Bonferonni, tukey or Scheffe
needs to be conducted so as we do not know which of the specific groups differed.
pwmean weight, over(farm) mcompare(tukey) effects
With ANOVA, more than two independent variables can be used – Two way ANOVA
51
anova weight farm##farm_size
Paired T-Test
Objective: to determine whether the mean of a dependent is the same in two related groups.
Here the assumptions are the same as those of the independent samples t-test except
independent variable should consist of two categorical, "related groups" or "matched pairs“
In Stata :
ttest FirstVariable == SecondVariable
The Wilcoxon signed rank sum test: non-parametric version of a paired samples t-test.
signrank FirstVariable == SecondVariable
id time dv
1 1 4.5
1 2 3.0
1 3 2.5
2 1 7.2
2 2 4.2
2 3 2.4
Example: The weight 2000 cows was measured over a period of 1 year every 3 months. We would
like to investigate whether there is a difference in the average weights over the 4 time periods.
Here,
Dependent variable: weight is your dependent variable,
Independent variable: time(4 related groups)
Check for assumptions the same way as in the case of one-way anova
52
Used to establish the relationship between dependent variable and one or more set of
explanatory variables.
Assumptions:
Normality of residuals
Homogeneity of variances
Dependent variable should be continuous
Independence of observations
Linear relationship
Example: Cross-sectional study involving 1500 farmers in Meru County. The aim of the study is to
find determinants for maize yield (bags per ha). Using the maize_yield_ols.dta, we model the
maize_yield on the variety of maize planted..
use maize_yield_ols.dta, clear
swilk res
Homogeneity of variances
First compute the studententized residuals, then plot a scatter plot of residuals against
predicted values. To obtain predicted values (yhat) from this regression, type
predict yhat,xb
53
To assess homogeneity in the data, check for any pattern in the data. A pattern indicates that the
assumption of heterogeneity of variance has been violated. The Breusch–Pagan test is used to
formally test for heteroscedasticity.
estat hettest
Now let’s move on to overall measures of influence, specifically let’s look at Cook’s D. The lowest
value that Cook’s D can assume is zero, and the higher the Cook’s D is, the more influential the
point. The convention cut-off point is 4/n, where n is the number of observations. We can list
any observation above the cut-off point by doing the following.
predict d, cooksd
Now let’s take a look at DFITS. The cut-off point for DFITS is 2*sqrt(k/n). DFITS can be either
positive or negative, with numbers close to zero corresponding to the points with small or zero
influence.
predict dfit, dfits
Panel data (also known as longitudinal or cross-sectional time-series data) is a dataset in which
the behavior of entities are observed across time. These entities could be states, companies,
54
individuals, countries, etc. Panel data looks like this.
Observations are not independent since measurements at different time points are correlated.
Therefore we have to fit a model that correct for the lack of independence. One can fit a model
with either fixed or random town effects depending on the study objective and the assumptions
we have about the respondents.
Fixed effects model
Here all the respondents are expected to share a common effect. The goal of the analysis is to
generalize the results to the population from which the respondents were drawn (and not to
extrapolate to other populations) FE explore the relationship between predictor and outcome
variables within an entity (country, person, company, etc.).
When using FE we assume that something within the individual may impact or bias the predictor
or outcome variables and we need to control for this. Fixed-effects models are designed to study
the causes of changes within a person. A time-invariant characteristic e.g gender cannot cause
such a change, because it is constant for each person.”
Random effects model
By contrast, when the researcher is accumulating data from respondents that are different
(geographical regions, tribe), it would be unlikely that all they are functionally equivalent.
Additionally, the goal of this analysis is usually to generalize to a range of scenarios, possible to
extrapolate.
If you have reason to believe that differences across entities have some influence on your
dependent variable then you should use random effects. An advantage of random effects is that
you can include time invariant variables (i.e. gender). In the fixed effects model these variables
are absorbed by the intercept.
Example: 500 farmers were recruited into a study where the quantity of maize they harvested
per ha was recorded over a period of 3 years. The aim of the study is to study how the yields
evolve over time and how these evolutions depend on some demographic variables.
The variables in the dataset are:
55
id: unique identifier for each farmer
bags_per_ha: response of interest (maize yield)
year : year (repeated measurements)
age : age of the respondent (age in years)
education : education level of the respondent (low, medium, high)
gender_hhead : gender of the household head (male, female)
manager : Who manages the farm (man, woman, child)
The first step is to declare the data to be clustered with a farmer (id) as the cluster variable and
year as the order of observations within a respondent.
xtset id year
To summarize maize yield, decomposing standard deviation into between and within
components
xtsum bags_per_ha
Where;
: is the overall intercept
: (i=1….500) are the 500 subject-specific effects
: maize yield for year j in subject i
: year j for subject i
56
: time effect
: The error term
Estimate a fixed-effects model with robust standard errors using Stata as follows.
xtreg bags_per_ha year, fe vce(robust)
Where;
: is the overall intercept
: (i=1….500) are the 500 subject-specific random intercepts
: (i=1….500) are the 500 subject-specific random slopes
: maize yield for year j in subject i
: year j for subject i
: age effect
: time effect
: The error term
We will fit a mixed effects model with both fixed and random effects.
xtreg bags_per_ha year age, re vce(robust)
Used to model binary response - the response measurement for each subject is “success” or
“failure”. The most popular model for binary data is logistic regression.
In the logit model the log odds of the outcome is modeled as a linear combination of the
predictor variables. The odds ratio (OR) is the ratio of the “odds” of success versus failure for
the two groups:
/( )
OR =
/( )
The log-odds ratio is often used to alleviate the restrictions that the odds ratio must be positive.
More on odds ratio
• OR=1 corresponds with no difference
• When 1 < OR < ∞, the odds of success are higher in group 1 than in group 2. Thus,
subjects in the first group are more likely to have successes than subjects in group 2.
57
• When 0 < OR < 1, the odds of success are smaller in group 1 than in group 2. Thus,
subjects in the first group are less likely to have successes than subjects in group 2.
• Values of OR farther from 1 in a given direction represent stronger level of association.
Suppose you are comparing males and females on their likelihood to adopt a new farming
method (yes/no).
Total number of females = 100
Total number of males = 120
The logistic regression model has linear form for the logit of the success probability π(x) when X
takes value x:
( )
( ) = log = +
1− ( )
The relationship between π(x) and the x is then described by the logistic function:
exp( + )
( )=
1 + exp( + )
Alternatively
( = ) = +
58
• Logit = link function - says how the expected value of the response relates to the linear
predictor of explanatory variables.
• The parameter β determines the rate of increase or decrease of the curve.
• When β > 0, π(x) increases as x increases.
• When β < 0, π(x) decreases as x increases.
• When β = 0, no effect
Example: A researcher is interested in how variables, such as monthly income, land size and
education level affect adoption of a new variety of potatoes.
The response: binary (adoption , 1 = adopt, 0 = didn’t adopt))
There are three predictors
• Continuous variables: income, land_size variables:
• Categorical variables: education – takes on the values 1 through 4. A rank of 1 implies
highest level of literacy, a rank of 4 has the lowest literacy level.
59
This is interpreted as follows.
• The likelihood ratio chi-square of 41.46 with a p-value of 0.0000 tells us that our model
as a whole fits significantly better than an empty model.
• Both income and land_size are statistically significant, as are the three indicator
variables for education.
• The logistic regression coefficients give the change in the log odds of the outcome for a
one unit increase in the predictor variable.
• For every one unit change in income, the log odds of adoption(versus non-
adoption) increases by 0.0002.
• For a one unit increase in land_size, the log odds of adoption increases by 1.6.
• Having a literacy level of rank 3 versus a literacy level of rank 4 (lowest) increases
the log odds of adoption by 0.21.
Other tests
We can test for an overall effect of education using the test command.
test 1.education 2.education 3.education
We can also test additional hypotheses about the differences in the coefficients for different
levels of education. Below we test that the coefficient for rank=2 is equal to the coefficient
for rank=3.
test 2.education 3.education
60
You can also exponentiate the coefficients and interpret them as odds-ratios.
logit, or
Predicted probabilities
You can also use predicted probabilities to help you understand the model. You can calculate
predicted probabilities using the margins command
Below we use the margins command to calculate the predicted probability of adoption at each
level of education, holding all other variables in the model at their means.
margins education, atmeans
We can also generate the predicted probabilities for values of income from 2000 to 8000 in
increments of 1000 (continuous variable). The average predicted probabilities will be calculated
using the sample values of the other predictor variables.
For example, to calculate the average predicted probability when income = 2000, the predicted
probability was calculated for each case, using that case’s values of education and land_size,
with income set to 2000.
margins, at(income=(2000(1000)8000))
Ordinal logistic regression is used to predict an ordinal dependent variable with more than two
categories given one or more independent variables.
Example: A study looks at factors that influence the decision of whether to adopt a new potato
variety. Farmers are asked if they are unlikely, somewhat likely, or very likely to adopt the new
variety. Hence, our outcome variable has three categories. Data on farmer’s educational
status, whether the farmer is male or female, and current land size in ha is also collected.
ologit, or
61
One of the assumptions underlying ordinal logistic regression is that the relationship between
each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that
the coefficients that describe the relationship between, say, the lowest versus all higher
categories of the response variable are the same as those that describe the relationship between
the next lowest category and all higher categories, etc. This is called the proportional odds
assumption or the parallel regression assumption.
First, we need to download a user-written command called omodel. The null hypothesis is that
there is no difference in the coefficients between models, so we "hope" to get a non-significant
result. Please note that the omodel command does not recognize factor variables, so the i. is
omitted.
findit omodel
The brant command performs a Brant test. As the note at the bottom of the output indicates,
we also "hope" that these tests are non-significant. The brant command, is part of
the spost add-on and can be obtained by typing findit spost. We will use the detail option
here, which shows the estimated coefficients for the two equations.
brant, detail
A generalized ordered logistic model using gologit2 should be fit if the assumption is
violated. You need to download gologit2 by typing findit gologit2.
Predicted probabilities
margins, at(education=(0/1)) predict(outcome(0)) atmeans
Way forward
There are lots of materials on STATA online and we encourage you to search and refer to this
support materials as need arises. Some of the materials are written by specialist in different
disciplines and you are likely to specialized tailored guides and instruction.
62