0% found this document useful (0 votes)
31 views62 pages

6.1 Stata

The document is a comprehensive guide on using Stata for data management and analysis, covering topics from basic commands to advanced statistical techniques. It includes sections on directory management, importing/exporting data, managing variables, and conducting various types of analyses including regression and logistic regression. The training was conducted by Nicholas Ndiwa at ILRI from April 14-21, 2018.

Uploaded by

abdi1211001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views62 pages

6.1 Stata

The document is a comprehensive guide on using Stata for data management and analysis, covering topics from basic commands to advanced statistical techniques. It includes sections on directory management, importing/exporting data, managing variables, and conducting various types of analyses including regression and logistic regression. The training was conducted by Nicholas Ndiwa at ILRI from April 14-21, 2018.

Uploaded by

abdi1211001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Include Margins prediction note

DATA MANAGEMENT AND ANALYSIS USING STATA

Nicholas Ndiwa

ILRI – ISTVS Training


14th – 21st April 2018
Table of Contents
INTRODUCTION TO STATA 5

DIRECTORY MANAGEMENT 5

SET COMMANDS IN STATA 6

OBTAINING HELP AND PERFORM SEARCH 7

STATA SYNTAX 7

USING BASIC STATISTICAL COMMANDS 8

USER-WRITTEN EXTENSIONS 10

DATA STRUCTURES AND TYPES OF VARIABLES 11

IMPORT, EXPORT, LOAD AND SAVE DATASETS 14

KNOW YOUR DATA 15

DUPLICATES 17

SORTING AND ORDERING 20

EXECUTING COMMANDS USING DO-FILES 21

OUTPUT MANAGEMENT 24

GENERATING NEW VARIABLES 25


EGEN 26
CREATING DUMMY VARIABLES 26

LABELLING VARIABLES 26

DEFINING VALUE LABELS 27

CHANGING VARIABLE CONTENT 27


RECODING VARIABLES 27
ENCODE 28
DECODE 28
DESTRING/TOSTRING 28

2
MANAGING DATETIME VARIABLES 29

MANAGING MISSING DATA 29

THE REPLACE 30

COMBINING DATASETS 31
MERGE 31
APPEND 32

CHANGING THE SHAPE OF THE DATA 33


RESHAPE 33

BY COMMAND 35

STRING CLEANING 35

SUB-SETTING DATA 36

MACROS 37

LOOPS 37
FOREACH 37

DESCRIPTIVE STATISTICS 38

SUMMARIZING GROUP DATA. 39

TABLES OF SUMMARY STATISTICS 41

AGGREGATING/SUMMARIZING DATA SETS 41

BULK DATA PROCESSING (NEEDS EXPANDING) 42

INTRODUCTION TO GRAPHING USING STATA 42

GRAPHING BY CATEGORICAL GROUPS 43

GRAPHING BIVARIATE RELATIONSHIPS 45

THE TWOWAY COMMAND 45

COMBINING GRAPHS 46

3
ADDING TITLES AND LEGENDS TO GRAPHS 46

STATISTICAL INFERENCE 46

NORMALITY TESTING 48

TEST OF ASSOCIATION 48

TESTS OF DIFFERENCE 49

TEST OF DIFFERENCE FOR REPEATED MEASURES DATA 52

REGRESSION ANALYSIS WITH STATA: CROSS-SECTIONAL DATA 52

INTRODUCTION TO PANEL DATA ANALYSIS 54

CONDUCTING PANEL DATA ANALYSIS IN STATA 56

INTRODUCTION TO LOGISTIC REGRESSION 57

FORMULATION OF THE LOGIT MODEL 58

FITTING A LOGISTIC REGRESSION MODEL IN STATA 59

ORDINAL LOGISTIC REGRESSION 61

WAY FORWARD 62

4
Introduction to Stata

What is Stata
STATA is versatile statistical packages used for data management, analysis and production of
graphical outputs. It was initially popular among economists and it is nowadays used by
researchers from all fields.
Why Stata
STATA is popular because it is user-friendly, it has perpetual licensing options, has very powerful
and specialized user-written add-ons and allows reproducible reporting. It encompasses menu
and command driven interaction options.
The Stata Environment

Tool bar

Directory management

A directory is a folder where the files you are working on are stored.
Why create a directory?

 Your work is quite safe


 You can access your files without having to write to Stata the full path where the file is

5
How to create a working directory
By default, STATA saves or retrieves files from a defaulter folder specified during installation.
Users can change the default working folder to their preferred ones using the Change Directory
(cd) Command.
The cd (change directory) command can be used on its own to identify the directory you are
currently working in.
However, the command, followed by a directory name, changes the directory you work in. For
instance, let’s change our directory to “/Volumes/Transcend/Career/ILRI/Training”
cd “/Volumes/Transcend/Career/ILRI/Training”

Note that if your directory path contains embedded spaces, you will need to put the path in
double quotes.
To change the directory from the menu go to: File > Change Working Directory…

Set commands in Stata

The set command is used to control the STATA operating environment. There are dozens of set
commands, but many of them are rarely used.
If you get the error message: “No room to add more observations”, this means the data file is too
big for the memory allocated to Stata . You can use this command increases the memory
allocated to Stata .
set mem XXm

This sets memory to XX megabytes. You cannot set memory greater than the total RAM memory
in the computer – physical and virtual memory. If you set too much memory, it lowers the
processing speed of the computer, so be cautious
If the problem is in variable allocation (default is 5,000 variables), you increase it by typing, for
example:
set maxvar 10000

To check the initial parameters type


query memory

Another SET command commonly used is the set more command. This command is used to turn
on and off the continuous scrolling of output.
set more off, [option]

6
Use if you are not interested in the intermediate output, only the final result. The option
permanent keeps this instruction to STATA perpetually. Use “set more on” if you need to be
able to read the early output.
Another SET command is the set trace command which traces the execution of programs for
debugging. set trace off turns off tracing after it has been set on.

Obtaining help and perform search

To get help in Stata , type help followed by the command, this will provide help for inbuilt STATA
commands. To get help on the USER-written commands, type findit followed by the command.
There are several online materials that cover almost all STATA component.
Alternatively, Click on “help” from the Menu bar, then select “search”. Type the search string in
the dialog box that appears and click “OK”

For example to search for help on the command summarize, type:


help summarize

findit summarize

Stata syntax

STATA is case sensitive whereby all STATA commands should be typed in lower case while File
and Variables names should be typed in the exact case (lower or upper)
STATA command has the following structure
keyword argument, options

The keyword is the STATA reserved command; the argument is the parameter input by that user
that the keyword will act on while options are additional commands that gives additional
commands on how the command will be executed. SPACE is used to separate keyword and the
argument while a COMMA is used to separate the argument and the options. For example to
import the data we use the command below:
use "example data1.dta", clear

7
use is the key word, "example data1.dta " is the argument and clear is the option

An extensive definition of the STATA syntax structure is as shown:


cmd [varlist | namelist | anything]

[if]

[in]

[using filename]

[=exp]

[weight]

, options

Where;
if - used to apply the command cmd to those observations for which the value of the expression
is true
in - used a specific observation range
weight indicates the weight to be attached to each observation e.g pweight for sampling weights
=exp value to be assigned to a variable and is most often used with generate and replace

[using filename] e.g mydata, mydata.dta,


The parts enclosed in [ ] imply that they are optional. To check the syntax for a specific command
e.g summarize, type
help summarize

----------------------------------------------------------------
help for summarize (manual: [R] summarize)
------------------------------------------------------------------
Summary statistics
------------------
[by varlist:] summarize [varlist] [weight] [if exp] [in range] [, { detail |
meanonly } format ]

Using basic statistical commands

STATA contains several example datasets that have been pre-installed within it. To load a
specified Stata -format dataset use the command:
syuse filename

8
If you are not sure of which example dataset to load, use this command to list all the names of
the datasets shipped with STATA . From the Menu go to File > Example Datasets...
sysuse dir, all

Now let’s load an example dataset called auto and run some basic STATA functions.
sysuse auto.dta

To obtain the summary statistics for all of the variables in the data file use the command
summarize or just su

For specific variables: statistics just for mpg and price.


summarize mpg price

summarize with in specifying a range of records to be summarized.


summarize in 1/10

summarize with simple if specifying records to summarize.


summarize if foreign == 1

summarize with complex if specifying records to summarize.


summarize if foreign == 1 & mpg > 30

summarize followed by option(s).


summarize, detail

Here is a list of basic STATA commands


View the dataset in memory, type
edit, to view and edit

browse, to just view

list [varlist]

To view information/description of the variables in a dataset


inspect varlist

codebook varlist

describe

To perform simple analyses:


summarize

tabulate

9
Re-using commands:
If you want to reuse a command all that you need to do is double click on the command in the
Review Window.

User-written extensions

There are a tremendous number of user-written programs for Stata available which, once
installed, act just like official Stata commands.
There are organizations house user-written programs, including the Boston College Statistical
Software Components (SSC) archive and the Stata Journal.
So, how do you start to use user-written commands? There are several approaches based on
whether the command is found on SSC, elsewhere online, or currently unavailable online.
SSC programs
SSC is the largest repository of Stata user-written programs. To install an SSC package named
package, type in Stata :
ssc install package

For example, a popular SSC program is estout. To install it, type on the Stata command line:
ssc install estout

After it’s installed, type:


help estout

to learn more about the program.


This might not work depending on the internet connection you currently have.
Other programs publicly available online
To search for a user-written package publicly available online that isn’t on SSC, type in Stata :
net search package

10
A list of packages that match the search are displayed. Click on the package you want, and a
window will appear. Follow the instructions to install the package.
Alternatively, one can search and install a package by searching it using the findit function
findit catplot

Data structures and types of variables

Distinguishing between Categorical data, Continuous data; Numerical, string and date/time
variables
Common terms used in STATA
Records (or cases or observations): Individual observations (e.g farm plots, households, villages,
or provinces). Usually considered to be the “rows” of the data file
Variables: Characteristics, location, or dimensions of each observation. Considered the
“columns” of the data file.
Levels: The level of a dataset describes what each record represents.
Discrete variables (or categorical variables): Variables that have only a limited number of
different values (e.g region, sex, type of roof, and occupation).
Binary variables (or dummy variables), Dichotomous : Discrete variable that only takes two
values. Examples: yes/no, male/female, have/don‟t have, or other variables with only two
values.
Continuous variables: Variables whose values are not limited. Examples: per capita expenditure,
farm size, number of trees. Usually expressed in some units such as shillings, kilometers, hectares,
or kilograms. Also, may take fractional values.
Variable labels: These are longer names associated with each variable to explain them in tables
and graphs. E.g: variable REGION could have a label “Region of Kenya”.
Value labels : These are longer names attached to each value of a categorical variable. For
example, if the variable REG has four values, each value is associated with a name. The value
labels for REG=1 could be “Northern Region”, REG=2 could be the “Central Region”, and so on.

11
Data storage types
Numeric types:
Stata provides five numeric types for storing variables, three of them integer types and two of
them floating point.
The floating-point types are float and double.
The integer types are byte, int, and long.
String storage type:
Text variables (such as names of respondents) are best stored as string characters
Note: Variables that are stored as string characters do not allow any statistical analysis on them
other than frequency counts
Stored Data Colours
When you browse the data, the spreadsheet is displayed and you can distinguish some of the
storage types by looking at the colors, for example:

 String (text) appear red


 Numeric without value labels appear black
 Numeric with value labels appear blue

12
Operators
STATA uses specific mathematical and logical operators

Operator Meaning Remarks


+ Addition/sum Be careful when addition columns with missing
values. Use the “sum” function instead
- Minus
* Multiplication
/ Division
== Equal to Note the difference with the standard = sign.
!= Not equal
> Greater than
>= Greater than or
equal
< Less than
<= Less than or equal
& And
| Or

Special symbols
Missing values: The default STATA symbol for missing values is the “ . ” (dot), which is interpreted
as a large value by STATA . There is an option of defining your own missing value indicator.
For example, change all zero entries for hired labor to missing using the command below then
browse to see the change
mvdecode Fem_hired_work Tot_hired_lab , mv(0)

brows Male_hired_work Fem_hired_work

_n: This is used to refer to the record number (we will see examples of its applications)
_all: This is used to apply a command globally (we will see examples of its applications)
Variable types
Two main data types used in STATA are Numeric and String. Numeric data can be stored in
different formats depending on the required precision (table below).

Type Minimum Maximum

byte -127 100

int -32,767 32,740

13
long -2,147,483,647 2,147,483,620

float -1766.1 1.70141173319*1038

double -92644 8.9884656743*10307

Import, export, load and save datasets

Inbuilt datasets
STATA software comes with inbuilt datasets for users to practice with, called example datasets.
To access these datasets, go to:
>File>Example datasets>Datasets installed with Stata >use – to open the
dataset.

Example datasets within Stata can also be accessed using the command: sysuse “file.name” – if
you know the exact name of the example dataset, e.g.
sysuse auto

Import external datasets.


There are three commands for getting existing data into STATA depending on the file format. The
commands are
USE – for Stata data sets e.g.
use “No_farmers.dta”, clear

ii)INSHEET – for data created using a spreadsheet or database program, separated by commas
or tab delimited, e.g.
insheet using “farmers.csv”, clear

iii) if text format, use the following


insheet using “farmers.txt”, clear

iv) From excel


import excel using “farmers.xlsx”, clear firstrow

The firstrow option - tells STATA to treat first row of Excel data as variable names. There are
several other commands that can be used to import data into STATA . Run the command “help
import” to learn more commands.
To import external datasets in other formats apart from Stata datasets from the menu, go to
File > Import >.. Then select the type of file you want to import

14
v) Other sources
Stata can import data from several other sources: such as SPSS/PASW, SAS, etc. Data from SPSS
and SAS require user written programs before data importation. The commands below show how
to install the user written programs using the command ssc install package_name and how to
use the packages to import data.
Dataset from SPSS/PASW:
ssc install usespss // installs the required package

usespss using “farmers.sav”, clear // imports the dataset

Dataset from SAS:


ssc install usesas // installs the required package

usesas using “farmers.sas7bdat”, clear

Export the dataset


Saving STATA format data
save "example data2.dta"

Exporting data to Excel


export excel using "example1", sheetmodify firstrow(variables)

Exporting data to delimited format


export delimited using "example1", replace

There are special user written commands for exporting STATA files to other statistical packages
such as SPSS and SAS or saving STATA data to older versions of STATA. Install and try any of the
following user-written commands
savespss – save STATA data in SPSS format

usespss – open SPSS data in STATA

saveold – allows use to save STATA data in the preceding old version of STATA, e.g. STATA 14
to be opened by STATA 13, 12, 11, 10
use13 – allows users of the STATA 10-12 to open STATA 13 files,

Know your data

Before working with any dataset it is important to have a glimpse of the data contents in terms
of the types of variables and variable contents. STATA provides several commands to getting
descriptions and contents of variables in the data.

15
If you had closed the example data set, open it again.
use “farmers.dta”,clear

Below are the common commands used to display dataset summaries or view the data.
describe: produces a summary of active dataset. Additional options can be included in the
command to get specific or additional information (type: help describe to get available
options). Test the following options and discuss the difference:
describe

describe, short

describe, number

codebook: Similar to describe but provides additional summary statistics (range, frequencies) for
each variables
codebook

browse: Displays the data in a spreadsheet format (columns/rows) but one is not able to edit or
modify the data contents
browse

edit: Displays the data in a spreadsheet format (columns/rows) and allows the user to edit or
modify the data contents
edit

For both browse and edit commands, STATA displays data in different colors depending on the
data type, for example, string variables in RED, value labels in BLUE and numeric in BLACK. One
can change the default data codes by selecting Edit->Preferences menu options.
Labelling datasets
STATA has an option of saving a dataset with a brief description of the file contents
(documentation) using the label data command.
Get description of the example data using describe command

16
Notice that there is no description of the data! Label the data. Label the data using the command
below then run describe again and notice the difference
label data "This data contains the list of farmers receiving vines"

Duplicates

Datasets have a unique ID, a variable that uniquely identifies observations. Unique IDs can
distinguish respondents from each other, so that John Doe is identified by the value 1 of variable
uniqueid, Jane Smith is identified by the value 2, and so on. An observation can also be uniquely
identified by a set of variables, for example, the combination of a household ID and a respondent
ID.
One of the first things that you will need to do when cleaning your data is cleaning the ID. More
specifically, you need to check that the ID is unique. If it isn't, you’ll need to find how many and
which IDs are duplicates, and resolve them.

 A unique ID is important for all sorts of reasons. Some of these are:

17
 Matching respondents across rounds (for example, matching baseline and endline
observations) and across datasets (for example, survey data and administrative data)
 Data entry reconciliation
1 isid
Probably the easiest way to figure out whether a variable (or a combination of several variables)
uniquely identify observations is the isid command. isid will have no output if the variables
are a unique id together (or one variable by itself), and will give an error if they are non-unique.
To test this command, test whether he hhid is a unique identifier in our dataset:
isid hhid

Ok, so we’ve established that hhid by itself is not a unique identifier.


2. Other ways to check for duplicates
Another useful set of commands that goes more in depth than isid on “duplicated” variables, and
is useful for identifying unique IDs, is duplicates. Let’s use what we learned with the help files
and pull up the file for the duplicates commands. Type in:
help duplicates

Please read the help file before proceeding further.


As you can see, the set of duplicates commands are used to report, display, list, tag, or drop
duplicate observations, depending on the subcommand specified. Duplicates are observations
with identical values either on all variables if no varlist is specified or on a specified varlist. What
that last sentence means is that if you use duplicates and specify a variable, it will only
check/tag/drop observations with identical values in that variable. However, if you simply say
“duplicates report” (or drop or tag) without specifications, Stata will look at observations that are
identical across all variables.
Let’s focus on duplicates report for now. What does the command do?
The command produces a table showing observations that occur as one or more copies and
indicating how many observations are "surplus" in the sense that they are the second (third, ...)
copy of the first of each group of duplicates.
Please figure out the syntax and use the command to look at the hhid variable. How many IDs
have more than one observation assigned to them? Why is that?
duplicates report hhid

18
We find that there is a “surplus” of 13 observations.
Try browsing the dataset to figure out we have duplicates on the hhid variable.
Before we move on to that, however, let’s briefly look at the duplicates tag and drop commands.
Duplicates tag essentially does the same thing as duplicates report, but also generates a new
variable (that people usually call “tag”) that shows you how many duplicates each observation of
that variable has. So, if there are no duplicates (observation is unique), the tag will be equal to 0.
If there is 1 duplicate observation—i.e. there are 2 observations that look the same, one of them
an “extra”, —the tag will be equal to 1 for both observations – the “1st” and the
“extra/duplicate”. If there are 2 duplicates (3 observations that look the same), tag will be equal
to 2… and so on.
Let’s try to generate a tag (called tag1) for childid duplicates, just for practice. Try to look up how
to do it using a help file.
duplicates tag hhid, gen(tag1)

Now tab the tag variable and browse just the observations are duplicated.

Finally, duplicates drop removes the duplicates. Please use it sparingly, and NEVER use it without
first checking out what the duplicates are and why they are there very carefully using the other
duplicates commands.
duplicates drop

To manage duplicate observations using the menu, go to: Data > Data utilities > Manage
duplicate observations

19
Sorting and ordering

sort – arranges the observations of the current data into ascending order based on the values
of the variables you list after the command. Suppose we wanted Stata to look at the data in the
order of the hhid. To accomplish this, say:
sort hhid

Now browse again. You can see the data is sorted by the student number. What if you wanted
Stata to sort by hhid within each county? You simply list the largest category first, and then add
extra variables to the command to sort within the bigger categories. So, to sort by student
number within schools, you’d type:
sort County hhid

Browse to see what that did. There is actually no limit to the number of variables that you can
sort by at one time, and Stata just reads them left to right and sorts in that order.
Certain commands will require you to sort the data beforehand, and sometimes it’s a great trick
to use with commands that don’t require it. However, you have to be very careful when using
sort.
To sort the data in descending or ascending order, use the gsort command with a + or – before
the variable you are using to sort. For example to sort the data in descending order, use the
following command.
gsort –hhid

To sort via the menu go to: Data > Sort


order - order relocates varlist to a position depending on which option you specify. If no option
is specified, order relocates varlist to the beginning of the dataset in the order in which the
variables are specified.
Suppose you want to move County and subcounty to the beginning of the dataset. Using the
option first to accomplish this.
order County Subcounty, first

Suppose you change your mind and want the County and Subcounty variables to appear after the
hhid. This can be accomplished as follows:
order County Subcounty, after(hhid)

To order via the menu go to: Data > Data utilities > Change order of variables

20
Executing commands using do-files

Stata comes with an integrated text editor called the Do-file Editor, which can be used for many
tasks. To access the do file editor either type in the command line
doedit

You can write the commands, to run them select the line(s), and click on the last icon in the do-
file window

Once the do file is open, type these commands;

Notice the different colors of the syntax elements. You can change them under Edit>Preferences
Click on the Do button, to execute the commands. Stata executes the commands in sequence.
Stata saves the commands to a temporary file and uses do command to execute them.
Save the do file with a .do extension. After you have saved your do-file, you can execute the
commands it contains by typing do filename, where the filename is the name of your do-file
do analysiscode.do

Tips for programming do files

21
You can continually update your do-file with additional commands. You can try your commands
interactively, and if they seem to work, cut and paste them into the do file.
If you can put a * before a line in the do-file, Stata will not execute that line. This serves two
different purposes.
1. you can rerun your do-file while leaving out certain commands.
2. you can annotate your dofile
You can have Stata skip over several lines by using /* and */.
clear

use filename

log using filename, text replace

*THE NEXT FEW LINES GENERATE LOGS OF VARIABLES

gen lprice3= log(price3)

gen lprice1=log(price1)

/*THE NEXT LINE EXECUTES THE BOXCOX TEST

xi:boxcox sales3 pr* i.store*/

If there is a syntax error in your do-file, Stata will stop execution at the point of the error. You
can go back to the do-file editor, correct the syntax error, and rerun your program.
You may want to create two do-files for any project. The first manipulates the data and creates
new variables. At the end of this do-file, be sure to save the resulting data set in a new data file.
The second file uses the data set you created in the first file to perform all of your analyses.
Alternatively, you might prefer to work interactively. In this case, start analyzing your data
interactively, normally. Right click in the Review window. An intuitive menu will appear. You can
use the menu to:

 Delete commands that you do not want to keep (be sure to highlight these commands
before deleting)
 Highlight the entire contents of the review window (“select all”)
 Send the highlighted commands to the do-file editor (“send to do-file editor”)
Edit your do-file Use the Stata do-file editor or even MS Word to edit your file; correct mistakes,
add comments (*)
Execute your do-file Type do filename in the Stata Command window or click on the execute
do-file button from the Stata do-file editor.

22
Calling other do-files
Say that you wrote makedata.do, which infiles your data, generates a few variables, and saves
step1.dta.
Say that you wrote anlstep1.do, which performed a little analysis on step1.dta. You could then
create a third do-file, begin master.do
do makedata

do anlstep1

and so in effect combine the two do-files. Do-files may call other do-files, which, in turn, call other
do-files, and so on. Stata allows do-files to be nested 64 deep.
Writing long commands
You can change the end-of-line delimiter to ‘;’ by using #delimit,
you can comment out the line break by using /* */ comment delimiters,
or you can use the /// line-join indicator.
In the following fragment of a do-file, we temporarily change the end-of-line delimiter:
use mydata

#delimit ;

summarize weight price displ headroom rep78 length turn gear_ratio if


substr(company,1,4)=="Ford" | substr(company,1,2)=="GM", detail ;

delimit cr

gen byte gm = substr(company,1,2)=="GM"

Once we change the line delimiter to semicolon, all lines, even short ones, must end in
semicolons.
Stata treats carriage returns as no different from blanks. We can change the delimiter back to
carriage return by typing #delimit cr. The #delimit command is allowed only in do-files—it is
not allowed interactively.
The other way around long lines is to comment out the carriage return by using /* */ comment
brackets or to use the /// line-join indicator.
use mydata

summarize weight price displ headroom length turn gear_ratio /* */

if substr(company,1,4)=="Ford" | /* */

substr(company,1,2)=="GM", detail

23
gen byte ford = substr(company,1,4)=="Ford"

OR
use mydata

summarize weight price displ headroom rep78 length turn gear_ratio ///

if substr(company,1,4)=="Ford" | ///

substr(company,1,2)=="GM", detail

gen byte ford = substr(company,1,4)=="Ford"

Output management

STATA Results window does not automatically keep all the output you generate, unlike SPSS. It
only stores about 300-600 lines, and when it is full, it begins to delete the old results as you add
new results.
set scrollbufsize XX

This is used to change the amount of output that STATA will store in the Results window. XX is
expressed in bytes. The default is 32000 and the maximum is 500000. Type “help set” for a list of
other settings in Stata .
STATA output files are called log files (saved using .log or .smcl extenstion). You can save all your
STATA outputs in log file, sort of Stata ’s built-in tape recorder and where you can retrieve the
output of your work and keep a record of your work.
Creating log files
In the command line type:
log using mylog.log OR log using mylog, text

This will create the file ‘mylog.log’ in your working directory. You can read it using any word
processor (notepad, word, etc.). To close a log file type:
log close

To add more output to an existing log file add the option append, type:
log using mylog.log, append

To replace a log file add the option replace, type:


log using mylog.log, replace

Note that the option replace will delete the contents of the previous version of the log.
Alternatively you can log using SMCL extension, a language Stata understands. Created as
before,

24
log using mylog.smcl

With the .smcl file, one can print to a text file or PDF file using command translate.
translate mylog.smcl mylog.log

translate mylog.smcl mylog.pdf

If the mylog.log file already exists and you wish to overwrite the existing copy, you can specify
the replace option:
translate mylog.smcl mylog.log, replace

To start a log file from the menu, go to: File > Log >… .There will appear a drop down with
several options such as Begin, Append, Translate a log file.

Generating new variables

STATA has two commands for creating new variables, generate and egenerate. Although the
two commands do similar functions, egenerate is applicable when using mathematical or
statistical functions such as sum, mean, etc while generate is limited to straight mathematical
operation. In summary use egenerate when performing more complex operations.
Example: Calculate the total number of hired labor for each record then browse to see results?
gen total_hired_labor= Male_hired_work+Fem_hired_work
brows Male_hired_work Fem_hired_work total_hired_labor

Do you notice anything unusual/wrong with the calculations?


Repeat same calculation using egenerate
egen total_hired_labor2= Male_hired_work+Fem_hired_work

You get the error message: unknown egen function Male_hired_work+Fem_hired_work ()


egen works with functions and therefore it assumes total_hired_labor2 is a function hence
the error.
egen total_hired_labor2= rowtotal(Male_hired_work Fem_hired_work)

Browse both fields and discuss the difference


brows Male_hired_work Fem_hired_work total_hired_labor total_hired_labor2

To generate variables using the gen command from the menu, Data > Create or
change data >
To create new variable with the egen command use the menu Data > Create or
change data > Create new variable (extended).
Alternatively, one can use the menu to generate variables. From the Data Menu go to Create or
change data > Create new variable and use the dialog box that appears to accomplish this.

25
More on Egen

Egen allows you to use STATA inbuilt functions to generate new variables. For example, to
create a variable that is the maximum value of number of vines allocated to household we can
use:
egen maxVines = max(Numberofvinecuttingskabode)

egen is especially useful for by-groups, for example, calculating the minimum value of number
of vines by county, we use:
bysort County: egen maxVines = min(Numberofvinecuttingskabode)

Above, min() and max() are statistical functions. This means that you can't use them outside
egen. Remember, one of egen's chief roles is to create variables that depend on multiple
observations. There are other egen functions apart from the ones we have used. Type “help
egen” to check the list of functions.

Creating dummy variables

Suppose we want to dichotomize the continuous variable. This code will assign a value of 1 if car
repair record> 3 and 0 otherwise.
gen rep_cat = rep78>3

However, beware of missing values!


STATA treats a missing value as the largest possible value (e.g., positive infinity).
Therefore, we will explicitly exclude missing values to make sure they are treated properly, as
shown below.
gen rep_cat2 = rep78>3 if rep78!=.

In some cases gen can be used together with the tab command to generate dummy variables
from categorical variables. For example:
tab price_cat3, generate(price_dum)

Labelling variables

It is important to document your dataset and one way of doing this is attaching meaningful labels
to the variables. This is done using the label variable command in STATA
label variable total_hired_labor2 "total number of hired workers in the farm"

Using the menu go to: Data > Variables Manager, click on the variable of interest then label the
variable by typing in the text box under “Label”

26
Defining value labels

In addition to labelling variables, another important dataset documentation is defining the what
the values in categorical variables mean, for example, form of ownership (Ownership) in the
example dataset. Defining value labels is a two-stage process – first defining the values labels
then attach the value label to the specific variable.
label define ownershipval 1 "Sole ownership" 2 "Partnership with another
person"

After defining the value labels attach them to a specific variable. Common value labels such
yes/no can be attached to several variables without having to redefine them each time.
label values Ownership ownershipval

Using the menu go to: Data > Variables Manager, click on the “Manage” button on the right of
the Value Label then follow the preceding dialog boxes to create a new value label.

Changing variable content

In addition to deriving new variables, data management usually involves changing contents of
existing variables, changing the formats or reclassifying variables. Although there are several
commands or different ways of doing these changes, we will look at the common ones in this
workshop.

Recoding variables

When working with coded variables (variables with value labels), it is more efficient to use recode
command than replace. For example education of the manager has 6 levels (0=Illiterate,
1=Literate, 2= Primary school, 3=Intermediate, 4= Secondary school, 5= College, 6= University)
but looking at the distribution using the TABULATE command, it is more informative to combine
literate with primary, intermediate, secondary, college and university so that there will be three
levels of education. Meaning we recode 1 and 2 into 2 then 3, 4, 5 and 6 into 3. Also 0 is not a
very common code so change it to 1. Use the recode command. It is good practice NOT to change
the original variable and therefore we shall store the changes in a new variable called
edu_recoded.

recode Educ_mgr (0=1) (1/2=2) (3 4 5 6=3), gen(educ_recode)

We can directly add value labels to the recoded values within the recode command
recode Educ_mgr (0=1 “Illiterate”) (1/2=2 “Medium level”) (3 4 5 6=3 “Higher
level”), gen(educ_recode)

Using the menu go to: Data > Create or change data > Other variable-transformation commands
> Recode categorical variable

27
Encode

Some data analysis routines in STATA (e.g. ANOVA) work with numeric variables only and
therefore it becomes necessary to code string variables. The encode command is can be used to
code string variables.
encode owner_gender, generate(owner_gendercode)

The string variable being recoded is automatically sorted in ascending then codes are sequentially
assigned starting with 1 and are stored in a new variables as value labels.
From the menu, go to: Data > Create or change data > Other variable-transformation commands
> Encode value labels from string variable

Decode

decode is the reverse command for encode for converting variables with value labels to string
variables.
decode owner_gendercode, generate(owner_gender_encoded)

From the menu, go to: Data > Create or change data > Other variable-transformation commands
> Decode strings from labeled numeric variable

mvdecode

When working with variables in STATA, we can define our own missing values, e.g. -999 or -888.
We need to explicitly tell stata that -999 or -888 are missing values by running several replace
commands or using mvdecode. Note the prefix “mv” stands for “missing value”.

Several replace commands:


replace q777 = . if q777 ==-999

You can a suffix on a missing value to describe the type of missing for example:
-999 = No answer
-888= Not applicable
And we would therefore want to retain the same information when defining missing values as
follows:

replace female = .a if female ==-999


replace female = .b if female ==-888

One would have to do that for all variables in the dataset which is time consuming. With
mvdecode we can change all variables with a single command with the _all key word as follows:

mvdecode _all, mv(-999=. \-888=.)

28
Destring/Tostring

To convert variables defined string to numeric variables use destring and use tostring to
convert variables defined as numeric to string variable. When using the two commands either
generate or replace must be specified as one of the options.
tostring Tot_hired_lab, gen(Tot_hired_lab_string)

In the example dataset, the variable for income from hides and skin is defined as string variable
with entries such as “no income” if the firm didn’t get income from the business. To use this
variable for any quantitative analysis it has to be converted to numeric variable. NOTE that, in so
doing, the text entries will be converted to missing as follows:
destring Firm_Income_hideskins, gen(Firm_Income_hideskins2)

the command fails because the variable contains both numbers and text entries, to convert the
text entries into missing use the force option
destring Firm_Income_hideskins, gen(Firm_Income_hideskins2) force

To destring from the menu, go to: Data > Create or change data > Other variable-transformation
commands >Convert variables from string to numeric
To tostring from the menu, go to: Data > Create or change data > Other variable-transformation
commands > Convert variables from numeric to string

Managing datetime variables

If you have datetimes stored in the string variable mystr, an example being "2010.07.12 14:32".
To convert to SIF datetime/c, you type
gen double eventtime = clock(mystr, "YMDhm")

The mask "YMDhm" specifies the order of the datetime components. In this case, they are year,
month, day, hour, and minute.
Run the command “help datetime” to learn more commands used to manage date variables.

Managing missing data

Missing data common in research. Stata depicts missing data with a “.” for numeric and a black
for strings. Stata commands that perform computations of any type handle missing data by
omitting the missing values.
However, the way that missing values are omitted is not always consistent across commands, so
let's take a look at some examples (auto dataset).

29
summarize
For each variable, the number of non-missing values are used.
summarize rep78

tabulation
By default, missing values are excluded and percentages are based on the number of non-missing
values. If you use the missing option on the tab command,
tabulate foreign
tabulate foreign, m

As a general rule, computations involving missing values yield missing values. For example,
gen sum1 = length + headroom
egen sum2 = rowtotal(length headroom)

The rowtotal function treats missing values as 0. The rowtotal function with the missing option
will return a missing value if an observation is missing on all variables.
egen sum3 = rowtotal(length headroom) , missing

Finally, you can use the rowmiss and rownomiss functions to determine the number of missing
and the number of non-missing values, respectively, in a list of variables. This is illustrated below.
egen miss = rowmiss(length make)
egen nomiss = rownonmiss(length make)

Using the replace command, you can fix the missing values as follows, e.g.
replace rep78=0 if rep78==.
replace rep78=0 if missing(rep78)

The replace
tab Subcounty

Sub county
Uriri
Bungoma North
Matayos
Matoyos
Nambale

Looking at the output table, you notice that there are two sets of Subcounties that look like they
are same but with slight variation in spelling of order. For example Matayos is same company as
Matoyos and so one should be changed.
Replace Matoyos with Matayos using the replace command

30
replace Subcounty=”Matayos” if Subcounty==”Matoyos”

From the menu, go to: Data > Create or change data > Change contents of variable

Combining datasets

Combining dataset is a process of joining data from two different datasets. Broadly, there are
two way of joining datasets. One involves adding records to existing variables while the second
one involves adding fields/variables. Each of them have specific requirements and modes of
joining, for example, to append records, the two datasets must have same variables names while
merging will require a common variable that is used to link to the two datasets. STATA uses merge
and append commands to join datasets.

append adds more observations to the dataset, while merge adds more variables to existing
observations. To illustrate this with a simple graphic, append makes the data longer, while merge
makes it wider. The most common way in which append is used with cleaning is to put separate
batches of entered data into one dataset.

MERGE

Merge command is used to add variables to an open data set (master file) from an external
dataset (using) based on the common variable to match the records. Within STATA the matching
can be done based on one or more common variables. The match merging can be one-to-one,
one-to-many, many-to-one or many-to-many. STATA by default creates a new variable _merge
that contains codes indicating the source of the merged observations. The codes meaning are
displayed as part of the output from the merging process. The merge command can also be used
to update missing observations using the “update” option in the merge command, that is, if the
existing variables in the master file has missing values which have values in the using file. If the
‘using file’ is generally more up to date than the master file, then use the replace update option.
The “replace update” option affects only the observations with missing values.
We will merge two separate files – The first dataset contain the farmers data and the second
dataset contains the crops that were sold by the farmers. We have to join the two datasets in
order to carry out data some analysis. Use the merge 1:m command to merge them since the
dataset in memory (master file) has unique observations for the hhid variable while the using
dataset contains a list of crops sold by each household (here the hhid is repeated several times).
Open the owner dataset

31
use "/Volumes/Transcend/Career/ILRI/Training/crops_sold.dta”, clear

Check that the hhid is unique

Use the merge 1:m command to merge them.


use "/Volumes/Transcend/Career/ILRI/Training/farmers.dta”, clear

merge 1:1 hhid using crops_sold.dta”,

The output shows the codes of the _merge variable. Here is an explanation of each of the
codes.
1 master - observation appeared in master only
2 using - observation appeared in using only
3 match - observation appeared in both
4 match_update - observation appeared in both, missing values updated
5 match_conflict - observation appeared in both, conflicting nonmissing value
Codes 4 and 5 can arise only if the update option is specified. If codes of both 4 and 5 could
pertain to an observation, then 5 is used.
Discussion: What changes do you see in the newly dataset
From the menu, go to: Data > Combine datasets > Merge two datasets.

APPEND

Appending data occurs when records in the dataset to be analyzed were entered into separate
files but with the same variables. In the example below, two datasets (farmers1 and farmers2)
with the same variables were entered separately by two different data entry clerks. We use
append to join the two datasets into one
append using "/Volumes/Transcend/Career/ILRI/Training/farmers2.dta"

From the menu go to: Data > Combine datasets > Append datasets

32
Changing the shape of the data

RESHAPE

Often data come from either cross-sectional or panel studies. Cross-sectional data is data
collected from subjects at one specific time while panel data is data collected from same subjects
repeatedly over a certain period. Some cross-sectional data have hierarchy e.g. data on
household members, whereby the main subject is the household head while data on members
in the household are a second level data on the same household and therefore the household
identification is repeated. Panel data and cross-sectional data are entered either in WIDE or
LONG formats. Here is a pictorial presentation of the different data formats we are likely to
encounter.

In the file data below, you notice that each household produced and sold more than one crop
and this are spread over several columns such that the hhid’s are unique. This is a wide format.

It is possible to change data from long to wide in STATA using the reshape command.

33
To reshape data from wide to long STATA expects that the variables start with same stubname,
that the prefixes are the same, e.g. crop1, crop2, crop 3 or price2015, price2016, price2017, the
stubnames are crop and price. By default STATA assumes that the suffix is numeric and therefore
if the suffix text, it must be stated with the string option; this includes suffixes that are preceded
with zeros e.g. 01, 02.
Since the variables do not have a stubname, we will add one by renaming the variables such that
each is preceded by the prefix “yield” as shown below:
rename * yield*

This adds the prefix “yield” to all the variables in the dataset. Since we do not want the hhid and
county variables to contain the “yield” prefix, we revert to their original names using the
following code:
ren (yieldhhid yieldcounty) (hhid county)

The new dataset now looks like this.

We can then reshape the data from wide to long:


reshape long yield, i(hhid) j(crop) string

It is important to drop the duplicates before reshape works properly.

duplicates drop hhid, force

View the reshaped dataset using the browse command. The dataset looks like this. Note that
the hhid variable is repeated.

Let us now reshape the data back to wide format:


reshape wide yield, i(hhid) j(crop) string

34
If you need to use the reshape the data using the menu, go to: Data > Create or change data >
Other variable-transformation commands > Convert data between wide and long

By command

In Stata , you often find yourself needing to repeat a command, perhaps making minor
adjustments from one call to another. For example, suppose you want to examine summary
statistics for variable sum Numberofvinecuttingskabode for all the different counties. You could
try:
summarize Numberofvinecuttingskabode if County == “Busia”
summarize Numberofvinecuttingskabode if County == “Migori”
summarize Numberofvinecuttingskabode if County == “Bungoma”

This is a very inefficient approach, especially when you several counties are considered. What's
the solution? We use the by command:
sort County
by County: summarize Numberofvinecuttingskabode

by repeats Stata commands on subsets of the data, called "by-groups." All observations of each
by-group share the same value of the "by-variable" (for example, County above). The by
command above first ran summarize for all observations with County == “Busia”, then all those
with County == “Migori”, and so on. Each of these was its own by-group. by is followed by a list
of by-variables. This list can include more than one variable.
You might be wondering why we needed to sort by County before using by. by requires the
dataset to be sorted by the by-variables. You can do this before by, as we did above, or at the
same time by combining by and sort into one command, named bysort:
bysort County: summarize Numberofvinecuttingskabode

String cleaning

35
Very often, whether due to the enumerator, the data entry operator, or just the fact that certain
places, people, and things have different names, string variables will have multiple spellings for
the same response. For example a single county variable may have “Homa Bay” and “Homabay”.
We will discuss some of the approaches that can be used to clean string variables:
For string variables that are supposed to be only one word, you may want to remove all spaces
entirely using the following commands
replace County = itrim(trim(County)) OR
replace County = subinstr(County, " ", "", .)

The trim command removes leading and trailing spaces while the itrim command removes extra
spaces in between words. The subinstr command searches and replaces all instances of “ “ With
“”. There are several other string functions that can be used to clean and create new variables
such as strupper, strlower, substr etc
Suppose we want to remove typos from the County variable but cannot use any of the defined
functions. First you need to decide on the standardized responses, then make replacements. For
example, suppose these are all the same County: Homa Bay, Homa bay, Homabay, Homa bey and
if I wanted to standardize County by changing all these values to "Homabay", I could code.
replace County = "Homabay" if County == "Homa Bay" | County == "Homa bay" |
County == "Homa bey"

To be efficient, I could use the function inlist() in place of this code:


replace County = "Homabay" if inlist(County, "Homa Bay", "Homa bay", "Homa
bey")

Sub-setting data

STATA allows you keeping and dropping variables that you need in your final dataset. This can
be achieved using the drop command.
Suppose we want to just have the variables hhid and County, we can keep just those variables,
as shown below.
keep hhid County

Perhaps we are not interested in the variables Wardcode Subcountycode and Countycode. We
can get rid of them using the drop command shown below.
drop Wardcode Subcountycode Countycode

STATA also allows you to keep and drop observations based on a specific condition. For
instance, we can eliminate the observations which have missing values using drop if as shown
below.
drop if missing(Countycode)

36
Let's illustrate using keep if to eliminate observations. Suppose we want to keep just the
households from Migori County.
keep if County == “Migori”

To keep or drop variables/observations using the menu, go to: Data > Create or change data >
Drop or keep observations

Macros

Loops

In Stata, loops are used to repeat a command multiple times or to execute a series of commands
that are related. There are several commands that work as loops:

 foreach,
 while,
 forvalues,
 if (different if from the if qualifier you learned before),
For example: suppose you need to transform a set of variables. Instead of doing it one by one,
you can loop through the variables.

foreach

This is the most basic foreach syntax runs the following way:
foreach item in list_item1 list_item2 list_item3{

command `item'

Where item is the macro that is used to signify each item of the list. So, as Stata loops through
the command for each item on the list, it will substitute each list_item into the macro, performing
the following:
command list_item1

command list_item2

command list_item3

For example:
foreach letter in a b c d {

display "`letter'"

37
}

In this case foreach … in is the command, a b c d is the list of items that you want to loop over,
and letter is the local you use to first declare and then call the list.
Now let’s try something serious
foreach var in price mpg weight length displacement {

summarize `var’

This will loop through each variable in the list, summarizing one at a time.
For starters, please check out the rep78 variable – tab it, codebook it, etc. You will see that it has
5 values (1-5). Now suppose we wanted to create a separate variable out of each of the five types.
Before we would’ve done it the slow way. We would’ve said
gen rep78_1 = (rep78 == 1)

gen rep78_2 = (rep78 == 2)

But now we can write a loop! Observe:


foreach value in 1 2 3 4 5 {

gen rep78_`type’ = (rep78 == `value’)

That’s 3 lines instead of 5 we would’ve done. And one of them is just a bracket! Now imagine
that had been 20 values, or a hundred!

Descriptive statistics

Analysis of data that helps describe, show or summarize data in a meaningful way such that, for
example, patterns might emerge from the data. Descriptive statistics do not, however, allow us
to make conclusions beyond the data we have analyzed or reach conclusions regarding any
hypotheses we might have made. They are simply a way to describe our data.

Types of statistics that are used to summarize data:

Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. For example, the mode, median, and mean. Use the summarize
command to accomplish this:

38
Measures of spread: these are ways of summarizing a group of data by describing how spread
out the scores are. A number of statistics are available to us, including the range, quartiles,
absolute deviation, variance and standard deviation. Use the summarize command with the
detail option to accomplish this:

Summarizing group data.

Group data is summarized by tabulations. There are three kinds of tabulations:


• Simple – Simple tabulation is when the data are tabulated to one characteristic.
• Double – Double tabulation is when two characteristics of data are tabulated.
• Complex – Complex tabulation of data that includes more than two characteristics.

The are three related commands that produce frequency tables for discrete(categorical)
variables, that is, tab, tab1 and tab2.
tab:
i. Produces a oneway frequency table with percentages
tab County

39
ii. Produce a two-way frequency table (without percentages) but no more than two
variables is allowed by Stata . The row and col options tell Stata to output row and cell
percentages.

tab County Genderedhouseholdtype1Femalh, row col

Using the key, we can tell that the first row shows the frequencies, the second row shows
percentages of the total frequency for each county and the third row shows percentages of the
total frequency for each level of Gender of the household head. For example, we can say that
75% of households in Bungoma County are headed by both males and females.
tab1
Produces a one-way frequency table for each variable in the variable list, unlike tab which allows
only one variable at a time when producing one-way frequency tables.
tab1 Genderedhouseholdtype1Femalh County Subcounty Ward

tab2
Produces all possible two-way tables from all the variables in the list (unlike the tab command)
tab2 Genderedhouseholdtype1Femalh County Subcounty Ward

40
From the menu, go to: Statistics > Summaries, tables, and tests > Frequency tables > One-way
table
OR
Statistics > Summaries, tables, and tests > Frequency tables > Multiple one-way tables

Tables of summary statistics

Instead of using the summarize command, one can also use the tabstat command which allows
you to specify the summary statistics of interest. Here are several examples
tabstat Numberofchildren023months Numberofpregnantwomen, by(County)

This command will output the means of the 2 variables for each county.
tabstat Numberofchildren023months, s(mean median sd var count range min max)
by(County)

This command generates descriptive statistics of Numberofchildren023months but tabulated


according to categorical levels of the County. Here one is at liberty to specify as many summary
statistics as possible.
tab County Genderedhouseholdtype1Femalh, sum(Numberofchildren023months)

This command generates mean, sd and freq of Numberofchildren023months for the categorical
two-way table.
tab County Genderedhouseholdtype1Femalh, sum(Numberofchildren023months)
nofreq

This removes frequency from the list of the descriptive.


To generate tables of summary statistics from the menu, go to: Statistics > Summaries, tables,
and tests > Other tables > Compact table of summary statistics

Aggregating/summarizing data sets

Often data is collected at more details than required during analysis, for example data for income
from livestock sales is often collected at animal level but analysis is usually done at household
level. It is therefore necessary to aggregate income from livestock sales to household level.
Similarly rainfall and temperature data is usually collected on daily basis but the reporting is
either monthly or annually and thus it has to be aggregated to the unit of analysis level. STATA
provides a tool called “COLLAPSE” to make data aggregation possible.
Collapse creates a summarized dataset from a master dataset. It is important to save the master
file before running the collapse command because it replaces the open the dataset.
use "farmers.dta", clear

41
collapse (sum) Number_children, by(County)

From the Data Menu, go to: Create or change data > Other variable-transformation commands >
Make dataset of means, medians, etc.

Bulk Data Processing (needs expanding)

There are times when it is necessary to perform an operation on a set of records or fields at the
same time. STATA has functions that can be used in this circumstances. For example, in
longitudinal studies, we may want to calculate change over time for each such subject in a dataset
that is in a long format.
Example: We want to assess the trend in price change between grades in the example dataset,
by calculating the difference in price between grade 1 and 2, 2 and 3 and then 3 and 4 within
household and species.
Sort by species and qno and then get the price difference between grades.
sort hhid species level
duplicates report hhid species level
bysort hhid (level ) : gen price_change = gradepr - gradepr[_n-1]
collapse price_change, statistics(mean ) by(level)

Introduction to graphing using STATA

Graphs provide an great way to explore your data visually – perform graphical summaries. Stata
has excellent graphic facilities, accessible through the graph command, see help graph for an
overview. This gives a list of commands i.e,
 graph twoway scatterplots, line plots, etc.
 graph matrix scatterplot matrices
 graph bar bar charts
 graph box box-and-whisker plots
 graph pie pie charts
Once graphs have been generated, one can use the following commands to save a previously
drawn graph, redisplay previously saved graphs, and combine graphs
 graph save save graph to disk
 graph use redisplay graph stored on disk
 graph display redisplay graph stored in memory
 graph combine combine multiple graphs
 graph export export .gph file to PostScript, pdf,png etc
We will use these commands extensively later in the course. First, we need to know the
difference between qualitative and quantitative data and how each data type can be plotted.

42
Qualitative data is one of several categories e.g blood group, region etc. One can generate the
count of the number of subjects in each group is commonly refered to as the frequency. Stata
command to produce a tabulation is tabulate varname.
This kind of data can be summarized graphically using any of the following graphs.
• Bar Chart: Data represented as a series of bars, height of bar proportional to frequency
(similar to histogram).
• Pie Chart: Data represented as a circle divided into segments, area of segment
proportional to frequency.
Quantitative can take any numerical value e.g weight, height,price. A histogram can be used to
summarize the continuous data.
• Area of bars proportional to probability of observation being in that bar
• Axis can be;
 Frequency (heights add up to n) - frequency;
 Percentage (heights add up to 100%) - percent
 Density (Areas add up to 1) - density

Graphing by categorical groups

A histogram can also be used to plot categorical data using the hist command and the discrete
option. Stata assumes by default that the variable supplied is continuous unless one adds the
discrete option. Note that string variables have to be converted to numeric first before being
used.
encode County, gen(County1)
hist County1, discrete percent

From the Graphics Menu, select Histogram.

43
With a categorical variable, one can always generate graphs for each level of the variable, by use
of the by option. For example, using the farmers dataset, one can create a histogram for the
distribution of the number of children for each County.
hist Numberofchildren023months, by(County1) percent

You can also create an independent kernel density plot with the kdensity command
kdensity Numberofchildren023months

You can overlay a kernel density on your histogram just by adding the kdensity option (there's
also a normal option to add a normal density).
hist Numberofchildren023months, frequency kdensity

44
Graphing bivariate relationships

Use the graph bar command to plot for instance the mean number of children by county. To
accomplish this, go to the Graphics menu them select Bar chart.
Under the Main tab, Check the radio button, Graph by calculating Summary statistics. Then under
Statistics to plot, select the Mean from the drop down and the variable to which to apply the
Statistics. Under the Categories tab, select County as the category variable. See the screenshot
below.

One can also run this command directly from the Command line
graph bar (mean) Numberofchildrenunder5, over(County) bar(1, fcolor(ltblue))

The twoway Command

twoway is a family of plots, all of which fit on numeric y and x scales. Two-way graphs show the
relationship between numeric data. Suppose we want to shows the relationship between price
and mileage in the auto dataset, we could graph these data as a twoway scatterplot;

45
sysuse auto, clear

twoway scatter price mpg

or we could graph these data as a twoway line plot,


twoway (line price mpg, sort)

or we could graph these data as a scatterplot and put on top of that the prediction from a linear
regression of le on year,
twoway (scatter price mpg) (lfit price mpg)

Combining graphs

To include multiple plots in a graph, they must be separated either by putting them in
parentheses or by putting two pipe characters between them (||).
Thus to create a graph containing two scatter plots of price and mpg, one for if foreign==1 and
another for which foreign==0, you can type either:
scatter price mpg if foreign==1 || scatter price mpg if foreign==0

twoway(scatter price mpg if foreign==1) (scatter price mpg if foreign==0)

Adding titles and legends to graphs


twoway(scatter price mpg if foreign==1) (scatter scatter price mpg if
foreign==0), title(“Scatterplot of Price vs Mileage”) legend(label(1 "Car
type==Foreign") label(2 "Car type==Domestic"))

Statistical inference

Oftentimes we want to study a phenomenon about a population. Since we cannot collect data
from the whole population we pick a representative sample through sampling.
Descriptive statistics provide information about the immediate group of data. Inferential
statistics on the other hand, allow us to make generalizations about the population using the
sample.
This introduces some degree of uncertainty since you are using a sample to infer what would be
measured in a population.
In statistical inference, we test hypothesis. Hypothesis testing is used to establish whether a
research hypothesis extends beyond those individuals examined in a single study. To conduct a
study, a researcher has to go through the following steps,
i. Define the research hypothesis for the study.
ii. Set out the variables to be studied and how to measure them.

46
iii. Set out the null and alternative hypothesis
iv. Set the significance level – mostly 5%
v. Make a one- or two-tailed prediction.
vi. Determine whether the distribution that you are studying is normal.
vii. Select an appropriate statistical test based on the variables you have defined and
whether the distribution is normal or not.
viii. Run the statistical tests on your data and interpret the output.
ix. Reject or fail to reject the null hypothesis based on the p-value of the test statistic.

The figure below shows a flow chart of the most commonly used statistical tests.

The table below shows a more detailed description of the statistical tests depending on the
type of response variable.
Outcome Are the observations correlated? Alternatives if the normality
Variable assumption is violated (and
small n):

Independent Correlated/Panel

Continuous Ttest: compares means between two Paired ttest: compares means Non-parametric statistics
(e.g. blood independent groups between two related groups Wilcoxon sign-rank test: non-
pressure, age, ANOVA: compares means between more (e.g., the same subjects before parametric alternative to paired
pain score) than two independent groups and after) ttest
Pearson’s correlation coefficient (linear Repeated-measures ANOVA: Wilcoxon sum-rank test
correlation): shows linear correlation compares changes over time in (=Mann-Whitney U test): non-
between two continuous variables the means of two or more parametric alternative to the
Linear regression: multivariate regression groups (repeated ttest
technique when the outcome is measurements) Kruskal-Wallis test: non-
continuous; gives slopes Mixed models: multivariate parametric alternative to
regression techniques to ANOVA
compare changes over time Spearman rank correlation
between two or more groups coefficient: non-parametric
alternative to Pearson’s
correlation coefficient

47
Categorical Logistic regression: Binary outcome GEE Models: to model changes
response (yes/no response) over
Ordinal logistic regression: Ordered
categories (e.g Mild/Moderate/Severe Fixed/random/mixed models:
pain) (Unlikely, Somewhat likely, very to take into account
likely to adopt a new farming method) dependence of observations.
Multinomial logistic regression: Nominal
response (Type of farming method
adopted: Zero-grazing, range farming,
Extensive)

Normality testing

It is important to test whether the continuous response variable follows a normal distribution or
not so as to be able to use the appropriate statistical method.
Stata provides ways of checking for normality of continuous variables using graphical ways. For
instance, suppose we would like to explore the distribution of a variable called score. We can
accomplish this using a variety of plots such as .
• stem-and-leaf plot : use the Stata command - stem score
• Dot plot : use the Stata command - dotplot score
• Box plots - use the Stata command - graph box score
• Histograms - use the Stata command - hist score
• Distributional diagnostic plots such as P-P plot and Q-Q plot. From the Statistics Menu, go
to Summaries, tables, and tests > Distributional plots and tests > Normal probability plot,
standardized for a P-P plot. Furthermore, for a Q-Q plot, Go to - Statistics > Summaries,
tables, and tests > Distributional plots and tests > Quantile-quantile plot
The methods described above are just explanatory. They do not help us provide conclusions on
whether the variable score is normally distributed or not. To test for normality, we need to state
the hypothesis first as follows,
• Null: Data is not normally distributed
• Alternative: Data is normally distributed
In Stata, this test can be implemented using the following tests.
• Shapiro-Wilk test - swilk score
• Shapiro-Francia test - sfrancia score
• Skewness/Kurtosis test - sktest score

Test of Association

Tests whether categorical variables are related. Here we will use the chi-square test of
association. The Null and Alternative hypothesis is as follows:
Null: The two variables are not related

48
Alt: The two variables are related

When you choose to analyze your data using a chi-square test for association, you need to make
sure that the data you want to analyze conforms to two assumptions. These two assumptions
are:

 Your two variables should be measured at an ordinal or nominal level


(i.e., categorical data).
 Your two variable should consist of two or more categorical, independent groups.

Using the provided farmers.dta dataset,


tab County Genderedhouseholdtype1Femalh, chi2

Another assumption: each cell has an expected frequency of five or more.


Fisher's exact test
Used to conduct a chi-square test, but one or more of your cells has an expected frequency of
five or less. Fisher's exact test has no such assumption and can be used regardless of how small
the expected frequency is. Replace the chi2 option with exact so as to conduct the test
tab County Genderedhouseholdtype1Femalh, exact

Tests of difference

Independent T-Test
The independent t-test is used to determine whether the mean of a dependent variable e.g maize
yield is the same in two unrelated, independent groups (e.g., males vs females, employed vs
unemployed. Specifically, you use an independent t-test to determine whether the mean
difference between two groups is statistically significantly different to zero.
Assumptions
 Dependent variable should be continuous.

49
 Independent variable should consist of two categorical, independent (unrelated) groups.
 There should be no significant outliers
 Dependent variable should be approximately normally distributed for each category of
the independent variable.
 There needs to be homogeneity of variances. Tested using Levene's test for homogeneity
of variances.
Example: Test whether there is a difference in calf’s birth weight based on the gender of the calf.
The hypothesis for this test is:

Null: ℎ = ℎ

Alternative: ℎ != ℎ

In Stata, run the following commands


use births.dta, clear

labmask sex, values(sexalph)

ttest calf_bweight, by(sex)

ttest calf_bweight, by(sex)

Checking for assumptions.

 Check for outliers using the box plot


graph box calf_bweight, by(sex)

 Test for normality


swilk calf_bweight

pnorm calf_bweight

hist calf_bweight, by(sex)

 Levene’s test for equality of variances


robvar calf_bweight, by(sex)

The Wilcoxon-Mann-Whitney test : used when the dependent variable is assumed not to be
normally distributed.
ranksum calf_bweight, by(sex)

One-way ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of two or more independent (unrelated)
groups.
When you choose to analyze your data using a one-way ANOVA, part of the process involves
checking to make sure that the data you want to analyze can actually be analyzed using a one-
way ANOVA. This involves checking whether the following assumptions are met by your data.

50
 Dependent variable should be continuous.
 Independent variable should consist of two categorical, independent (unrelated) groups.
 There should be no significant outliers
 Dependent variable should be approximately normally distributed for each category of
the independent variable.
 There needs to be homogeneity of variances. Tested using bartlet’s test for
homogeneity of variances.
 You should have independence of observations

Example: An agricultural research company is comparing the weights of cows from 4 farms
where each has 500 cows. The company wants to know whether the average weight of the
cows differed based on the farm they came from.
The hypothesis for this study is as shown below.
Null: av_ ℎ = _ ℎ = _ ℎ = av_ ℎ
Alternative: At least one of the means is different
use farmer_cattle_weight.dta, clear

oneway weight farm, tabulate

One of the assumptions of ANOVA is normality of dependent variable. The Kruskal Wallis test:
used when this normality assumption is violated.
kwallis weight, by(farm)

If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly
different value of chi-squared.
One way ANOVA assumes homogeneity (equality) of variances. To test this STATA uses the
Bartlett’s test. Stata provides results for the Bartlett’s test for equality of variance together with
the anova results.
The null hypothesis of the test states that the variances are equal. If the assumption of
homogeneity of variance is violated, recast the ANOVA into a regression with dummy variables
then use the robust option will output robust standard errors:
regress weight i.farm, robust

PostHoc Tests
If the ANOVA test is significant, pairwise differences using either Bonferonni, tukey or Scheffe
needs to be conducted so as we do not know which of the specific groups differed.
pwmean weight, over(farm) mcompare(tukey) effects

With ANOVA, more than two independent variables can be used – Two way ANOVA

51
anova weight farm##farm_size

Test of difference for repeated measures data

Paired T-Test
Objective: to determine whether the mean of a dependent is the same in two related groups.
Here the assumptions are the same as those of the independent samples t-test except
independent variable should consist of two categorical, "related groups" or "matched pairs“
In Stata :
ttest FirstVariable == SecondVariable

The Wilcoxon signed rank sum test: non-parametric version of a paired samples t-test.
signrank FirstVariable == SecondVariable

Repeated measures ANOVA


Used to determine whether 3 or more group means are different where the participants are the
same in each group. The structure of the data should look like this.

id time dv
1 1 4.5
1 2 3.0
1 3 2.5
2 1 7.2
2 2 4.2
2 3 2.4

anova dv id time, repeated(time)

Example: The weight 2000 cows was measured over a period of 1 year every 3 months. We would
like to investigate whether there is a difference in the average weights over the 4 time periods.
Here,
 Dependent variable: weight is your dependent variable,
 Independent variable: time(4 related groups)

use cattle_weight_over_time.dta, clear

anova weight animal_id time, repeated(time)

Check for assumptions the same way as in the case of one-way anova

Regression Analysis with Stata: Cross-sectional data

52
Used to establish the relationship between dependent variable and one or more set of
explanatory variables.
Assumptions:
 Normality of residuals
 Homogeneity of variances
 Dependent variable should be continuous
 Independence of observations
 Linear relationship

Example: Cross-sectional study involving 1500 farmers in Meru County. The aim of the study is to
find determinants for maize yield (bags per ha). Using the maize_yield_ols.dta, we model the
maize_yield on the variety of maize planted..
use maize_yield_ols.dta, clear

regress bags_per_ha i.variety age i.education i.gender_hhead

Checking model Assumptions


Use the resid option, to obtain residuals stored in the variable res.
predict res, residuals

label var res "residual"

 Residuals Analysis - Normality of residuals


pnorm res, title("Normality of Residuals")

swilk res

kdensity res, normal

 Homogeneity of variances
First compute the studententized residuals, then plot a scatter plot of residuals against
predicted values. To obtain predicted values (yhat) from this regression, type
predict yhat,xb

label var yhat "predicted mean yield"

predict rstud, rstudent

graph twoway (scatter rstud yhat, symbol(d)), title(”Seeds study")


subtitle("Studentized Residuals v Predicted")

Or, use the following Stata command


rvfplot, yline(0)

53
To assess homogeneity in the data, check for any pattern in the data. A pattern indicates that the
assumption of heterogeneity of variance has been violated. The Breusch–Pagan test is used to
formally test for heteroscedasticity.
estat hettest

• Outlying and influential observations


We can use the leverage versus residual squared plot
lvr2plot, mlabel(id)

Now let’s move on to overall measures of influence, specifically let’s look at Cook’s D. The lowest
value that Cook’s D can assume is zero, and the higher the Cook’s D is, the more influential the
point. The convention cut-off point is 4/n, where n is the number of observations. We can list
any observation above the cut-off point by doing the following.
predict d, cooksd

br id bags_per_ha age variety education gender_hhead manager d if d>4/1500

Now let’s take a look at DFITS. The cut-off point for DFITS is 2*sqrt(k/n). DFITS can be either
positive or negative, with numbers close to zero corresponding to the points with small or zero
influence.
predict dfit, dfits

br id bags_per_ha age variety education gender_hhead manager rstud res d dfit


if abs(dfit)>2*sqrt(3/1500)

• Checking for multicollinearity


Occurs when there is a perfect linear relationship among the predictors - the estimates for a
regression model cannot be uniquely computed.
The primary concern is that as the degree of multicollinearity increases, the regression model
estimates of the coefficients become unstable and the standard errors for the coefficients can
get wildly inflated.
We can use the vif command after the regression to check for multicollinearity. vif stands
for variance inflation factor. As a rule of thumb, a variable whose VIF values are greater than 10
may merit further investigation.
Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity.
A tolerance value lower than 0.1 is comparable to a VIF of 10.
vif

Introduction to Panel data analysis

Panel data (also known as longitudinal or cross-sectional time-series data) is a dataset in which
the behavior of entities are observed across time. These entities could be states, companies,

54
individuals, countries, etc. Panel data looks like this.

Observations are not independent since measurements at different time points are correlated.
Therefore we have to fit a model that correct for the lack of independence. One can fit a model
with either fixed or random town effects depending on the study objective and the assumptions
we have about the respondents.
Fixed effects model
Here all the respondents are expected to share a common effect. The goal of the analysis is to
generalize the results to the population from which the respondents were drawn (and not to
extrapolate to other populations) FE explore the relationship between predictor and outcome
variables within an entity (country, person, company, etc.).
When using FE we assume that something within the individual may impact or bias the predictor
or outcome variables and we need to control for this. Fixed-effects models are designed to study
the causes of changes within a person. A time-invariant characteristic e.g gender cannot cause
such a change, because it is constant for each person.”
Random effects model
By contrast, when the researcher is accumulating data from respondents that are different
(geographical regions, tribe), it would be unlikely that all they are functionally equivalent.
Additionally, the goal of this analysis is usually to generalize to a range of scenarios, possible to
extrapolate.
If you have reason to believe that differences across entities have some influence on your
dependent variable then you should use random effects. An advantage of random effects is that
you can include time invariant variables (i.e. gender). In the fixed effects model these variables
are absorbed by the intercept.
Example: 500 farmers were recruited into a study where the quantity of maize they harvested
per ha was recorded over a period of 3 years. The aim of the study is to study how the yields
evolve over time and how these evolutions depend on some demographic variables.
The variables in the dataset are:

55
 id: unique identifier for each farmer
 bags_per_ha: response of interest (maize yield)
 year : year (repeated measurements)
 age : age of the respondent (age in years)
 education : education level of the respondent (low, medium, high)
 gender_hhead : gender of the household head (male, female)
 manager : Who manages the farm (man, woman, child)

Exploring panel data


• Scatter plot matrix to check the correlation structure
reshape wide bags_per_ha, i(id) j(year)

graph matrix bags_per_ha2014 bags_per_ha2015 bags_per_ha2016, half

corr bags_per_ha2014 bags_per_ha2015 bags_per_ha2016

Conducting panel data analysis in stata

The first step is to declare the data to be clustered with a farmer (id) as the cluster variable and
year as the order of observations within a respondent.
xtset id year

To report panel aspects of a dataset


xtdescribe

To summarize maize yield, decomposing standard deviation into between and within
components
xtsum bags_per_ha

To plot panel data as a line plot – first six subjects


xtline bags_per_ha if id <= 6, overlay

Fitting a fixed effects model:


In this study, the total quantity harvested was regressed on year with farmer fixed effects as
shown in the equation below.
= + + +

Where;
 : is the overall intercept
 : (i=1….500) are the 500 subject-specific effects
 : maize yield for year j in subject i
 : year j for subject i

56
 : time effect
 : The error term
Estimate a fixed-effects model with robust standard errors using Stata as follows.
xtreg bags_per_ha year, fe vce(robust)

Fitting a random effects model:


In this study, the total quantity harvested was regressed on year with farmer fixed effects as
shown in the equation below while correcting for baseline variables.
= + + +( + ) +

Where;
 : is the overall intercept
 : (i=1….500) are the 500 subject-specific random intercepts
 : (i=1….500) are the 500 subject-specific random slopes
 : maize yield for year j in subject i
 : year j for subject i
 : age effect
 : time effect
 : The error term

We will fit a mixed effects model with both fixed and random effects.
xtreg bags_per_ha year age, re vce(robust)

Introduction to logistic regression

Used to model binary response - the response measurement for each subject is “success” or
“failure”. The most popular model for binary data is logistic regression.
In the logit model the log odds of the outcome is modeled as a linear combination of the
predictor variables. The odds ratio (OR) is the ratio of the “odds” of success versus failure for
the two groups:
/( )
OR =
/( )

The log-odds ratio is often used to alleviate the restrictions that the odds ratio must be positive.
More on odds ratio
• OR=1 corresponds with no difference
• When 1 < OR < ∞, the odds of success are higher in group 1 than in group 2. Thus,
subjects in the first group are more likely to have successes than subjects in group 2.

57
• When 0 < OR < 1, the odds of success are smaller in group 1 than in group 2. Thus,
subjects in the first group are less likely to have successes than subjects in group 2.
• Values of OR farther from 1 in a given direction represent stronger level of association.
Suppose you are comparing males and females on their likelihood to adopt a new farming
method (yes/no).
Total number of females = 100
Total number of males = 120

The probability of adoption (proportion of “yes”) is given by:


Females (p1)= 60/100 = 0.6
Males (p2) = 50/120 = 0.42

The odds ratio is given by:


OR = (0.6/(1-0.6))/ (0.42/(1-0.42))/ = 2.07
The odds ratio has the interpretation that females have twice (2.1) the odds of adopting anew
farming method versus not adopting than males. Hence females are more likely to adopt.

Formulation of the logit model

The logistic regression model has linear form for the logit of the success probability π(x) when X
takes value x:
( )
( ) = log = +
1− ( )

The relationship between π(x) and the x is then described by the logistic function:

exp( + )
( )=
1 + exp( + )
Alternatively

( = ) = +

Interpreting the model parameters:


• Y = random component (response)
• X = systematic component (independent variables)

58
• Logit = link function - says how the expected value of the response relates to the linear
predictor of explanatory variables.
• The parameter β determines the rate of increase or decrease of the curve.
• When β > 0, π(x) increases as x increases.
• When β < 0, π(x) decreases as x increases.
• When β = 0, no effect

A Logistic model makes the following assumptions:


• The data Y1, Y2, ..., Yn are independently distributed - cases are independent.
• The dependent variable Yi does NOT need to be normally distributed – assumes a
binomial distribution.
• The homogeneity of variance does NOT need to be satisfied.
• Errors need to be independent but NOT normally distributed.
• Does not assume linearity - but it does assume linear relationship between the
transformed response in terms of the link function and the explanatory variables; e.g.,
for binary logistic regression logit(π) = β0 + βX.

Example: A researcher is interested in how variables, such as monthly income, land size and
education level affect adoption of a new variety of potatoes.
The response: binary (adoption , 1 = adopt, 0 = didn’t adopt))
There are three predictors
• Continuous variables: income, land_size variables:
• Categorical variables: education – takes on the values 1 through 4. A rank of 1 implies
highest level of literacy, a rank of 4 has the lowest literacy level.

Fitting a logistic regression model in Stata

Below we use the logit command to estimate a logistic regression model.


The i. before rank indicates that rank is a categorical variable. Stata by default select the lowest
category of a categorical variable as the reference category. One can change this category using
the ib# command as shown below.
logit adopt income land_size i.education

logit adopt income land_size ib4.education

The output of the model fitted above is as shown below,

59
This is interpreted as follows.
• The likelihood ratio chi-square of 41.46 with a p-value of 0.0000 tells us that our model
as a whole fits significantly better than an empty model.
• Both income and land_size are statistically significant, as are the three indicator
variables for education.
• The logistic regression coefficients give the change in the log odds of the outcome for a
one unit increase in the predictor variable.
• For every one unit change in income, the log odds of adoption(versus non-
adoption) increases by 0.0002.
• For a one unit increase in land_size, the log odds of adoption increases by 1.6.
• Having a literacy level of rank 3 versus a literacy level of rank 4 (lowest) increases
the log odds of adoption by 0.21.

Other tests
We can test for an overall effect of education using the test command.
test 1.education 2.education 3.education

We can also test additional hypotheses about the differences in the coefficients for different
levels of education. Below we test that the coefficient for rank=2 is equal to the coefficient
for rank=3.
test 2.education 3.education

60
You can also exponentiate the coefficients and interpret them as odds-ratios.
logit, or

Predicted probabilities
You can also use predicted probabilities to help you understand the model. You can calculate
predicted probabilities using the margins command
Below we use the margins command to calculate the predicted probability of adoption at each
level of education, holding all other variables in the model at their means.
margins education, atmeans

We can also generate the predicted probabilities for values of income from 2000 to 8000 in
increments of 1000 (continuous variable). The average predicted probabilities will be calculated
using the sample values of the other predictor variables.
For example, to calculate the average predicted probability when income = 2000, the predicted
probability was calculated for each case, using that case’s values of education and land_size,
with income set to 2000.
margins, at(income=(2000(1000)8000))

Ordinal logistic regression

Ordinal logistic regression is used to predict an ordinal dependent variable with more than two
categories given one or more independent variables.
Example: A study looks at factors that influence the decision of whether to adopt a new potato
variety. Farmers are asked if they are unlikely, somewhat likely, or very likely to adopt the new
variety. Hence, our outcome variable has three categories. Data on farmer’s educational
status, whether the farmer is male or female, and current land size in ha is also collected.

 adopt: response (0=unlikely,1=somewhat likely,2=very likely)


 education: (0=illiterate, 1 = literate)
 gender: (0=male, 1 =female)
 land_size: in acres

In Stata, run the following command,


ologit adopt i.education i.gender land_size

ologit, or

• The relationship between each pair of outcome groups is the same.

Testing the proportional odds assumption.

61
One of the assumptions underlying ordinal logistic regression is that the relationship between
each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that
the coefficients that describe the relationship between, say, the lowest versus all higher
categories of the response variable are the same as those that describe the relationship between
the next lowest category and all higher categories, etc. This is called the proportional odds
assumption or the parallel regression assumption.
First, we need to download a user-written command called omodel. The null hypothesis is that
there is no difference in the coefficients between models, so we "hope" to get a non-significant
result. Please note that the omodel command does not recognize factor variables, so the i. is
omitted.
findit omodel

omodel logit adopt education gender land_size

The brant command performs a Brant test. As the note at the bottom of the output indicates,
we also "hope" that these tests are non-significant. The brant command, is part of
the spost add-on and can be obtained by typing findit spost. We will use the detail option
here, which shows the estimated coefficients for the two equations.
brant, detail

A generalized ordered logistic model using gologit2 should be fit if the assumption is
violated. You need to download gologit2 by typing findit gologit2.
Predicted probabilities
margins, at(education=(0/1)) predict(outcome(0)) atmeans

margins, at(education=(0/1)) predict(outcome(1)) atmeans

margins, at(education=(0/1)) predict(outcome(2)) atmeans

Way forward

There are lots of materials on STATA online and we encourage you to search and refer to this
support materials as need arises. Some of the materials are written by specialist in different
disciplines and you are likely to specialized tailored guides and instruction.

62

You might also like