Training on Economics Software
Applications: Introduction to Stata©
Tolasa Alemayehu
Economics Department
Mattu University
Nov, 2022
Mattu , Ethiopia
A PLAN (CONTENT) OF TRAINING
1.Introduction
The Stata Interface
Exploring and Examining Datasets
Storing Commands and Outputs
2. Data management
Creating, Modifying and Defining Variables
Appending and Merging Datasets
Collapsing Data Sets
3. Describing Data
Summary Statistics
Statistical Tests
Graphics
4. Analysis of Regression Models
Steps in Empirical Analysis
Structure of Economic Data
Regression Models: Cross Section , Time series and Panel
A PLAN (CONTENT OF TRAINING)
Exercises will be given for every section
Working in group is advisable
Good to have seats in such a way that at least one in
your right or left side has some acquaintances with
Stata
1. introduction
What is Stata? Why Use Stata?
Types of Stata
Stata (pronounced “stah-tah). Version 1
Born in 1985.
Stata is not an abbreviation but rather a
corruption of the word Statistics.
Stata is a general purpose software and a
command-driven package (i.e. not
specialized like DAD, Eviews, GAMs, SPSS,
Matlab, nlogit, etc.)
◦ Cross section, panel, and time series
data analysis (Especially suited for the
former two )
Why should I use Stata?
Stata preferred to other packages as, “a very interactive
package, which makes you feel like you are talking to it
and does exactly what you are telling it to do.”i,.e.,
• Handling and manipulating large data sets (e.g.
millions of observations!)
• Growing capabilities for handling panel and time-
series regression analysis.
• There are improvements in computing speed,
capabilities and functionality.
• Constantly being updated or advanced by users with a
specific need
• Fast and easy to use.
Types (size) of stata
There are four different types (sizes) available for
each version of Stata:
1. Stata MP (Multi Processor),the most powerful,
2. Stata SE (Special Edition),
3. Stata Intercooled (IC) and
4. Stata Small.
The main difference between these versions is the
maximum number of variables, regressors and
observations that can be handled.
It is important to know these types if one is to make a
good choice of what to buy.
Stata Typ Maximum Maximum Maximum Remarks
e Number of Number of Number of
Variables Regressors Observation
s
· Runs on multiple CPUs
Stata/ or cores, from 2 to 64 but
MP can also run on single
32,767 10,998 2,147,583,647* core. The number of
cores depends on the
licence.
· Fastest version of Stata
· Run on single core.
· Can run on multiple core
Stata/SE 32,767 10,998 2,147,583,647* computers but uses only
single core.
· Run on single core.
· Can run on multiple core
Stata/IC 2,047 798 2,147,583,647* computers but uses only
single core
· Run on single core.
· Can run on multiple core
Small Sta 99 99 1,200 computers but uses only
ta single core
Menu bar
Toolsbar
Results window
Variables window
5 windows
Variables window
Review Window
Properties window
Commands window
The Stata Interface: Windows, Toolbar, Menus, and Dialogs
Windows
The Stata windows give you all the key information about
the data file you are using, recent commands, and the results
of those commands.
The five main windows are the Review, Results, Command,
Variables, and Properties windows.
There are other, more specialized windows such as the
Viewer, Data Editor, Variables Manager, Do-file Editor,
Graph, and Graph Editor Windows.
Some of them open automatically when you start Stata,
while others can be opened using the Windows pull-down
menu or the buttons on the tool bar.
Stata windows are:
• Stata Results To see recent commands and output
• Stata Command To enter a command
• Stata Browser To view the data file
• Stata Editor To edit the data file
• Stata Viewer To get help on how to use Stata
• Variables To see a list of variables
• Review To see recent commands
• Stata Do-file Editor: To write or edit a program
Menus
Stata displays 8 drop-down menus across the top of the outer window, from left to right:
File
Open open a Stata data file (use)
Save/Save as save the Stata data in memory to disk
Do execute a do-file
Filename copy a filename to the command line
Print print log or graph
Exit quit Stata
Edit
Copy/Paste copy text among the Command, Results, and Log windows
Copy Table copy table from Results window to another file
Table copy options what to do with table lines in Copy Table
Prefs Various options for setting preferences. For example, you can save
a particularly layout of the different Stata windows or change the
colors used in Stata windows.
Data
Graphics
Statistics build and run Stata commands from menus
User menus for user-supplied Stata commands (download from Internet)
Window bring a Stata window to the front
Help Stata command syntax and keyword searches
Button bar
The buttons on the button bar are from left to right
Open files from some specific directory use
Save the Stata data in memory to disk: save
Print a log or graph
Open a log, or suspend/close an open log: log
New Do file: Editor Do edit
Edit the data in memory: edit
Browse the data in memory: browse
Important Short cuts
keyboard commands are quicker to use
than the buttons. The most useful ones are:
Control-O Open file
Control-S Save file
Control-C Copy
Control-X Cut
Control-V Paste
Control-Z Undo
Control-F Find
Control-H Find and Replace
1.2. Exploring and Examining Datasets
1.2.1. Exploring Data Files
Common Stata Syntax
• Stata commands follow the same syntax:
[by varilist1:] command [varlist2] [if exp] [in range]
[weight], [options]
• Items inside of the squares brackets are either options
or not available for every command.
• This syntax applies to all Stata commands
Logical operators used in Stata
~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& And
| Or
1.2.2. Examining dataset
Using the command window:
a. Stata file (.dta): use command
b. Excel file (.xlsx): import command
c. CSV file (csv): insheet command
d. .SPSS file: usespss command
Log file:Stata can save the file in one of 2 d/t formats.
a. Stata Markup and Control Language (SMCL) format
(SMCL format is recommended because SMCL files
can be translated into a variety of formats readable by
applications other than Stata)
b. 2. Log format
• Use
– This command opens an existing Stata data file.
– The syntax is:
• use filename [, clear ] : opens new file
• use [varlist] [if exp] [in range] using filename [, clear]
opens selected parts of file
– If there is no path, Stata assumes it is in the current folder.
– You can use a path name such as: use C:\...\ERHScons1999
– If the path name has spaces, you must use double quotes:
use .”d:\my data\ERHScons1999”
– You can open selected variables of a file using a variable list.
– You can open selected records of a file using if or in.
Examining dataset
Here are some examples of the use command:
• use ERHScons1999 : opens this file for analysis.
• use ERHScons1999 if q1a == 1: opens data from region 1
• use ERHScons1999 in 5/25: opens records 5 through 25
of file
• We can also combine the if and the in commands
• use q1a hhid hhsize cons using ERHScons1999:
This opens 3 variables from ERHScons1999
• use ERHScons1999, clear: clears memory before
opening the new file
Examining dataset
Clear: The clear command deletes all files, variables, and
labels from the memory to get ready to use a new data file
◦ You can clear memory using the clear command or by
using it as part of the use command. This command does
not delete any data saved to the hard-drive
Exit: Differs from the Clear command
◦ Closes Stata and other relevant windows
If data was entered in other formats such as excel,
importing in Stata is simple
Example: if our data set is in Excel, then use
Import excel using “C:\Users\eea\Desktop\SD\original\
teff price.xlsx”, sheet(addis) firstrow clear
Examining dataset
Save
– The save command will save the dataset as a .dta file
under the name you choose. Editing the dataset changes
data in the computer's memory, it does not change the
data that is stored on the computer's disk.
save “C:\...\ ERHScons1999.dta ”, replace
The replace option allows you to save a changed file to the
disk, replacing the original file.
– Stata is worried that you will accidentally overwrite
your data file.
– You need to use the replace option to tell Stata that you
know that the file exists and you want to replace it.
Examining dataset
• Edit
• This command use to open window called data editor
window that allow us to view all observation in the
memory.
• You can change the data using data editor window but it
is not recommend to edit data using this window
• It is better to correct errors in the data using a Do-file
program that can be saved.
• Browse
• This window is exactly like the Stata editor window
except that you can’t change the data
Examining dataset
• Describe
– This command provides a brief description of the
data file.
– You can use “des” or “d” and Stata will understand.
– The output includes:
• the number of variables
• the number of observations (records)
• the size of the file
• the list of variables and their characteristics
• Storage types: String vs numeric
Examining dataset
list
◦ This command lists values of variables in data set.
◦ The syntax is:
list [varlist] [if exp] [in range]
examples:
◦ list lists entire dataset
◦ list in 1/10 lists observations 1 through 10
◦ list hhsize q1a food lists selected variables
◦ list hhsize sex in 1/20 lists observations 1-20 for
selected variables
Examining dataset
• list with “if” condition
– This command is used to select certain records in carrying
out a command
• command if exp
Examples:
– list hhid q1a food if food >1200 list data if food is > 1200
– list if q1a < 6 lists cases in region is 1 through 5
– Browse hhid q1a food if food >=1200 browse data if
food consumption is above 1200
• Note that “if” statements always use ==, not a single =.
Also note that | indicates “or” while & indicates “and”
Examining dataset
list with “in”
◦ We also use in to select records based on the case number.
◦ The syntax is: command in exp
For example:
◦ list in 10 list observation number 10
◦ summarize in 10/20 summarize obs 10-20
codebook
◦ The codebook command is a great tool for getting a quick
overview of the variables in the data file.
◦ It produces a kind of electronic codebook from the data
file, displaying information about variables' names, labels
and values
◦ Codes/ codebook/ codebook hhid q1a food
Examining dataset
Inspect
It is a command for getting a quick overview of a data file.
◦ inspect command displays information about the values of
variables and is useful for checking data accuracy
inspect
inspect hhid q1a food
• assert
• count
– count command can be used to show the number of
observations that satisfying if options.
– If no conditions are specified, count displays the number of
observations in the data.
• Count: 1452
• count if q1a==3: 466
1.3. STORING: Outputs, Commands & Data
The following topics are covered:
◦ Using the Do-file Editor
◦ log using
◦ log off
◦ log on
◦ log close
◦ set log type to move tables from Stata to Word and Excel
Using the Do-file Editor
The Do-file Editor allows you to store a program (a set of
commands),
◦ It makes it easier to check and fix errors,
◦ It allows you to run the commands later,
◦ It lets you show others how you got your result, and
◦ It allows you to collaborate with others on the analysis.
STORING: Outputs, Commands and Data
In general, any time you are running more than 10 commands
to get a result, it is easier and safer to use a Do-file.
To open the Do-file Editor, you can click on Windows/Do-file
Editor or click on the envelope on the Tool Bar.
To run the commands in a Do-file,
you can click on the Do button.
If you want to run one or just a few commands rather than the
whole file, mark the commands and click on the Do button
Note: If you would like to add a note to a do file, but do not
want Stata to execute your notes, /* */ is used for more than
one line and * just for a line
If one needs to put comments following commands use // with
space from command and write the comments
STORING: Outputs, Commands and
Data
Saving the Output
◦ Stata Results window does not keep all the output you
generate.
◦ It only stores about 300-600 lines, and when it is full, it
begins to delete the old results as you add new results.
◦ Thus, we need to use log to save the output
log using
◦ This command creates a file with a copy of all the
commands and output from Stata. The syntax is:
log using filename [, append replace [ text | smcl ] ]
Append: adds the output to an existing file
Replace: replaces an existing file with the output
STORING: Outputs, Commands and Data
Here are some examples:
log using "C:\Users\eea\Desktop\SD\results.smcl”
log using "C:\Users\eea\Desktop\SD\results.smcl , replace
log using "C:\Users\eea\Desktop\SD\results.smcl, append
log off: This command temporarily turns off the logging of output,
log on: This command is used to restart the logging,
log close: is used to turn off the logging and save the file.
Storing data
Save
Save, replace
Examples
Save "C:\Users\eea\Desktop\SD\verion1.dta”
Save “C:\Users\eea\Desktop\SD\verion2.dta”, replace
Getting help in Stata
• Help: The help command gives you information about any Stata
command or topic
• help [command]
For example,
• help tabulate: gives a description of the tabulate command
• help summarize gives a description of the summarize
• search: a keyword search and Useful when one does not know
stata commands
Example : search ols
hsearch : not restricted to key words
E.g. hsearch weak instruments
netsearch: when connected to internet
◦ E.g. netsearch outreg2
2. Data Management in Stata
Some Organizing Tips
Adding Notes to Datasets and Variables
Creating and Modifying Variables
Defining, labeling and renaming Variables
Appending and Merging Data Sets
Collapsing Data Sets
Additional Help on Stata
Exercises
First, be organized
Be organized in your data management
Always use do files for your research project
Know the Stata version you are working with
◦ What if I do not know the Stata version?
Save your outputs
◦ capture log
◦ log using commands
Create a shorter way of writing your directories
◦ The global command
Adding notes on your data set
You can add notes on your data set
Example
◦ note: This data contains some variables generated
by Economics staff
To read notes,
◦ Note
Notes can be written for variables
◦ Note food: Is this per capital or per week? Please
check.
To delete notes
◦ notes drop q2_area in 1
◦ notes drop _dta in 2
CREATING NEW VARIABLES
When new variables are created, they are in memory &
they will appear in the Data Browser
However, that they will not be saved on the hard-disk
unless you use the save command.
generate
◦ This command is used to create a new variable.
◦ It is similar to “compute” in SPSS.
The syntax is;
generate newvar = exp [if exp]
where “exp“ is an expression like
“price*quant” or “1000*kg”
CREATING NEW VARIABLES
You can use “gen“ or “g” as an abbreviation for
“generate“
If the expression is an equality or inequality, the
variable will take the values 0 if the expression is false
and 1 if it is true
If you use “if“, the new variable will have missing
values when the “if“ statement is false
For example,
use "$original\ERHScons1999_old.dta", clear
CREATING NEW VARIABLES
• generate age2= ageh*ageh
create age squared variable
• gen conspercap=food/hhsize
Creates consumption per capita
• gen consperad=food/aeu
Creates consumption per adult
• gen highcons =(rconsae>2000)
Indicates those with consumption of greater than 200
To know the number of these households
CREATING NEW VARIABLES
• tab highcons
save "$final\ERHScons1999.dta", replace
replace : This command is used to change the definition
of an existing variable.
The syntax is the same:
replace oldvar = exp [if exp] [in exp]
replace cons=. if cons<0: replaces negative consumption
with missing value
tabulate … generate : This command is useful for
creating a set of dummy variables (variables with a
value of 0 or 1) depending on the value of an existing
categorical variable.
CREATING NEW VARIABLES
The syntax is:
tabulate old variable, generate(newvariable)
tab q1a, gen(region)
This creates 6 new variables:
region1=1 if q1a=1 and 0 otherwise
region6=1 if q1a =8 and 0 otherwise
egen : This is an extended version of “generate” [extended
generate] to create a new variable by aggregating the
existing data.
The syntax is:
egen newvar=fcn(arguments) [if exp] [in range] , by(var)
CREATING NEW VARIABLES
Functions
mean() mean
median() median
max () standard deviation
min() standardize variables
sum () sums
egen average = mean(cons): creates variable of average
consumption over entire sample
egen median= median(cons), by(sex): creates variable of median
consumption for each sex
egen regav = mean(cons), by(region): creates variable of mean
consumption for each region
egen avecon=mean(cons), by( q1c)
egen highavecon=(cons> avecon)
CREATING NEW VARIABLES
Some operators used in Stata
+ addition > greater than
- Subtraction < less than
* Multiplication >= greater than or equal
/ division <= less than or equal
^ power == equal
Logical ~= not equal
~ not != not equal
| or
& and
The Variables Manager is a tool for managing properties
of variables both individually and in groups.
It can be used to create variable and value labels, rename
variables, change display formats, and manage notes.
It has the ability to filter and group variables as well as to
create variable lists.
Labeling variable is: label variable var1 "description"
Labeling the various levels of a categorical variable can be
labeled using the following two Stata syntaxes together:
label define and label value.
Example: gender has two categories 1 for male and 2 for
female. if gender can be labeled as: label define gender 1
“male” 2 “female”
label values gender gender
MODIFYING VARIABLES
We begin with an explanation of how to label data in Stata.
Then see how to format variables.
◦ rename variable
◦ label variable
◦ Keep/ drop and order/sort
◦ label define/values
rename variables: This command is used to rename variables
in order to give other variable name. The command is
rename old_variable new_variable
Example: Generate a dummy for the region variable and rename
the new dummy variables
Label variable: this helps us give a short description of the
variable. Command: label variable yield “output per hectare”
MODIFYING VARIABLES
We can subset data by keeping or dropping variables, or by
keeping and dropping observations
◦ keep and drop variables
The keep command is used to keep variables in the list while
dropping other variables
The drop command is used to delete variables in the list while
keeping other keep and drop observations
The keep if command is used to keep observations if condition is
met and vice versa for drop.
If there are many variables to drop and few ones to keep, then
apply keep
However, if there are many variables to keep and only few to
drop use drop
MODIFYING VARIABLES
Examples
◦ drop pwhole_mixed pretail_mixed
◦ keep pwhole_white pretail_white pwhole_red pretail_red
Note: The two commands are the same in this case
Sort: This command arranges the observations of the current
data into ascending order based on the values of the variables
listed
Variable ordering: This command helps us to organize
variables in a way that makes sense by changing the order of
the variables
order x y z: Puts x first y second z third
sort x : Puts data in ascending order of the variable x
Appending datasets
Appending datasets
Often we don’t have all the info
that we need in one dataset, and
we have to append two or more
datasets into one
merge two or more datasets
into one
There are several types of
“appending” “merging”
datasets…
As long as the variables in the
files are the same and the only
thing you need to do is to add
observations, this is vertical
combination.
For this we use the append
command.
Appending datasets
Appending data files
◦ concatenates two datasets, that is, stick them together
vertically, one after another
use "$final\tprice_addis.dta", clear
append using "$final\tprice_dire.dta“
save "$final\tprice_all.dta", replace
◦ The append command does not require that the two
datasets contain the same variables.
◦ But it highly recommended to use identical list of
variables for append command to avoid missing
values from one dataset
Defining Variables
label define: This command gives a name to a set of value
labels. For example, instead of numbering the regions, we can
assign a label to each region. The syntax is:
label define lblname # "label" # "label" # “label“ [, add modify]
Where: lblname is the name given to the set of value labels
◦ # are the value numbers
◦ “label”are the value labels
◦ add means add these value labels to the existing set
◦ modify means o change these values in the existing set
Defining Variables
Note that:
You can use the abbreviation “label def“
The double quotation marks are only necessary if there are
spaces in the labels
Stata will not let you define an existing label unless you say
“modify” or “add“
label values
◦ This command attaches named set of value labels to a
categorical variable.
The syntax is:
label values varname [lblname] [, nofix]
label define reg 1"Tigray" 3"Amhara" 4"Oromia" 7"SNNP",modify
label values q1a reg
Merging and appending datasets
If the identifying variable
which appears in the files is
unique in both files, then it's
a one-to-one match.
Unique means that for each
value of this variable, there
is only one observation that
contains it.
In the figure below, country is
the identifying variable.
In both datasets, each country
has only one observation.
Merging and appending datasets
One-to-one match merging
The merge command sticks two datasets horizontally, one
next to the other. Before any merge, both datasets must be
sorted by identical merge variable
. use p2sec9a.dta, clear
. sort hhid item1234
. save consumption.dta, replace
.use p_r5, clear
. sort hhid item1234
. save comprice.dta, replace
. use consumption.dta, clear
. merge hhid item1234 using compri
Merging and appending datasets
One-to-many
matching
◦ If the identifying
variable is unique in
one file, but not
unique in the other,
then it's a one-to-
many matching.
Collapsing data sets
Collapse
◦ Sometimes we have data files that need to be
aggregated at a higher level to be useful for us.
◦ For example, we have household data but we really
interested in regional data.
◦ The collapse command serves this purpose by
converting the dataset in memory into a dataset of
means, sums, medians and percentiles
For instance, we would like to see the mean cons in each
q1a and sex of hh head.
collapse (mean) cons, by(q1a sex)
Additional Stata Resouces
Don’t forget to get help for command
specific searches
◦ Help help
◦ Search
◦ Hsearch
◦ Netsearch
http://stataproject.blogspot.com.
http://www.stata.com/
http://www.stata.com/statalist/
Additional Stata Resources
Statalist is
◦ hosted at the Harvard School of Public
Health,
◦ is an email listserver
◦ Stata users including experts writing Stata
programs to users like us
◦ maintain a lively dialogue about all things
statistics and Stata.
4. Data Analysis Using Stata ©
Describing Data with Summary
Statistics
Applying Some Statistical Tests in
Stata
Describing Data with Graphs
Exercises
3.1.Basic Descriptive Statistics Using Stata
• summarize
– The summarize command produces statistics on continuous
variables like age, food, cons hhsize.
– The syntax looks like this:
summarize [varlist] [if exp] [in range] [, [detail]]
By default, it produces the following statistics:
• Number of observations
• Average (or mean)
• Standard deviation
• Minimum
• Maximum
Basic Descriptive Statistics
Using Stata
If you specify “detail” Stata gives you additional
statistics,such as
• skewness,
• kurtosis,
• the four smallest values
• the four largest values
• various percentiles.
mean = expected value (expectation) of Y = E(Y) = μY = long-
run average value of Y over many repeated occurrences of Y
variance = E[(Y – μY)2 = measure of the squared spread of the
distribution around its mean
standard deviation = σY
Basic Descriptive Statistics using Stata
E Y Y
3
skewness = Y3
measure of asymmetry (lack of symmetry) of a distribution
skewness = 0: distribution is symmetric
skewness > (<) 0: distribution has long right (left) tail
Skewness mathematically describes how much a distribution
deviates from symmetry
Kurtosis =
= measure of mass in tails = measure of probability of large values
kurtosis = 3: normal distribution
kurtosis > 3: heavy tails (“leptokurtotic”)
1-62
Basic Descriptive Statistics using Stata
Basic Descriptive Statistics Using Stata
Here are some examples:
Summarize: gives statistics on all variables
summarize hhsize food: gives statistics on selected variables
summarize hhsize, detail
summarize hhsize cons if q1a==3: gives statistics on two
variables for one region
By: This prefix goes before a command and asks Stata to
repeat the command for each value of a variable.
The general syntax is: by varlist: command
Note: bysort command is most commonly used to shorten the
sorting process. example of the by prefix are:
bysort sex: sum rconsae for sex of hh head, give stats on real
per capita consumption.
Basic Descriptive Statistics
Using Stata
Tabulate, tab1, tab2
◦ These are three related commands that produce frequency
tables for discrete variables.
◦ They can produce one-way frequency tables (tables with
the frequency of one variable) or two-way frequency tables
(tables with a row variable and a column variable.
Tabulate/ tab: produce a frequency table for one or two
variables
Tab1: produces a one-way frequency table for each variable
in the variable list
Tab2: produces all possible two variable tables from the list
of variables
Basic Descriptive Statistics
Using
You Stata
can use several options with these commands:
• Cell: gives the overall percentage for two-way tables
• Column: gives column percentages for two-way tables
• Row: gives row percentages for two-way tables
There are many other options, including other statistical tests.
For more information, type “help tabulate”
Some examples of the tabulate commands are:
tabulate q1a: produces table of frequency by region
tabulate q1a sexh: produces a cross-tab of frequencies by region and sex of head
tab q1a sexh
tab1 q1a sexh: produces three tables, a frequency table for each variable
tab2 q1a sexh
tab2 q1a poor
tab2 q1a sexh, cell
tab2 q1a sexh, row
tab2 q1a sexh, column
Statistical Tests
ttest command
We would like to see if the mean of hhsize equals to 6 by using
single sample t-test, testing whether the sample was drawn from a
population with a mean of 6. ttest command is used for this
purpose: ttest hhsize=6
We are interested that if cons is close to food.
ttest cons=food
ttest command for independent groups with pooled (equal)
variance: ttest cons, by(sexh)
ttest command for independent groups using unequal variance:
ttest cons, by(sexh) unequal
STATISTICAL TESTS
correlate command
◦ The correlate command displays a matrix of Pearson correlations
for the variable listed. E.g correlate cons hhsize
Correlation vs Causation
Any 2 variables can be correlated without being the cause of another
cov( X , Z ) XZ
corr(X,Z) = var( X ) var(Z ) X Z = rXZ
• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association
• Correlation coefficient is unit less, so it avoids the problems of the
covariance.
• corr(X,Z) when measured in feet same as corr(X,Z) when X & Z in
meters or pounds
1-68
PRESENTING DATA WITH GRAPH
The Stata graph commands begin with the word graph (in some
cases this is optional).Examples:
◦ graph twoway scatterplots, line plots,
◦ graph bar bar charts
◦ graph pie pie charts
Examples
◦ graph twoway scatter cons food
We can show the regression line predicting cons from food using
lfit option.
◦ twoway lfit cons food
The two graphs can be overlapped like this
◦ twoway (scatter cons hhsize) (lfit cons hhsize)
PRESENTING DATA WITH GRAPH
Labeling graphs
scatter var1 var2, title("title") subtitle ("subtitle") xtitle
("xtitle") ytitle ("ytitle") note("note")
Example
scatter ageh cons , title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
Histograms and kernel density
◦ histogram cons
◦ histogram cons, normal
kernel density
◦ kdensity cons
◦ kdensity cons, normal
4. Regression Analysis Using Stata
Steps in Empirical Analysis
Structure of Economic Data
Regression Models
◦ Assumptions and their violations
Regression Analysis Using Stata
◦ Linear Models: Cross Section
◦ Linear Models: Panel Data
◦ Non linear Models: Cross Section
Reporting Regression Models
Steps in Empirical Analysis
Empirical Analysis
• An empirical analysis uses data to test a theory or to estimate a
relationship.
• First step in any empirical analysis is the careful formulation of
the question of interest.
• Literature review is an important step in any empirical analysis
• In some cases a formal economic model is constructed.
• An economic model consists of mathematical equations that
describe various relationships y f ( x1 , x2 ,...)
• Formal economic modeling is the starting point for empirical
analysis. but it is more common to use economic theory less
formally, or intuition
Steps in Empirical Analysis
• Then we need to turn the economic model into what we call
an econometric model yi 0 1 xi1 2 xi 2 ... i
• The form of the function must be specified before we can
undertake an econometric analysis.
• We need to deal with variables that cannot reasonably be
observed.
• We must somehow account for the many factors that we
cannot even completely list
• Unobserved factors and error in measurement can be
accounted for using error term or disturbance term
• Once an econometric model has been specified, various
hypotheses of interest can be stated in terms of the unknown
parameters
Structure of Economic Data
Data Management
• Structure of Economic data
– Economic data sets come in a variety of types
– Some econometric methods can be applied with little or no
modification to many different kinds of data sets
– The special features of some data sets must be accounted for or
should be exploited
– We next describe the most important data structures encountered
in applied work
1. Cross-section
• Consists of a sample of individuals, households, firms, cities,
states, countries, or a variety of other units, taken at a given point
in time
• In a pure cross section analysis we would ignore any minor timing
differences in collecting the data
Structure of Economic Data
• An important feature of cross-sectional data is that we can
often assume that they have been obtained by random
sampling from the underlying population, which simplifies
most of the analysis
• But there could be violations of the random sample
assumptions
– Refusal to respond by some group of the respondents
– Sampling from units that are large relative to the
population (the population is not large enough to
reasonably assume the observations are independent draws)
• cross-sectional data is closely aligned with the applied
microeconomics fields, such as labor economics, state and
local public finance, industrial organization, urban
economics, demography, and health economics
Structure of Economic Data
2. Time-series
• A time series data set consists of observations on a variable or
several variables over time. Examples of time series data include
stock prices, money supply, consumer price index, gross domestic
product, annual homicide rates
• Because past events can influence future events and lags in
behavior are prevalent in the social sciences, time is an important
dimension in a time series data set
• The chronological ordering of observations in a time series conveys
potentially important information
• What makes time series more difficult to analyze than cross-
sectional data is the fact that economic observations can rarely, if
ever, be assumed to be independent across time
• Another feature of time series data that can require special attention
is the data frequency at which the data are collected
Structure of Economic Data
3. Pooled cross-sections
• Some data sets have both cross-sectional and time series features
• Pooled cross-section is a combination of several cross-section data
that are collected from the same population in different time periods
• Pooling cross sections from different years is often an effective way
of analyzing the effects of a new policies
• The idea is to collect data from the years before and after a key
policy change
4. Panel or longitudinal data
• A panel data (or longitudinal data) set consists of a time series for
each cross-sectional member in the data set
• Panel data can be collected on household, firms or geographical units
• The key feature of panel data that distinguishes it from a pooled cross
section is the fact that the same cross-sectional units (individuals,
firms, or counties are followed over a given time period
Simple Linear Regression
In this case we only have one repressor and
a constant yi 0 1 xi i
The Gauss–Markov
Assumptions
There are assumptions about the error term
and the explanatory variables xi
So-called Gauss–Markov assumptions are
A1 E i 0, i 1,2,..., n
A2 1 ,..., n and x1 ,..., xn are independent
A3 V i , i 1,2,..., n
2
A4 cov i , j 0, i, j 1,2,..., n, i j
Additional Assumptions
The relationship of interest is linear
◦ Linearity is in parameter, not in variables
Data are stationary (pertinent for time series
data)
◦ Distribution is the same over time
Weak/covariance vs strong stationarity
Data is random
Survey design
Nature of data
Properties of the OLS
Estimator
Under assumptions (A1)–(A4), the OLS
estimator b for has the properties:
◦ Unbiasedness of E b
◦ The OLS estimator b of is the best estimator, i.e.
among the set of estimators, the OLS estimator is
one with the least variance
◦ b is a linear function of the explanatory and the
dependent variables
◦ Hence, b is BLUE for
Multiple Regression Analysis
Multiple regression analysis is more
amenable to ceteris paribus analysis
It allows us to explicitly control for many
other factors which simultaneously affect
the dependent variable
yi 0 1 xi1 2 xi 2 ... k xik i
Multiple regression models can
accommodate many explanatory variables
Violations of GM
Assumptions
A1 E i 0, i 1,2,..., n
A2 1 ,..., n and x1 ,..., xn areindependent
A3 V i 2 , i 1,2,..., n
A4 cov i , j 0, i, j 1,2,..., n, i j
Additional
The relationship of interest is linear
Data are stationary
Data is random
Violations of GM Assumptions
GM assumptions can be violated for a
variety of reasons
Example 1: The assumption that there is
zero covariance between the error term and
one or more explanatory variables can be
violated due to:
◦ Omitted variables bias
◦ Measurement error
◦ Simultaneity
Example 2: Linearity assumption may falter
◦ Most behavioral relation ships are non linear
◦ The structure of the data may require us to use
non linear data
Violations of GM Assumptions
Example4: cross section (household data) is
usually hetroskedastic
Example 3: Time series data are usually non
stationary
Example 4: There might be selection problem
How to amend these violations
Omitted variables
◦ Instrumental variable
◦ Proxy variable
◦ Simultaneous equations models:2SLS
◦ Panel data
Non linear Models
◦ Use discrete choice models
◦ Corner solution outcomes
Heteroskedasticity
◦ Use weighted list square
◦ Use non-standard standard errors
E.g. robust /white standard errors
How to amend these violations
Non stationary time series
◦ EG EC Model
◦ Johanson Approach
Non random sample
◦ Selection models
E.g. Heckman Selection model
Remember that before trying to amend these
violations, we have to test their existence in a
given data set.
Simple Linear Regression Models: Cross
Section
General Format
regress depvar indvar if/in weights, options
The regress command performs OLS
regression and yields an analysis of
variance table, goodness of feet stats, coef.
estimates, se, t stats and p values, and
confidence intervals
See examples
Basic Format: Linear-Cross Section
◦ The xi prefix is used to dummy code categorical
variables, and we tag these variables with an “i.”
in front of each target variable
xi: regress cons hhsize i.q1a,
robust
◦ By default, Stata selects the first category in the
categorical variable as the reference category. If
we would like to declare a certain category as
reference category
char q1a[omit] 7
xi:regress cons hhsize i.q1a,
robust
Basic Format: Linear-Panel
Basic format for linear panel data
xtreg depvar indvar if/in weights, options
Two things to be noted before running panel data
regression models
The dataset should be in long form (not wide form
which is the default after merging two or more
datasets)
use reshape long varlist, i(identifier) j(time
variable)
Basic Format: Linear-Panel
Basic format for linear panel data
xtreg depvar indvar if/in weights, options
Two things to be noted before running panel data
regression models
The dataset should be in long form (not wide form
which is the default after merging two or more
datasets)
use reshape long varlist, i(identifier) j(time
variable)
Reporting Regression Outputs
One can present regression outputs in a
format that we see on journals, articles etc.
To do that
◦ Regress the models and store them separately
◦ Use estimates store to do this
◦ Combine the tables using estimates table
command
◦ See Examples
◦ If one would like to report coefficients of only
selected explanatory variables, use the
keep(varlist) option
Many Thanks for Your
Attention and Effort