0% found this document useful (0 votes)
48 views68 pages

Lecture 1-2 Applied Econometrics

The document discusses various data management techniques in Stata including loading, viewing, describing, and cleaning data. Key methods covered include using do-files, loading datasets, viewing data, describing dataset structure and variables, changing variable types, labeling data and variables, sorting data, ordering and dropping variables or observations, generating and replacing variables, and creating dummy variables.

Uploaded by

lebabm07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views68 pages

Lecture 1-2 Applied Econometrics

The document discusses various data management techniques in Stata including loading, viewing, describing, and cleaning data. Key methods covered include using do-files, loading datasets, viewing data, describing dataset structure and variables, changing variable types, labeling data and variables, sorting data, ordering and dropping variables or observations, generating and replacing variables, and creating dummy variables.

Uploaded by

lebabm07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

APPLIED ECONOMETRICS

Master 1 IES (2020-2021)


International Economics Studies

Fozan Fareed
Email ID: [email protected]
Key Objectives of the Course
• Familiarize students with the basic features of Stata and apply econometrics using real data
sets

• On the completion of the course, students should have learnt


➢ How to manage big data sets
➢ How to do descriptive statistics and visualize data
➢ How to run OLS regression
➢ How to do regressions with Limited Dependent Variables
➢ How to tackle the endogeneity issue
Grading of the Course

➢Group Project (Graded out of 20 Points)


➢ Deliverables: 1) Project Report and 2) Do File
➢ Groups of 2-3 people

➢ Bonus points (positive/negative) for class participation

➢ Zero if you miss more than 2 classes

➢ The Free Rider Issue


Course Outline

Data Management

Descriptive Statistics and Data Visualization

OLS Regression

Limited Dependant Variable Models

Tacking Endogeneity

Details about the Project


Chapter 1:
Basics in Stata

“Without data you are just another person with an opinion”

“We trust in God, all others must bring data”

-Edward Deming
1.1 : Back to Basics: Learning by Doing
• Please open the data ‘Lecture 1 HDI’

• It has country-wise information on Human Development


Index (HDI) and Sustainable Development Goals (SDGs)

• Good Habits: Always work with a Do File


1.2: The Do-File (.do)

• The Do-file Editor creates do-files: Write your commands in the do-file and
execute it by clicking on the Execute icon

• Using do-files is essential to reproduce your work later! It's also much more
convenient if you want to modify some commands. Always keep a track of your
program in a do-file.

• Good habits: Save your work regularly!!!


1.3: How does a do-file look like?

Write comments in your do file: ***blabla***

Comments (in green) will not be executed


1.3: How does a do-file look like? (Cont.)

Run the commands of the do-file,


Save the current do-file to disk showing all commands and their output
(remark: you can run only a part of your
Open an existing do-file from your
program by selecting the lines)
disk in a new tab
Open a new do-file in the Do-file Editor
Create your do-file to save today's commands and save it regularly (lecture1.do)
1.4: Loading Data Sets
• There are many ways to upload data on stata

Open an existing stata


format data file

You can Import other data


formats from here
1.4: Loading Data Sets (Cont.)
• Alternatively, you can open a pre-existing STATA database by coding
/*Type in your do-file and execute:*/
clear /*we clear the memory*/
use "path to database \census.dta" /*we load the database*/

• A series of variables now appear in the Variables window

browse /*to look at your data*/


1.4: Loading Data Sets (Cont.)
• Load a database: From data in Excel format:
Data Editor
(Edit)
• Open your Excel Sheet and select your data.
• Copy your data.
• Open STATA :
• Point and click on "Data Editor (Edit)“
• Point and click on the first cell (on the left side)
• Paste your database (previously copied)
• A message box appear, so select what you intend to do
• Here select : Treat first row as variable names
• Close the Data Editor
1.5: Viewing the Data
• We can view the data by browsing them in the Data
Editor.

1. This can be done by clicking on the “Data Editor


(Browse)” button,
or Data Editor (Browse)

2. By selecting Data > Data Editor > Data Editor


(Browse) from the menus,
or
3. By typing the command browse (here the
command is just “browse” or “br”)
DATA MANAGEMENT
1.6: Details about the Data
• Step 1: Figure out the key information about the data
➢ How many observations are there in the data?
➢ How many variables? Which ones are numeric and which ones are text?
➢ What is the data set about? Does it have a title?
➢ Is it a micro level data or macro level? (E.g. Individual level, firm level,
Regional level, Country level or Sector level etc.)
➢ Is it a cross section?

• How can you do it?


1.6 : Details about the Data (Cont.)
Stata Command: describe

Title/Label of the Data


The number of observations
The number of variables
A list variables names and the storage
type

• Another way: This can be done by going to Data menus and clicking on describe data
1.7: Variable Types in Data Editor
• When the Data Editor is open, you can see that the:
• Columns represent variables, whereas the rows represent observations.

• The data are displayed in multiple colors:


• variables listed in black are numeric
• whereas those that are in colors are text.

• Missing values: a period (.) for numeric variables and empty quotations "" for
string variables
1.8 : Changing Variable Type
• Numeric to non-numeric: tostring [var name], gen([new var name]) [force]
• Non-numeric to numeric:
• destring [var name], gen([new var name])
• If you have a text variable: encode [var name], gen([new var name])

• If you want to change the type and replace the existing variable
• destring [var name], replace
1.9 : Labelling
• To ensure that the database is easily readable, its important to add some extra
information to your dataset and variables

• Labelling Data: For a data label (that adds a label to your dataset) type label data
“Human Development Index Data” in the Command window & press Enter.

• Stata Command: label data " label"

• Labelling Variable: For a variable label (that adds a label to a particular variable)
type label variable [the name of the variable] “[the label]”

• Stata Command: label var [variable name] "label": This adds information on a
specific variable
1.10 : Renaming a Variable
• Lets say that you want to change the name of a variable (s)
• How can you do that?

• Example: the variable urban isn’t precise enough, type a new name for this variable

• Stata command: rename [old name] [new name]


1.11 : Sorting the Data
• Lets say that you want your data to be arranged in a certain order
• You can sort observations into ascending (or descending) order based on the values
of one (or several) variables!

• Stata command: sort [variable(s) name(s)]


• Use the command gsort if you want to sort data in descending order

***sorting the data using gsort***


gsort -gdp
gsort +gdp
1.11 : Sorting the Data (Cont.)
1.12 : Ordering Variables
• In case you want to change the position of a variable in a dataset, you can use the
command order

To have a variable at the start of the data:


Stata command: order [variable], first
To have a variable at the end of the data:
Stata command: order [variable], last

Note: You can also order a variable before/after some other variable
1.13 : Dropping Observations or Variables
• Drop variable(s): drop [variable(s) name(s)]

• Drop observations (rows): drop if [condition]


1.14 : Generating a new Variable
• We can create some new variables from existing variables recorded in the database
• To do this, type generate [new variable name] = [formula] in the Command window
and press Enter.

• Example: For this example we will create a GDPperCapita variable that provides
information about the economic performance of the country.

**creating a new variable**


generate gdppercapita= gdp/ population
1.14 : Generating a new Variable (Cont.)
1.15 : Replacing a Variable
• Replacing a variable works similarly, the difference is that generate creates a new
variable and replace modifies a pre-existing variable
• Stata Command: replace [variable name]=formula
• Its useful if you want to transform a continuous variable into a categorical variable
• Remark: it is safer to create first a new variable rather than replacing directly the
initial variable. You can also use the stata command recode
• Example: let's record the HDI variable
gen hdicategory= hdivalue
replace hdicategory=1 if hdivalue<.25
replace hdicategory=2 if hdivalue>=.25 & hdivalue<.50
replace hdicategory=3 if hdivalue>=.50 & hdivalue<.75
replace hdicategory=4 if hdivalue>=.75
<
1.16 :Generating Dummy Variables
• You can tranform a categorical variable into dummy variables typing this general
syntax:
tab [variable name], gen([variable name 2]) [missing]
• It creates one dummy variable per range of the categorical variable
• Remark: the missing option creates a dummy variable for missing values

tab developmentlevel, gen(developmentD)


browse
1.17 :Label Values
• We can also give labels to the values in a variable
label define [label name]
label values [var name] [label name]

• Example: Let’s label the values for HDI

label define labelhdi 1 "Very Low Development Level" 2 ‘’Low Development Level" 3
« HighDevelopment Level" 4 "Very High Development Level"

label values [variable] [label name]


1.18 Save your work!
• When your work is finished, Save your database
• If you want to create a new database
save "path to database \hdi.dta"

• If you want to replace your database


save "path to database \hdi.dta", replace

• Save your do-file program


• You can also do this from the Menu tab in stata
• Clear STATA's memory (clear command)
1.19 : Using the Collapse Command
• Making a new dataset of summary statistics (at a certain macro level)

• Stata command: collapse (option) [variables], by [category]

• Options include: Mean, median, sum, count, percent, percentile etc.

Example:
• collapse (sum) GDP trade TotalPopulation , by ( Region )
1.20 : What does the Help Command do?
• Stata command: help [command]

• Tells you about using the right syntax for the command
• How to reach it via the Menu
• Description of what does the command do
1.21 : Other Useful Commands
Commands Description Examples

Concatenate To join the characters of two or more egen X= concat(VarA VarB)


variables
Duplicates It helps you to manage duplicate duplicates drop
values duplicates tag, gen(dup)
Recode It helps you recode an existing recode developmentlevel (1=6)
variable
Count To count the number of certain count if developmentlevel==2
observations in data
Global To put a list of variables under one global demographics age married sex
macro/name summ $demographics
Forvalues Used to create a loop. Loops are used forvalues lname = range {
to perform some task repeatedly over commands
a set of items. }
1.22: Merge / Append Data Sets
• Sometimes, the information that we need is contained in different databases

• We need to join observations from the database currently in memory (master


database) with another database (using data), matching on one or more key variables

• Example: information about top celebrities


1.23 : Merging Data Sets
• If you want to add variables: Merge
• Example:
• 1- Celebrity List.dta (master database) provides information on the age of top
15 celebrities

• 2 merge.dta (using database) provides information on their nationality and IQ

• We want to combine these 2 databases in order to have only one database


containing all the variables that we need
1.23 : Merging Data Sets (Cont.)
• Step 1: Open the using database (here, data2-merge.dta) on STATA
• In order to combine this database with the using database (here, data2-merge.dta), we
need a variable that is common in both. Make sure the name of the variable is the same in
both files.

• Step 2: Close the using dataset (after making sure that there is a variable common in both)
and open the master database data1- Celebrities.dta

• Now, we merge this master database with the using database mergeA.dta

Syntax: merge [type][variable name] using "[using dataset]“

In this case: merge 1:1 name using "C:\Users\Desktop\data 2- merge.dta"


1.23 : Merging Data Sets (Cont.)
• STATA has merged the 2 databases and has automatically created a new variable _merge
• 3 = observation appeared in both databases
• 1 = observation appeared in master database only (here mergeB.dta )
• 2 = observation appeared in using database only (here mergeA.dta)

• Save your new database (we name it merge.dta)


1.24 : Appending Data Sets
• If you want to add observations: Append
• Now, we want to add celebrities to our top 15 celebrities to obtain data on top 20
celebrities

merge.dta
(master database, currently opened in
your STATA)

Data3- append.dta
(using database)

General syntax: append using "[dataset]"


1.24 : Appending Data Sets (Cont.)
• Append the databases and when finished, save the new database (we call it FINAL
Celebrities.dta)
1.25 : Final note on data management
• Remember: The quality of your analysis is only as good as the quality
of your data

• Or in simpler terms:
Chapter 2
Descriptive Statistics
“Without data you are just another person with an opinion”

“We trust in God, all others must bring data”

-Edward Deming
2.1 : Summarizing the Data
• Describing the data tells us something about the structure of the data, but
it says little about the data themselves.

• To go further, we can type summarize in the Command window and press


Enter.

• The result is a table containing summary statistics about all the variables in
the dataset.

Stata Command: summarize


2.1 : Summarizing the Data (Cont.)
• This command provides following information:

• The variable name,


• The number of observations, for each variable,
• The average value of each variable,
• The standard deviation of each variable,
• The minimum and the maximum values recorded in the
dataset for each variable.

Try this: summarize [var(s) name(s)], detail

Note: You can also include conditions with the summarize command. E.g. summarize [varA] if [varB]=1
2.2 : Variable Types
• Statistics and Econometrics deal with several kinds of data: these data could be
discrete or continuous variables or categorical variables.

• Discrete variables: is a variable that takes values from a finite or countable set,
such as the number of students, cars in a parking lot.

• Continuous variables: is a variable that has a continuous distribution function,


such as temperature, income

• Categorical variables: is a variable that can take on one of a limited, and fixed,
number of possible values, such as sex, age group
2.3 : Frequency Tables
• If you want basic descriptive statistics on a categorical variable, use the command
tabulate which gives frequency one-way tables

• Example: In our data, What is the proportion of countries by region?


• Stata Command: tab [variable]
. tab Region

Region Freq. Percent Cum.

Africa 47 26.11 26.11


Asia 52 28.89 55.00
Europe 37 20.56 75.56
North & Central America 23 12.78 88.33
Oceania 10 5.56 93.89
South America 11 6.11 100.00

Total 180 100.00

• You can also tabulate two categorical variables together to look at bi-variate statistics
2.3 : Frequency Tables (Cont.)
• Making a two-way table using two categorical variables “Regions” & “Development Level”

• Stata Command: tab [variable 1] [variable 2]


Development Level
Region 1 2 3 4 Total

Africa 0 6 10 31 47
Asia 18 14 16 4 52
Europe 32 5 0 0 37
North & Central Ame.. 4 14 4 1 23
Oceania 2 5 1 2 10
South America 3 7 1 0 11

Total 59 51 32 38 180

• But sometimes your are more interested in percentages, how to get the above table in
percentages?
2.3 : Frequency Tables (Cont.)
• Making a two-way table using two categorical variables
• Stata Command : tab [variable 1] [variable 2], col
Development Level
Region 1 2 3 4 Total

Africa 0 6 10 31 47
0.00 11.76 31.25 81.58 26.11

Asia 18 14 16 4 52
30.51 27.45 50.00 10.53 28.89

Europe 32 5 0 0 37
54.24 9.80 0.00 0.00 20.56

North & Central Ame.. 4 14 4 1 23


6.78 27.45 12.50 2.63 12.78

Oceania 2 5 1 2 10
3.39 9.80 3.13 5.26 5.56

South America 3 7 1 0 11
5.08 13.73 3.13 0.00 6.11

Total 59 51 32 38 180
100.00 100.00 100.00 100.00 100.00

• Stata Command: tab [variable 1] [variable 2], row


Quick Test

1- What is the average “life expectancy” and average “expected years of schooling” for
Asian counties?

2- What percentage of countries with the lowest HDI level (category 4) are in South
America? [use the variable development]

3- Generate a new variable “Poorcountries” which takes the value “1” if countries have low
development level (category 3 & 4) and “0” otherwise

4- What percentage of countries in Africa fall under the “Poor countries” category?
2.3 : Frequency Tables(Cont.)
• If we are interested in the relationship between one categorical and one
quantitative variable, we can describe the quantitative variable for the
different subgroups of the categorical variable
• bysort [categorical var]: summarize [quantitative var], detail

• Ex: Is there a difference in the GDP of countries across poor and rich countries?

bysort poorcountries: summarize gdp

bysort poorcountries: summarize gdp, detail


2.3 : Frequency Tables(Cont.)
-> Poor = 1

GDP PPP Billions 2017

Percentiles Smallest
1% .2877007 .2029027
5% 1.182875 .2877007
10% 2.528558 .5860755 Obs 109
25% 33.54116 .7151036 Sum of Wgt. 109
What can we say about the difference in
50% 163.2309 Mean 859.9123
Largest Std. Dev. 2710.746 GDP here?
75% 485.1162 3740.232
90% 2029.069 4944.928 Variance 7348146
95% 2951.687 17662.27 Skewness 6.209043
99% 17662.27 21223.92 Kurtosis 43.64091

-> Poor = 2

GDP PPP Billions 2017

Percentiles Smallest Run the ttest to see if the difference is


1% .2305998 .2305998
5% .8071597 .3482386 statistically significant
10% 3.08077 .6237518 Obs 74
25% 11.14925 .8071597 Sum of Wgt. 74

50% 32.88739 Mean 284.2043 Stata Code: ttest [var], by ([var2])


Largest Std. Dev. 1062.896
75% 121.8979 1019.038
90% 599.5331 1029.206 Variance 1129749
95% 1019.038 2953.732 Skewness 6.8445
99% 8606.475 8606.475 Kurtosis 52.75616
APPLIED ECONOMETRICS
Lecture 2
Master 1 IES (2018-2019)

Fozan Fareed
Email ID: [email protected]
Class Outline

1. Data Visualization
➢ Scatter Plots, Histogram, Pie charts etc.
➢ How not to put graphs in your research!

2. OLS Regression
➢ Main Assumptions
➢ How to tackle non-linearities
➢ Interpreting Regression Results
Data Visualization
3.1 : Graphics
• Go to the Graphics tab in order to prepare:
• Scatter Plots
• Pie Charts
• Histograms
• Bar Charts
• Others…

Note: Its better to use the tool box to create


graphs rather than writing codes
3.2 : Scatter Plot
Stata codes: Scatter varY varX

Make a plot for HDI (hdivalue) and Life expectancy


(sdg3lifeexpectancyatbirthyears)

What if you want to fit in a line?


twoway (scatter varY varX ) (lfit varY varX )

Exercise: Make a plot for Life expectancy and HDI numbers


for South American countries only.
3.2 : Scatter Plot (Cont.)

twoway (scatter hdivalue sdg3lifeexpectancyatbirthyears, mlabel(country)) if region=="South America", ytitle(Human Development Index)
xtitle(Life expectancy (In Years)) title("HDI and Life Expectancy") subtitle("South American Countries") note("Source: United Nations Data 2016")
3.3 : Pie Chart
Stata Code: graph pie, over(Variable)

graph pie, over(region) pie(_all, explode) plabel(_all percent, format(%4.0g)) title("Regional Classification of Countries")
note("Source: UN Data")
3.4 : Histogram
Stata code: histogram [variable name ], [option]
Histogram [variable], percent

histogram sdg3lifeexpectancyatbirthyears, width(10) start(50) percent normal ytitle(Percentage of Countries) xtitle(Life


Expectancy (In Years)) title("Life Expectancy at Birth (In Years)") note("Souce: UN Data")
3.5 : WHAT NOT TO DO!
• Here are a few examples from your last year’s projects highlighting how not to do descriptive
statistics and data visualization!
3.5 : WHAT NOT TO DO! (Cont.)
3.5 : WHAT NOT TO DO! (Cont.)
- Be careful about the unit of axis
- Proper labelling is important
3.5 : WHAT NOT TO DO! (Cont.)
3.5 : WHAT NOT TO DO! (Cont.)
3.6 Important Notes on Data Visualization
Graphs:
• Graphs should be self explanatory!
• Both the axis should be labelled properly (specify the units if necessary)
• Clear title of the graph with the source of the data
• Its better to report percentages and not absolute values
Tables:
• Be careful with the names of the variables
• Clearly mention what the categories within a variable stand for (e.g. Marital Status: Single,
Married, Divorced, In a relationship… Don’t use 1 2 3 4)
• It is a bad idea to copy paste tables from Stata as pictures
• Use percentages and not absolutely terms
• No need to summarize unordered categorical variables such as marital status or profession
3.7 : Correlation
• If we are interested in the relationship between two quantitative variables,
we can measure the correlation (pwcorr) or use scatterplots (scatter)

• Ex: what is the relationship between the GDP and HDI?

pwcorr gdp hdivalue

pwcorr gdp hdivalue, sig /*option sig: computes pvalues*/


3.7 : Correlation (Cont.)
. pwcorr GDP hdivalue, sig

GDP hdivalue
• Correlation is NOT causation!!!
GDP 1.0000

hdivalue 0.1818 1.0000


0.0138

H0: the correlation equals 0 in the population


H1: the correlation is not equal to 0
If P value<0.05 → we reject H0 at the 5% level
3.8 : Why is Correlation not enough?
• Correlation Coefficient : The “spurious correlation” issue ! Correlation: 0.666

• Sometimes we have a “spurious correlation” (i.e., appearance of a relationship when, in fact,


there is none), because, for example, X is only a proxy for something else.
3.8 : Why is Correlation not enough? (Cont.)
• Why is a correlation coefficient not enough to analyze the impact of X on Y ?

1. Sometimes we have a non linear relationships, and this non-linearity is not captured by
the correlation coefficient.

2. The correlation coefficient, include some extreme values for sample observations
(named: Outliers) that influence the value of the correlation coefficient.

3. Spurious correlation

You might also like