Python in Stata
Python in Stata
David M. Drukker
Executive Director of Econometrics
Cass University
4 September 2019
Contents
II Back to Stata 5
2 Back to Stata 5
Part I
Why use Python in Stata?
1 Why use Python in Stata?
Why use Python in Stata?
• Some parts of data science are methods for data management, graphical analysis, statistical estimation, and
prediction
• In this talk, I focus on statistical estimation
Do today
• Write a Stata command that estimates the mean and stores an estimate of its VCE
• Rewrite this command using Python to do the numerical computations
A mean example
. sysuse auto
(1978 Automobile Data)
. mean mpg rep78
Mean estimation Number of obs = 69
A mean example
. ereturn list
scalars:
e(df_r) = 68
e(N_over) = 1
e(N) = 69
e(k_eq) = 1
e(rank) = 2
macros:
e(cmdline) : "mean mpg rep78"
e(cmd) : "mean"
e(vce) : "analytic"
e(title) : "Mean estimation"
e(estat_cmd) : "estat_vce_only"
e(varlist) : "mpg rep78"
e(marginsnotok) : "_ALL"
e(properties) : "b V"
matrices:
2
e(b) : 1 x 2
e(V) : 2 x 2
e(_N) : 1 x 2
e(error) : 1 x 2
functions:
e(sample)
A mean example
Wald test statistic of the q-dimensional hypothesis that β
b = β is
0
b − β )(V)−1 (β
w = (β b −β )
0 0
where
• V is the VCE
3
. python:
python (type end to exit)
>>> print(b)
[[21.28985507246377, 3.4057971014492754]]
>>> print(V)
[[0.498764471131868, 0.033862757453327896], [0.033862757453327896, 0.0142024043
> 39177386]]
>>> end
• Work in do files
• Python is also persistent (__main__) between do files
python:
from sfi import Matrix
import numpy as np
b = Matrix.get('e(b)')
V = Matrix.get('e(V)')
b = np.array(b,dtype='float64')
V = np.array(V,dtype='float64')
print(b)
print(V)
end
numpy arrays
. do np1.do
. python:
python (type end to exit)
>>> from sfi import Matrix
>>> import numpy as np
>>> b = Matrix.get(´e(b)´)
>>> V = Matrix.get(´e(V)´)
>>> b = np.array(b,dtype=´float64´)
>>> V = np.array(V,dtype=´float64´)
>>> print(b)
[[21.28985507 3.4057971 ]]
>>> print(V)
[[0.49876447 0.03386276]
[0.03386276 0.0142024 ]]
>>> end
.
end of do-file
4
p1 = np.matmul(Vi,np.transpose(c))
fv = (1/2)*np.matmul(c,p1)
d1 = 2
d2 = Scalar.getValue('e(N)') - 1
p = f.sf(fv, d1, d2)
print(fv)
print(p)
end
.
end of do-file
Part II
Back to Stata
2 Back to Stata
How to store stuff in Stata
• Scope
– local: within a .do or .ado file
– global: anywhere in a current session
• Store a dataset in variables
5
– scalar names and contents are global
• Store lists, string scalars and numeric scalars in macros
• Scope
– local: within a .do or .ado file
– global: anywhere in a current session
• Store lists, string scalars and numeric scalars in macros
• See my blog post
Programming an estimation command in Stata: Where to store your stuff
https://t.co/TIJqkSScvc
for another introduction to macros
• Everywhere a punctuated macro name appears, its contents are substituted for the macro name.
– The names of local macros are punctuated by enclosing them between single left quotes (‘) and single
right quotes (’)
– The names of global macros are punctuated by preceding them with a dollar sign ($).
6
Examples of local macros
Levels of Stata
• The notion that there are levels of Stata can help explain the difference between global boxes and local boxes
• In the main do-file we define a global macro and then execute another do-file to do the work
• The work do-file can access the information stored in the global macro by the main do-file
*-------------------------------Begin globala.do ---------------
*! globala.do
* In this do-file we define the global macro vlist, but we
* do not use it
global vlist y x1 x2
do globalb
*-------------------------------End globala.do ---------------
7
*-------------------------------Begin globalb.do ---------------
*! globalb.do
* In this do-file, we use the global macro vlist, defined in globala.do
Memory for
Interactive Session objects local to
interactive session
Memory for
globalb.do objects local to
globalb.do
do localb
8
Local macros are local
. do locala
. *-------------------------------Begin locala.do ---------------
. *! locala.do
. local mylist "a b c"
. display "mylist contains |`mylist´|"
mylist contains |a b c|
.
. do localb
. *-------------------------------Begin localb.do ---------------
. *! localb.do
. local mylist "x y z"
. display "mylist contains |`mylist´|"
mylist contains |x y z|
. *-------------------------------End localb.do ---------------
.
end of do-file
.
. display "mylist contains |`mylist´|"
mylist contains |a b c|
. *-------------------------------End locala.do ---------------
.
.
end of do-file
Memory for
Interactive Session objects local to
interactive session
• For more about local versus global macros, see my blog post
Programming an estimation command in Stata: Global macros versus local macros
http://bit.ly/1ksYHrI
• Macro evaluation is recursive
forvalues
forvalues lname = #/# {
commands referring to ‘lname’
}
9
// forvalues.do
forvalues i = 1/3 {
display "i is now `i'"
}
. do forvalues
. // forvalues.do
. forvalues i = 1/3 {
2. display "i is now `i´"
3. }
i is now 1
i is now 2
i is now 3
.
end of do-file
foreach
foreach lname in list {
commands referring to ‘lname’
}
foreach II
// foreach.do
local vlist y x1 x2
foreach v of local vlist {
display "v is now `v'"
}
. do foreach
. // foreach.do
. local vlist y x1 x2
. foreach v of local vlist {
2. display "v is now `v´"
3. }
v is now y
v is now x1
v is now x2
.
end of do-file
foreach III
// foreach2.do
local v "3"
display "v is now `v'"
local vlist y x1 x2
foreach v of local vlist {
display "v is now `v'"
}
display "v is now |`v'|"
. do foreach2
. // foreach2.do
. local v "3"
. display "v is now `v´"
v is now 3
10
. local vlist y x1 x2
. foreach v of local vlist {
2. display "v is now `v´"
3. }
v is now y
v is now x1
v is now x2
. display "v is now |`v´|"
v is now ||
.
end of do-file
if
1. command if exp
restricts the sample to those observations for which if exp is true and command works on the restricted
sample
. poisson accidents traffic tickets if male==1
2. In do files and ado files,
if exp { commands }
will only execute commands if exp is true
if II
. local test 0
. if `test´ < 1 {
. display "expression is true!"
expression is true!
. }
. do meanb
. // version 1.0.0 09Jun2019 (This comment is ignored by Stata)
. version 15 // version #.# fixes the version of Stata
. sysuse auto
(1978 Automobile Data)
. summarize price
Variable Obs Mean Std. Dev. Min Max
11
r(sum_w) = 74
r(mean) = 6165.256756756757
r(Var) = 8699525.974268788
r(sd) = 2949.495884768919
r(min) = 3291
r(max) = 15906
r(sum) = 456229
.
end of do-file
• We can use summarize to compute the sample-average estimator for the mean and its standard error
. do meanc
. // version 1.0.0 09Jun2019
. version 15
. sysuse auto
(1978 Automobile Data)
. quietly summarize price
. return list
scalars:
r(N) = 74
r(sum_w) = 74
r(mean) = 6165.256756756757
r(Var) = 8699525.974268788
r(sd) = 2949.495884768919
r(min) = 3291
r(max) = 15906
r(sum) = 456229
. local sum = r(sum)
. local N = r(N)
. local mu = (1/`N´)*`sum´
. generate double e2 = (price - `mu´)^2
. quietly summarize e2
. local V = (1/((`N´)*(`N´-1)))*r(sum)
. display "muhat = " `mu´
muhat = 6165.2568
. display "sqrt(V) = " sqrt(`V´)
sqrt(V) = 342.87193
. mean price
Mean estimation Number of obs = 74
.
end of do-file
12
• Standard options
– noconstant
– vce(robust)
– vce(cluster clustervar)
– level(#)
• Maximize options
– iterate(#)
– from(init_spec)
– nrtolerance(#)
– constraints(numlist)
– help file
– Stata Journal article
end
13
An ado that always computes the same thing
File mymean2/mymean.ado
*! version 2.0.0 09Jun2019
program define mymean
version 15
syntax varlist
display "varlist contains `varlist'"
quietly summarize `varlist'
local sum = r(sum)
local N = r(N)
local mu = (1/`N')*`sum'
capture drop e2
generate double e2 = (`varlist' - `mu')^2
quietly summarize e2
local V = (1/((`N'-1)*(`N')))*r(sum)
display "muhat = " `mu'
display "sqrt(V) = " sqrt(`V')
end
. quietly cd mymean3
. program drop mymean
. mymean price
varlist contains price
muhat = 6165.2568
sqrt(V) = 342.87193
. quietly cd ..
14
Put results into matrices b and V
File mymean5/mymean.ado
*! version 5.0.0 09Jun2019
program define mymean
version 15
syntax varlist
matrix list b
matrix list V
end
• syntax allows you to specify extentions or restrictions on the type of varlist allowed
– You can extend the default to allow for time-series or factor-variable operators
– You can restrict the number or type of variables allowed
• In the case hand, we want to restrict the variables to be numeric and we want only one variable specified
Restrict varlist
File mymean5a/mymean.ado
*! version 5.1.0 09Jun2019
program define mymean
version 15
matrix list b
matrix list V
end
15
mymean now produces
. quietly cd mymean5a
. program drop mymean
. capture noisily mymean price mpg
too many variables specified
. capture noisily mymean make
string variables not allowed in varlist;
make is a string variable
. mymean price
symmetric b[1,1]
mu
r1 6165.2568
symmetric V[1,1]
mu
mu 117561.16
. quietly cd ..
Tempnames
• Recall that variable, matrix and scalar names are global in Stata
• This implies that there are problems with our current version of mymean
• The Stata command tempvar creates a list of local macros, each of which contains a name that is not used
elsewhere
File myean6/myregress.ado
*! version 6.0.0 09Jun2019
program define mymean
version 15
tempname b V
tempvar e2
16
Example of mymean with tempnames
. quietly cd mymean6
. program drop mymean
. mymean price
symmetric __000000[1,1]
mu
r1 6165.2568
symmetric __000001[1,1]
mu
mu 117561.16
. quietly cd ..
• Safe program
• We cannot access our results
• Need to store the results somewhere
e-class commands
• e-class commands return
– e(b), the vector of parameter estimates
– e(V), the VCE of e(b)
– e(sample), a function that equals 1 if the observation is part of the estimation sample and 0 otherwise.
– e(N), the number of observations in the sample
File mymean7/mymean.ado
*! version 7.0.0 09Jun2019
program define mymean, eclass
version 15
tempname b V
tempvar e2
17
eclass version of mymean
. quietly cd mymean7
. program drop mymean
. mymean price
Coef. Std. Err. z P>|z| [95% Conf. Interval]
mu 6165.257 342.8719 17.98 0.000 5493.24 6837.273
. ereturn list
scalars:
e(N) = 74
macros:
e(properties) : "b V"
matrices:
e(b) : 1 x 1
e(V) : 1 x 1
. quietly cd ..
tempname b V
tempvar e2
quietly summarize `varlist' if `touse'==1
local sum = r(sum)
local N = r(N)
matrix `b' = (1/`N')*`sum'
matrix colnames `b' = mu
generate double `e2' = (`varlist' - `b'[1,1])^2 if `touse'==1
quietly summarize `e2' if `touse'==1
matrix `V' = (1/((`N')*(`N'-1)))*r(sum)
matrix colnames `V' = mu
matrix rownames `V' = mu
18
Part III
Python Stata programming
5 Python Stata programming
*! version 1.0.0
program define pmean, eclass
version 16.0
tempname b
matrix `b' = (1, 2, 3)
python: MyMeanWork("`b'")
matrix list `b'
end
version 16.0
python:
from sfi import Matrix
import numpy as np
def MyMeanWork(bname ):
b = Matrix.get(bname)
b = np.array(b,dtype='float64')
b = b*b
Matrix.store(bname, b)
end
*! version 2.0.0
// compute mean using python
program define pmean, eclass
version 16.0
tempname b v N
python: MyMeanWork("`varlist'", "`touse'", "`b'", "`v'", "`N'")
ereturn post `b' `v', esample(`touse')
ereturn scalar N = scalar(`N')
ereturn scalar df_r = scalar(`N')-1
ereturn display
end
version 16.0
python:
import numpy as np
Matrix.create(bname, 1, p, Missing.getValue())
Matrix.create(vname, p, p, Missing.getValue())
19
m = np.mean(data, axis=0)
E = data - m
Ep = np.transpose(E)
E2 = np.matmul(Ep,E)
E2 = (1/n)*(1/(n-1))*E2
Matrix.store(bname, m)
Matrix.setRowNames(bname,['mean'])
Matrix.setColNames(bname,vlist)
Matrix.store(vname, E2)
Matrix.setRowNames(vname,vlist)
Matrix.setColNames(vname,vlist)
Scalar.setValue(nname,n)
end
*! version 3.0.0
// compute mean using python
program define pmean, eclass
version 16.0
tempname b v N
python: MyMeanWork("`varlist'", "`touse'", "`b'", "`v'", "`N'")
ereturn post `b' `v', esample(`touse')
ereturn scalar N = scalar(`N')
ereturn scalar df_r = scalar(`N')-1
ereturn display
end
version 16.0
python:
version 16.0
python:
if p < 1:
SFIToolkit.errprint('Bad varlist in work function')
SFIToolkit.exit(498)
if data.ndim != 2:
SFIToolkit.errprint('Bad array in work function')
SFIToolkit.exit(498)
shape = data.shape
n = shape[0]
cols = shape[1]
20
if cols != p :
SFIToolkit.errprint('Wrong number of columns in work function data')
if cols != p :
SFIToolkit.errprint('Wrong number of columns in work function data')
SFIToolkit.exit(498)
Matrix.create(bname, 1, p, Missing.getValue())
Matrix.create(vname, p, p, Missing.getValue())
m = np.mean(data, axis=0)
E = data - m
Ep = np.transpose(E)
E2 = np.matmul(Ep,E)
E2 = (1/n)*(1/(n-1))*E2
Matrix.store(bname, m)
Matrix.setRowNames(bname,['mean'])
Matrix.setColNames(bname,vlist)
Matrix.store(vname, E2)
Matrix.setRowNames(vname,vlist)
Matrix.setColNames(vname,vlist)
Scalar.setValue(nname,n)
end
21