Data Analysis With STATA - Sample Chapter
Data Analysis With STATA - Sample Chapter
ee
Sa
pl
e
E x p e r t i s e
D i s t i l l e d
$ 34.99 US
22.99 UK
P U B L I S H I N G
P r o f e s s i o n a l
Prasad Kothari
This book is for all the professionals and students who want
to learn Stata programming and apply predictive modeling
concepts. This book is also very helpful for experienced
Stata programmers as it introduces advanced statistical
modeling concepts and shows how to apply them.
Explore the big data eld and learn how to perform data analytics
and predictive modeling in Stata
Prasad Kothari
P U B L I S H I N G
organizations such as Merck, Sanofi Aventis, Freddie Mac, Fractal Analytics, and
the National Institute of Health on various analytics and big data projects. He has
published various research papers in the American Journal of Drug and Alcohol
Abuse and American Public Health Association.
Prasad is an industrial engineer from V.J.T.I. and has done his MS in management
information systems from the University of Arizona. He works closely with
different labs at MIT on digital analytics projects and research.
He has worked extensively on many statistical tools, such as R, Stata, SAS, SPSS, and
Python. His leadership and analytics skills have been pivotal in setting up analytics
practices for various organizations and helping them in growing across the globe.
Prasad set up a fraud investigation team at Freddie Mac, which is a world-renowned
team, and has been known in the fraud-detection industry as a pioneer in cuttingedge analytical techniques. He also set up a sales forecasting team at Merck and Sanofi
Aventis and helped these pharmaceutical companies discover new groundbreaking
analytical techniques for drug discovery and clinical trials. Prasad also worked with
the US government (the healthcare department at NIH) and consulted them on various
healthcare analytics projects. He played pivotal role in ObamaCare.
You can find out about healthcare social media management and analytics at
hp://www.amazon.in/Healthcare-Social-Media-Management-Analycs-ebook/dp/B00VPZFOGE/
ref=sr_1_1?s=digital-text&ie=UTF8&qid=1439376295&sr=1-1.
Preface
This book covers data management, visualization of graphs, and programming
in Stata. Starting with an introduction to Stata and data analytics, you'll move on
to Stata programming and data management. The book also takes you through
data visualization and all the important statistical tests in Stata. Linear and logistic
regression in Stata is covered as well. As you progress, you will explore a few
analyses, including survey analysis, time series analysis, and survival analysis in
Stata. You'll also discover different types of statistical modeling techniques and
learn how to implement these techniques in Stata. This book will be provided with
a code bundle, but the readers would have to build their own datasets as they
proceed with the chapters.
Preface
Chapter 4, Important Statistical Tests in Stata, discusses how statistical tests, such as
t-tests, chi square tests, ANOVA, MANOVA, and Fisher's test, are significant in
terms of the model-building exercise. The more tests you conduct on the given
data, the better an understanding you will have of the data, and you can check
how different variables interact with each other in the data.
Chapter 5, Linear Regression in Stata, teaches you linear regression methods and their
assumptions. You also get a review of all the nitty-gritty, such as multicollinearity,
homoscedasticity, and so on.
Chapter 6, Logistic Regression in Stata, covers how to build a logistic regression model
and what the best business situations in which such a model can be applied are. It
also teaches you the theory and application aspects of logistic regression.
Chapter 7, Survey Analysis in Stata, teaches you different sampling concepts and
methods. You also learn how to implement these methods in Stata and how to apply
statistical modeling concepts, such as regression to the survey data.
Chapter 8, Time Series Analysis in Stata, covers time series concepts, such as seasonality,
cyclic behavior of the data, and autoregression and moving averages methods. You
also learn how to apply these concepts in Stata and how to conduct various statistical
tests to make sure that the time series analysis that you performed is correct.
Chapter 9, Survival Analysis in Stata, teaches survival analysis and different statistical
concepts associated with it in detail.
[1]
Data visualization: After the data preparation, we need to visualize the data
for the the following:
Linear regression in Stata: Once done with the hypothesis testing, there
is always a business need to predict one of the variables, such as what the
revenue of the financial organization will be in specific conditions, and so on.
These predictions about continuous variables, such as revenue, the default
amount on a credit card, and the number of items sold in a given store, come
through linear regression. Linear regression is the most basic and widely
used prediction methodology. We will go into details of linear regression in a
later chapter.
[2]
Chapter 1
Survival analysis in Stata: These days, lots of customers attrite from telecom
plans, healthcare plans, and so on, and join the competitors. When you
need to develop a churn model or attrition model to check who will attrite,
survival analysis is the best model.
[3]
[4]
Chapter 1
The preceding diagram represents the Stata layout. The four types of processors in
Stata are multiprocessor (two or four), special edition processor (flavors), intercooled,
and small processor. The multiprocessor is one of the most efficient processors.
Though all processor versions function in a similar fashion, only the variables'
repressors frequency increases with each new version. At present, Stata version 11
is in demand and is being used on various computers. It is a type of software that
runs on commands. In the new versions of Stata, new ways, such as menus that can
search Stata, have come in the market; however, typing a command is the simplest
and quickest way to learn Stata. The more you use the functionality of typing the
command, the better your understanding becomes. Through the typing technique,
programming becomes easy and simple for analytics. Sometimes, it is difficult to
find the exact syntax in commands; therefore, it is advisable that the menu command
be used. Later on, you just copy the same command for further use. There are three
ways to enter the commands, as follows:
Use the do-file program. This is a type of program in which one has to inform
the computer (through a command) that it needs to use the do-file type.
Though all the three types discussed in the preceding bullets are used, the do-file
type is the most frequently used one. The reason is that for a bigger file, it is faster as
compared to manual typing. Secondly, it can store the data and keep it in the same
format in which it was stored. Suppose you make a mistake and want to rectify it;
what would you do? In this case, the do-file is useful; one can correct it and run
the program again. Generally, an interactive command is used to find out the
problem and later on, a do-file is used to solve it. The following is an example of
an interactive command:
[5]
Matrices
Macros
Matrices should be used carefully as they consume more memory than variables,
so there might be a possibility of low space memory before work is started.
Another form is macros; these are similar to variables in other programming
languages and are named containers, which means they contain information of any
type. There are two flavors of macros: local/temporary and global. Global macros
are flexible and easy to manage; once they are defined in a computer or laptop, they
can be easily opened through all commands. On the other hand, local macros are
temporary objects that are formed for a particular environment and cannot be used
in another area. For example, if you use a local macro for a do-file, that code will only
exist in that particular environment.
Dos
Linux
Unix
[6]
Chapter 1
For example, if you need to change the directory, you can use the CD command,
as follows:
CD C:\Stataforlder
You can also generate a new directory along with the current directory you have
been using. For example:
mkdir "newstata".
You can leverage the dir command to get the details of the directory. If you need
the current directory name along with the directory, you can utilize the pwd or
CD command.
The use of paths in Stata depends on the type of data. Usually, there are two paths:
absolute and relative. The absolute path contains the full address, denoting the
folder. In the command you have seen in the earlier example, we leveraged the
CD command using the path that is absolute. On the contrary, the relative path
provides us with the location of the file. The following example of mkdir has
used the relative path:
mkdir "E\Stata|Stata1"
The use of the relative path will be beneficial, especially when working on different
devices, such as a PC at home or a library or server. To separate folders, Windows
and Dos use a backslash (\), whereas Linux and Unix use a slash (/). Sometimes,
these connotations might be troublesome when working on the server where Stata
is installed. As a general rule, it is advisable that you use slashes in the relative path
as Stata can easily understand a slash as a separator. The following is an example
of this:
mkdir "/Stata1/Data" this is how you create the new folder for your
STATA work.
less
India pwt
80-2010.dta,
clear
The option at the end of the code, clear, makes Stata read the dataset again before
you open another data file.
[7]
country
year
using
"t1
less India
pwt
80-2010 . dta" ,
clear
Insheet
In order to read data in Stata, it has to be converted into a format other than Excel.
Also, save the data in one of the following formats:
Excel
You need to take into consideration certain rules and regulations while working
on Stata:
Suppose that the first row in the Excel file contains the name of the variables
or headers, that is, the sheet contains variable names (series/code/names).
Then, the second row must have data. The title of the first row must be
removed before saving the file.
In Stata, every single word is read; therefore, any additional lines below or to
the right of the data, for example, footnotes or endnotes, should be deleted
before saving it. If essential, delete the entire bottom row or the column on
the right-hand side.
You should not put numbers in the beginning of the variable name. In Stata,
a problem might occur when the file is arranged with years (1980, 1985) in
the top row. In such cases, placing an underscore before numbers will be
helpful, and this can be done by selecting the row, using the spreadsheet
package, and finding replace tools; for example, 1980 becomes _1980,
and so on.
The most important thing to note is the deletion of commas from the data
because Stata won't be able to understand the starting point and finishing
point of columns and rows. You can do this by leveraging the first find then
replace option.
Notations such as double dots (..) or hyphens (-) might trouble Stata and
will create confusion because Stata can read a single dot (.) as double dots or
hyphens as text.
[8]
Chapter 1
After saving the data in the CSV format, it can be read in Stata, as shown in the
following code snippet:
insheet using "E:\Stata1|t1 less India pwt 80-2010.
txt",
clear
If any changes are made to the data by applying the CD command, then it can be read
as follows:
insheet using "t1 less India pwt 80-2010.
txt",
clear
Many ways are available for the insheet command. Options are defined as
additional qualities of standard commands, which are generally added once the
command ends, should have commas in between, and so on. The following are
some of the options used in Stata:
The clear option: This can be used to insert a new file, insheet, regardless
of the selected data: insheet using "E:\ Stata1\t1 less India pwt
80-2010 . txt" , clear
The option name: This provides insights of data (usually from the first row),
which helps Stata remember the file automatically. However, in certain cases,
if this option does not work, then Stata uses variable names; an example is
as follows:
insheet using "E:\Stata1 classes\t1 less India pwt 80-2010 .
txt" , names clear
Infix
Along with insheet, you can use the infix command, as shown later.
Most times, CSV or tab-delimited datasets are utilized, and the ASCII format is still
used to save older data. Let's take the example of a survey taken by the government.
This example represents two lines from 2010:
10862226023331
10001222228332
06 022
06 022
3
3
[9]
02220155500666600777000003331
02555553006666000000000044441
A codebook or data dictionary usually comes in the PDF or text file format. It
explains the data that shows us that the first two numbers, the row ID, and the other
two numericals are survey records (2010 from the previously mentioned dataset),
and the fifth number is the quarter (the first quarter in this case) of the interview,
among other things. infix is required to read such types of data and provides
information to Stata from the codebook. The following is an example:
infix rowtype 1-2 yr 3-4 quart 5 [] using
"E:\ Stata1\Survey2010.dat", clear
In order to save many files, the dictionary file is used; it will save the codebook
information and mark it as a separate file. The file can be seen as follows:
infix dictionary using Survey2010.dat
{
dta
rowtype 1-2
yr 3-4 quart5 []
}
The infix command is used after saving the data as Survey2010.dct. As a relative
path is used in the dictionary file (Survey2010), it is believed that raw data will be
inside the same file set that is either a dictionary or a catalogue file. This being the
case, then referring data is not required. The file will look like this:
infix using "H:\ECStata\NHIS1986.dct", clear
Defining and constituting a dictionary file in a proper way is a tedious job. However,
NHIS has a dictionary that can be read through the SAS program; this can be
converted into Stata using the Stat/Transfer program.
[ 10 ]
Chapter 1
[ 11 ]
Countrycode
Yr
Pops
Cccgdp
Openss
India
IND
2010
23452.9
10897.23
23.11111
U.S.
USA
2010
22222.1
23987.23
90.42231
Pakistan
PAK
2010
11111.2
23675.21
10.22291
China
CHN
2010
98765
97654.94
30.98765
Russia
RUS
2010
19876
65745.11
43.34343
Germany
GER
2010
23467
23874.35
23.74747
After importing the data in Stata, it is always a good practice to examine the data.
It gives you an advantage in any modeling or visualization exercise.
countrycode
IND
USA
PAK
CHN
RUS
GER
yr
2010
2010
2010
2010
2010
2010
[ 12 ]
pops
23452.9 |
22222.1 |
11111.2 |
98765 |
19876 |
23467 |
Chapter 1
In the preceding table, the star is called the placeholder, and it instructs Stata to
incorporate the entire data with the country. Alternatively, we could focus on all
variables but list only a limited number of observations, for example, the observation
from 14th to 19th row:
The following table contains the country, country code, year, and pops 14/19:
Country
Countrycode
Yr
Popscon
Cccgdps
kOpenss
India
IND
2010
23452.9
10897.23
23.11111
U.S.
USA
2010
22222.1
23987.23
90.42231
Pakistan
PAK
2010
11111.2
23675.21
10.22291
China
CHN
2010
98765
97654.94
30.98765
Russia
RUS
2010
19876
65745.11
43.34343
Germany
GER
2010
23467
23874.35
23.74747
The third command lists observations from 30 till the last observation
The if statement is the other way of subsetting data; it generally has values of true
or false. The following is an example from the observation of the year 2010, where the
variable name is yr:
list if yr == 2010
In order to examine the raw data, the browse window is used. However, a problem
occurs when only selected variables are to be viewed; this happens in big datasets.
So, in this condition, create a list of the variables you want to examine before
browsing. This is done through the following command:
browse country yr popscon
[ 13 ]
It is important to note that this edit command will help change the dataset
manually. The assert command helps Stata examine the observation. This is
because when the bigger data (or big data, as it is called in today's world) arrives,
checking single data through browse or edit commands becomes difficult. In this
case, the assert command is helpful. There are a couple of advantages: it helps
identify whether a data statement is right or wrong. For example, in the case of the
population of the country (popscon), it will tell us that the values are positive:
assert popscon>0,
assert popscon<0
If the preceding command results in the value true, then assert does not give any
output. However, if the command value is false, then an error message will appear.
The describe command accounts for various fundamental information regarding
datasets and variables, such as the total size of the dataset and the variable, the total
number of variables in the dataset, and different formats of the variables. This can
be denominated as describe. It can only be applied to an unread file in Stata.
An example is given as follows:
describe using "E:\Ind-Health-sample.dta"
Codebook can give information on variables in the dataset without the list of
variables; an example of this is codebook country.
The summarize command delivers the statistics summary: means, standard
deviation, and so on. The following table represents this tab:
summarize table
Variable
Obs
Mean
Std. Dev.
Min
Max
Cntry
countrycode
Yr
97
2000
2.156
1990
2010
Popscon
97
87634.46
8374.33
29383.9
93830
ccCgdps
97
67544.23
4100.682
15890.71
98739.67
kOpenss
97
34
13
50
Chi-ppl
97
23.6
3.56
10.456
40.8796
Fdhsa
97
19.56
9.567
12.456
34.98765
Gdkliyu
97
1.987456
1.2
-3.238917
6.46896
[ 14 ]
Chapter 1
As we can see in the preceding table, string variables such as Cntry and
Countrycode do not have numbers; this is why no summary details are available.
Yr is a numeric variable; therefore, we can see that it has a statistics summary.
For more details, the summarize detail option can be used.
The wide range of graphic qualities makes Stata a unique tool. One can easily get
help by typing the help command in Stata. A histogram graph can be created
through the following command:
graph twoway histogram cccgdps
Even though there is some benefit of having advanced graphs in Stata, this makes it
work slowly. In certain cases, it is better to use version 7 graphics because they help
visualize the data properly without using papers or presentations. This can be seen
as follows:
graph7 cccgdps popscon
If we have sets of files of the same content, then the replace tab/option can be
helpful. It will swap the last version and save it. If the old version is to be stored
for some reason, then save it with a different name. One thing that should be kept
in mind is that the original file content can be changed if it is saved with revised
datasets. Therefore, after changes are made to the revised file, in order to open the
file and restart it, just reopen it.
There are two ways to preserve and store the data. One option is to save the current
data and revise it, and later, if you don't want to keep the data, then reopen the
saved data version. Another option is to use the preserve and restore functions/
commands; they will take an image of the data, and the data will come back after
you type restore.
[ 15 ]
Summary
We discussed lots of basic commands, which can be leveraged while performing
Stata programming. The next chapter will discuss data management techniques and
programming in detail. This chapter is basic and will help any beginner-level Stata
programmer start working on Stata.
As you learn more about Stata, you will understand the various commands and
functions and their business applications.
[ 16 ]
www.PacktPub.com
Stay Connected: