Data Analysis Using STATA Software
Data Analysis Using STATA Software
- Prepared by
Fazlur Rahman
Lecturer
Department of Accounting and Information Systems,
Jashore University of Science and Technology,
Jashore-7408, Bangladesh.
E-mail: [email protected]
Introduction to STATA
Description of STATA interface: The STATA interface contains five different types of windows.
1. Results window: Result window is the main and large window of STATA interface. All outputs or
results from STATA commands is displayed here (excluding graphs, which are
displayed in their own window).
2. Command window: This window is used to type and execute the STATA commands which is located
below the result window or at the bottom middle of the STATA interface.
3. Review window: All STATA commands are recorded in this window, which is located on the left
side of the STATA interface. You can retype a previous command by double-
clicking it in the Review window or by pressing the Page-Up key on your keyboard.
4. Variables window: When a dataset is loaded or imported into STATA, the variable names, and their
labels are displayed in the upper right corner of the interface named as variables
window. By selecting this ( ) arrow on the left side of each variable, we can
transfer each variable to the command line in the command window.
5. Properties window: This window is located in the lower right corner of the STATA interface and
displays the variable's properties when the variables in the variable window are
selected. In addition, the properties of the dataset are displayed in the properties
window's lower portion.
Variables
Review window
Properties
window
Command window
Working directory
Ways of Operating STATA
There are three modes available for operating or running STATA software:
1. Interactive mode: In this mode, codes are typed directly in the Command window and then executed
the command by pressing Enter key.
2. Batch mode: In this mode, codes are typed in a separate file called a do-file and all commands are
executed together in one step. This do-file can be preserved for using later and can be executed in other
computers as well. A do-file can be created by clicking this icon in the toolbar of STATA interface
or by executing doedit command in the command window. The command can be executed by clicking
this icon in the last toolbar of do-file editor.
3. Drop-down menu: STATA software can be operated by drop-down menus.
Directory: A directory is a location for storing files on the computer. In the STATA software, a directory
is a folder that is used to store data and other associated files. Before running further commands, it is
essential that the working directory of the STATA program should be identified in the first step. The
working directory lies at the bottom left corner (below review window) of the STATA interface.
Sometimes it is more essential to change the working directory into the expected folder of our computer
for importing or exporting data or other files. If we do not change the working directory to our expected
folder, we must need to type the whole path of the folder for importing, exporting data or other activities.
Log files: Log file is similar to a STATA’s built-in recorder which supposed to preserve all the STATA
commands and its results. In other words, everything that runs through the result window can be recorded
in the log file. Therefore, it is more essential to create a log file after setting our working directory and
when we are not trying to use a do-file editor for running STATA command. In short, when we want to
preserve our work, we need to create a log file and when we want to preserve the STATA codes only,
we need to create and save a do-file.
STATA is case sensitive, therefore, the main keywords and the variable names in a STATA command
must be the same. In the following table, red italic text in the STATA commands indicate that the text
should be changed based on the requirements of the users or procedures, while the black text must be
remained unchanged to run the commands correctly.
STATA code Explanation of the STATA codes
Finds the working directory in which STATA software remains as well as
pwd
to locate the working folder for importing or exporting the required files.
cd “Folder path” Changes the working directory to the Folder path such as D:\AIS 4202 that
cd “D:\AIS 4202” we are expecting to set as working directory.
Creates a log file named as filename in our working directory and all the
log using filename.log
results of our further work will be automatically stored in this log file.
Close the existing log file and after this command, the results will not be
log close
saved on our log file.
Adds more output to the existing log file (filename) in the working
log using filename.log, append
directory after closing it by the previous command.
dir This command will show all the files of the working directory.
To perform data analysis using STATA software, the following topics must be learned empirically:
We must store the datasets in the working directory before importing into STATA. Suppose, we have
our STATA data file in the folder named E:\AIS 4102\STATA Data then we must need to set our
working directory to this folder using the code: cd “E:\AIS 4102\STATA Data”
File format Extension Keywords Example of STATA codes
STATA format .dta use use data.name.dta, clear
Website .dta use use website link or hyperlink, clear
Excel format .xls or .xlsx import excel import excel using data.name.xlsx, firstrow clear
Comma separated .csv insheet insheet using data.name.csv, clear
Text data (Notepad) .txt insheet insheet using data.name.txt, clear
Exporting data: The following table presents main keywords of STATA code/syntax (along with
example) for exporting data into different formats.
2. Data Management
Variable type: Some basic features of our imported data files should be extracted before proceeding to
the data management part. In addition, an important step is to make sure that the variables are in their
expected format. STATA has a color-coded system for each type of variable such as black is for number,
red is for string or text, and blue is for labeled variables. The missing values in a variable shows dot (.)
in STATA. Labeled variable means that the variable is a qualitative or categorical in nature and STATA
reads number (values for each category) and shows text (labels or name of each category). Furthermore,
if a categorical variable is not labeled for its values, then the STATA will show black colored number.
Sometimes, a continuous variable may show a red color meaning that STATA reads the variable as a
string variable and we cannot do any analysis other than the frequency distribution with this variable.
That’s why, first, we need to identify all the categorical and continuous variable of our dataset. The
following table presents some standard codes along with their explanation:
STATA codes Purposes of the codes
ds Provides the list of all variables
Provides the size of the data file such as number of row (values) and the
describe, short
number of column (variables)
codebook Provides the detailed contents and summary statistics of data file.
codebook var.name(s) Provides details of the selected variable(s).
labelbook Provides the information of value labels for each categorical variable.
Summarizes all the variables by estimating number of observations, mean,
sum standard deviation, minimum and maximum value. The number of
observations will estimate zero for the string variable.
Data management operates and manages variables of the datafile based on the requirements of analysis.
Some points should be kept in mind while writing the variable names, viz., the use of upper case, lower
case, underscore instead of using spaces between two words of a variable name, and finally using short
name of variables instead of long name and providing variable labels to understand the variable names.
The following table represents the example STATA codes and their explanation. The black letters must
be remained unchanged and red italic letter can be changed based on the requirements of the users.
Example of STATA codes Explanation
This command will apply a label (text in the double quotation marks) to
label variable var.name “label name”
the variable var.name
rename var.old.name var.new.name Renames the variable var.old.name to the new name var.new.name
Two commands are used to apply a value label in a categorical variable.
label define label.name num.code1 category1
First command will create a general label name (label.name) for applying
num.code2 category2
to one or more categorical variables, viz., two number codes for two
categories with 1 for Male and 2 for Female. Then, the second command
label values cate.var label.name
will apply the created value label to the categorical variable named cate.var
sort var.name Sorts the full datasets based on the ascending order of the variable and gsort
gsort -var.name along with a minus in front of the variable invokes descending order.
recode old.var (old.num1 = new.num1) Creates a new categorical variable (new.var) by recoding the number codes
(old.num2 = new.num2), gen(new.var) of categories of the old categorical variable (old.var)
Creates a new categorical variable (new.var) with three categories (for
recode old.var (min/num1=1) (num2/num3=2)
example) by recording a continuous variable (old.var) from minimum
(num3/max=3), gen(new.var)
value to number1, number2 to number2, and number3 to maximum value.
gen new.var = var1 + var2 + var3 Creates a new variable (new.var) by adding three variables.
gen new.var = ln(old.var) Creates a new variable by using natural logarithm transformation of
gen new.var = exp(old.var) another variable old.var. Similarly, square (^2), square root (sqrt),
gen new.var = sqrt(old.var) exponential (exp) etc., transformation can be used to create a new variable
gen new.var = old.var^2 in STATA.
egen new.var = rowmean(var1 var2 var3) Creates a new variable (new.var) using the average of three variables.
Creates a new variable using logical expression of two categorical
variables. Only logical reasoning cases (after if command in the code) will
egen new.var = 1 if var1==1 & var2==1
produce numbers (1 and 2 for this case) and other values will be missing.
replace new.var = 2 if var1==2 & var2==1
We can procced to creating numbers by the replace command for all of our
expected reasoning case.
Creates a new variable (new.var) using the mean value of a continuous
egen new.var = mean(cont.var), by(cate.var) variable (cont.var) for each category/group of a categorical variable
(cate.var).
Create a new variable (new.var) by the standardized values (Z-score) of the
egen new.var = std(old.var)
variable old.var.
encode str.var = gen(num.var) Create a new numeric variable (num.var) from the string variable (str.var)
Added the values two datafile when they are not loaded in STATA. Since,
append using data.file1.dta data.file2.dta append command adds row, therefore the column names i.e., the variable
names in these two datafiles must be the similar.
Added the rows of datafile2 to the existing datafile in STATA. The variable
append using data.file2.dta
names of these two datafiles must be similar.
Datafile2 from current directory are combined or merged with datafile1 in
STATA by the key variable or identification variable (id.var). The name
merge 1:1 id.var using data.file2.dta
and values of the key variable should be the similar and 1:1 merge means
that there is no repeating data in the key variable of both datafiles.
This are similar task but must be used when the key variable of the existing
merge 1:m id.var using data.file2.dta data in STATA contains unique value and the data in current directory
(data.file2.dta) contain repeated values.
For this, the key variable of existing data contains repeated value and
merge m:1 id.var using data.file2.dta
data.file2.dta contain unique value.
Exploring data represents the initial analysis of data covering summary statistics and descriptive
statistics. Summary statistics presents estimation of frequencies and percentage along with some
graphical presentation of data. Descriptive statistics presents the estimation of the measures of central
tendency and the measures of dispersion. However, some experts consider the frequency distribution,
measure of central tendency, and dispersion along with their graphical display as the descriptive
statistics. The following table presents the classification of proper descriptives and graphical presentation
of data based on the combination and types of the variable.
In this section, we are going to estimate the descriptive statistics and to construct their associated graphs.
The following table presents STATA codes and their explanation for exploring study variables.
Example of STATA codes Explanation
table cate.var Estimates frequency distribution for the categorical variable.
graph bar, over(cate.var) Draws a vertical bar diagram for a categorical variable.
graph hbar, over(cate.var) Draws a horizontal bar diagram for a categorical variable.
ssc install catplot Installs catplot package because it is a user defined program.
catplot cate.var, percent() blabel(bar) recast(bar) Draws a bar diagram with its percentage for each category
and recast option allows to draw a bar or hbar or dot diagram.
Draws pie diagram for a categorical variable and percent
graph pie, over(cate.var) plabel(_all percent)
option allows sum (count), percent (%), and name (category)
tabstat cont.var(s), stat(n min max mean median Estimates the descriptives for a continuous variable.
var sd q iqr semean cv skew kurt)
hist cont.var, normal Draws a histogram with the normal probability curve.
graph box cont.var Draws a Box-and-Whisker plot for a continuous variable.
pnorm cont.var Draws a normal probability plot for a continuous variable.
qnorm cont.var Draws a quantile-quantile plot for a continuous variable.
tab cate.var1 cate.var2, col row Estimates two-way table using two categorical variables
along with their column and row percentage.
tab cate.var1 cate.var2, nofreq col Estimates two-way table with column percentage only.
graph bar, over(cate.var1) over(cate.var2) Draws a composite vertical bar diagram
Draws a composite bar diagram for two categorical variables
catplot cate.var1 cate.var2, percent() blabel(bar)
and by(cate.var3) is allowed at the end of the command to
recast(bar)
draw the composite bar for each category of cate.var3
pwcorr cont.var1 cont.var2 . . . cont.varn Estimates the parametric Pearson’s correlation coefficient
spearman cont.var1 cont.var2 . . . cont.varn Estimates and test the nonparametric Spearman correlation.
scatter cont.var1 cont.var2 Draws a scatter plot for the two continuous variables
twoway scatter cont.var1 cont.var2 || lfitci Draws a scatter plot with fitted line.
cont.var1 cont.var2
Estimates the descriptives of the continuous variable(s) for
tabstat cont.var(s), by(cate.var) stat(n min max
each category of the categorical variable. col(stat) command
mean sd cv median iqr) long
at the end produces another display of descriptives.
graph box cont.var, over(cate.var) Draws a box plot for each category of the categorical variable
Draws a mean plot of continuous variable(s) for each
graph bar cont.var(s), blabel(bar) over(cate.var)
category of the categorical variable.
pwcorr cont.var1 cont.var2 . . . cont.var(n) Estimates a correlation matrix for all continuous variables
graph matrix cont.var1 . . . cont.var(n), half Draws a scatter plot matrix of all continuous variables.
4. Analyzing Data (Inferential Statistics)
This part focuses on the analysis of data to make informed decision based on the suitable inferential
statistics by categorical and continuous variables. The following table presents the STATA code and
their explanation for estimating inferential statistics.
The residual analysis can be performed when a regression model was not fitted with robust standard
error. The following table summarizes the name of formal (theoretical) and informal (graphical) test
procedures for each of the underlying assumption of the classical linear regression model:
The following table presents STATA codes and their explanation for performing the residual analysis: