Chapter One:
Introduction to
Softwares
Debark University
Department of Economics
Introduction
Currently the dynamic nature of the world leads
to question among people in their daily lives.
To answer these questions, the collection,
organization, analysis and interpretation of data is
critical.
Data are the information that you collect to learn,
draw conclusions, and test hypotheses.
This data can be collected and stored in numerous
ways, depending on
the type of data,
source & context,
study design,
data volume & turnaround time and
data security.
Cont.…
The field of economic statistics and econometrics
is rapidly changing.
Increasing data availability combined with
powerful computing and advanced software
allows research to address issues of statistical
inference and analysis in innovative ways.
Statistical skills enable you to intelligently
collect, analyze and interpret data relevant to
decision-making.
Cont.…
Some of the software
packages for analysis and
collection of data
INTRODUCTION TO
STATA
What is Stata?
Stata is a general purpose Statistical software
package which is created in 1985 by economists
Stata is a statistical analysis package, used for
exploring, graphing, summarizing and
manipulating data files.
The word Stata is a combination of the words
`statistics' and `data.'
Stata is not an acronym and should not appear all
letters capitalized.
Cont..
Stata is an integrated statistical analysis
packaged designed for research professionals and
handling and manipulating large data sets.
It is a multi-purpose statistical package to help
you explore, summarize and analyze datasets.
Stata utilizes command line interface so users
can type commands to perform specific tasks.
Users can also run commands in batch using a
do-file.
Cont..
In addition, Stata has menus and dialog boxes
that give the user access to nearly all built-in
commands.
Stata is case-sensitive; thus, it distinguishes
between lower and upper case letters.
Most Stata built-in commands are lower case, a
convention most programmers follow.
Cont..
Forms or ‘flavors’ of Stata
There are 4 flavors':
STATA MP (multi-processor) which is the most powerful
STATA SE (special edition) extended
STATA IC (Intercooled)
Small STATA
Most features are shared by the other
flavors of Stata.
The version differ basically in terms of
the number of variables handled
the speed of processing
Why Stata?
Documentation and reproducibility of data
and results
Manipulating data, carrying out statistical
analyses, and producing publication
quality graphics
Time and energy saver for advanced user
Steps in data analysis
Locate or gather data
Load data into software package
Manipulate as needed
Analyze
“Data”
A set of numbers and/or text describing
specific phenomena
Mortality, drug effectiveness, economy,
weather, traffic, pollution levels, etc.
Data always organized in rectangular way:
columns contain “variables”
rows contain “observations”
Stata windows or
interface
When Stata is started, a screen opens as shown in Figure
containing four windows labeled:
History
Variable
s
Results
Command line
interface
Windows Cont’d
Each of the Stata windows can be resized
and moved around in the usual way
To bring a window forward that may be
obscured by other windows, make the
appropriate selection in the Window
menu.
Ways to use Stata
Point & click
Command line interface
Batch file (called a “do-file”)
Cont..
Stata has a Graphical User Interface (GUI) that
allows almost all commands to be accessed via
point-and-click.
Simply start by clicking into the Data, Graphics,
or Statistics menus, make the relevant
selections, fill in a dialog box, and click OK.
Stata then behaves exactly as if the
corresponding command had been typed with the
command appearing in the Stata Results and
Datasets
Stata datasets have the .dta extension and
can be loaded into Stata in the usual way
through the File menu
Data is a set of numbers and/or text
describing specific phenomena
Mortality, drug effectiveness, economy, weather,
traffic, pollution levels, etc.
Always rectangular:
Stata file types
Stata uses and creates many types of files, which are
distinguished by extensions at the end of the filename. The
extensions used by Stata are
.ado Programs that add commands to Stata, such as the
SPost commands.
.do Batch files that execute a set of Stata commands.
.dta Data files in Stata’s format.
.gph Graphs saved in Stata’s proprietary format.
.hlp The text displayed when you use the help command.
For example, fitstat.hlp has help for fitstat.
.log Output saved as plain text by the log using command.
Loading data into Stata
The dataset may be viewed as a spreadsheet by opening the
Data Browser with the button and edited by clicking to open
the Data Editor
Stata command:
use file path/file name.dta, clear
e.g. use "C:\Users\Malede\Desktop\data.dta", clear
A command is typed in the Stata Command window and
executed by pressing the Return (or Enter) key.
working directory: using data, saving data, or logging output.
type cd in the Command Window and to change use: cd "C:\Users\Malede\
Desktop"
Do-files
Double click
Editor window
Log files
log allows you to make a full record of your Stata session. A
log is a file containing what you type and Stata's output.
At the beginning of a Stata session, Press the
button , type a filename into the dialog box, and choose
Save.
By default, this produces a SMCL (Stata Markup and
Control Language, pronounced ‘smicle’) file with
extension .smcl, but an ordinary ASCII text file can be
produced by selecting the .log extension.
Log files can also be opened, viewed, and closed by
selecting Log from the File menu, followed by Begin...,
View..., or Close.
log using mylog, replace
log using mylog2, name(mylog2)
. log using firstfile, name(log1) text
. log using secondfile, name(log2) smcl
Getting help
Select Stata Command
Keywords search and press OK from Frequently Asked
Questions (FAQs) are available
search keywords
help Keywords
Data input and output
Stata has its own data format with default extension .dta.
Reading and saving a Stata file are straightforward.
use “file path/file name”
save “file path/file name”
There are essentially two kinds of variables in Stata: string and
numeric.
The storage types are byte, int, long, float, and double for
numeric variables and str1 to str80 for string variables of
different lengths.
Besides the storage type, variables have associated with them a
name, a label, and a format.
Entering Data
Insheet: Read ASCII (text) data created by a spreadsheet (.csv
files only)
Infile: Read unformatted ASCII (text) data (space delimited
files)
Input: Enter data from keyboard
Describe: Describe contents of data in memory or on disk
Compress: Compress data in memory
Save: Store the dataset currently in memory on disk in Stata
data format
Count: Show the number of observations
List: List values of variables
Exploring data
Describe: Describe a dataset
List List the contents of a dataset
Codebook: Detailed contents of a dataset
Log: Create a log file
Summarize: Descriptive statistics
Tabstat: Table of descriptive statistics
Table: Create a table of statistics
Stem: Stem-and-leaf plot
Graph: High resolution graphs
Kdensity: Kernal density plot
Sort: Sort observations in a dataset
Histogram: Histogram for continuous and categorical variables
Tabulate: One- and two-way frequency tables
Correlate: Correlations
Pwcorr: Pairwise correlations
Type: Display an ASCII file
Modifying Data
label data: Apply a label to a data set
Order: Order the variables in a data set
label variable: Apply a label to a variable
label define: Define a set of a label for the levels of a categorical
variable
label values: Apply value labels to a variable
List: Lists the observations
Rename: Rename a variable
Recode: Recode the values of a variable
Notes: Apply notes to the data file
Generate: Creates a new variable
Replace: Replaces one value with another value
Managing Data
Pwd: Show current directory (pwd=print working
directory)
dir or ls: Show files in current directory
cd Change directory
keep if: Keep observations if condition is met
Keep: Keep variables (dropping others)
Drop: Drop variables (keeping others)
append using: Append a data file to current file
Merge: Merge a data file with current file
Analyzing Data
Ttest: t-test
Regress: Regression
Predict: Predicts after model estimation
Kdensity: Kernel density estimates and graphs
Pnorm: Graphs a standardized normal plot
Qnorm: Graphs a quantile plot
Rvfplot: Graphs a residual versus fitted plot
Rvpplot: Graphs a residual versus individual
predictor plot
Xi: Creates dummy variables during model estimation
Test: Test linear hypotheses after model estimation
Oneway: One-way analysis of variance
Anova: Analysis of variance
Logistic: Logistic regression
Must-Know Commands
System Data Management
clear Use
exit sysuse
log Infile, infix
set list
# delimit describe
net keep, drop
search generate, replace, rename
help save, out file
Must-Know Commands
Data Analysis
summarize Statistical Analysis
correlate
regress
graph
predict
two way, scatter,…
hist
test
dwstat
hettest
Comments and Notes
Stata treats lines that begin with an asterisk * or are
located between a pair of /* and */ as comments that are
simply echoed to the output
If a command continues over two lines, we use /* at the
end of the first line and */ at the beginning of the second
line to make Stata ignore the line break.
An alternative would be to use /// at the end of the line.
Variable names are case-sensitive.
Missing value
A missing values in a numeric variable is represented by a
period ‘.’ (system missing values), or by a period followed by
a letter, such as .a,.b. etc.
Missing values are interpreted as very large positive
numbers with . < .a < .b, etc.
Note that this can lead to mistakes in logical expressions.
Numerical missing value codes (such as ‘−99’) may be
converted to missing values (and vice versa) using the
command mvdecode.
mvdecode x, mv(-99)
Data management
Looking at your data
Browse: opens a spreadsheet in which you can scroll to
look at the data, but you cannot change the data.
Edit : You can look and change data
List : creates a list of values of specified variables and
observations.
Cont..
Getting information about variables
describe: provides information on the size of
the dataset and the names, labels, and types of
variables.
codebook summarizes a variable in a format
designed for printing a codebook.
summarize: provides summary statistics. By
default, summarize presents the number of non
missing observations, the mean, the standard
deviation, the minimum values, and the
maximum. Adding the detail option includes
additional information. Eg. . sum age, detail
tabulate: creates the frequency distribution for
a variable. If you do not want the value labels