0% found this document useful (0 votes)
73 views43 pages

Stata Class Notes

Here are the key steps in conducting a bivariate analysis in Stata: 1. Select the two variables you want to analyze - one independent variable and one dependent variable. 2. Cross-tabulate the variables using the tabulate command. For example: tab sex education. 3. Add options like row or column to view percentages down rows or columns. 4. To test for statistical significance, add the chi2 option. This will run a chi-squared test. 5. Interpret the results. A low p-value (typically <0.05) indicates the relationship is statistically significant. 6. You can also use other commands like tabodds and mlogit for more advanced b

Uploaded by

Mido Med
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views43 pages

Stata Class Notes

Here are the key steps in conducting a bivariate analysis in Stata: 1. Select the two variables you want to analyze - one independent variable and one dependent variable. 2. Cross-tabulate the variables using the tabulate command. For example: tab sex education. 3. Add options like row or column to view percentages down rows or columns. 4. To test for statistical significance, add the chi2 option. This will run a chi-squared test. 5. Interpret the results. A low p-value (typically <0.05) indicates the relationship is statistically significant. 6. You can also use other commands like tabodds and mlogit for more advanced b

Uploaded by

Mido Med
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

1.

Open Stata
2. Click on file/ change directory
3. Search for the folder with all files required for your analysis
4. Select data sub folder and click open
Summary: file/change directories/stataclassfolder/datasubfolder/open
1. Type “dir” in command view to check directory under the
subfolder data
2. Type “pwd” (print working directory)
• Log file purpose is to save everything/any changes made to your work in
stata.
• It creates a file that will store all the information necessary for the work
done and changes:-
 Double checking your work later on
 Reviewing the output of statistical procedures
 Copying and pasting results
General way of making it:-
Log using “filename.log
When you want to save to the already existing log file;
Log using “filename.log, replace
Log close- closes the log file
Tyoe use “path and filename”.dta
use “path and filename”.dta,clear for clearing or cancelling the selection
use “path and filename”.dta,append/replace used when one wants to add more
information to the existing .dta file (append) or replace used when you want the file to
be replaced by the new formed .dta
e.g use zsbs2009.dta, append/replace
Note: replace in the command changes the contents in the old file, so be conscious
that no any wanted information is deleted by resaving it.
use zsbs2009.dta,clear

Type clear and close if you want to close your Stata remember your log file is not
affected because you have saved all the changes to it.
• Type describe to check all the variables available in the file
• This brings the observations, number of variables and variable names,
storage type and labels and order of variables
• describe

• Type describe [varlist] for checking specific variables of interest.


• Describe q02-q08
Used to search for variables in stata.
e.g. lookfor age or lookfor “marital status”
Useful when working with huge datasets.
BROWSE:- is used to open a window with dataset currently in memory that
resembles excel spreadsheet.
COUNT:- used to assess how many observations satisfy a particular condition
e.g. count
count if q103>20 for data of those responds to age less that 20 years
Codebook [varlist]
Shows contents of data, with more details such as:-
• Variable name/labels/order, and value labels
• Range of values, frequency count, number of missing values
• If you use noes option, the notes attached to data are displayed
• e.g. codebook q101
• codebook q101 q501
• codebook q101-q103
List [varlist] [if exp] [ in range], [,nolabel]
Example
• List q103
• List q101 q501
• List q103 in 1/20
TABULATE ONE WAY CMD
TABULATE varname [if]
-nolabel option suppresses value labels in display such as :-
tab q101
tab q101, nol
Used to create frequency distribution
of individual variable ( frequency is
the number or counts f a category of a Used to display the
variable) values of variables for
• Double check variables observations in data set
a) Frequency- the frequency count of the number of observations that fall into the various
categories defined for the variables

b) Percent- the % of the total number of observation for which each category accounts

c) Cum. The cumulative percentage

TABULATE ONE-WAY. MISSING VALUES

Stata ignores. Missing values when running tabulations to view them you must use the
missing syntax,

This is good for knowing what is missing in the data e.g

Tab q101, missing


OPERATION,EXPRESSIONS, AND OPERATORS
Y=X+Z
• Expression is a statement that requires operands and operators
• Operands can be numbers, variables, the result of a function, or sme combination
• Operators can either be arithmetic , logical or relational
• This works using the keep or drop command
Arithmetic Logical Relational
+ addition & (and) >(greater than)
- Subtraction | (or) <(less than)
• Multiplication ! Or ~ (not) >=(greater than equal to)
/ division <=(less than equal to)

^ power == (equal to)


- Negation != or ~=(not equal to)
= assignment
Stata runs in two ways
• Interactive mode: were CMD are directly and executed in CMD window
• Batch mode: this is when CMDs can be written in a text file and executed
together in one step
• DO_FILE ends with the extension .do extension
• Contains Stata CMDs exactly way you’d type them into CMD window
• In windows, the text for the do file is written in do file
1. Type doedit in cmd window
2. Using menu bar-click on windows, select do file editor/ Ctrl+9
To save the do file you
Adding comments on a do file
• Name of project
• Purpose of analysis
• Your name
• Date of creation
• Institution
• Any modification
Different ways t write comments
Text of comment

• * Used at the beginning of the text


• Double bar slash
// text of comment used to comment out single CMD or text line
Can be placed at beginning or end
name of a project: DEM 3310
*purpose of analysis: sexual behaviour among teenagers
*name of Authors: Vincelaama
*Institution: university of Zambia
*Date: 31/07/2017
*stata commands
clear // this closes any dataset that is open in stata
capture log close // this will suppress any error message
cd "C:\\users\Chilax\Desktop\stata class\Data" // this will enable stata to log in
to the prefered working folder
log using dem33110.log, append
use zsbs2009.dta
 * commands for subsetting datasets

 keep q101 q103 q701-q711

 *drop command

 drop q101 q107

 keep if q101==1

 keep if q103<=19

 //data transformation in stata

 tab q101, nol

 *creating new variables

 **generate command

 tab q103, missing

 gen age=q103

 recode q103 (15/24=1 “15-24) (25/49=2 “25-49”)(50/60=3 “50-60), gen(agegroup)// this generate new
groups to the new name called agegroup
 tabulate agegroup, missing

 tab agegroup, nol // nol means no labels

 Rename q101 sex // to rename write the Varname space then new name of the variable
Kay Vincelaama 8/26/2017
 recode q103 (15/24=1 “15-24) (25/49=2 “25-49”)(50/60=3 “50-60) if q101==1, gen(mage)
• Open stata, type the do file CMD or just use the CTRL +9 function
• Click on the open icon and select your do file by allocation it
• To open the file using the do file, click execute once and in the result window a
set of instruction will appear
• Use the keep CMD
e.g. keep q101 q701-q711
• Highlight the command and click on execute to open the selected
variables
To execute all the variables highlight the dataset and click execute to open
all the variables
 Using the zsbs2009 individual dataset and the age variable q103, what command
could you sue to identify the number of repsondents who are missing a value for
age? tab q103, m
 How many are there? 482
 Create a new variable called oldfolks to identify those respondents older than age
45 using the generate and replace commands. How many are there
gen oldfolks=q103
Keep if oldfolks>45
574 people older than 45 yrs

Kay Vincelaama 8/26/2017


EXAMPLES OF CMDS TO USE
 tab q107, m  tab q101 q501, row
 drop if q107==9  tab q501, m
 tab q107  drop if ==9
 tab q107, m  drop if q501==9
 tab q108  tab q501
 tab q108, m  tab q101 q108, col
 drop if q108==.  tab q101 q108, col
 tab q108  Tab q101 q108,exp
 tab q108  tab q101 q108,exp
 tab q101 q108  tab q101 q108, row chi2
 tab q101 q108, row
 set more off
 tab q101 q105
Kay Vincelaama 8/26/2017
 tab q101 q105, row
This is the simplest form of quantitative analysis. The analysis is
carried out with the description of a single variable in terms of the
applicable unit of analysis.
It is performed when we want to explore each variable in a data set,
separately.
It looks at the range of values, as well as the central tendency of the
values.
It describes the pattern of response to the variable.

Kay Vincelaama 8/26/2017


 Helps to relate one variable against another; The option of which could be to
investigate if there is a relationship existing or if it exists, determine whether or not it is
significant
 To perform a bivariate analysis, you have to cross-tabulate variables of interest

Cross tabulation general rules


 Decide which is the independent variable and which is the dependent
 The independent variable is written first in the CMD structure
 In case you get lost, the independent variable should always add up to 100%
 E.g Gender with the completition of school

Kay Vincelaama 8/26/2017


 X2 (chi-squared) analysis:
 A chi-square test measures association between two categorical variables
 Calculates the probability that the relationship observation between two categorical
variables is due to chance (a.k.a. random sampling error)
 Its requires that you compare what you observe to what you expected to observe if
there were no pattern
 E.g. sex and education
 Tab q101 q108, bivariate
 Tab q101 q108,ex
 Tab q101 q108,row chi2
 Chi2 pr= 0.000 the confidence level 95% probability of it happening which is put
at 95%

Kay Vincelaama 8/26/2017


 DESCRIPTION
 The T-test performs tests of equality of means. It has the following descriptions
 Tests that varname has a mean of #
 E.g. males average age is 21
 Tests that varname has the same mean within the two groups defined by group var
 E.g. check if the score of girls and boys is the same.
 Tests that varnmane have the same mean, assuming paired (or unpaired) data
 E.g. test whether if women of 15-24 and 25-29 have the same number of children

Kay Vincelaama 8/26/2017


 The dependent variable should be measured at the interval or ratio level i.e. continuous);
e.g. weight, height, income
 The independent variable should consist of two categorical, independent (unrelated)
groups; e.g. marital status
 Indendence observation; i.e. there must be different participants in each group with no
participant being in more than one group;
 E.g. women using traditional and modern contraceptive in CEB as the continuous variable
 No significant outliers; that which sticks out of the ordinary e.g. from the data when the
average year was 23 and the results shows someone above 35years
 The dependent variable should be approximately normally distributed for each category (of
the independent variable); and
 E.g. when we take the population sample we assume there is a normally distribution
 Homogeneity(same) of variances(difference); using Levene’s test for homogeneity of
The variation should be similar
Kay Vincelaama 8/26/2017
 Test whether the mean of the sample is equal to a known constant under the
assumption of unknown variance;
 -use http//www.stata-ress.com/data/r13/auto
 -save ttest_de3310, replace
 Example: test whether the overall average for the sample is 20 km per litter
(variable is miles per gallon=mpg)
 -ttest mpg==20

Kay Vincelaama 8/26/2017


 Determine whether the mean of dependent varable (e.g. weight) is the same in two unrealated,
independent groups ) e.g. male and Females)
 Example: how can we test the effectiveness of new treated fuel?
 Such as leaded petrol, low Sulphur diesel
 Performance of the banks by measuring the customers satisfaction of services
 We can run an experiment in which 12 cars are given the new treated fuel and other 12 cars are not;
we then measure how far they travel in kilometres using the “mpg” variable. We also have a variable
called “treated” coded 1 if fuel is new (treated) and “0” if not.
 This process will test the equality of means for the treated and untreated group:
 Use twosample_ttest_de3310.dta
 Ttest mpg, by(treated) how far a car can go if using treated
 Ttest mpg, by(treated) how far a car can go if using treated

 In demo trying to measure contraceptive of those using and not

Kay Vincelaama 8/26/2017


 Used to determine whether the mean of a dependent vriable is the same intwo
related groups (e.g., two groups measured at two different “time points” or who
undergo two different “conditions”
 Examples compare km travelled measures as “mpg1” and “mpg2” by a car ran on
an additive fuel and also ran without or with an ordinary fule. In this case, we
conduct a paired ttest

 Ttest mpg1==mpg2
 Ttest mpg1==mpg2, unpaired

Kay Vincelaama 8/26/2017


 Analysis of variances: this looks on 3 characteristics of a sample of independence

Kay Vincelaama 8/26/2017


CORRELATION AND REGRESSION
 The relationship is developed through answering the questions over the core problem
 E.g. if there is a relationship or distinctions between rural and urban.
 Mostly the questions that requires more research about defining the solutions such as defining
culture of Zambia which is so complex.
 Another example is “breastfeeding duration has impact on IQ?” such questions does not require
yes or no answers but analysis.
 Analysis can be made in the assumption below
 Breasfeeding impacts on disease burden
 Breastfeeding (independent) diarrhea (dependent) in this case we can not use ttest because the
dependent variable is a categorical variable. In this case we use chi-square, bivariate and
univariate then jump to correlation
 Bivariate explains if there is a relationship by producing the percentages
 Correction will show if the relationship is strong or weak
 Regression shows how strong or weak the relationship is
Kay Vincelaama 8/26/2017
WHAT IS OUR WORKING OR SOUND THEORY
 We hypothesize (or ask a research question) that a baby’s birth weight (m19_1) is a
function of or related to the mothers’…
1. Area of residence (v025)
2. Sex of the baby (b4_01)
3. Education status of the mother (v438)
4. Age of respondent (v012)
5. We shall use the 2007 zdhs dataset
6. Scatter plot graph used only when the independence and dependent variables
are continuos variables.
7. twoway (scatter v458 m19_1) (lfit v438 m19_1)

Kay Vincelaama 8/26/2017


SCATTER PLOT GRAPH
10000
8000
6000
4000
2000

0 2000 4000 6000 8000 10000


birth weight in kilograms (3 decimals)

respondent's height in centimeters (1 decimal) Fitted values

The concentration lies between 0 and 6kg the other values beyond 6kg lies in the unusual
outcomes hence called outliers this can be eliminated to have a clear view of the data 8/26/2017
Kay Vincelaama
1. drop if v438>2000
2. drop if m19_1>6000
1800
1600
1400
1200
1000

1000 2000 3000 4000 5000 6000


birth weight in kilograms (3 decimals)

respondent's height in centimeters (1 decimal) Fitted values

This is what the final outcomes should show.

Kay Vincelaama 8/26/2017


CORRELATION
 Run a correlation using either corr or pwcorr
 In both CMDs, correlation shows how strong or weak a relationship is between the
observed variables of interest. This can either be positive or negative
 A positive value toward 1 shows how strong it is and,
 A negative value toward -1 shows the relation closeness
 E.g. 0.002 is a weak positive relationship and 0.4 is a strong relationship towards 1
unlike the first one.
 Corr uses listwise deletion method
 Pwcorr uses pairwise deletion method, but they produce similar results
 corr m19_1 v438 b4_01 v106 v025 v012
 Pwcorr m19_1 v438 b4_01 v106 v025 v012, star (0.05)

Kay Vincelaama 8/26/2017


 . corr m19_1 v438 b4_01 v106 v025 v012

 (obs=6605)

 | m19_1 v438 b4_01 v106 v025 v012


 -------------+------------------------------------------------------

 m19_1 | 1.0000
 v438 | 0.1056 1.0000
 b4_01 | -0.0808 -0.0002 1.0000
 v106 | -0.0268 0.1444 -0.0053 1.0000
 v025 | 0.0756 -0.1229 -0.0031 -0.3265 1.0000
 v012 | 0.1177 0.1147 0.0067 -0.1717 0.0472 1.0000

 . Kay Vincelaama 8/26/2017


 pwcorr m19_1 v438 b4_01 v106 v025 v012, star (0.05)

 | m19_1 v438 b4_01 v106 v025 v012


 -------------+------------------------------------------------------
 m19_1| 1.0000
 v438 | 0.0277* 1.0000
 b4_01 | 0.0016 0.0037 1.0000
 v106 | -0.2841* -0.0120 0.0006 1.0000
 v025 | 0.3028* 0.0054 -0.0115 -0.3652* 1.0000
 v012 | 0.0953* 0.0302* 0.0047 -0.2173* 0.0606* 1.0000
 The points showing the star, indicates variable that have a strong -/+ relationship

Kay Vincelaama 8/26/2017


 This is a best guess of the number of the number of people expected to be alive at
a future date, based on assumptions about population size, births, deaths, and
migration.
 A set of calculations, which show the future course of population based on
assumptions used about fertility, mortality and migration.
 BASE POPULATION: this is the population classified by age and sexat the start of
the projection period.
 SOURCES OF PROJECTION DATA
1. National statistics office.
2. National university (Academics)
3. Global population bureau.

Kay Vincelaama 8/26/2017


 Government .e.g. for planning purposes
 Academia
 NNGOs
 Donor community
 LEVELS OF PROJECTIONS
 Projections are generated at three levels
1. Short term ( about 5years) considered as the best to review population changes
2. Medium term (5-15yrs)
3. Long term (over 15yrs)

Kay Vincelaama 8/26/2017


 There are basically two main methods for making projections
1. Mathematical methods-
growth rates
2. Cohort component methods
 Taking into account contribution of fertility, mortality and migration

PURPOSE OF POPULATION PROJECTIONS


Why do we bother to generate population projections

1. Planning for developmental sectors at national and subnational levels

2. Provide information for further research

Kay Vincelaama 8/26/2017


 Educational level :planning to build primary and secondary schools
and provision of teaching staff in 2020 will depend on the projected
proportion of children of primary and secondary entry ages. (age 7
and age 15 respectively
 Health sector
 Labour
 Energy
 Agriculture
 Housing
 Transport
 Water

Kay Vincelaama 8/26/2017


1. Select geographic area
2. Determine data usage
3. Determine the period for the projection
4. Gather input data
5. Make assumptions
6. Enter data into software
7. Examine projection output
8. Make alternative projections
9. Publish the results

Requirements and outputs


1 Demproj [future group international]
2 PAS (population analysis software) [US census bureau]
Kay Vincelaama 8/26/2017
Demographic indicators  Child mortality rate
 Total population size Fertility indicators
 Population aged 0-4  CBR
 Population aged 5-14  NRR
 Population aged 15-64  GRR
 Population aged 65+  TFR
 Total net international migration
Mortality indicators
 CDR
 Annual deaths
 IMR
 Life expectancy
8/26/2017
Kay Vincelaama

You might also like