0% found this document useful (0 votes)
145 views6 pages

Stata Session 1 KA (Class)

The document discusses basic data management tasks in Stata including describing datasets, inspecting missing values, computing new variables, recoding and labeling variables. It also discusses importing data, opening datasets, listing variables, dropping and keeping observations, and defining value labels.

Uploaded by

jmn 06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views6 pages

Stata Session 1 KA (Class)

The document discusses basic data management tasks in Stata including describing datasets, inspecting missing values, computing new variables, recoding and labeling variables. It also discusses importing data, opening datasets, listing variables, dropping and keeping observations, and defining value labels.

Uploaded by

jmn 06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Data management and manipulation: use C:\filepath.

dta, clear

Describe all data set in memory:


Stata Session 1
describe
The objectives of this session is to get introduced to Stata’s
basic data management commands which will enable you to Describe specific variables:
explore your data set. Therefore by the end of the session you describe [varlist]
will be able to: describe your dataset, inspect missing values,
compute new variables, recode and label variables, Another alternative to the command describe is

The Dirty Data Theorem states that “real world” data tends codebook
to come from bizarre and unspecifiable distribution of
codebook examines the variable names, labels, and data to
highly correlated variables and have unequal sample sizes, produce a codebook describing the dataset.
missing data points, non-independent observations and an
indeterminate number of inaccurately recorded values. codebook [varlist]

-Unknown inputting data into stata

input patient_id age weight


Basic stata code structure:
1 12 45
command varlist, options
2 18 67
Use stata help files:
3 13 50
help stata_command
4 26 77
 Importing data sets into stata:
5 25 75
Safest way is to import .csv files as the .csv format is recognized by
6 13 56
most statistical platforms.
7 36 85
import delimited C:\filepath.csv\
8 19 52
 Opening a regular .dta (Stata format)
end encode creates a new variable named newvar based on the
string variable varname, creating, adding to, or just using
List all variables for all observations
(as necessary) the value label newvar or, if specified, name.
list
destring converts variables in varlist from string to
List all variables for specific number of observations numeric.
list in 1/6 ===================================

list specific variables for specific number of observations webuse hbp2, clear

list [varlist] in 1/6 codebook

delete the first 6 observations encode sex, gen (gender)

drop in 1/6 codebook

===================================
keep the first 10 observations
webuse destring1, clear
keep in 1/10
destring id, replace
same logic applies to variables
replace total = "toto" in 2
drop [varlist]
destring total, replace

===================================
Defining labels and values for variables
label define age_cat 1 “less than 20” 2 “20-24” 3
“25-29” 4 “30-34” 5 “35-39” 6 “40-44” 7 “45+”

encode and destring functions


Application 0 that the hourly wage would range from 0 to 300$. What can
you tell? Hint: you can use the option detail
Open wws data set. However before starting to work on
any stata data set we need to open a log file to save our
output and more importantly use a do-file-editor which is
the equivalent of the SPSS syntax file. inspect the variables married and nevermarried. What can
you note?
log using "C:\filepath\wws.log"

use "C:\filepath.dta", clear

use the describe command to inspect your data set.


How many variables are there? How many
observations?

check the variable collgrad. What is this variable


suppose to tell us? Does it have any missing
values?

check the variable race. It should consist of only


3 levels. What is the idcode of the erroneous
entry?

inspect the variable wage, which contains info about hourly


wage in dollars of previous week. Prior knowledge tells us
Application 1 How to find the average (mean), the median and the standard
deviation of the variable bmi?
Open bmi data set. However before starting to work on any
stata data set we need to open a log file to save our output Hint: help summarize
and more importantly use a do-file-editor which is the Now let’s calculate the average bmi seperatly for males and females,
equivalent of the SPSS syntax file. but first we have to assign value labels for gender.
log using "C:\filepath\bmi.log"
A faster way to do it is by using the command bysort
use "C:\filepath.dta", clear help bysort

what is the storage type of the variable “name”? bysort sex: sum bmi

what’s the number of observations? Number of variables?

Give the variable energy the following label “total energy What do you notice? Any extreme observations?
expenditure” and “body mass index” for the variable bmi.
Try and replace the extreme observations by a missing value

Are there any missing observations?


let’s recalculate the bmi for males and females separately.
Using the inspect command provide a quick summary of your
data. can you pinpoint any abnormal entry? We need to categorize bmi into 4 categories: underweight, normal,
overweight and obese using the following ranges for categorization:
Suppose we want to give a unique id number for each respondent
to be able to track him more easily. <20 (underweight), 20-25 (normal), 26-30 (overweight), >30 (obese)

gen id=_n

replace respondent with id # 17 with a positive bmi value. gen bmi_cat=.

replace bmi=26 if id==17 replace bmi_cat=1 if bmi >=0 & bmi<20

replace bmi_cat=2 if bmi >=20 & bmi<=25


another way to do it
replace bmi_cat=3 if bmi >25 & bmi<=30
replace bmi=25 in 17
replace bmi_cat=4 if bmi >30 & bmi!=. Start by describing and inspecting the data set:
label define bmi_cat 1 “underweight” 2 “normal” 3 - How many respondents are there?
“overweight” 4 “obese”
- What’s the number of variables used in this data set?
let’s give a label “bmi categorized” for the variable bmi_cat
- provide appropriate labels for the following variables: age gender
label var bmi_cat "bmi categorized"
marital_stat education height weight
let us check the newly created variable
- What are the values assigned for variables gender and marital
tab bmi_cat status?

suppose we want to merge the 2 categories obese and overweight - Assign the following values for gender 1(males), 2 (females);
together marital status 1 (never married) 2 (married) 3 (divorced)

recode bmi_cat (4=3), gen (bmi_cat1) - how many missing values do we have for the following variables:
age gender marital_stat education height weight.
label define bmi_cat1 1 “underweight” 2 “normal” 3
“overweight” - categorize age into 4 groups 14 to 29, 30 to 49, 50 to 69 and >69

give a label for the newly created variable - appropriately label each category.

check the distribution of bmi_cat1


- using the formula weight/height in meter squared, calculate the
now recreate bmi_cat1 but let’s call it bmi_cat2 using the
BMI of inmates
“generate” method.
However first we have to transform height from cm to meters
gen bmi_cat2=bmi_cat

replace bmi_cat2=3 if bmi_cat2==4 hint: help gen

label define bmi_cat2 1 “underweight” 2 “normal” 3 - knowing that the condition “and” is denoted as “&” and the
“overweight” condition “or” is denoted as “|” categorize the newly
created bmi into 4 categories as follow
Application 2
FOR FEMALE INNAMTES: <18.5 (underweight), 18.5-25
Open dataset inmates.dta. (normal), 26-30 (overweight), >30 (obese)
FOR MALE INMATES: <20 (underweight), 20-25 (normal), 26-30
(overweight), >30 (obese)

Produce the mean bmi for male and female inmates separately in two
different ways.

- generate a variable summarizing whether a respondent has at least 1


chronic disease (diabetes hyperlipidemia anemia asthma
migraine)

You might also like