0% found this document useful (0 votes)
147 views

Basic STATA Command

This document provides a summary of STATA commands for summarizing, visualizing, and analyzing quantitative data. Key commands covered include tabulate (tab) for creating frequency tables, summarize (summarize) for describing numerical variable characteristics, regress for linear regression analysis, and logistic for logistic regression analysis. The document emphasizes best practices for selecting appropriate statistical tests based on data type and distribution to obtain valid results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

Basic STATA Command

This document provides a summary of STATA commands for summarizing, visualizing, and analyzing quantitative data. Key commands covered include tabulate (tab) for creating frequency tables, summarize (summarize) for describing numerical variable characteristics, regress for linear regression analysis, and logistic for logistic regression analysis. The document emphasizes best practices for selecting appropriate statistical tests based on data type and distribution to obtain valid results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 5

### STATA command ###

>> DAY1 <<

Command

help {command} - Help on STATA command


describe +/-{var} - Tell more details about data (Variable, Type of
variable, Label)
summarize,sum +/-{var} - If numerical data, can describe characters of data
(Mean, SD, Range) -> Numerical data . Categorical data result has no meaning.
sum +/-{var}, detail - Describe more characters of data includes
Percentile, Median (= 50th percentile), SD, Variance

**TIPS Before analysis


1. Check range (min, max) to make sure that data is appropriately collected (eg.
gender min must be 0, max must be 1)
2. Use describe and sum to check if data is valid before perform any analysis.

disp {formula} - Calculator function; eg. disp (1+1)/(2*6), disp


sqrt(16), disp ln(10) ; log = ln = natural logarithm
tab {var} - Create table with frequency, percentage, and
cumulative percentage -> Categorical data
tab {var1} {var2} - Create 2x2 table (row = var1, column = var2) (Cross-
tabulation)
tab {var1} {var2}, col - Create 2x2 table (row = var1, column = var2) with
column percentage **Use column percentage is preferred.
tab {var1} {var2}, row - Create 2x2 table (row = var1, column = var2) with row
percentage
histogram {var} - Create histogram -> Can be used for evaluate if the
data is normally distributed
histogram {var},{var2} - Create 2 histograms of Var1 by Var2
histogram {var},by({var2}) - Create 2 histograms of Var1 by each of Var2
silk {var} - Shapiro-Wilk W test for normal data (if not
significant -> normal)

**Test of normality (Kolmogorov-Smirnov, Shapiro-Wilk) - if n > 40 cannot be used


(Tends to always significant despite true normal distribution)
**Easy way to confirm normality
1. Eyeball test (Histogram plot)
2. Size of S.D. (< Mean/2 ?)
3. Mean = Median = Mode ?
**Clinical count data 1. Mostly non-normal distribution, 2. Mostly Right skewed

gen {var1}={var2} - Create new


variable var1 with value of var2
recode {min1}/{max1}={cat1} … {minn}/{maxn}={catn} - Stratify continuous data
to categorical data for n strata (Note : each stratum is separated with space)
recode {min1}/{max1}={cat1} … {minn}/{maxn}={catn}, gen({newgroupname} ) -
Combine gen and recode command in 1 line
cii means {n} {mean} {SE} - Calculate 95%
confidence interval from Mean
cii proportions {n} {proportion} - Calculate 95%
confidence interval from Proportion

**TIPS Normally not to present standard error of mean in manuscript, use CI instead.
**Proportion -> STATA with use binomial or Bernoulli’s distribution with proportion
variable -> show 95% CI in ‘Binomial Exact’

tab {var1} {var2}, col chi2 - Create 2x2 table (row = var1,
column = var2) with column percentage, analyse with Chi-square
tab {var1} {var2}, col exact - Create 2x2 table (row = var1, column =
var2) with column percentage, analyse with Fisher’s Exact Probability test
sum {var1} if {var2 + comparator + value} - Summarize variable with if clause
eg. sum a if b==1
ttest {var1},by({var2}) - Test of Mean using 2-sample t-test
with equal variances (Var1 by Var2)
sdtest {var1},by({var2}) - Variance test of 2 means (Proof of
equal variance)
ranksum {var1}, by({var2}) - Test of Mean using Ranksum

**T-test can only be use if both 2 means are normally distributed -> Use histogram to
evaluate first! (T-test is ‘Parametric test’ -> using parameters eg Mean, SD)
**If not normally distributed -> Use Non-parametric test instead : Wilcoxon rank-sum
(=Mann-Whitney U test)
**Non-parametric statistics - Not depends on mean, SD of data (But lower power
compared to Parametric test)
**Using normal distribution data with non-parametric test is ok, but not preferred
because of lower power
**Ranksum test is the test of rank summation, not the test of median!
**Conservatively : always use 2-sided p-value initially (Pr(|T|>|t|), because we don’t
actually know the direction of difference
**But if we know that the intervention we give will result in only 1 direction of
difference only -> We can use 1-sided p-value of expected direction of difference.
(BUT not recommended)
**H0 = NULL hypothesis, Ha = Alternative hypothesis
**Chi-square : Use in ‘LARGE’ sample test (Not clearly defined how much is
LARGE). If small sample size of don’t want any assumption -> Use Fisher’s Exact
test instead
**Fisher’s Exact test use very complex calculation -> very slow if very high sample
size -> Use Chi-square is accepted (result can be assumed as equal)

pwcorr {var1} {var2} - Pearson’s pairwise correlation : - is negative


correlation, + is positive, 0 is cannot describe correlation, rage can only be between
-1 to +1. Greater value = greater strength of linear correlation!
pwcorr {var1} {var2}, sig - Pearson’s pairwise correlation with p-value
spearman {var1} {var2} - Spearman’s rho correlation

**Use Pearson’s pairwise correlation in conjunction with scatter plot


**Use correlation analysis for ‘Hypothesis generating purpose’
**Correlation analysis -> cannot be used for determine effect size/slope; Tell only 2
things 1. Direction, 2. How linear the data is?
**Use Pearson’s if the data is normally distributed
**Use Spearman’s rho if the data is not normally distributed

oneway {var1} {var2} - Analysis of variance (ANOVA) of Var1 by


multiple groups of Var2
oneway {var1} {var2}, tab - Create 2x2 table of Var1 by Var 2, and do the
Analysis of variance (ANOVA) of Var1 by multiple groups of Var2
oneway {var1} {var2}, tab bon - Create 2x2 table of Var1 by Var 2, and do the
Analysis of variance (ANOVA) of Var1 by multiple groups of Var2 with Bonferroni
correction (Do multiple paired T-test with p-value compensation)
kwallis {var1}, by({var2}) - K-Wallis rank test for multiple means

**ANOVA is like T-test with same assumption (normally distribute, equal variance ->
This command use Bartlett’s test of equal variance; if significant -> variance not
equal between group)
**ANOVA is Parametric test
**We don’t do T-test 3 times instead of using analysis of multiple mean ->
Multiplicity, Some may use Bonferroni p-value correction (but not recommended)
**Multivariate analysis have to include variable that has no statistical significant, but
has a difference
**Regression analysis is better than ANOVA, and thus more preferred
**K-Wallis rank test is Non-parametric test for multiple means. Not depends on mean,
SD of data

Regression plot (Linear plot) : Menu Graphics -> Two-way graph ->
Create -> select ‘Fit plot’ -> Linear prediction -> input X and Y variable -> Submit
Scatter plot : Menu Graphics -> Two-way graph ->
Create -> select ‘Basic plot’ -> Scatter plot -> input X and Y variable -> Submit
regress {var1} {var2} - Do the linear regression analysis using
var1 and var2 and display constant and coefficient to form linear formula (Y = a +
b(x), a = constant, b = coefficient)
regress {var1} i.{var2} **if var2 is ‘strata’ (group1, group2,…) - Do the
linear regression analysis (Y = base + 0(group0) + Coef1(group1) + Coef2(group2) +
…)
regress {var1} i.{var2}, base **if var2 is ‘strata’ (group1, group2,…) - Do the
linear regression analysis (Y = base + 0(group0) + Coef1(group1) + Coef2(group2) +
…), and show base group (group 0)
regress {var1} i.{var2} {var3} …{varn} - Do the linear regression analysis, adjust
base with Var 3 to Var n(var 3 to var n have to be linear associated with var1**)

**Linear regression plot - Create a line the have lowest cumulative distance between
line and each point of data in scatter plot (least error)
**Regression analysis = regress to the mean/best line
**regress command = Gaussian regression (Y data has normal distribution). There is
non-Gaussian regression

>> DAY2 <<


Command

(Cohort study)

drop if {condition} - Drop table according to if clause


cs {var y} {var x} - Cohort study -> Create 2x2 table,
calculate risk ratio, risk difference with 95% CI and Chi-square test result
cs {var y} {var x}, exact - Cohort study -> Create 2x2 table,
calculate risk ratio, risk difference with 95% CI and Fisher’s exact test result (Use 2-
sided exact test result)
csi {value1} {value1} {value1} {value1} - Cohort study immediate command ->
Create table using value 1-4
binreg {var1} {var2} {var3} … {var n}, rr - Do the multivariate regression
analysis between Var1 and Var2 using Binary regression, **adjust for Var3 to Var n to
correct confounding factors, then calculate RR

**Using CS command to create 2x2 epitable without univariable risk ratio in Cohort
study is not acceptable anymore (We have to use Multivariate binary regression to
adjust other confounding factors)
Except for RCT (OK due to low confounding factors)

(Case-control study)

cc {var y} {var x} - Case-control study -> Create 2x2 table,


calculate odds ratio, 95% CI and Chi-square test result
cci {value1} {value1} {value1} {value1} - Case-control study immediate command
-> Create table using value 1-4
logistic lbw smoke - Do the multivariate regression
analysis between Var1 and Var2 using Logistic regression, **adjust for Var3 to Var n
to correct confounding factors, then calculate OR

**Risk factor research (Cohort study) : Can use OR in Cohort study, but may
overestimate risk ratio (But it looks dramatic!, and frequently use in risk factor
research)

ir {var y} {var x} {follow-up time} - Create 2x2 table,


calculate Incidence rate, Incidence rate ratio, Incidence rate difference and Fisher’s
exact test result
poisson {var1} {var2}, exp(day) irr - Poisson regression
analysis for rate (Univariable)
poisson {var1} {var2} {var3} … {var n}, exp(day) irr - Poisson regression
analysis for rate (Multivariable adjust using Var3 to Var n) -> for Average rate
(Incidence rate must be constant at all point of time)
stset {day} {var y} - Prepare data for
survival analysis (street = survival time set)
sts graph, hazard - Show shape of
Smoothed hazard estimate -> shape of rate at all time point
sts graph, cumhaz - Show shape of
Cummulative hazard curve
sts graph, (+/-surv) - Show Kaplan-
Meier survival probability curve -> Showing overall survival **Default of sts graph
command is KM curve, no need to use ‘surv’
sts graph, surv by({var}) - Show Kaplan-
Meier survival probability curve stratify by var
sts graph, failure - Show Failure curve
(Inversion of KM curve) -> Showing complication, not death outcome
stsum - Show Time at risk,
Incidence rate, and Survival time percentile
sts graph if {condition} - Show curve that comply
with the specifiedcondition
sts test - Log-rank test for
survival function (Non-parametric) -> Only tell if all survival curve are different or not
sts list, at({time1},{time2},{time3},…) surv - Show survival list
at time1, time2, time3,…
stcox {var x} - Cox-regression
analysis -> show Hazard ratio compare by Var x
stcox i.{var x}, base - Cox-regression
analysis if Var X is ‘strata’ (not value) and show base

**If Incidence rate is not constant -> Instantaneous rate : cannot use Poisson
regression, use Cox-regression analysis instead
**sts command : Only use after stset command
**Median survival time = Time point that only 50% of study population survive

diagt {reference} {test} - Calculate all characters of diagnostic test


(Sense, Spec, PPV, NPV), with 95% CI included
roctab {reference} {test} - Create ROC table -> Help in display
accuracy in non-binary index test
roctab {reference} {test}, graph - Create ROC curve -> Help in display
accuracy in non-binary index test

**Accuracy = (True pos + True neg)/(All sample population)


**When Index test is not binary -> Cannot directly evaluate sensitivity/specificity ->
We have to convert into binary eg. restratify group 1,2,3,4,5 into (1,2) and (3,4,5)
**LR+ = Odds of disease in test / Odds of disease in all patient -> Use when Index
test is not binary instead of try to create sense/spec table (too crude!)

simps {proportion1} {proportion2}, power({value}) alpha({value}) ration({value}) -


Object-based sample size estimation

You might also like