0% found this document useful (0 votes)
121 views3 pages

STATA Command Summary

This document summarizes common STATA commands used in clinical statistics. It provides commands for data management and exploration, descriptive statistics, hypothesis testing, correlations, regression analysis, and cohort studies. Key tips are included such as checking for normality before parametric tests and using non-parametric alternatives when appropriate. Common plots like histograms, scatter plots, and regression lines are demonstrated. Parameter meanings and assumptions are explained for various tests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views3 pages

STATA Command Summary

This document summarizes common STATA commands used in clinical statistics. It provides commands for data management and exploration, descriptive statistics, hypothesis testing, correlations, regression analysis, and cohort studies. Key tips are included such as checking for normality before parametric tests and using non-parametric alternatives when appropriate. Common plots like histograms, scatter plots, and regression lines are demonstrated. Parameter meanings and assumptions are explained for various tests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

### STATA command Summary ### (CMU Basic Clinical Statistic Course 2020)

>> DAY1 <<

Command

help {command} - Help on STATA command


describe +/-{var} - Tell more details about data (Variable, Type of variable, Label)
summarize,sum +/-{var} - If numerical data, can describe characters of data (Mean, SD, Range) -> Numerical
data . Categorical data result has no meaning.
sum +/-{var}, detail - Describe more characters of data includes Percentile, Median (= 50th percentile),
SD, Variance

**TIPS Before analysis


1. Check range (min, max) to make sure that data is appropriately collected (eg. gender min must be 0, max must be 1)
2. Use describe and sum to check if data is valid before perform any analysis.

disp {formula} - Calculator function; eg. disp (1+1)/(2*6), disp sqrt(16), disp ln(10) ; log = ln = natural
logarithm
tab {var} - Create table with frequency, percentage, and cumulative percentage ->
Categorical data
tab {var1} {var2} - Create 2x2 table (row = var1, column = var2) (Cross-tabulation)
tab {var1} {var2}, col - Create 2x2 table (row = var1, column = var2) with column percentage **Use column
percentage is preferred.
tab {var1} {var2}, row - Create 2x2 table (row = var1, column = var2) with row percentage
histogram {var} - Create histogram -> Can be used for evaluate if the data is normally distributed
histogram {var},{var2} - Create 2 histograms of Var1 by Var2
histogram {var},by({var2}) - Create 2 histograms of Var1 by each of Var2
silk {var} - Shapiro-Wilk W test for normal data (if not significant -> normal)

**Test of normality (Kolmogorov-Smirnov, Shapiro-Wilk) - if n > 40 cannot be used (Tends to always significant despite true normal
distribution)
**Easy way to confirm normality
1. Eyeball test (Histogram plot)
2. Size of S.D. (< Mean/2 ?)
3. Mean = Median = Mode ?
**Clinical count data 1. Mostly non-normal distribution, 2. Mostly Right skewed

gen {var1}={var2} - Create new variable var1 with value of var2


recode {min1}/{max1}={cat1} … {minn}/{maxn}={catn}
- Stratify continuous data to categorical data for n strata (Note : each stratum is separated
with space)
recode {min1}/{max1}={cat1} … {minn}/{maxn}={catn}, gen({newgroupname} )
- Combine gen and recode command in 1 line
cii means {n} {mean} {SE} - Calculate 95% confidence interval from Mean
cii proportions {n} {proportion} - Calculate 95% confidence interval from Proportion

**TIPS Normally not to present standard error of mean in manuscript, use CI instead.
**Proportion -> STATA with use binomial or Bernoulli’s distribution with proportion variable -> show 95% CI in ‘Binomial Exact’

tab {var1} {var2}, col chi2 - Create 2x2 table (row = var1, column = var2) with column percentage, analyse with
Chi-square
tab {var1} {var2}, col exact - Create 2x2 table (row = var1, column = var2) with column percentage, analyse with
Fisher’s Exact Probability test
sum {var1} if {var2 + comparator + value}
- Summarize variable with if clause eg. sum a if b==1
ttest {var1},by({var2}) - Test of Mean using 2-sample t-test with equal variances (Var1 by Var2)
sdtest {var1},by({var2}) - Variance test of 2 means (Proof of equal variance)
ranksum {var1}, by({var2}) - Test of Mean using Ranksum

**T-test can only be use if both 2 means are normally distributed -> Use histogram to evaluate first! (T-test is ‘Parametric test’
-> using parameters eg Mean, SD)
**If not normally distributed -> Use Non-parametric test instead : Wilcoxon rank-sum (=Mann-Whitney U test)
**Non-parametric statistics - Not depends on mean, SD of data (But lower power compared to Parametric test)
**Using normal distribution data with non-parametric test is ok, but not preferred because of lower power
**Ranksum test is the test of rank summation, not the test of median!
**Conservatively : always use 2-sided p-value initially (Pr(|T|>|t|), because we don’t actually know the direction of difference
**But if we know that the intervention we give will result in only 1 direction of difference only -> We can use 1-sided p-value of
expected direction of difference.(BUT not recommended)
**H0 = NULL hypothesis, Ha = Alternative hypothesis
**Chi-square : Use in ‘LARGE’ sample test (Not clearly defined how much is LARGE). If small sample size of don’t want any
assumption -> Use Fisher’s Exact test instead
**Fisher’s Exact test use very complex calculation -> very slow if very high sample size -> Use Chi-square is accepted (result can be
assumed as equal)

pwcorr {var1} {var2} - Pearson’s pairwise correlation : - is negative correlation, + is positive, 0 is cannot
describe correlation, rage can only be between -1 to +1. Greater value = greater
strength of linear correlation!
pwcorr {var1} {var2}, sig - Pearson’s pairwise correlation with p-value
spearman {var1} {var2} - Spearman’s rho correlation

**Use Pearson’s pairwise correlation in conjunction with scatter plot


**Use correlation analysis for ‘Hypothesis generating purpose’
**Correlation analysis -> cannot be used for determine effect size/slope; Tell only 2 things 1. Direction, 2. How linear the data is?
**Use Pearson’s if the data is normally distributed
**Use Spearman’s rho if the data is not normally distributed

oneway {var1} {var2} - Analysis of variance (ANOVA) of Var1 by multiple groups of Var2
oneway {var1} {var2}, tab - Create 2x2 table of Var1 by Var 2, and do the Analysis of variance (ANOVA) of Var1 by
multiple groups of Var2
oneway {var1} {var2}, tab bon - Create 2x2 table of Var1 by Var 2, and do the Analysis of variance ANOVA) of Var1 by
multiple groups of Var2 with Bonferroni correction (Do multiple paired T-test
with p-value compensation)
kwallis {var1}, by({var2}) - K-Wallis rank test for multiple means

**ANOVA is like T-test with same assumption (normally distribute, equal variance -> This command use Bartlett’s test of equal
variance; if significant -> variance not equal between group)
**ANOVA is Parametric test
**We don’t do T-test 3 times instead of using analysis of multiple mean -> Multiplicity, Some may use Bonferroni p-value correction
(but not recommended)
**Multivariate analysis have to include variable that has no statistical significant, but has a difference
**Regression analysis is better than ANOVA, and thus more preferred
**K-Wallis rank test is Non-parametric test for multiple means. Not depends on mean, SD of data

Regression plot (Linear plot) : Menu Graphics -> Two-way graph -> Create -> select ‘Fit plot’ -> Linear prediction ->
input X and Y variable -> Submit
Scatter plot : Menu Graphics -> Two-way graph -> Create -> select ‘Basic plot’ -> Scatter plot -> input
X and Y variable -> Submit
regress {var1} {var2} - Do the linear regression analysis using var1 and var2 and display constant and
coefficient to form linear formula (Y = a + b(x), a = constant, b = coefficient)
regress {var1} i.{var2} **if var2 is ‘strata’ (group1, group2,…)
- Do the linear regression analysis (Y = base + 0(group0) + Coef1(group1) +
Coef2(group2) +…)
regress {var1} i.{var2}, base **if var2 is ‘strata’ (group1, group2,…)
- Do the linear regression analysis (Y = base + 0(group0) + Coef1(group1) +
Coef2(group2) +…), and show base group (group 0)
regress {var1} i.{var2} {var3} …{varn}
- Do the linear regression analysis, adjust base with Var 3 to Var n (var 3 to var n have
to be linear associated with var1**)
**Linear regression plot - Create a line the have lowest cumulative distance between line and each point of data in scatter plot (least
error)
**Regression analysis = regress to the mean/best line
**regress command = Gaussian regression (Y data has normal distribution). There is non-Gaussian regression

>> DAY2 <<

Command

(Cohort study)

drop if {condition} - Drop table according to if clause


cs {var y} {var x} - Cohort study -> Create 2x2 table, calculate risk ratio, risk difference with 95% CI and
Chi-square test result
cs {var y} {var x}, exact - Cohort study -> Create 2x2 table, calculate risk ratio, risk difference with 95% CI and
Fisher’s exact test result (Use 2-sided exact test result)
csi {value1} {value1} {value1} {value1}
- Cohort study immediate command -> Create table using value 1-4
binreg {var1} {var2} {var3} … {var n}, rr
- Do the multivariate regression analysis between Var1 and Var2 using Binary
regression, **adjust for Var3 to Var n to correct confounding factors, then
calculate RR
**Using CS command to create 2x2 epitable without univariable risk ratio in Cohort study is not acceptable anymore (We have to use
Multivariate binary regression to adjust other confounding factors)
Except for RCT (OK due to low confounding factors)

(Case-control study)

cc {var y} {var x} - Case-control study -> Create 2x2 table, calculate odds ratio, 95% CI and Chi-square
test result
cci {value1} {value1} {value1} {value1}
- Case-control study immediate command -> Create table using value 1-4
logistic lbw smoke - Do the multivariate regression analysis between Var1 and Var2 using Logistic
regression, **adjust for Var3 to Var n to correct confounding factors, then
calculate OR

**Risk factor research (Cohort study) : Can use OR in Cohort study, but may overestimate risk ratio (But it looks dramatic!, and
frequently use in risk factor research)

ir {var y} {var x} {follow-up time} - Create 2x2 table, calculate Incidence rate, Incidence rate ratio,
Incidence rate difference and Fisher’s exact test result
poisson {var1} {var2}, exp(day) irr - Poisson regression analysis for rate (Univariable)
poisson {var1} {var2} {var3} … {var n}, exp(day) irr
- Poisson regression analysis for rate (Multivariable adjust using Var3 to Var n) -> for
Average rate (Incidence rate must be constant at all point of time)
stset {day} {var y} - Prepare data for survival analysis (street = survival time set)
sts graph, hazard - Show shape of Smoothed hazard estimate -> shape of rate at all time point
sts graph, cumhaz - Show shape of Cummulative hazard curve
sts graph, (+/-surv) - Show Kaplan-Meier survival probability curve -> Showing overall survival **Default of
sts graph command is KM curve, no need to use ‘surv’
sts graph, surv by({var}) - Show Kaplan-Meier survival probability curve stratify by var
sts graph, failure - Show Failure curve (Inversion of KM curve) -> Showing complication, not death
outcome
stsum - Show Time at risk, Incidence rate, and Survival time percentile
sts graph if {condition} - Show curve that comply with the specifiedcondition
sts test - Log-rank test for survival function (Non-parametric) -> Only tell if all survival curve are
different or not
sts list, at({time1},{time2},{time3},…) surv
- Show survival list at time1, time2, time3,…
stcox {var x} - Cox-regression analysis -> show Hazard ratio compare by Var x
stcox i.{var x}, base - Cox-regression analysis if Var X is ‘strata’ (not value) and show base

**If Incidence rate is not constant -> Instantaneous rate : cannot use Poisson regression, use Cox-regression analysis instead
**sts command : Only use after stset command
**Median survival time = Time point that only 50% of study population survive

diagt {reference} {test} - Calculate all characters of diagnostic test (Sense, Spec, PPV, NPV), with 95% CI
included
roctab {reference} {test} - Create ROC table -> Help in display accuracy in non-binary index test
roctab {reference} {test}, graph - Create ROC curve -> Help in display accuracy in non-binary index test

**Accuracy = (True pos + True neg)/(All sample population)


**When Index test is not binary -> Cannot directly evaluate sensitivity/specificity -> We have to convert into binary eg. restratify group
1,2,3,4,5 into (1,2) and (3,4,5)
**LR+ = Odds of disease in test / Odds of disease in all patient -> Use when Index test is not binary instead of try to create
sense/spec table (too crude!)

simps {proportion1} {proportion2}, power({value}) alpha({value}) ration({value})


- Object-based sample size estimation

You might also like