Week 12 - Data Analysis
Week 12 - Data Analysis
• The CONTENTS procedure shows the contents of a SAS data set and prints the directory
of the SAS library
PROC CONTENTS <option-1 <...option-n>>;
Content procedure
Part 1.2 Data
cleaning
Data cleaning
• Data cleaning is the process of editing, correcting, and structuring data within a data set
so that it’s generally uniform and prepared for analysis
• Data cleaning is one of the important processes involved in data analysis
Activities in data cleaning
Removal of Unwanted Observations
Duplicate Observations
Irrelevant Observations
Fix Data Structure
Define missing
Filter out data outliers
Removal of invalid values
Data transformation
Demonstration using Patient.txt
Gender
PROC FREQ DATA=PATIENTS; Gender Frequency Percent Cumulative Cumulative
Frequency Percent
TITLE "FREQUENCY COUNTS";
1 1 1.02 1 1.02
TABLES GENDER AE / NOCUM
F 52 53.06 53 54.08
NOPERCENT; M 43 43.88 96 97.96
RUN; f 1 1.02 97 98.98
x 1 1.02 98 100.00
Frequency Missing = 3
proc freq data=patients;
table _character_;
run;
2.1.3 DATA _NULL_;
INFILE ‘…\Patients.txt' PAD;
To @38 AE
***CHECK GENDER;
1.;
r Values
PATNO= AE=;
RUN;
2.2 Checking For Invalid Numerical
Values
• PROC MEANS and PROC UNIVARIATE can be useful as a first step in data cleaning for
numeric variables.
• Data step;
2.2.1 Mean Procedure
PROC MEANS <option(s)> <statistic-keyword(s)>;
BY <DESCENDING> variable-1 <<DESCENDING> variable-2 …>
<NOTSORTED>;
CLASS variable(s) </ option(s)>;
FREQ variable;
ID variable(s);
OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)>
<id-group-specification(s)> <maximum-id-specification(s)>
<minimum-id-specification(s)> </ option(s)> ;
TYPES request(s);
VAR variable(s) </ WEIGHT=weight-variable>;
WAYS list;
WEIGHT variable;
Proc means statistic-keyword(s)
Part 2- Categorical data
analysis
Lecture Outlines
Choosing the correct statistical test for categorical data
Categorical data analysis -one sample
o Binomial test
o Chi-square goodness of fit test
Categorical data analysis-two samples
o Two independent samples (2x2 table)
Chi-square test (large sample)
Fisher’s exact test (small sample)
Crude Odds Ratio and Relative risk
o Two matched samples
Paired samples- McNemar test and kappa coefficient
Odds Ratio estimation matched pair case-control study
Chi-square test for trend (2*N table)
Stratified tables analysis - the cohort Mantel-Haenszel statistics
Acknowledgement
• The lecture slides are developed based on several
different resources including:
• Online SAS helper;
• Books: Applied Statistics and the SAS Programming language 5th
Edition
• Online reference to UCLA Institute for Digital Research
&Education
2.1 Choosing correct
statistical tests
Types of variable
• Categorical or nominal
• Gender (male vs. female);
• Ordinal
• Age group (<20, 20-39, 40-59, 60+)
• Smoking (0, 1-5, 5-9, 10+ cigarettes per day)
• Interval (also called numerical)
• Cholesterol levels (mg/dL)
• Weight (pounds)
Statistical tests for categorical
data
Number of Nature of Independent Nature of Test(s)
Dependent Variables Dependent
Variables Variable(s)*
1 0 IVs (1 population) categorical (2
binomial test
categories)
Categorical (2+ Chi-square goodness-
categories) of-fit
1 IV with 2 levels Categorical
(independent groups) (large sample Chi-square test
size)
Small sample
Fisher’s exact test
size
1 IV with 2 or more
levels (independent categorical Chi-square test
groups)
Statistical tests for categorical
data 2
Number of Nature of Independent Nature of Test(s)
Dependent Variables Dependent
Variables Variable(s)*
1 IV with 2 levels
(dependent/matched categorical McNemar test
groups
1 IV with 2 or more
levels categorical (2 Conditional logistic
(dependent/matched categories) regression
groups)
2 or more IVs categorical (2+
logistic regression
(independent groups categories)
1 interval IV categorical simple logistic
regression
1 or more interval IVs categorical
multiple logistic
and/or 1 or more
regression
Variable Description
name
female Gender of students
• For example
• H0: P=0.5
Example: Binominal test
•In the HSB data, test whether the proportion of females
(female) differs significantly from 50%, i.e., from .5.
• We will use the exact statement to produce the exact p-
values.
proc freq data = wk11.hsb2;
tables female / binomial(p=.5);
exact binomial;
run;
Table Statement
TABLES requests </ options> ;
• The TABLES statement requests one-way to n-way frequency and crosstabulation tables and statistics for those
tables.
•Options:
• BINOMIAL <(binomial-options)>
BIN <(binomial-options)>
• requests the binomial proportion for one-way tables. When you specify this option, by default PROC FREQ
provides the asymptotic standard error, asymptotic Wald and exact (Clopper-Pearson) confidence limits,
and the asymptotic equality test for the binomial proportion.
EXACT statement
EXACT statistic-options </ computation-options> ;
• The EXACT statement requests exact tests and confidence limits for selected statistics. The statistic-
options identify which statistics to compute, and the computation-options specify options for computing exact
statistics.
• Statistic options
BINOMIAL /BIN
• requests an exact test for the binomial proportion (for one-way tables).
Output
• The results indicate that there is
no statistically significant
difference (p = .2292). In other
words, the proportion of
females in this sample does not
significantly differ from the
hypothesized value of 50%.
2.2 Chi-square goodness of fit (2+
categories)
• A chi-square goodness of fit test allows us to test whether the observed proportions for a
categorical variable differ from hypothesized proportions.
Example
• let’s suppose that we believe that the general population consists of 10% Hispanic, 10%
Asian, 10% African American and 70% White folks. We want to test whether the observed
proportions from our sample differ significantly from these hypothesized proportions.
• chisq-options
• TESTP=(values)| SAS-data-set
• specifies null hypothesis proportions for the one-way chi-square goodness-of-fit
tests.
Output
• These results show that racial
composition in our sample does
not differ significantly from the
hypothesized values that we
supplied (chi-square with three
degrees of freedom = 5.0286, p
= .1697).
2.3. R x C table
analysis
3.1 Two + independent sample populations
• Using the hsb2 data file, let’s see if there is a relationship between the type of school
attended (schtyp) and students’ gender (female).
Cases Control
s
Exposur Yes 50 20
e No 100 130
Using the code to generate the
dataset
DATA ODDS;
INPUT OUTCOME $ EXPOSURE $ COUNT;
DATALINES;
CASE 1-YES 50
CASE 2-NO 100
CONTROL 1-YES 20
CONTROL 2-NO 130
;
Code for OR calculations
Heart Attack
Yes No
Choleste High 20 80
rol Low 15 135
Create the dataset using the
table
DATA RR;
LENGTH GROUP $ 9;
INPUT GROUP $ OUTCOME $ COUNT;
DATALINES;
HC Y 20
HC N 80
LC Y 15
LC N 135
;
Code for analysis
Radiologis Radiologist 2
t1 no yes
no 25 3
yes 5 50
3.2.2 Odds Ratio for matched pair
case-control study
• A matched pair case control study was conducted to examine alcohol consumption on liver
disease. The data was listed as below
Cases
prese absent
nt
Contro prese 15 5
ls nt
absen 20 60
Create the data using do loop
data a;
do case = 'present','absent';
do control = 'present','absent';
input count @@;
output;
end;
end;
datalines;
15 20
5 60
;
Wrong code for OR
Group
A B C D
Test Fail 10 15 14 25
Results Pass 90 85 86 75
SAS code for test
DATA TREND;
INPUT RESULT $ GROUP $ COUNT @@;
DATALINES;
FAIL A 10 FAIL B 15 FAIL C 14 FAIL D 25
PASS A 90 PASS B 85 PASS C 86 PASS D 75
;
PROC FREQ DATA=TREND;
TITLE "Chi-square Test for Trend";
TABLES RESULT*GROUP / CHISQ;
WEIGHT COUNT;
RUN;
Output
• There is a significant
linear trend in
proportions of Fail from
group A through D.
3.4 stratified tables
• Sometimes we need to stratify the study group into several subgroups based on one or
two factors. Then we examine 2*2 tables or R*C tables in each subgroup and generate a
summary across the subgroups.
Stratification-Why?
• To better control the confounding effects of a third factor, we need to examine the
association between an exposure and a disease in each stratum of this third variable.
• Objectives of this analysis include
Determine if the association between an exposure and a disease in each stratum of this third variable are
similar or statistically significantly different.
If different, the estimates of an exposure on disease in each stratum of this third variable are required to
present.
If not different, the cohort Mantel-Haenszel statistics can provide the summary statistics across all the strata.
Example
• We examine the relationship between hours of sleep and the chance of failing a test by
gender.
TEST
Fail Pass
Boys
Sleep Low 20 100
High 15 150
TEST
Fail Pass
Girls Sleep Low 30 100
High 25 200
DATA ABILITY;
INPUT GENDER $ RESULTS $ SLEEP $
COUNT;
Code to create the DATALINES;
BOYS FAIL 1-LOW 20
dataset
BOYS FAIL 2-HIGH 15
BOYS PASS 1-LOW 100
BOYS PASS 2-HIGH 150
GIRLS FAIL 1-LOW 30
GIRLS FAIL 2-HIGH 25
GIRLS PASS 1-LOW 100
GIRLS PASS 2-HIGH 200
;
Code for analysis
PROC FREQ DATA=ABILITY;
TITLE "Mantel-Haenszel Chi-square Test";
TABLES GENDER*SLEEP*RESULTS/ALL;
WEIGHT COUNT;
RUN;
ALL
requests all tests and measures that are produced by the CHISQ, MEASURES, and CMH
options.
Output
• The Breslow-day test
suggests that the
relationships
between sleep hours
and fail of a test are
not different between
boys and girls
Step 2: provide the summary
statistics across all the strata