0% found this document useful (0 votes)
6 views83 pages

Week 12 - Data Analysis

The document outlines a lecture on data analysis in epidemiological research, focusing on evaluating and preparing datasets for analysis, data cleaning, and statistical tests for categorical data. It details the steps of data analysis, methods for checking invalid values in datasets, and various statistical tests applicable to categorical data. Additionally, it provides examples of using SAS procedures for data analysis and cleaning.

Uploaded by

KinSparkin'
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views83 pages

Week 12 - Data Analysis

The document outlines a lecture on data analysis in epidemiological research, focusing on evaluating and preparing datasets for analysis, data cleaning, and statistical tests for categorical data. It details the steps of data analysis, methods for checking invalid values in datasets, and various statistical tests applicable to categorical data. Additionally, it provides examples of using SAS procedures for data analysis and cleaning.

Uploaded by

KinSparkin'
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 83

Week 12- Data Analysis

PHEB 689: SAS PROGRAMMING FOR


EPIDEMIOLOGICAL RESEARCH
Xiaohui Xu, Ph.D.
Department of Epidemiology and Biostatistics
Part 1- Evaluate and
Prepare datasets for
analysis
Lecture Outlines
 Understanding the steps of data analysis
 Check invalid values for character variables
o Proc Freq
o Data step
 Check invalid values for numeric variables
o Proc means
o Proc univariates
o Data step
 Data cleaning and check data accuracy again
 Checking data normality and data transformation if necessary
Part 1. 1 Data
analysis steps
Sources of research data
• Primary sources
• The researcher or team of researchers designs, collects, and
analyzes the data, for the purpose of answering a research
question
• Secondary Data Sources
• Existing data collected for another purposes, that you use to
answer your research question
Steps for data Analysis

 Understand the dataset;


 Data cleaning;
 Data analysis;
 Sharing results;
Understand the dataset

The information includes:


 The variables' names, types, and attributes (including formats, informats,
and labels)
 Definitions, valid values and units of variables
 How many observations are in the dataset
 How many variables are in the dataset
 When the dataset was created
Access the information from two ways

• Receiving metadata or data dictionary;


• Running SAS procedure to get the information
Content procedure

• The CONTENTS procedure shows the contents of a SAS data set and prints the directory
of the SAS library
PROC CONTENTS <option-1 <...option-n>>;
Content procedure
Part 1.2 Data
cleaning
Data cleaning
• Data cleaning is the process of editing, correcting, and structuring data within a data set
so that it’s generally uniform and prepared for analysis
• Data cleaning is one of the important processes involved in data analysis
Activities in data cleaning
 Removal of Unwanted Observations
 Duplicate Observations
 Irrelevant Observations
 Fix Data Structure
 Define missing
 Filter out data outliers
 Removal of invalid values
 Data transformation
Demonstration using Patient.txt

Note: Cody Data cleaning 101


2.1 Checking For Invalid Character
Values
• A very simple approach to identifying invalid character values in this file is to use PROC
FREQ to list all the unique values of these variables.
2.1.1. Syntax-
FREQ
Procedure
PROC FREQ < options > ;
BY variables ;
EXACT statistic-options < /
computation-options > ;
OUTPUT <OUT=SAS-data-
set > output-options ;
TABLES requests < /
options > ;
TEST options ;
WEIGHT variable < / option
>;
2.1.2 Invalid data in Patients

Gender
PROC FREQ DATA=PATIENTS; Gender Frequency Percent Cumulative Cumulative
Frequency Percent
TITLE "FREQUENCY COUNTS";
1 1 1.02 1 1.02
TABLES GENDER AE / NOCUM
F 52 53.06 53 54.08
NOPERCENT; M 43 43.88 96 97.96
RUN; f 1 1.02 97 98.98
x 1 1.02 98 100.00
Frequency Missing = 3
proc freq data=patients;
table _character_;
run;
2.1.3 DATA _NULL_;
INFILE ‘…\Patients.txt' PAD;

Using A FILE PRINT;


TITLE "LISTING OF INVALID DATA";
input @1 Patno $3.
Data Step @11 Gender
@31 Dx
$1.
$7.

To @38 AE
***CHECK GENDER;
1.;

IF GENDER NOT IN ('F','M',' ') THEN


Identify PUT PATNO= GENDER=;
***CHECK DX;
Invalid IF VERIFY(DX,' 0123456789') NE 0
THEN PUT PATNO= DX=;
Characte ***CHECK AE;
IF AE NOT IN ('0','1',' ') THEN PUT

r Values
PATNO= AE=;
RUN;
2.2 Checking For Invalid Numerical
Values
• PROC MEANS and PROC UNIVARIATE can be useful as a first step in data cleaning for
numeric variables.
• Data step;
2.2.1 Mean Procedure
PROC MEANS <option(s)> <statistic-keyword(s)>;
BY <DESCENDING> variable-1 <<DESCENDING> variable-2 …>
<NOTSORTED>;
CLASS variable(s) </ option(s)>;
FREQ variable;
ID variable(s);
OUTPUT <OUT=SAS-data-set> <output-statistic-specification(s)>
<id-group-specification(s)> <maximum-id-specification(s)>
<minimum-id-specification(s)> </ option(s)> ;
TYPES request(s);
VAR variable(s) </ WEIGHT=weight-variable>;
WAYS list;
WEIGHT variable;
Proc means statistic-keyword(s)
Part 2- Categorical data
analysis
Lecture Outlines
 Choosing the correct statistical test for categorical data
 Categorical data analysis -one sample
o Binomial test
o Chi-square goodness of fit test
 Categorical data analysis-two samples
o Two independent samples (2x2 table)
 Chi-square test (large sample)
 Fisher’s exact test (small sample)
 Crude Odds Ratio and Relative risk
o Two matched samples
 Paired samples- McNemar test and kappa coefficient
 Odds Ratio estimation matched pair case-control study
 Chi-square test for trend (2*N table)
 Stratified tables analysis - the cohort Mantel-Haenszel statistics
Acknowledgement
• The lecture slides are developed based on several
different resources including:
• Online SAS helper;
• Books: Applied Statistics and the SAS Programming language 5th
Edition
• Online reference to UCLA Institute for Digital Research
&Education
2.1 Choosing correct
statistical tests
Types of variable
• Categorical or nominal
• Gender (male vs. female);
• Ordinal
• Age group (<20, 20-39, 40-59, 60+)
• Smoking (0, 1-5, 5-9, 10+ cigarettes per day)
• Interval (also called numerical)
• Cholesterol levels (mg/dL)
• Weight (pounds)
Statistical tests for categorical
data
Number of Nature of Independent Nature of Test(s)
Dependent Variables Dependent
Variables Variable(s)*
1 0 IVs (1 population) categorical (2
binomial test
categories)
Categorical (2+ Chi-square goodness-
categories) of-fit
1 IV with 2 levels Categorical
(independent groups) (large sample Chi-square test
size)
Small sample
Fisher’s exact test
size
1 IV with 2 or more
levels (independent categorical Chi-square test
groups)
Statistical tests for categorical
data 2
Number of Nature of Independent Nature of Test(s)
Dependent Variables Dependent
Variables Variable(s)*
1 IV with 2 levels
(dependent/matched categorical McNemar test
groups
1 IV with 2 or more
levels categorical (2 Conditional logistic
(dependent/matched categories) regression
groups)
2 or more IVs categorical (2+
logistic regression
(independent groups categories)
1 interval IV categorical simple logistic
regression
1 or more interval IVs categorical
multiple logistic
and/or 1 or more
regression
Variable Description
name
female Gender of students

Leture dataset ses Social economic status(1=low 2=middle


3=high)
• HSB data file race Ethnic background (1=hispanic 2=asian
• This data file 3=african-amer 4=white)
contains 200 schtyp type of school (1=public 2=private)
observations from
prog type of program (1=general 2=academic
a sample of high
3=vocational)
school students
with demographic read Reading scores on standardized tests
information about
write Writing scores on standardized tests
the students.
math Mathematics scores
science Science scores
socst Social studies scores
2.2. Test for One
Sample Population
2.1 Binomial test (two categories)
• A one sample binomial test allows us to test whether the proportion of successes on a
two-level categorical dependent variable significantly differs from a hypothesized value.

• For example
• H0: P=0.5
Example: Binominal test
•In the HSB data, test whether the proportion of females
(female) differs significantly from 50%, i.e., from .5.
• We will use the exact statement to produce the exact p-
values.
proc freq data = wk11.hsb2;
tables female / binomial(p=.5);
exact binomial;
run;
Table Statement
 TABLES requests </ options> ;
• The TABLES statement requests one-way to n-way frequency and crosstabulation tables and statistics for those
tables.
•Options:
• BINOMIAL <(binomial-options)>
BIN <(binomial-options)>
• requests the binomial proportion for one-way tables. When you specify this option, by default PROC FREQ
provides the asymptotic standard error, asymptotic Wald and exact (Clopper-Pearson) confidence limits,
and the asymptotic equality test for the binomial proportion.
EXACT statement
 EXACT statistic-options </ computation-options> ;
• The EXACT statement requests exact tests and confidence limits for selected statistics. The statistic-
options identify which statistics to compute, and the computation-options specify options for computing exact
statistics.

• Statistic options
BINOMIAL /BIN
• requests an exact test for the binomial proportion (for one-way tables).
Output
• The results indicate that there is
no statistically significant
difference (p = .2292). In other
words, the proportion of
females in this sample does not
significantly differ from the
hypothesized value of 50%.
2.2 Chi-square goodness of fit (2+
categories)
• A chi-square goodness of fit test allows us to test whether the observed proportions for a
categorical variable differ from hypothesized proportions.
Example
• let’s suppose that we believe that the general population consists of 10% Hispanic, 10%
Asian, 10% African American and 70% White folks. We want to test whether the observed
proportions from our sample differ significantly from these hypothesized proportions.

proc freq data = wk11.hsb2;


tables race / chisq testp=(10 10 10 70);
run;
Options of Table statement
•CHISQ <(chisq-options)>
• For one-way tables, the CHISQ option provides the Pearson chi-square goodness-of-fit
test. You can also request the likelihood ratio goodness-of-fit test for one-way tables by
specifying the LRCHI chisq-option in parentheses after the CHISQ option.

• chisq-options
• TESTP=(values)| SAS-data-set
• specifies null hypothesis proportions for the one-way chi-square goodness-of-fit
tests.
Output
• These results show that racial
composition in our sample does
not differ significantly from the
hypothesized values that we
supplied (chi-square with three
degrees of freedom = 5.0286, p
= .1697).
2.3. R x C table
analysis
3.1 Two + independent sample populations

• Chi-square test (large samples)


• Fisher’s exact test (Small samples)
• Crude Odds Ratio and 95%CI
• Crude Relative Risk estimations and 95%CI
3.1.1 Chi-square test (large
samples)
• H0:
• There is no association between the row variable and the column variable
• Methods include:
• Pearson chi-square (Large sample size: differences between the observed and expected frequencies)
• Continuity-adjusted chi-square (Small sample size; similar to Fisher’s exact test)
• Likelihood-ratio chi-square (based on the ratio of the observed to the expected frequencies)

• Mantel-Haenszel chi-square (Linear association; Ordinal categorical variable)


Grouping syntax In table statement
Example

• Using the hsb2 data file, let’s see if there is a relationship between the type of school
attended (schtyp) and students’ gender (female).

proc freq data = wk11.hsb2;


tables schtyp*female / chisq;
run;
CHISQ- table statement option
• CHISQ <(chisq-options)>
• For two-way tables, the chi-square tests include the Pearson chi-square,
likelihood ratio chi-square, and Mantel-Haenszel chi-square tests. The chi-
square measures include the phi coefficient, contingency coefficient, and
Cramér’s V.
Output
• These results indicate
that there is no
statistically significant
relationship between
the type of school
attended and gender
(chi-square with one
degree of freedom =
0.0470, p = 0.8283).
3.1.2 Fisher’s exact test (Small
Samples)
• The Fisher’s exact test is used when you want to conduct a chi-square test, but
one or more of your cells has an expected frequency of less than five.
• Fisher’s exact test does not depend on any large-sample distribution
assumptions,
• It is appropriate even for small sample sizes and for sparse tables.
Example
• Using the hsb2 data file, let’s see if there is a relationship between the type of school
attended (schtyp) and students’ race (Race).

proc freq data = wk11.hsb2;


tables schtyp*race / fisher;
run;
Fisher- Table statement option
• FISHER
• Requests Fisher’s exact test for tables that are larger than 2x2
• For 2x2 table, the CHISQ option provides Fisher’s exact test.
• These results suggest that there is not a
statistically significant relationship between
race and type of school (p = 0.5975). Note
that the Fisher’s exact test does not have a
“test statistic”, but computes the p-value
directly
Output
3.1.3 Crude Odds Ratio
• A unmatched case control study was conducted to examine the association between brain
tumor and benzene exposure. The data is listed as below.

Cases Control
s
Exposur Yes 50 20
e No 100 130
Using the code to generate the
dataset
DATA ODDS;
INPUT OUTCOME $ EXPOSURE $ COUNT;
DATALINES;
CASE 1-YES 50
CASE 2-NO 100
CONTROL 1-YES 20
CONTROL 2-NO 130
;
Code for OR calculations

PROC FREQ DATA=ODDS;


TITLE "Program to Compute an Odds Ratio";
TABLES EXPOSURE*OUTCOME / CHISQ CMH;
WEIGHT COUNT;
RUN;
Options in table statement
•CMH <(cmh-options)>
• For 2*2 tables, the CMH option provides the adjusted Mantel-Haenszel and logit estimates of the odds ratio
and relative risks, together with their confidence limits

•OR <(CL=type | (types )> ODDSRATIO <(CL=type | (types)>


• requests the odds ratio and confidence limits for tables.
COMMON ODDS RATIO AND RELATIVE RISKS

Statistic Method Value 95% Confidence Limi


Output ts
Odds Ratio Mantel- 3.2500 1.8189 5.8070
• The OR=3.25, 95% CI is 1.8189-
5.807. Thus Odds of benzene Haenszel
exposure in cases is 3.25 times
Logit 3.2500 1.8189 5.8070
higher than the odds of benzene
exposure in controls. Relative Mantel- 1.6429 1.3331 2.0246
Risk
(Column 1)
Haenszel

Logit 1.6429 1.3331 2.0246


Relative Mantel- 0.5055 0.3432 0.7446
Risk
(Column 2)
Haenszel

Logit 0.5055 0.3432 0.7446


3.1.4 Crude RR estimations
• A prospective cohort study is conducted to investigate the effect of Cholesterol on heart
attacks (MI). The data was listed as below.

Heart Attack
Yes No
Choleste High 20 80
rol Low 15 135
Create the dataset using the
table
DATA RR;
LENGTH GROUP $ 9;
INPUT GROUP $ OUTCOME $ COUNT;
DATALINES;
HC Y 20
HC N 80
LC Y 15
LC N 135
;
Code for analysis

proc sort data=RR;


by GROUP descending OUTCOME;
run;
PROC FREQ DATA=RR ORDER=DATA;
TITLE "Program to Compute a Relative
Risk";
TABLES GROUP*OUTCOME / CMH;
WEIGHT COUNT;
RUN;
Output
The Relative Risk is
2.00, 95%CI 1.076-
3.717
3.2 Matched sample
populations
• McNemar test and kappa coefficient
• Odds Ratio for matched pair case-control study
3.2.1 McNemar test and kappa
coefficient
• McNemar test
• A non-parametric test used to analyze paired nominal data.
• To determine if the proportions of categories in two related groups significantly differ from
each other.
• kappa coefficient
• A measure of interrater agreement
• When there is perfect agreement between the two ratings, the kappa coefficient equals +1.
• Here is one possible interpretation of Kappa.
• Poor agreement = Less than 0.20
• Fair agreement = 0.20 to 0.40
• Moderate agreement = 0.40 to 0.60
• Good agreement = 0.60 to 0.80
• Very good agreement = 0.80 to 1.00
PROC FORMAT;
VALUE $OPINION 'P'='Positive'
'N'='Negative';
Example RUN;
DATA MCNEMAR;
LENGTH AFTER BEFORE $ 1;
To check the change of altitude before INPUT AFTER $ BEFORE $ COUNT;
and after an educational innervation for
FORMAT BEFORE AFTER $OPINION.;
the same individual.
DATALINES;
N N 32
N P 30
P N 15
P P 23
;
Code for the analysis

PROC FREQ DATA=MCNEMAR;


TITLE "McNemar's Test for Paired
Samples";
TABLES BEFORE*AFTER / AGREE ;
WEIGHT COUNT;
RUN;
Agree- Table statement option
• AGREE <(agree-options)>
• requests tests and measures of classification agreement for square tables.
• This option provides the simple and weighted kappa coefficients along with their
standard errors and confidence limits.
• This option provides McNemar’s test;
Output
• The altitude has
significantly changed
before and after an
educational
innervation for the
same individual
(p=0.0253).
Output-Kappa coefficient
Practice
• Two radiologists read the x-ray to make a diagnosis of a disease. The result is listed in the
table

Radiologis Radiologist 2
t1 no yes

no 25 3
yes 5 50
3.2.2 Odds Ratio for matched pair
case-control study
• A matched pair case control study was conducted to examine alcohol consumption on liver
disease. The data was listed as below

Cases
prese absent
nt
Contro prese 15 5
ls nt
absen 20 60
Create the data using do loop
data a;
do case = 'present','absent';
do control = 'present','absent';
input count @@;
output;
end;
end;
datalines;
15 20
5 60
;
Wrong code for OR

proc freq order=data;


weight count;
table case * control / agree relrisk;
run;
data indiv;
set a;
retain id 0;
do id=id+1 to id+count;
factor=case; response='case'; output;
Data transformation
factor=control; response='control';
output;
end;
keep id factor response;
run;

proc freq order=data;


table id*factor*response / cmh noprint;
run;
Output
• The correct estimate
of the odds ratio from
this matched pairs
data is 4.0 which is
provided by the
Mantel-Haenszel
estimate from the
CMH option in
PROC FREQ
3.3 Chi-square test for trend
• When the group variable is an ordinal categorical factor, chi-square test for trend is used
to detect if the proportion is linearly increasing or decreasing across the N levels of this
ordinal categorical variable.
Example
• Test if the proportions of “fail” in groups A through D is linearly increasing?

Group
A B C D
Test Fail 10 15 14 25
Results Pass 90 85 86 75
SAS code for test

DATA TREND;
INPUT RESULT $ GROUP $ COUNT @@;
DATALINES;
FAIL A 10 FAIL B 15 FAIL C 14 FAIL D 25
PASS A 90 PASS B 85 PASS C 86 PASS D 75
;
PROC FREQ DATA=TREND;
TITLE "Chi-square Test for Trend";
TABLES RESULT*GROUP / CHISQ;
WEIGHT COUNT;
RUN;
Output
• There is a significant
linear trend in
proportions of Fail from
group A through D.
3.4 stratified tables
• Sometimes we need to stratify the study group into several subgroups based on one or
two factors. Then we examine 2*2 tables or R*C tables in each subgroup and generate a
summary across the subgroups.
Stratification-Why?
• To better control the confounding effects of a third factor, we need to examine the
association between an exposure and a disease in each stratum of this third variable.
• Objectives of this analysis include
 Determine if the association between an exposure and a disease in each stratum of this third variable are
similar or statistically significantly different.
 If different, the estimates of an exposure on disease in each stratum of this third variable are required to
present.
 If not different, the cohort Mantel-Haenszel statistics can provide the summary statistics across all the strata.
Example
• We examine the relationship between hours of sleep and the chance of failing a test by
gender.
TEST
Fail Pass
Boys
Sleep Low 20 100
High 15 150

TEST
Fail Pass
Girls Sleep Low 30 100
High 25 200
DATA ABILITY;
INPUT GENDER $ RESULTS $ SLEEP $
COUNT;
Code to create the DATALINES;
BOYS FAIL 1-LOW 20
dataset
BOYS FAIL 2-HIGH 15
BOYS PASS 1-LOW 100
BOYS PASS 2-HIGH 150
GIRLS FAIL 1-LOW 30
GIRLS FAIL 2-HIGH 25
GIRLS PASS 1-LOW 100
GIRLS PASS 2-HIGH 200
;
Code for analysis
PROC FREQ DATA=ABILITY;
TITLE "Mantel-Haenszel Chi-square Test";
TABLES GENDER*SLEEP*RESULTS/ALL;
WEIGHT COUNT;
RUN;

ALL
requests all tests and measures that are produced by the CHISQ, MEASURES, and CMH
options.
Output
• The Breslow-day test
suggests that the
relationships
between sleep hours
and fail of a test are
not different between
boys and girls
Step 2: provide the summary
statistics across all the strata

You might also like