0% found this document useful (0 votes)
56 views83 pages

Chapter 2 Slides

1. The document discusses analyzing categorical data using rates, proportions, and percentages from tables and plots. 2. It demonstrates calculating marginal, conditional, and joint rates to analyze outcomes based on two categorical variables: treatment type and outcome. 3. The analysis determines that treatment X has a higher success rate (77.4%) than treatment Y (57.1%), suggesting treatment X is better.

Uploaded by

Milim Nava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views83 pages

Chapter 2 Slides

1. The document discusses analyzing categorical data using rates, proportions, and percentages from tables and plots. 2. It demonstrates calculating marginal, conditional, and joint rates to analyze outcomes based on two categorical variables: treatment type and outcome. 3. The analysis determines that treatment X has a higher success rate (77.4%) than treatment Y (57.1%), suggesting treatment X is better.

Uploaded by

Milim Nava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Categorical Data

CHAPTER 2 Analysis
Overview
2A.
1A. Rates I 1B. Rates II
Association I

2B. 3A. Simpson’s 3B: Simpson’s


Association II Paradox I Paradox II

4.
Confounders
Unit 1A: Rates I
By the end of this unit you should be able to
do the following:
1. Identify a categorical variable.
2. Understand and interpret tables and
plots created from 1 categorical variable.
RECAP

Types of variables

Categorical variables Numerical variables

Ordinal Nominal Continuous Discrete


Categories come with some One where possible values of
One that can take on all possible
natural ordering and numbers No intrinsic ordering for the the variable form a set of
numerical values in a given
are often used to represent the variables. E.g.: Nationality numbers with “gaps”. E.g.:
range or interval. E.g.: Time
ordering. E.g.: Happiness level Module credits
THE
PROBLEM
A SNAPSHOT OF THE DATA

Size Gender Treatment Outcome

Large Male X Success

Large Male X Success

Small Male Y Success

Large Male Y Failure

Small Male X Success

Large Male Y Success


APPLYING THE PPDAC CYCLE
In general, do treatments tend to be
successful?

What to measure:
Variable “Outcome” tells us if the
treatment was a success or not

Sort the data


Plot graphs, tables of the
“Outcome” variable
ANALYSING 1 CATEGORICAL VARIABLE
- TABLE
Categories of the “Outcome”
Count Rate Percentage
variable

Success 831 831 0.791


rate(Success) = = 0.791
1050 × 100%
= 79.1%

Failure 219 219 0.209


rate(Failure) = = 0.209
1050 × 100%
= 20.9%

Total 1050 1050 1 × 100%


=1
1050 = 100%
Analyzing 1 categorical variable - Plot
Dodged bar plot for “Outcome” Stacked bar plot for “Outcome”
900 1200
831
800
1000
700 219
600 800

500
Counts

Counts
Failure 600 Failure
400 Success Success
300 400 831
219
200
200
100

0 0
Outcome Outcome
Analysing 1 categorical variable - Plot
100% Stacked Bar plot for “Outcome”
100%

90% 20.9%
80%

70%

60%
Percentage

50% Failure
Success
40% 79.1%
30%

20%

10%

0%
Outcome
Conclusion
Table and bar plots gave us the same
conclusion

79% success

21% failure

Should go for treatment


Summary
We have learned:
• Use of tables and plots to summarize a categorical variable
• Calculation of rates
Unit 1B: Rates II
By the end of this unit you should be able to
do the following:
1. Understand and interpret tables and
plots created from 2 categorical
variables.
2. Calculate marginal, conditional and joint
rates.
Size Gender Treatment Outcome

Large Male X Success

Large Male X Success


WHICH
TREATMENT
Small Male Y Success
TO
Large Male Y Failure CHOOSE?
Small Male X Success

Large Male Y Success


PPDAC CYCLE – A NEW QUESTION

Which treatment is better?

Key variable of interest:


Treatment variable

Sort the data


Plot graphs, tables of the
“Treatment” and
“Outcome” variable
Outcome
Row
Success Failure
Treatment Total

X 542 158 700


2 x 2 Table
Y 289 61 350

Column 831 219 1050


Total
Marginal rates /
proportions / percentages
Outcome Row • What proportion of the total number of patients
Success Failure
Treatment Total underwent Treatment Y?
350 1 1
X 542 158 700 • rate Y = = = 33 %
1050 3 3

• What proportion of the total number of patients had


Y 289 61 350
a successful treatment?
831
Column Total 831 219 1050 • rate Success = = 0.791 = 79.1%
1050

• Calculations above are called marginal rates /


proportions / percentages.
Conditional rates /
proportions / percentages
Outcome Row • If we focus on patients who undergone Treatment
Success Failure X, what proportion of them had a successful
Treatment Total
treatment?
X 542 158 700 542
• rate Success given X = = 0.774 = 77.4%
700
Y 289 61 350 • Calculation above is known as a conditional
proportion / percentage.
Column Total 831 219 1050
• An even shorter way of writing this is to use a
vertical bar in place of given: rate Success X)
Joint rates / proportions /
percentages

• What is the proportion of patients who chose


Outcome Row Treatment Y and had a failure?
Success Failure
Treatment Total
61
• rate Y and failure = = 0.0581 = 5.81%
1050
X 542 158 700
• NOT a conditional rate.
Y 289 61 350 • Calculation is known as a joint rate/ proportion
/ percentage.
Column 831 219 1050
Total
Which treatment is better?
Outcome Success Failure Row Total Treatment X has 542
Treatment successful cases.
X 542 158 700
Treatment Y has 289
successful cases.
Y 289 61 350
“We should recommend
Treatment X!”?

Column Total 831 219 1050 More patients choosing


Treatment X as compared
to Y.
Making it fair!

Given that I pick


Compare success
some treatment, Treatment Y is
rate of Treatments Fair comparison
what is the rate of better!
X and Y
success?

• rate Success X) = • For Treatment X,


542
= 0.774 = 77.4% roughly 77 out of 100
700
patients had a
• rate Success Y) = successful treatment.
289
= 0.826 = 82.6% • For Treatment Y,
350
roughly 83 out of 100
patients had a
successful treatment.
Table with row percentages

Outcome Success (row %) Failure (row %) Row Total (row %)


Treatment

X
542 (77.4%) 158 (22.6%) 700 (100%)

Y
289 (82.6%) 61 (17.4%) 350 (100%)

Column Total
831 (79.1%) 219 (20.9%) 1050 (100%)
Analysing 2 categorical variables - plot
Dodged bar plot for “Outcome” by Stacked bar plot for “Outcome” by
“Treatment” “Treatment”
600 800
542
700
500
600 158
400
500
289
Counts

Counts
300 400
Failure Failure
300 61
200 158 Success 542 Success
200
100 61 289
100

0 0
X Y X Y
Treatment Treatment
Analysing 2 categorical variables - plot
100% Stacked Bar plot for “Outcome” by “Treatment”
100%
90% 22.6% 17.4%
80%
70%
Percentage

60%
50%
Failure
40% 77.4% 82.6%
Success
30%
20%
10%
0%
X Y
Treatment
Summary
We have learnt how to analyse 2 categorical variables from the perspective of:
• Tables – 2x2 table
• Plots – Bar plots / 100% stacked bar plots
Unit 2A: Association I

By the end of this unit, you should


be able to do the following:
1. Understand and apply association
2. Understand and apply symmetry rule
Used rates to conclude that Treatment Y Caution: Association, not
is better than Treatment X. causation!
Relationship between “Treatment Y is “Treatment X is
Associative Not sure if success of
the type of treatment positively associated negatively associated
relationship between treatment is due to the
and the outcome of to the success of the to the success of the
the 2 variables treatment or not
the treatment treatment.” treatment.”

Tend to see treatment Tend to see Treatment


Y and successful X and unsuccessful
treatments go hand in treatments go hand in
hand. hand.

Continuation from Unit 1


Is there an association?
Suppose we have A and B as characteristics in a population. We shall assume that some
people have A, and some do not have A (labelled as NA). We assume the same about B.
Association absent Association present

rate(A | B) = rate(A | NB) rate A B) ≠ rate A NB)

Rate of A is not affected by the presence


or absence of B.
rate(A | B) > rate(A | NB) rate(A | B) < rate(A | NB)

Presence of A when B is present Presence of A when B is present


A and B are not associated. is stronger than when B is is weaker than when B is
absent. absent.

Positive association between A Negative association between


and B. A and B.
Linking back to our dataset
Checking for association Compare
between 2 variables
• Outcome of treatment • rate(A | B) = rate(Success | X) = 0.774
• A: Success
• rate(A | NB) = rate(Success | Y) = 0.826
• NA: Failure
• Treatment
• B: Treatment X
• NB: Treatment Y

Conclusion
• rate(A | B) < rate(A | NB)
• Presence of A is weaker when B is present.
• Less successful treatments when we see Treatment X: Treatment X
is negatively associated to a successful treatment.
• More successful treatments when we see Treatment Y: Treatment Y
is positively associated to a successful treatment.
On Establishing Association

• Any of the following • Likewise, for negative • Try it – using the


comparisons can show association between A example in the previous
positive association and B: slide, you can see that
between A and B: this relation holds true;

Eg. "If success is positively


rate(A | B) > rate (A | NB) rate(A | B) < rate (A | NB) associated with treatment Y,
rate(B | A) > rate (B | NA) rate(B | A) < rate (B | NA) then …"
rate(NA | NB) > rate (NA | B) rate(NA | NB) < rate (NA | B) • "… success is negatively
rate(NB | NA) > rate (NB | A) rate(NB | NA) < rate (NB | A) associated with ???"
• "… failure is positively
associated with ???"
• "… failure is negatively
associated with ???"
2 rules that govern rates

Suppose we have A and B as characteristics in a population.


We shall assume that some people have A, and some do not
have A (labelled as NA). We assume the same about B.

Basic rule on rates (to be


Symmetry rule discussed in Unit 2B:
Association II)
Symmetry Rule

rate(A | B) > rate(A NB ⇔ rate(B | A) > rate(B | NA).

rate(A | B) < rate(A NB ⇔ rate(B | A) < rate(B | NA).

rate(A | B) = rate(A NB ⇔ rate(B | A) = rate(B | NA).


rate(A | B) > rate (A | NB) ⇔ rate(B | A) > rate(B | NA)

B Not B Row Total


A w x w+x
Not A y z y+z
Column Total w+y x+z w+x+y+z

𝑤𝑤 𝑥𝑥 𝑤𝑤 𝑦𝑦
> >
𝑤𝑤 + 𝑦𝑦 𝑥𝑥 + 𝑧𝑧 𝑤𝑤 + 𝑥𝑥 𝑦𝑦 + 𝑧𝑧
𝑤𝑤 𝑥𝑥 + 𝑧𝑧 > 𝑥𝑥(𝑤𝑤 + 𝑦𝑦) 𝑤𝑤 𝑦𝑦 + 𝑧𝑧 > 𝑦𝑦 𝑤𝑤 + 𝑥𝑥
𝑤𝑤𝑤𝑤 + 𝑤𝑤𝑤𝑤 > 𝑥𝑥𝑥𝑥 + 𝑥𝑥𝑥𝑥 𝑤𝑤𝑦𝑦 + 𝑤𝑤𝑤𝑤 > 𝑦𝑦𝑤𝑤 + 𝑦𝑦𝑥𝑥
𝑤𝑤𝑤𝑤 > 𝑥𝑥𝑥𝑥
rate(A | B) > rate(A NB ⇔ rate(B | A) > rate(B | NA)

1
2

rate(A | B) > rate(A NB → rate(B | A) > rate(B | NA)

rate(B | A) > rate(B NA → rate(A | B) > rate(A | NB)


rate(A | B) > rate(A NB → rate(B | A) > rate(B | NA)

More likely to see A


Rate of A given B is
Positive association when B is present as
more than rate of A
between A and B. compared to when
given NB.
B is absent.

Also more likely to


see B when A is Rate of B given A is
present as more than rate of B
compared to when given NA.
A is absent.
rate(B | A) > rate(B NA → rate(A | B) > rate(A | NB)

More likely to see B


Rate of B given A is
Positive association when A is present
more than rate of B
between B and A. as compared to
given NA.
when A is absent.

Also more likely to


see A when B is Rate of A given B is
present as more than rate of A
compared to when given NB.
B is absent.
rate(A | B) > rate(A NB → rate(B | A) > rate(B | NA)

1
rate(B | A) > rate(B NA → rate(A | B) > rate(A | NB)

rate(A | B) > rate(A NB ⇔ rate(B | A) > rate(B | NA)


Consequence of the symmetry rule
To identify if there is any association, check for either:
1. rate(A | B) ≠ rate(A | NB) OR
2. rate(B | A) ≠ rate(B | NA)
rate(Success | X) < rate(Success | Y):
Negative association between successful treatments and Treatment X

Check:
rate(X | Success) < rate(X | Failure)
Summary
We have learned:
• How to identify association
• Symmetry rule and its consequence on identifying association
Unit 2B: Association II
By the end of this unit, you should
be able to do the following:

1. Understand and apply basic rule on rates


BASIC RULE ON RATES
The overall rate(A) will always lie between
rate(A | B) and rate(A | NB).
Consequences of the basic rule on rates
1. The closer rate(B) is to 100%, the closer rate(A) is to
rate(A | B).
2. If rate(B) = 50%, then
rate A B)+rate A NB)
rate A = .
2
3. If rate(A | B) = rate(A | NB), then
rate(A) = rate(A | B) = rate(A | NB).
1. The closer rate(B) is to 100%,
the closer rate(A) is to rate(A | B).
• 2 Cups of bubble tea
• Let A be the level of sweetness
• Represented by the colour “Green” in the cup.

• Let B / NB be the cups: “Cup 1” vs. “Cup 2”

Cup 1
Size: Large cup
Sweetness: 90% Size: Cup 1 + Cup 2
Sweetness: In between 20%
Cup 2
to 90%, but closer to Cup 1
Size: Small cup
Sweetness: 20%
1. The closer rate(B) is to 100%,
the closer rate(A) is to rate(A | B).
Sweetness in the final cup is between Sweetness | Cup 1 and
Sweetness | Cup 2

Expect sweetness of the final cup to be nearer


Cup 1 takes up most of the final to the sweetness of Cup 1.
cup.

Overall rate(A) to be between Overall rate(A) to be closer to rate(A | B) if B


rate(A | B) and rate(A | NB) takes up a majority of the overall.
2. If rate(B) = 50% , then
rate A B) + rate A NB)
rate A =
2

Cup 1
Size: Small cup
Sweetness: 20% Size: Cup 1 + Cup 2
Sweetness: Exactly in between
20%+90%
20% to 90% = = 55%
Cup 2 2
Size: Small cup
Sweetness: 90%
3. If rate(A | B) = rate(A | NB), then
rate(A) = rate(A | B) = rate(A | NB).

Cup 1
Size: Small / Large cup
Sweetness: 20%
Size: 2 Cups added together
Sweetness: Exactly 20%
Cup 2
Size: Small / Large cup
Sweetness: 20%
If cups are of the same size, sweetness will be
exactly half of the original cups.
Linking back • If Rate(B) = 50%, overall rate of A will be
to exactly in between the rate of A given B and
the rate of A given NB.
Consequences If sweetness is the same for both cups, the
2 and 3 sweetness of the final cup will also be the same,
regardless of the sizes of the original cups.
• If rate(A | B) = rate(A | NB), then rate(A) is the
same as the 2 rates.
Linking back to dataset at hand
• rate(Success) = 0.79
Overall rate of successful treatments

• rate(Success | X) = 0.774
Groups: Treatment X and Treatment Y • rate(Success | Y) = 0.826
• rate(Success) in between the conditional rates

• Treatment X takes up a majority of the treatments.


700 2
Overall rate of success closer to • rate(X) = = 0.667 = 66 %
1050 3
rate(Success | X)
• Follows statement (1)
Summary
We have learned:
• What is the basic rule on rates
• The consequences of basic rule on rates
Unit 3A:
Simpson’s Paradox I
By the end of this unit you should be able to
do the following:
1. Identify Simpson’s paradox
2. Analysis using the slicing method
PPDAC CYCLE – A RECAP
Are the treatments are helping?

Yes. In general, there is a high


rate of success.

More specifically, which


treatment is better?
ANYTHING
ELSE?

Treatment Y is positively
associated to success rate
Size Gender Treatment Outcome

Large Male X Success

Large Male X Success

Small Male Y Success


WHAT ABOUT
Large Male Y Failure
OTHER
Small Male X Success
VARIABLES?
Large Male Y Success
Exploring the “stone size” variable
What would be a useful visualisation?
Analysing 2 categorical variables - Plot
100% Stacked Bar plot for “Outcome” by
“Treatment”
Overall,
100%
17.4%
Y Treatment Y is better
90% 22.6%
80%
70%
60%
Percentage

50% All stones Success Failure Total


Failure
40% 77.4% 82.6% All stones Yes No Grand Total
Success
30% X X 542 542 158 158 700
700
20% Y 289 61 350
10% Y Grand289
Total 831 61 219 350
1050
0%
X Y
Total 831 219 1050
Treatment
Plot across large stones only
100% Stacked Bar plot for “Outcome” by
“Treatment” for large stones
Across large stones,
100% X Treatment X is better
90%
27.6% 31.3%
80%
70%
60% Large
Percentage

Success Failure Total


50% stones
Failure
40% Large stones Yes No Grand Total
72.4% 68.8%
Success
X 381 145 526
30%
X 381 145 526
20%
Y Y 55 55 25
25 80 80
10%
Grand Total 436 170 606
0%
X Y Total 436 170 606
Treatment

rate(Success | X) > rate(Success | Y)


Exercise
Across large stones,
X Treatment X is better

Large
Success Failure Total
stones

X 381 145 526


381
rate Success | X = = 0.724
526 55 25 80
55 Y
rate Success | Y = = 0.688
80 436 170 606
Total

rate(Success | X) > rate(Success | Y)


Treatment X is positively associated to success
Plot across small stones only
100% Stacked Bar plot for “Outcome” by
“Treatment” for small stones
Across small stones,
X
100%
7.5% 13.3%
90% Treatment X is better
80%
70%
60%
Percentage

Small
50%
92.5%
Success Failure Total
86.7% Failure stones
40%
Success Small stones Yes No Grand
30%
X 161 13 Total 174
20% X 161 13 174
10%
Y Y 234 234 3636 270 270
0% Grand Total 395 49 444
X Y
Total 395 49 444
Treatment
Analysing 3 categorical variables - plot
100% Stacked Bar plot for "Outcome" by "Treatment"
100%
7.5%
90%
13.3%
27.6% 31.3%
80%

70%

60%
Percentage

50%
Failure
92.5%
40%
86.7% Success

72.4% 68.8%
30%

20%

10%

0%
X Y X Y
Large Small
A paradox on our hands
Overall,
Y Treatment Y is better
Is X or Y
better?

Across large stones,


X Treatment X is better

Across small stones,


X Treatment X is better
Unit 3B:
Simpson’s Paradox II
By the end of this unit you should be able to
do the following:
1. Explain a Simpson’s paradox
A paradox on our hands
Overall,
Y Treatment Y is better
Is X or Y
better?

Across large stones,


X Treatment X is better

Across small stones,


X Treatment X is better
A paradox explained

Across large stones,


X Treatment X is better

Across small stones,


X Treatment X is better
Analysing 3 categorical variables - Table
Large stones Small stones Total (Large + Small)

Successful Total number rate(Success) Successful Total rate(Success) Successful Total rate(Success)
treatments of treatments in % treatments number of in % treatments number of in %
treatments treatments

X 381 526 72.4% 161 174 92.5% 542 700 77.4%

Y 55 80 68.8% 234 270 86.7% 289 350 82.6%


Analysing 3 categorical variables - Table
Large stones Small stones Total (Large + Small)

Successful Total rate(Success) Successful Total rate(Success) Successful Total number rate(Success)
treatments number of in % treatments number of in % treatments of treatments in %
treatments treatments

X 381 526 72.4% 161 174 92.5% 542 700 77.4%

Y 55 80 68.8% 234 270 86.7% 289 350 82.6%


Analysing 3 categorical variables - Table
Large stones Small stones Total (Large + Small)

Successful Total rate(Success) Successful Total rate(Success) Successful Total number rate(Success)
treatments number of in % treatments number of in % treatments of treatments in %
treatments treatments

X 381 526 72.4% 161 174 92.5% 542 700 77.4%

Y 55 80 68.8% 234 270 86.7% 289 350 82.6%


ANALOGY
rate(success | X) < rate(success | Y)
Negative Association
X

rate(large stone | X) > rate(large stone | Y) rate(success | large stone) <


Positive Association rate(success | small stone)
Large Negative Association
stones

―― View from Association ――


X

Association Association
Stone
size

Confounding variable

Simpson’s paradox ⇒ confounder


Confounder ⇏ Simpson’s paradox
Summary
We learnt how to analyse 3 categorical variables from the perspective of:
• Tables – slicing by subgroups
• Graphs – sliced bar graph
Unit 4
Confounders
By the end of this unit you should be able to
do the following:
1. Define a confounder (ie. confounding
variable)
2. Identify possible confounding variables
in a study
Introduction
X

Association Association
Stone
size

Confounding variable
Definition:
A confounder is a third variable that is associated to both the independent
and dependent variable whose relationship we are investigating
Stone size associated to treatment type
Large Small Total
X
X 526 174 700

Y 80 270 350 Positive


Association
Total 606 444 1050

526
rate Large | X = = 0.751 Large
700 Stones
80 Since 0.751 > 0.229,
rate Large | Y = = 0.229 Large stones positively associated to treatment X
350
Stone size associated to success
Success Failure Total

Large 436 170 606

Small 395 49 444


Negative
Association
Total 831 219 1050

436 Large
rate Success | Large = = 0.719 Stones
606
395 Since 0.719 < 0.890,
rate Success | Small = = 0.890 Large stones negatively associated to success
444
rate(success | X) < rate(success | Y)
Negative Association
X

rate(success | large stone) <


rate(large stone | X) > rate(large stone | Y) rate(success | small stone)
Positive Association Large
Negative Association
stones

―― View from Association ――


Recall:
Treatment
Size Outcome
Type

Large X Success

Large X Success After slicing,


X Treatment X is better
Small Y Success

Large Y Failure

Small X Success

Large Y Success
Treatment
Size Outcome
Type

Large X Success DO WE STILL


Large X Success
OBSERVE SIMPSONS
Small Y Success
PARADOX?
Large Y Failure
No,
Small X Success Y Treatment Y is better

Large Y Success

We have to measure a variable in order to check if it is a confounder!


THE PROBLEM
We must measure a variable in order
to check if it is a confounder

??? We need to collect data on lots of


variables

This is not feasible


(costly, difficult to analyse)
THE PROBLEM
We must measure a variable in order
to check if it is a confounder

RANDOMISATION We need to collect data on lots of


variables

This is not feasible


(costly, difficult to analyse)
rate(success | X) < rate(success | Y)
Negative Association
X

rate(success | large stone) <


rate(large stone| X) > rate(large stone| Y) rate(success | small stone)
Positive Association Large
Negative Association
stones
No association

The effect of randomly assigning stone size to treatment type


Randomisation is not always possible

I want
Treatment X!
Summary Proving Association

rate A B) ≠ rate A NB)

OR
Main
variables X
rate B A) ≠ rate B NA)

OR

Confounding Stone
variable size

(prove using association)


Chapter 2 end
We learnt how to analyse categorical variables from the perspective of:
• Tables
• Graphs

You might also like