0% found this document useful (0 votes)
30 views3 pages

3.2 - Relationshipd Between Categorical Variables

This document discusses analyzing relationships between categorical variables in data analysis. It uses education level as the outcome variable and age group/gender as predictor variables in sample data. Separate bar charts are created for education level by gender, showing females are slightly more likely to have college degrees. Side-by-side bar charts make the differences clearer. Separate plots also show education level differs significantly between age groups, with the youngest and oldest having far fewer college graduates. Side-by-side bars further highlight differences between age groups' educational distributions.

Uploaded by

franco668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views3 pages

3.2 - Relationshipd Between Categorical Variables

This document discusses analyzing relationships between categorical variables in data analysis. It uses education level as the outcome variable and age group/gender as predictor variables in sample data. Separate bar charts are created for education level by gender, showing females are slightly more likely to have college degrees. Side-by-side bar charts make the differences clearer. Separate plots also show education level differs significantly between age groups, with the youngest and oldest having far fewer college graduates. Side-by-side bars further highlight differences between age groups' educational distributions.

Uploaded by

franco668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

FutureLearn 1

DATA TO INSIGHT: AN INTRODUCTION


TO DATA ANALYSIS
THE UNIVERSITY OF AUCKLAND




WEEK 3
RELATIONSHIPS BETWEEN CATEGORICAL VARIABLES


Hi again. In this video, you'll learn how to plot data on two categorical variables so
that you can look for relationships between them.

In the week three introductory video, we talked about relationships in terms of a
variable of primary interest-- called an outcome variable-- and variables that might
help us predict the outcome. These we call predictor variables.

Using the enhanced 2009-2012 data, we'll investigate how age group and gender
predict educational achievement levels. Education is our outcome of interest. Age
and gender are our predictor variables.

First, using gender as a predictor, we want to see how the distribution of
educational attainment differs between females and males. Here we have two
separate plots for education -- one for females and one for males. If there was no
difference between the female and male distributions, we'd say there was no
relationship between education and gender.

The two graphs here are very similar but slightly different. For example, about 30%
of females are college graduates, compared with about 28% of males. It looks as
though the two right-hand bars for females are higher than those for males--
females slightly more likely to have college education, while the left-hand bars are
slightly shorter-- proportionately more males in the lower levels of educational
attainment.

We'll now rearrange the sets of bars so that corresponding bars for females and
males are placed beside one another. This is called a side-by-side bar chart. This
makes the small differences between the educational attainment outcomes much
more obvious. The colour coding tells us what predictor group we are looking at--
green for females, red for males.

Which plot should we use? Both! Both have their strengths, and we should use
both.

FutureLearn 2

The first law of using graphics for discovery is that you should look at many types
of graphs. Often you'll spot something in one that you missed in another. A
separate set of plots of the outcome variable, one for each predictor group, is good
for revealing gross differences and overall shape. The side-by-side plot is good for
looking at detailed differences.

Age in decades is a predictor variable with more categories. Here we have a
separate plot for the outcome variable education for each category of age decade.
The separateness of the groups has been emphasised by using colour. If there
was no relationship between the outcome education and the predictor age decade,
all of these plots would be the same. But in fact, there are quite large differences.

These freehand curves emphasise how the plots differ in shape. The shapes of
the education distributions for the youngest and the oldest group are very different
from the shapes for any of the other age groups. The main difference is the
substantially lower percentages of college graduates. We'll think about possible
explanations later.

Now we'll use side-by-side bars to highlight the differences. Higher bars
correspond to larger percentages and lower bars to lower percentages.

On the right-hand side, we can see the reduction in the percentages of college
graduates with each decade from age 50, and the low percentage of college
graduates in the light blue 20-29 age group. The red 70+group generally has less
education, shown by lower than usual bars for college graduates and by higher
than usual bars in all of the lower three categories of education. The light blue 20-
29 group has unusually large percentages in the high school and some college
categories. We should expect this because many of the younger ones in particular
would not yet have finished their formal education.

You may have noticed in this plot that some bars are narrower than others. The
widths of the bars have been made proportional to the number of people in the
group. There are less than half as many people in the 70+group then there are in
the first three age groups. For all their good points, a big disadvantage of the side-
by-side arrangement of bars is that people often get confused by these graphs,
particularly by getting the percentages of what for who the wrong way around.

They may look at the cluster of CollegeGrad bars and think they're being told
about the percentages of CollegeGrads who fall into each age group. But they do
not. These percentages do not add to 100%. We have to think what is the
outcome variable? These percentages add to 100%. And for what groups are we
comparing those outcomes? The percentages that add to 100% are those four

FutureLearn 3
bars with the same colour. They tell us about the outcome variable results for
people in that colour group.

With iNZight graphs, the outcome variable is in the default graph title. If we see
distribution of education, we know the graph is telling us about educational
outcomes. The colour groups are the age groups. We're looking at the educational
outcomes of the various age groups. This is very different from the age group
outcomes for the different educational groups.

To keep your bearings, it is best to look at both separate and side-by-side plots.
And bear in mind that the side-by-side plot is just a rearrangement of the separate
plots' bars.

In summary, our main tools for investigating the relationship between two
categorical variables is separate bar charts of the outcome variable for each
predictor group and side-by-side bar charts.

Separate bar charts are best for revealing gross differences in our overall shape,
while side-by-side bar charts are better for highlighting detailed differences
between corresponding categories.

Finally, I'll leave you with these questions to remind you of the ideas we've just
covered.

You might also like