0% found this document useful (0 votes)
96 views8 pages

Lecture 4 - 9 - Association Between Categorical and Numerical Variables

The document discusses the association between categorical and numerical variables. It provides an example of examining the relationship between gender (categorical) and exam marks (numerical) using a scatter plot. It then defines the point bi-serial correlation coefficient as a method to quantify the association between a dichotomous categorical variable and a continuous numerical variable. The coefficient calculation involves finding the mean of the numerical variable for each category, the proportions in each category, and the standard deviation of the numerical variable.

Uploaded by

BHARGAV RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views8 pages

Lecture 4 - 9 - Association Between Categorical and Numerical Variables

The document discusses the association between categorical and numerical variables. It provides an example of examining the relationship between gender (categorical) and exam marks (numerical) using a scatter plot. It then defines the point bi-serial correlation coefficient as a method to quantify the association between a dichotomous categorical variable and a continuous numerical variable. The coefficient calculation involves finding the mean of the numerical variable for each category, the proportions in each category, and the standard deviation of the numerical variable.

Uploaded by

BHARGAV RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Statistics for Data Science -1

Statistics for Data Science -1


Lecture 4.9: Association between categorical and numerical
variables

Usha Mohan

Indian Institute of Technology Madras

1/ 1
Statistics for Data Science -1
Association between categorical and numerical variable

Introduction

I Understand the association between a categorical variable and


numerical variable.
I Assume the categorical variable has two categories
(dichotomous)

2/ 1
Statistics for Data Science -1
Association between categorical and numerical variable

Example 1: Gender versus marks

A teacher was interested in knowing if female students performed


better than male students in her class. She collected data from
twenty students and the marks they obtained on 100 in the subject.

3/ 1
Statistics for Data Science -1
Association between categorical and numerical variable

Example 1: Gender versus marks-Data

Gender Marks
1 F 71
2 F 67
3 F 65
4 M 69
5 M 75
6 M 83
7 F 91
8 F 85
9 F 69
10 F 75
11 M 92
12 F 79
13 M 71
14 M 94
15 F 86
16 F 75
17 F 90
18 M 84
19 F 91
20 M 90

4/ 1
Statistics for Data Science -1
Association between categorical and numerical variable

Example 1: Scatter plot

5/ 1
Statistics for Data Science -1
Association between categorical and numerical variable

Example 1: Scatter plot

6/ 1
Statistics for Data Science -1
Association between categorical and numerical variable

Point Bi-serial Correlation Coefficient


I Let X be a numerical variable and Y be a categorical variable
with two categories (a dichotomous variable).
I The following steps are used for calculating the Point Bi-serial
correlation between these two variables:
Step 1 Group the data into two sets based on the value of the
dichotomous variable Y . That is, assume that the value of Y
is either 0 or 1.
Step 2 Calculate the mean values of two groups: Let Y¯0 and Y¯1 be the
mean values of groups with Y = 0, and Y = 1, respectively.
Step 3 Let p0 and p1 be the proportion of observations in a group
with Y = 0 and Y = 1, respectively, and sX be the standard
deviation of the random variable X .
The correlation coefficient
Y¯0 − Y¯1
Ç å

rpb = p0 p1
sx
7/ 1

You might also like