0% found this document useful (0 votes)
12 views25 pages

Lecture 15 - Statistics For Data Science (Inferential Statistics)

The document covers inferential statistics, focusing on hypothesis testing, including null and alternate hypotheses, significance levels, and p-values. It explains statistical tests such as t-tests and the assumptions required for their application, as well as methods for testing normality and variance homogeneity. Examples are provided using Python code to illustrate the concepts discussed.

Uploaded by

gokulmohan4002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

Lecture 15 - Statistics For Data Science (Inferential Statistics)

The document covers inferential statistics, focusing on hypothesis testing, including null and alternate hypotheses, significance levels, and p-values. It explains statistical tests such as t-tests and the assumptions required for their application, as well as methods for testing normality and variance homogeneity. Examples are provided using Python code to illustrate the concepts discussed.

Uploaded by

gokulmohan4002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Inferential Statistics

Hypothesis
A hypothesis is an assumption that is neither confirmed nor disproved.
An example of a hypothesis is: "There is no difference between wages between
men and women"
Hypothesis testing: Analyze data to either reject or accept the hypothesis

Null and Alternate hypothesis


Null hypothesis (H0): "There is no difference between wages between men and
women"
Alternate hypothesis (H1): "Wages of men and women are different"
If the sample data is consistent with the null hypothesis, then you do not reject the
null hypothesis.
If the sample data is inconsistent with the null hypothesis, then you reject the null
hypothesis and conclude that the alternative hypothesis is true

Level of significance
How strong is the evidence from the sample to reject the null hypothesis?
Typical values: 5% (0.05) or 1% (0.01)

p-value
Charachterizes the strength of the sample evidence
If the p-value is less than or equal to the significance level, you reject the null
hypothesis.

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 1/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

If the p-value is greater than the significance level, you do not reject the null
hypothesis.

In [3]: # Example
# Dataset for the fuel cost of 25 families for this year
import pandas as pd
fuel_df = pd.read_csv('data/FuelsCosts.csv')

In [4]: # Descriptive statistics


fuel_df['Fuel Cost'].describe()
# Average is 330.56

Out[4]: count 25.000000


mean 330.560000
std 154.177679
min 77.000000
25% 205.000000
50% 320.000000
75% 435.000000
max 676.000000
Name: Fuel Cost, dtype: float64

It is known that last year's average fuel cost was 260


Is this year's fuel cost greater than last year?
From the descriptive statistics it looks like it is true.
Is it sampling error?

Hypothesis definition:

Null hypothesis: The population mean equals the null hypothesis mean (260).
Alternative hypothesis: The population mean does not equal the null hypothesis
mean (260).

Fundamental problem (Uncertainity) in


hypothesis testing

Sampling distribution
(Hypothetically) If we are able to get many samples from the population (assuming
average fuel cost of population is 260)

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 2/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Null hypothesis: The population mean equals the null hypothesis mean (260).

Alternative hypothesis: The population mean does not equal the null hypothesis
mean (260).

Significance level - 0.05

Sample is statistically significant (at 0.05 level) since it falls in the critical region

P-values

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 3/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Assume null hypothesis is true, P-value is the probability that a sample will have an
effect at least as extreme as the effect observed in your sample.

Observing a sample mean at least as extreme as our sample mean

Statistical Tests
Sampling distribution cannot be obtained in practice
Rely on statistical assumption regarding the population

"Typical" statistical assumptions


Variance homogenity
Normality of population

Variance homogeneity
Equal variance in the samples

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 4/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

H0: Groups have equal variances


H1: Groups have different variances

In [3]: import pandas as pd


math = [21,23,17,11,9,27,22,12,20,4]
hist = [18,22,19,26,13,24,23,17,21,15]
psyh = [17,16,23,7,26,9,25,21,14,20]
dict = {'Math': math, 'History': hist, 'Psychology': psyh}
df = pd.DataFrame(dict)
df

Out[3]: Math History Psychology

0 21 18 17

1 23 22 16

2 17 19 23

3 11 26 7

4 9 13 26

5 27 24 9

6 22 23 25

7 12 17 21

8 20 21 14

9 4 15 20

In [4]: df.agg(['mean', 'std'],axis=0)

Out[4]: Math History Psychology

mean 16.600000 19.800000 17.800000

std 7.290786 4.131182 6.442912

In [6]: import seaborn as sns


sns.boxplot(df)

Out[6]: <Axes: >

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 5/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [9]: # Use Levene test for statistical testing


from scipy.stats import levene
w_stats, p_value = levene(df.Math, df.History, df.Psychology,center='mean
p_value

Out[9]: 0.1527571716627755

In [13]: import numpy as np


small_dose = np.array([4.2, 11.5, 7.3, 5.8, 6.4, 10, 11.2, 11.2, 5.2, 7,1
21.5, 17.6, 9.7, 14.5, 10, 8.2, 9.4, 16.5, 9.7
medium_dose = np.array([16.5, 16.5, 15.2, 17.3, 22.5, 17.3, 13.6, 14.5, 1
19.7, 23.3, 23.6, 26.4, 20, 25.2, 25.8, 21.2, 14.
large_dose = np.array([23.6, 18.5, 33.9, 25.5, 26.4, 32.5, 26.7, 21.5, 23
25.5, 26.4, 22.4, 24.5, 24.8, 30.9, 26.4, 27.3, 29
])
w_stats, p_value = levene(small_dose, medium_dose, large_dose)
display(p_value)
w_stats

0.5280694573759905
Out[13]: 0.6457341109631506

Statistical test for normal distribution


Null: The sample data follow the normal distribution.
Alternative: The sample data do not follow the normal distri-bution.

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 6/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [16]: import matplotlib.pyplot as plt


x = np.random.normal(size=300)
plt.hist(x)

Out[16]: (array([ 4., 11., 43., 69., 77., 63., 23., 8., 1., 1.]),
array([-2.83217082, -2.19848507, -1.56479933, -0.93111358, -0.29742783,
0.33625792, 0.96994366, 1.60362941, 2.23731516, 2.87100091,
3.50468665]),
<BarContainer object of 10 artists>)

In [18]: from scipy.stats import shapiro


w_stats, p_value = shapiro(x)
p_value

Out[18]: 0.3422810733318329

Disadvantage of analytical tests for normal


distribution
In [48]: np.random.seed(0)

In [60]: import matplotlib.pyplot as plt


x = np.random.normal(size=10)
w_stats, p_value = shapiro(x)
p_value

Out[60]: 0.4637315273284912

In [88]: import matplotlib.pyplot as plt


x = np.random.normal(size=100)

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 7/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

w_stats, p_value = shapiro(x)


p_value

Out[88]: 0.10476154088973999

Graphical test for normality - QQ plot


A QQ plot is a scatterplot created by plotting two sets of quantiles against one
another.
If both sets of quantiles came from the same distribution, we should see the points
forming a line that's roughly straight
For normality test: On one axis we have the normal quantiles and the other axis we
have empirical quantiles

In [92]: import numpy as np


import matplotlib.pyplot as plt
import statsmodels.api as sm
x = np.random.normal(0,1, 250) # Mean - 5, Std dev - 5
sm.qqplot(x, line='45')
plt.show()

T- test

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 8/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

One-sample t-test
Null: The population mean equals the hypothesized mean.
Alternative: The population mean does not equal the hypothesized mean.

One-sample t-test - Assumptions


You have a random sample
Your data must be continuous
Your sample data should follow a normal distribution or have more than 20
observations

In [107… import numpy as np


from scipy import stats
data = np.random.normal(loc=0, scale=1, size=100)
# perform one sample t-test
# Null hypothesis: Mean of the population is 0
t_statistic, p_value = stats.ttest_1samp(a=data, popmean=0)
display(p_value)
# Null hypothesis: Mean of the population is 0.5
t_statistic, p_value = stats.ttest_1samp(a=data, popmean=0.5)
display(p_value)

0.10310814522004697
3.3912901586226853e-10

Two sample t-test

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 9/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Two-sample t-test
Null hypothesis: The means for the two populations are equal.
Alternative hypothesis: The means for the two populations are not equal.

Two-sample t-test (Assumptions)


You have a representative, random sample
Your data must be continuous
Your sample data should follow a normal distribution or each group has more than
15 observations
The groups are independent
Equal/Non-equal variance

In [14]: import numpy as np


import pandas as pd
methodA = np.array([72.47171449,72.10054831,69.70021876,61.29469083,76.50
methodB = np.array([72.14533547,89.81136212,98.07199674,84.48697781,80.53
data = pd.DataFrame.from_dict({"methodA": methodA, "methodB": methodB})
display(data.head())
data.describe()

methodA methodB

0 72.471714 72.145335

1 72.100548 89.811362

2 69.700219 98.071997

3 61.294691 84.486978

4 76.509736 80.530738

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 10/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[14]: methodA methodB

count 15.000000 15.000000

mean 71.503618 84.742407

std 9.413837 8.311136

min 52.743031 70.828394

25% 66.485161 79.873636

50% 72.100548 84.858600

75% 76.204012 91.370261

max 89.807999 98.071997

In [12]: import seaborn as sns


sns.boxplot(data)

Out[12]: <Axes: >

In [6]: import numpy as np


methodA = np.array([72.47171449,72.10054831,69.70021876,61.29469083,76.50
methodB = np.array([72.14533547,89.81136212,98.07199674,84.48697781,80.53
# Assume variance are equal
from scipy import stats
t_statistic, p_value = stats.ttest_ind(a=methodA, b=methodB)
display(p_value)

0.00033619096524951366

In [13]: # Assume variance are unequal


from scipy import stats

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 11/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

t_statistic, p_value = stats.ttest_ind(a=methodA, b=methodB, equal_var=Fa


display(p_value)

0.00034397297724683126

Paired t-test
Typical use cases

Does drug XY help you lose weight?


Is there a difference between people with and without a degree in terms of their
health?

Paired two-sample t-test


Null hypothesis: The mean difference is 0
Alternative hypothesis: The mean difference is not 0

Paired t-test (Assumptions)


Two dependent groups or samples are available
The variables are metric scaled
The differences of the paired values is normally distributed

In [22]: sid = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])


Pretest = np.array([90.56294638,94.8157883,109.5622995,90.2216652,97.5977
Posttest = np.array([110.6419831,101.5879712,120.6071699,83.22167741,109.
data = pd.DataFrame.from_dict({"SubjectID": sid, "Pretest": Pretest, "Pos
data.head()

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 12/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[22]: SubjectID Pretest Posttest

0 1 90.562946 110.641983

1 2 94.815788 101.587971

2 3 109.562299 120.607170

3 4 90.221665 83.221677

4 5 97.597791 109.272439

In [23]: data.describe()

Out[23]: SubjectID Pretest Posttest

count 15.000000 15.000000 15.000000

mean 8.000000 97.062228 107.834628

std 4.472136 10.306151 13.245207

min 1.000000 87.871787 82.822880

25% 4.500000 90.392306 100.741889

50% 8.000000 93.793262 109.272439

75% 11.500000 97.607024 117.289836

max 15.000000 121.287283 128.609768

In [27]: data.drop('SubjectID', axis=1, inplace=True)

In [28]: sns.boxplot(data)

Out[28]: <Axes: >

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 13/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [33]: import scipy.stats as stats


t_statistic, p_value = stats.ttest_rel(a=data.Pretest, b=data.Posttest)
p_value

Out[33]: 0.002220727093721955

Statistical tests to test for differences in more


than two groups (ANOVA)

One way ANOVA


Difference between mean of three or more different groups

One-way ANOVA
Null: All group mean are equal
Alternative: Not all group mean are equal

One-way ANOVA
Random samples
Independent groups
Independent variable is categorical
Dependent variable is nominal
Your sample data should follow a normal distribution or each group has more than
15 or 20 observations
Groups should have roughly equal variances

In [53]: dataS = pd.read_csv("data/OneWayExample.csv")


dataS

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 14/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[53]: Supplier 1 Supplier 2 Supplier 3 Supplier 4

0 11.715501 10.566155 10.283346 6.903486

1 11.981569 13.455359 12.177732 8.990110

2 8.043929 7.418840 10.559808 6.971273

3 10.558160 12.031314 9.655187 9.160390

4 14.079463 7.776633 8.790275 8.678426

5 10.776867 10.748939 10.862457 11.443832

6 7.860270 10.726980 10.378184 10.780441

7 11.889672 4.477291 10.188052 5.666760

8 11.942314 6.803820 11.624520 10.776041

9 13.177454 5.371892 12.305905 9.008765

In [54]: sns.boxplot(dataS)

Out[54]: <Axes: >

In [55]: import scipy.stats as stats


t_statistic, p_value = stats.f_oneway(dataS["Supplier 1"], dataS["Supplie
p_value

Out[55]: 0.031054179322102093

In [ ]: import scipy.stats as stats


t_statistic, p_value = stats.f_oneway(dataS["Supplier 1"], dataS["Supplie
p_value

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 15/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Two-Way ANOVA
Two-way ANOVA to assess differences between group means that are defined by
two categorical factors
Example: In addition to gender (Male/Female) does the highest level of education
have an influence on salary
Questions:
Does factor 1 have an effect on dependent variable?
Does factor 2 have an effect on dependent variable?
Is there interaction between factor 1 and factor 2?

In [11]: import pandas as pd


data_anova = pd.read_csv('data/Two-WayANOVAExamples.csv')
data_anova.drop(['Unnamed: 3', 'Food', 'Condiment', 'Enjoyment'], axis=1,
data_anova.head()

Out[11]: Gender Major Income

0 Male Statistics 78504.55540

1 Male Statistics 76268.90888

2 Male Statistics 66657.85452

3 Male Statistics 78026.35568

4 Male Statistics 83485.21734

In [16]: data_anova.groupby(['Gender', "Major"]).mean().unstack()

Out[16]: Income

Major Political Science Psychology Statistics

Gender

Female 55191.757954 66694.817654 74074.69174

Male 62015.975797 69539.851907 77743.38074

How to do Two-way ANOVA


Build a linear model

Two-way ANOVA
Hypothesis (3 implicit hypothesis):

Null:

The means of observations grouped by one factor are the same


The means of observations grouped by the other factor are the same

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 16/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

There is no interaction between the two factors.

In [27]: import statsmodels.api as sm


from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.graphics.regressionplots import abline_plot

In [54]: # Example with no interaction


formula = 'Income ~ C(Gender) + C(Major) + C(Gender):C(Major)'
model = ols(formula, data_anova).fit()

In [45]: model.summary()

Out[45]: OLS Regression Results


Dep. Variable: Income R-squared: 0.719

Model: OLS Adj. R-squared: 0.706

Method: Least Squares F-statistic: 58.20

Date: Wed, 23 Oct 2024 Prob (F-statistic): 8.52e-30

Time: 18:09:38 Log-Likelihood: -1184.2

No. Observations: 120 AIC: 2380.

Df Residuals: 114 BIC: 2397.

Df Model: 5

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 5.519e+04 1072.125 51.479 0.000 5.31e+04 5.73e+04

C(Gender)[T.Male] 6824.2178 1516.213 4.501 0.000 3820.612 9827.824

C(Major)[T.Psychology] 1.15e+04 1516.213 7.587 0.000 8499.453 1.45e+04

C(Major)[T.Statistics] 1.888e+04 1516.213 12.454 0.000 1.59e+04 2.19e+04

C(Gender)
[T.Male]:C(Major) -3979.1836 2144.249 -1.856 0.066 -8226.924 268.557
[T.Psychology]

C(Gender)
[T.Male]:C(Major) -3155.5288 2144.249 -1.472 0.144 -7403.270 1092.212
[T.Statistics]

Omnibus: 2.379 Durbin-Watson: 1.891

Prob(Omnibus): 0.304 Jarque-Bera (JB): 1.673

Skew: 0.053 Prob(JB): 0.433

Kurtosis: 2.431 Cond. No. 9.77

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 17/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [46]: aov_table = anova_lm(model, typ=2)


aov_table

Out[46]: sum_sq df F PR(>F)

C(Gender) 5.930022e+08 1.0 25.795022 1.495670e-06

C(Major) 6.009141e+09 2.0 130.695897 3.147345e-30

C(Gender):C(Major) 8.823224e+07 2.0 1.919008 1.514633e-01

Residual 2.620748e+09 114.0 NaN NaN

In [62]: from statsmodels.graphics.factorplots import interaction_plot


fig=interaction_plot(x = data_anova.Major, trace=data_anova.Gender, respo

C:\Users\ANUP\anaconda3\Lib\site-packages\statsmodels\graphics\factorplot
s.py:113: FutureWarning: The provided callable <function mean at 0x0000021
B10FD4040> is currently using DataFrameGroupBy.mean. In a future version o
f pandas, the provided callable will be used directly. To keep current beh
avior pass the string "mean" instead.
plot_data = data.groupby(['trace', 'x']).aggregate(func).reset_index()

In [87]: data2_anova = pd.read_csv('data/Two-WayANOVAExamples.csv')


data2_anova.drop(['Gender', 'Major', 'Income', 'Unnamed: 3'], axis=1, inp
data2_anova.dropna(inplace=True)
data2_anova.head()

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 18/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[87]: Food Condiment Enjoyment

0 Hot Dog Mustard 81.926957

1 Hot Dog Mustard 84.939774

2 Hot Dog Mustard 90.286479

3 Hot Dog Mustard 89.561802

4 Hot Dog Mustard 97.676826

In [88]: data2_anova.groupby(['Food', "Condiment"]).mean().unstack()

Out[88]: Enjoyment

Condiment Chocolate Sauce Mustard

Food

Hot Dog 65.316612 89.605687

Ice Cream 93.048096 61.308913

In [89]: # Example with interaction


formula = 'Enjoyment ~ C(Food) + C(Condiment) + C(Food):C(Condiment)'
model = ols(formula, data2_anova).fit()

In [90]: model.summary()

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 19/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[90]: OLS Regression Results


Dep. Variable: Enjoyment R-squared: 0.893

Model: OLS Adj. R-squared: 0.889

Method: Least Squares F-statistic: 212.4

Date: Thu, 24 Oct 2024 Prob (F-statistic): 7.41e-37

Time: 10:35:23 Log-Likelihood: -240.33

No. Observations: 80 AIC: 488.7

Df Residuals: 76 BIC: 498.2

Df Model: 3

Covariance Type: nonrobust

std
coef t P>|t| [0.025 0.975]
err

Intercept 65.3166 1.120 58.343 0.000 63.087 67.546

C(Food)[T.Ice Cream] 27.7315 1.583 17.515 0.000 24.578 30.885

C(Condiment)[T.Mustard] 24.2891 1.583 15.341 0.000 21.136 27.442

C(Food)[T.Ice
-56.0283 2.239 -25.023 0.000 -60.488 -51.569
Cream]:C(Condiment)[T.Mustard]

Omnibus: 2.073 Durbin-Watson: 1.985

Prob(Omnibus): 0.355 Jarque-Bera (JB): 2.035

Skew: 0.325 Prob(JB): 0.362

Kurtosis: 2.566 Cond. No. 6.85

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [91]: aov_table = anova_lm(model, typ=2)


aov_table

Out[91]: sum_sq df F PR(>F)

C(Food) 1.597762 1.0 0.063740 8.013618e-01

C(Condiment) 277.520500 1.0 11.071120 1.353292e-03

C(Food):C(Condiment) 15695.828458 1.0 626.153372 1.953476e-38

Residual 1905.097085 76.0 NaN NaN

In [92]: data2_anova.groupby(['Food']).mean('Enjoyment')

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 20/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[92]: Enjoyment

Food

Hot Dog 77.461150

Ice Cream 77.178505

In [93]: data2_anova.groupby(['Condiment']).mean('Enjoyment')

Out[93]: Enjoyment

Condiment

Chocolate Sauce 79.182354

Mustard 75.457300

In [95]: # Interaction implies that higher enjoyment of chocolate sauce is due to


fig=interaction_plot(x = data2_anova.Condiment, trace=data2_anova.Food, r

C:\Users\ANUP\anaconda3\Lib\site-packages\statsmodels\graphics\factorplot
s.py:113: FutureWarning: The provided callable <function mean at 0x0000021
B10FD4040> is currently using DataFrameGroupBy.mean. In a future version o
f pandas, the provided callable will be used directly. To keep current beh
avior pass the string "mean" instead.
plot_data = data.groupby(['trace', 'x']).aggregate(func).reset_index()

Categorical variables

Product recommendation

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 21/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Single categorical variable


In [127… data_cat2 = sm.datasets.get_rdataset("Melanoma","MASS")
data_cat2.data

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 22/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[127… time status sex age year thickness ulcer

0 10 3 1 76 1972 6.76 1

1 30 3 1 56 1968 0.65 0

2 35 2 1 41 1977 1.34 0

3 99 3 0 71 1968 2.90 0

4 185 1 1 52 1965 12.08 1

... ... ... ... ... ... ... ...

200 4492 2 1 29 1965 7.06 1

201 4668 2 0 40 1965 6.12 0

202 4688 2 0 42 1965 0.48 0

203 4926 2 0 50 1964 2.26 0

204 5565 2 0 41 1962 2.90 0

205 rows × 7 columns

In [138… vcnt= data_cat2.data['ulcer'].value_counts()


display(vcnt)
display(data_cat2.data['ulcer'].value_counts(normalize=True))

ulcer
0 115
1 90
Name: count, dtype: int64
ulcer
0 0.560976
1 0.439024
Name: proportion, dtype: float64

Hypothesis
Null: The population propotion is same as the population mean
Alternate: The population propotion not the same as the population mean

Null: 50% of the Melanoma cancer ulcerate

In [142… from statsmodels.stats.proportion import proportions_ztest


proportions_ztest(count=vcnt[0], nobs=vcnt[0]+vcnt[1], value=0.5)

Out[142… (1.759206287871047, 0.07854247665392992)

Are two categorical variable independent?


Null: The variables are independent
Alternate: A relation exists between the variables

In [106… data_cat = sm.datasets.get_rdataset("birthwt","MASS")


data_cat.data

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 23/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[106… low age lwt race smoke ptl ht ui ftv bwt

rownames

85 0 19 182 2 0 0 0 1 0 2523

86 0 33 155 3 0 0 0 0 3 2551

87 0 20 105 1 1 0 0 0 1 2557

88 0 21 108 1 1 0 0 1 2 2594

89 0 18 107 1 1 0 0 1 0 2600

... ... ... ... ... ... ... ... ... ... ...

79 1 28 95 1 1 0 0 0 2 2466

81 1 14 100 3 0 0 0 0 2 2495

82 1 23 94 3 1 0 0 0 0 2495

83 1 17 142 2 0 0 1 0 0 2495

84 1 21 130 1 1 0 1 0 3 2495

189 rows × 10 columns

Does smoking cause low birth weight?


In [121… contingency_table = pd.crosstab(data_cat.data.smoke, data_cat.data.low)
contingency_table

Out[121… low 0 1

smoke

0 86 29

1 44 30

In [116… contingency_table_norm = pd.crosstab(data_cat.data.smoke, data_cat.data.


contingency_table_norm

Out[116… low 0 1

smoke

0 0.747826 0.252174

1 0.594595 0.405405

Chi-square test for association


Null: Variables are independent
Alternative: A relationship exists between the variables

In [124… from scipy.stats import chi2_contingency


res = chi2_contingency(contingency_table, correction=0)

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 24/25


10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

res.pvalue

Out[124… 0.026490642530502487

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 25/25

You might also like