0% found this document useful (0 votes)

12 views25 pages

Lecture 15 - Statistics For Data Science (Inferential Statistics)

The document covers inferential statistics, focusing on hypothesis testing, including null and alternate hypotheses, significance levels, and p-values. It explains statistical tests such as t-tests and the assumptions required for their application, as well as methods for testing normality and variance homogeneity. Examples are provided using Python code to illustrate the concepts discussed.

Uploaded by

gokulmohan4002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

Lecture 15 - Statistics For Data Science (Inferential Statistics)

Uploaded by

gokulmohan4002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Inferential Statistics

Hypothesis
A hypothesis is an assumption that is neither confirmed nor disproved.
An example of a hypothesis is: "There is no difference between wages between
men and women"
Hypothesis testing: Analyze data to either reject or accept the hypothesis

Null and Alternate hypothesis

Null hypothesis (H0): "There is no difference between wages between men and
women"
Alternate hypothesis (H1): "Wages of men and women are different"
If the sample data is consistent with the null hypothesis, then you do not reject the
null hypothesis.
If the sample data is inconsistent with the null hypothesis, then you reject the null
hypothesis and conclude that the alternative hypothesis is true

Level of significance
How strong is the evidence from the sample to reject the null hypothesis?
Typical values: 5% (0.05) or 1% (0.01)

p-value
Charachterizes the strength of the sample evidence
If the p-value is less than or equal to the significance level, you reject the null
hypothesis.

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 1/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

If the p-value is greater than the significance level, you do not reject the null
hypothesis.

In [3]: # Example
# Dataset for the fuel cost of 25 families for this year
import pandas as pd
fuel_df = pd.read_csv('data/FuelsCosts.csv')

In [4]: # Descriptive statistics

fuel_df['Fuel Cost'].describe()
# Average is 330.56

Out[4]: count 25.000000

mean 330.560000
std 154.177679
min 77.000000
25% 205.000000
50% 320.000000
75% 435.000000
max 676.000000
Name: Fuel Cost, dtype: float64

It is known that last year's average fuel cost was 260

Is this year's fuel cost greater than last year?
From the descriptive statistics it looks like it is true.
Is it sampling error?

Hypothesis definition:

Null hypothesis: The population mean equals the null hypothesis mean (260).
Alternative hypothesis: The population mean does not equal the null hypothesis
mean (260).

Fundamental problem (Uncertainity) in

hypothesis testing

Sampling distribution
(Hypothetically) If we are able to get many samples from the population (assuming
average fuel cost of population is 260)

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 2/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Null hypothesis: The population mean equals the null hypothesis mean (260).

Alternative hypothesis: The population mean does not equal the null hypothesis
mean (260).

Significance level - 0.05

Sample is statistically significant (at 0.05 level) since it falls in the critical region

P-values

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 3/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Assume null hypothesis is true, P-value is the probability that a sample will have an
effect at least as extreme as the effect observed in your sample.

Observing a sample mean at least as extreme as our sample mean

Statistical Tests
Sampling distribution cannot be obtained in practice
Rely on statistical assumption regarding the population

"Typical" statistical assumptions

Variance homogenity
Normality of population

Variance homogeneity
Equal variance in the samples

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 4/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

H0: Groups have equal variances

H1: Groups have different variances

In [3]: import pandas as pd

math = [21,23,17,11,9,27,22,12,20,4]
hist = [18,22,19,26,13,24,23,17,21,15]
psyh = [17,16,23,7,26,9,25,21,14,20]
dict = {'Math': math, 'History': hist, 'Psychology': psyh}
df = pd.DataFrame(dict)
df

Out[3]: Math History Psychology

0 21 18 17

1 23 22 16

2 17 19 23

3 11 26 7

4 9 13 26

5 27 24 9

6 22 23 25

7 12 17 21

8 20 21 14

9 4 15 20

In [4]: df.agg(['mean', 'std'],axis=0)

Out[4]: Math History Psychology

mean 16.600000 19.800000 17.800000

std 7.290786 4.131182 6.442912

In [6]: import seaborn as sns

sns.boxplot(df)

Out[6]: <Axes: >

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 5/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [9]: # Use Levene test for statistical testing

from scipy.stats import levene
w_stats, p_value = levene(df.Math, df.History, df.Psychology,center='mean
p_value

Out[9]: 0.1527571716627755

In [13]: import numpy as np

small_dose = np.array([4.2, 11.5, 7.3, 5.8, 6.4, 10, 11.2, 11.2, 5.2, 7,1
21.5, 17.6, 9.7, 14.5, 10, 8.2, 9.4, 16.5, 9.7
medium_dose = np.array([16.5, 16.5, 15.2, 17.3, 22.5, 17.3, 13.6, 14.5, 1
19.7, 23.3, 23.6, 26.4, 20, 25.2, 25.8, 21.2, 14.
large_dose = np.array([23.6, 18.5, 33.9, 25.5, 26.4, 32.5, 26.7, 21.5, 23
25.5, 26.4, 22.4, 24.5, 24.8, 30.9, 26.4, 27.3, 29
])
w_stats, p_value = levene(small_dose, medium_dose, large_dose)
display(p_value)
w_stats

0.5280694573759905
Out[13]: 0.6457341109631506

Statistical test for normal distribution

Null: The sample data follow the normal distribution.
Alternative: The sample data do not follow the normal distri-bution.

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 6/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [16]: import matplotlib.pyplot as plt

x = np.random.normal(size=300)
plt.hist(x)

Out[16]: (array([ 4., 11., 43., 69., 77., 63., 23., 8., 1., 1.]),
array([-2.83217082, -2.19848507, -1.56479933, -0.93111358, -0.29742783,
0.33625792, 0.96994366, 1.60362941, 2.23731516, 2.87100091,
3.50468665]),
<BarContainer object of 10 artists>)

In [18]: from scipy.stats import shapiro

w_stats, p_value = shapiro(x)
p_value

Out[18]: 0.3422810733318329

Disadvantage of analytical tests for normal

distribution
In [48]: np.random.seed(0)

In [60]: import matplotlib.pyplot as plt

x = np.random.normal(size=10)
w_stats, p_value = shapiro(x)
p_value

Out[60]: 0.4637315273284912

In [88]: import matplotlib.pyplot as plt

x = np.random.normal(size=100)

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 7/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

w_stats, p_value = shapiro(x)

p_value

Out[88]: 0.10476154088973999

Graphical test for normality - QQ plot

A QQ plot is a scatterplot created by plotting two sets of quantiles against one
another.
If both sets of quantiles came from the same distribution, we should see the points
forming a line that's roughly straight
For normality test: On one axis we have the normal quantiles and the other axis we
have empirical quantiles

In [92]: import numpy as np

import matplotlib.pyplot as plt
import statsmodels.api as sm
x = np.random.normal(0,1, 250) # Mean - 5, Std dev - 5
sm.qqplot(x, line='45')
plt.show()

T- test

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 8/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

One-sample t-test
Null: The population mean equals the hypothesized mean.
Alternative: The population mean does not equal the hypothesized mean.

One-sample t-test - Assumptions

You have a random sample
Your data must be continuous
Your sample data should follow a normal distribution or have more than 20
observations

In [107… import numpy as np

from scipy import stats
data = np.random.normal(loc=0, scale=1, size=100)
# perform one sample t-test
# Null hypothesis: Mean of the population is 0
t_statistic, p_value = stats.ttest_1samp(a=data, popmean=0)
display(p_value)
# Null hypothesis: Mean of the population is 0.5
t_statistic, p_value = stats.ttest_1samp(a=data, popmean=0.5)
display(p_value)

0.10310814522004697
3.3912901586226853e-10

Two sample t-test

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 9/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Two-sample t-test
Null hypothesis: The means for the two populations are equal.
Alternative hypothesis: The means for the two populations are not equal.

Two-sample t-test (Assumptions)

You have a representative, random sample
Your data must be continuous
Your sample data should follow a normal distribution or each group has more than
15 observations
The groups are independent
Equal/Non-equal variance

In [14]: import numpy as np

import pandas as pd
methodA = np.array([72.47171449,72.10054831,69.70021876,61.29469083,76.50
methodB = np.array([72.14533547,89.81136212,98.07199674,84.48697781,80.53
data = pd.DataFrame.from_dict({"methodA": methodA, "methodB": methodB})
display(data.head())
data.describe()

methodA methodB

0 72.471714 72.145335

1 72.100548 89.811362

2 69.700219 98.071997

3 61.294691 84.486978

4 76.509736 80.530738

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 10/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[14]: methodA methodB

count 15.000000 15.000000

mean 71.503618 84.742407

std 9.413837 8.311136

min 52.743031 70.828394

25% 66.485161 79.873636

50% 72.100548 84.858600

75% 76.204012 91.370261

max 89.807999 98.071997

In [12]: import seaborn as sns

sns.boxplot(data)

Out[12]: <Axes: >

In [6]: import numpy as np

methodA = np.array([72.47171449,72.10054831,69.70021876,61.29469083,76.50
methodB = np.array([72.14533547,89.81136212,98.07199674,84.48697781,80.53
# Assume variance are equal
from scipy import stats
t_statistic, p_value = stats.ttest_ind(a=methodA, b=methodB)
display(p_value)

0.00033619096524951366

In [13]: # Assume variance are unequal

from scipy import stats

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 11/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

t_statistic, p_value = stats.ttest_ind(a=methodA, b=methodB, equal_var=Fa

display(p_value)

0.00034397297724683126

Paired t-test
Typical use cases

Does drug XY help you lose weight?

Is there a difference between people with and without a degree in terms of their
health?

Paired two-sample t-test

Null hypothesis: The mean difference is 0
Alternative hypothesis: The mean difference is not 0

Paired t-test (Assumptions)

Two dependent groups or samples are available
The variables are metric scaled
The differences of the paired values is normally distributed

In [22]: sid = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])

Pretest = np.array([90.56294638,94.8157883,109.5622995,90.2216652,97.5977
Posttest = np.array([110.6419831,101.5879712,120.6071699,83.22167741,109.
data = pd.DataFrame.from_dict({"SubjectID": sid, "Pretest": Pretest, "Pos
data.head()

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 12/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[22]: SubjectID Pretest Posttest

0 1 90.562946 110.641983

1 2 94.815788 101.587971

2 3 109.562299 120.607170

3 4 90.221665 83.221677

4 5 97.597791 109.272439

In [23]: data.describe()

Out[23]: SubjectID Pretest Posttest

count 15.000000 15.000000 15.000000

mean 8.000000 97.062228 107.834628

std 4.472136 10.306151 13.245207

min 1.000000 87.871787 82.822880

25% 4.500000 90.392306 100.741889

50% 8.000000 93.793262 109.272439

75% 11.500000 97.607024 117.289836

max 15.000000 121.287283 128.609768

In [27]: data.drop('SubjectID', axis=1, inplace=True)

In [28]: sns.boxplot(data)

Out[28]: <Axes: >

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 13/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [33]: import scipy.stats as stats

t_statistic, p_value = stats.ttest_rel(a=data.Pretest, b=data.Posttest)
p_value

Out[33]: 0.002220727093721955

Statistical tests to test for differences in more

than two groups (ANOVA)

One way ANOVA

Difference between mean of three or more different groups

One-way ANOVA
Null: All group mean are equal
Alternative: Not all group mean are equal

One-way ANOVA
Random samples
Independent groups
Independent variable is categorical
Dependent variable is nominal
Your sample data should follow a normal distribution or each group has more than
15 or 20 observations
Groups should have roughly equal variances

In [53]: dataS = pd.read_csv("data/OneWayExample.csv")

dataS

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 14/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[53]: Supplier 1 Supplier 2 Supplier 3 Supplier 4

0 11.715501 10.566155 10.283346 6.903486

1 11.981569 13.455359 12.177732 8.990110

2 8.043929 7.418840 10.559808 6.971273

3 10.558160 12.031314 9.655187 9.160390

4 14.079463 7.776633 8.790275 8.678426

5 10.776867 10.748939 10.862457 11.443832

6 7.860270 10.726980 10.378184 10.780441

7 11.889672 4.477291 10.188052 5.666760

8 11.942314 6.803820 11.624520 10.776041

9 13.177454 5.371892 12.305905 9.008765

In [54]: sns.boxplot(dataS)

Out[54]: <Axes: >

In [55]: import scipy.stats as stats

t_statistic, p_value = stats.f_oneway(dataS["Supplier 1"], dataS["Supplie
p_value

Out[55]: 0.031054179322102093

In [ ]: import scipy.stats as stats

t_statistic, p_value = stats.f_oneway(dataS["Supplier 1"], dataS["Supplie
p_value

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 15/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Two-Way ANOVA
Two-way ANOVA to assess differences between group means that are defined by
two categorical factors
Example: In addition to gender (Male/Female) does the highest level of education
have an influence on salary
Questions:
Does factor 1 have an effect on dependent variable?
Does factor 2 have an effect on dependent variable?
Is there interaction between factor 1 and factor 2?

In [11]: import pandas as pd

data_anova = pd.read_csv('data/Two-WayANOVAExamples.csv')
data_anova.drop(['Unnamed: 3', 'Food', 'Condiment', 'Enjoyment'], axis=1,
data_anova.head()

Out[11]: Gender Major Income

0 Male Statistics 78504.55540

1 Male Statistics 76268.90888

2 Male Statistics 66657.85452

3 Male Statistics 78026.35568

4 Male Statistics 83485.21734

In [16]: data_anova.groupby(['Gender', "Major"]).mean().unstack()

Out[16]: Income

Major Political Science Psychology Statistics

Gender

Female 55191.757954 66694.817654 74074.69174

Male 62015.975797 69539.851907 77743.38074

How to do Two-way ANOVA

Build a linear model

Two-way ANOVA
Hypothesis (3 implicit hypothesis):

Null:

The means of observations grouped by one factor are the same

The means of observations grouped by the other factor are the same

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 16/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

There is no interaction between the two factors.

In [27]: import statsmodels.api as sm

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.graphics.regressionplots import abline_plot

In [54]: # Example with no interaction

formula = 'Income ~ C(Gender) + C(Major) + C(Gender):C(Major)'
model = ols(formula, data_anova).fit()

In [45]: model.summary()

Out[45]: OLS Regression Results

Dep. Variable: Income R-squared: 0.719

Model: OLS Adj. R-squared: 0.706

Method: Least Squares F-statistic: 58.20

Date: Wed, 23 Oct 2024 Prob (F-statistic): 8.52e-30

Time: 18:09:38 Log-Likelihood: -1184.2

No. Observations: 120 AIC: 2380.

Df Residuals: 114 BIC: 2397.

Df Model: 5

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 5.519e+04 1072.125 51.479 0.000 5.31e+04 5.73e+04

C(Gender)[T.Male] 6824.2178 1516.213 4.501 0.000 3820.612 9827.824

C(Major)[T.Psychology] 1.15e+04 1516.213 7.587 0.000 8499.453 1.45e+04

C(Major)[T.Statistics] 1.888e+04 1516.213 12.454 0.000 1.59e+04 2.19e+04

C(Gender)
[T.Male]:C(Major) -3979.1836 2144.249 -1.856 0.066 -8226.924 268.557
[T.Psychology]

C(Gender)
[T.Male]:C(Major) -3155.5288 2144.249 -1.472 0.144 -7403.270 1092.212
[T.Statistics]

Omnibus: 2.379 Durbin-Watson: 1.891

Prob(Omnibus): 0.304 Jarque-Bera (JB): 1.673

Skew: 0.053 Prob(JB): 0.433

Kurtosis: 2.431 Cond. No. 9.77

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 17/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

In [46]: aov_table = anova_lm(model, typ=2)

aov_table

Out[46]: sum_sq df F PR(>F)

C(Gender) 5.930022e+08 1.0 25.795022 1.495670e-06

C(Major) 6.009141e+09 2.0 130.695897 3.147345e-30

C(Gender):C(Major) 8.823224e+07 2.0 1.919008 1.514633e-01

Residual 2.620748e+09 114.0 NaN NaN

In [62]: from statsmodels.graphics.factorplots import interaction_plot

fig=interaction_plot(x = data_anova.Major, trace=data_anova.Gender, respo

C:\Users\ANUP\anaconda3\Lib\site-packages\statsmodels\graphics\factorplot
s.py:113: FutureWarning: The provided callable <function mean at 0x0000021
B10FD4040> is currently using DataFrameGroupBy.mean. In a future version o
f pandas, the provided callable will be used directly. To keep current beh
avior pass the string "mean" instead.
plot_data = data.groupby(['trace', 'x']).aggregate(func).reset_index()

In [87]: data2_anova = pd.read_csv('data/Two-WayANOVAExamples.csv')

data2_anova.drop(['Gender', 'Major', 'Income', 'Unnamed: 3'], axis=1, inp
data2_anova.dropna(inplace=True)
data2_anova.head()

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 18/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[87]: Food Condiment Enjoyment

0 Hot Dog Mustard 81.926957

1 Hot Dog Mustard 84.939774

2 Hot Dog Mustard 90.286479

3 Hot Dog Mustard 89.561802

4 Hot Dog Mustard 97.676826

In [88]: data2_anova.groupby(['Food', "Condiment"]).mean().unstack()

Out[88]: Enjoyment

Condiment Chocolate Sauce Mustard

Food

Hot Dog 65.316612 89.605687

Ice Cream 93.048096 61.308913

In [89]: # Example with interaction

formula = 'Enjoyment ~ C(Food) + C(Condiment) + C(Food):C(Condiment)'
model = ols(formula, data2_anova).fit()

In [90]: model.summary()

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 19/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[90]: OLS Regression Results

Dep. Variable: Enjoyment R-squared: 0.893

Model: OLS Adj. R-squared: 0.889

Method: Least Squares F-statistic: 212.4

Date: Thu, 24 Oct 2024 Prob (F-statistic): 7.41e-37

Time: 10:35:23 Log-Likelihood: -240.33

No. Observations: 80 AIC: 488.7

Df Residuals: 76 BIC: 498.2

Df Model: 3

Covariance Type: nonrobust

std
coef t P>|t| [0.025 0.975]
err

Intercept 65.3166 1.120 58.343 0.000 63.087 67.546

C(Food)[T.Ice Cream] 27.7315 1.583 17.515 0.000 24.578 30.885

C(Condiment)[T.Mustard] 24.2891 1.583 15.341 0.000 21.136 27.442

C(Food)[T.Ice
-56.0283 2.239 -25.023 0.000 -60.488 -51.569
Cream]:C(Condiment)[T.Mustard]

Omnibus: 2.073 Durbin-Watson: 1.985

Prob(Omnibus): 0.355 Jarque-Bera (JB): 2.035

Skew: 0.325 Prob(JB): 0.362

Kurtosis: 2.566 Cond. No. 6.85

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [91]: aov_table = anova_lm(model, typ=2)

aov_table

Out[91]: sum_sq df F PR(>F)

C(Food) 1.597762 1.0 0.063740 8.013618e-01

C(Condiment) 277.520500 1.0 11.071120 1.353292e-03

C(Food):C(Condiment) 15695.828458 1.0 626.153372 1.953476e-38

Residual 1905.097085 76.0 NaN NaN

In [92]: data2_anova.groupby(['Food']).mean('Enjoyment')

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 20/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[92]: Enjoyment

Food

Hot Dog 77.461150

Ice Cream 77.178505

In [93]: data2_anova.groupby(['Condiment']).mean('Enjoyment')

Out[93]: Enjoyment

Condiment

Chocolate Sauce 79.182354

Mustard 75.457300

In [95]: # Interaction implies that higher enjoyment of chocolate sauce is due to

fig=interaction_plot(x = data2_anova.Condiment, trace=data2_anova.Food, r

Categorical variables

Product recommendation

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 21/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Single categorical variable

In [127… data_cat2 = sm.datasets.get_rdataset("Melanoma","MASS")
data_cat2.data

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 22/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[127… time status sex age year thickness ulcer

0 10 3 1 76 1972 6.76 1

1 30 3 1 56 1968 0.65 0

2 35 2 1 41 1977 1.34 0

3 99 3 0 71 1968 2.90 0

4 185 1 1 52 1965 12.08 1

... ... ... ... ... ... ... ...

200 4492 2 1 29 1965 7.06 1

201 4668 2 0 40 1965 6.12 0

202 4688 2 0 42 1965 0.48 0

203 4926 2 0 50 1964 2.26 0

204 5565 2 0 41 1962 2.90 0

205 rows × 7 columns

In [138… vcnt= data_cat2.data['ulcer'].value_counts()

display(vcnt)
display(data_cat2.data['ulcer'].value_counts(normalize=True))

ulcer
0 115
1 90
Name: count, dtype: int64
ulcer
0 0.560976
1 0.439024
Name: proportion, dtype: float64

Hypothesis
Null: The population propotion is same as the population mean
Alternate: The population propotion not the same as the population mean

Null: 50% of the Melanoma cancer ulcerate

In [142… from statsmodels.stats.proportion import proportions_ztest

proportions_ztest(count=vcnt[0], nobs=vcnt[0]+vcnt[1], value=0.5)

Out[142… (1.759206287871047, 0.07854247665392992)

Are two categorical variable independent?

Null: The variables are independent
Alternate: A relation exists between the variables

In [106… data_cat = sm.datasets.get_rdataset("birthwt","MASS")

data_cat.data

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 23/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

Out[106… low age lwt race smoke ptl ht ui ftv bwt

rownames

85 0 19 182 2 0 0 0 1 0 2523

86 0 33 155 3 0 0 0 0 3 2551

87 0 20 105 1 1 0 0 0 1 2557

88 0 21 108 1 1 0 0 1 2 2594

89 0 18 107 1 1 0 0 1 0 2600

... ... ... ... ... ... ... ... ... ... ...

79 1 28 95 1 1 0 0 0 2 2466

81 1 14 100 3 0 0 0 0 2 2495

82 1 23 94 3 1 0 0 0 0 2495

83 1 17 142 2 0 0 1 0 0 2495

84 1 21 130 1 1 0 1 0 3 2495

189 rows × 10 columns

Does smoking cause low birth weight?

In [121… contingency_table = pd.crosstab(data_cat.data.smoke, data_cat.data.low)
contingency_table

Out[121… low 0 1

smoke

0 86 29

1 44 30

In [116… contingency_table_norm = pd.crosstab(data_cat.data.smoke, data_cat.data.

contingency_table_norm

Out[116… low 0 1

smoke

0 0.747826 0.252174

1 0.594595 0.405405

Chi-square test for association

Null: Variables are independent
Alternative: A relationship exists between the variables

In [124… from scipy.stats import chi2_contingency

res = chi2_contingency(contingency_table, correction=0)

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 24/25

10/29/24, 6:14 PM Lecture 15 - Statistics for Data Science (Inferential Statistics)

res.pvalue

Out[124… 0.026490642530502487

file:///home/anupaprem/Downloads/Lecture 15 - Statistics for Data Science (Inferential Statistics).html 25/25

(Ebook PDF) Statistics For The Behavioral Sciences 10th Editioninstant Download
100% (4)
(Ebook PDF) Statistics For The Behavioral Sciences 10th Editioninstant Download
53 pages
(Ebook PDF) Statistics For The Behavioral Sciences 10Th Edition Install Download
No ratings yet
(Ebook PDF) Statistics For The Behavioral Sciences 10Th Edition Install Download
55 pages
Grade 12 Physics Exam Questions and Answers
80% (10)
Grade 12 Physics Exam Questions and Answers
3 pages
DSC 12 Inferential Statistics in Psychology
No ratings yet
DSC 12 Inferential Statistics in Psychology
96 pages
11 Statistical Tests
No ratings yet
11 Statistical Tests
24 pages
Lec 15
No ratings yet
Lec 15
43 pages
Metrics Topic3 Statistics Brief
No ratings yet
Metrics Topic3 Statistics Brief
24 pages
Inferential Stats
No ratings yet
Inferential Stats
96 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
54 pages
C22 Inferential Statistics DXB
No ratings yet
C22 Inferential Statistics DXB
66 pages
Introduction Statistics Imperial College London
50% (2)
Introduction Statistics Imperial College London
474 pages
Applied - Data - Science MODULE 2 SEM8
No ratings yet
Applied - Data - Science MODULE 2 SEM8
53 pages
Descriptive and Inferential Statistics
No ratings yet
Descriptive and Inferential Statistics
30 pages
PG Descriptive and Inferential Statistic 2024
No ratings yet
PG Descriptive and Inferential Statistic 2024
51 pages
The Statistical Tools
No ratings yet
The Statistical Tools
66 pages
Statistics
No ratings yet
Statistics
163 pages
AEB03 - Inferential Statitsitics (FE)
No ratings yet
AEB03 - Inferential Statitsitics (FE)
54 pages
Angela Verano MC MATH 15 Module 3
No ratings yet
Angela Verano MC MATH 15 Module 3
12 pages
3.1 R Programming For Statistics and Data Science - Course Notes - Hypothesis Testing
No ratings yet
3.1 R Programming For Statistics and Data Science - Course Notes - Hypothesis Testing
9 pages
(Ebook PDF) Statistics For The Behavioral Sciences 10th Edition Download
No ratings yet
(Ebook PDF) Statistics For The Behavioral Sciences 10th Edition Download
50 pages
4 - HY577 - Hypothesis Testing Basics
No ratings yet
4 - HY577 - Hypothesis Testing Basics
57 pages
Unit 4
No ratings yet
Unit 4
21 pages
Fdsa Unit 3 Aids Sem 4
No ratings yet
Fdsa Unit 3 Aids Sem 4
43 pages
The Most Compherensive Methods
No ratings yet
The Most Compherensive Methods
4 pages
(Ebook PDF) Statistics For The Behavioral Sciences 10Th Edition Download
No ratings yet
(Ebook PDF) Statistics For The Behavioral Sciences 10Th Edition Download
51 pages
02 Intrro Continued
No ratings yet
02 Intrro Continued
34 pages
Biostatistics Notes Part 1
No ratings yet
Biostatistics Notes Part 1
9 pages
03.22.2021 - L8 Statistics and Least Square
No ratings yet
03.22.2021 - L8 Statistics and Least Square
71 pages
Lecture 2 - MAT361 (21 JAN 2025)
No ratings yet
Lecture 2 - MAT361 (21 JAN 2025)
40 pages
Data Analytics Module 1 Lesson 6 Summary Notes
No ratings yet
Data Analytics Module 1 Lesson 6 Summary Notes
17 pages
Statistics Analysis
No ratings yet
Statistics Analysis
6 pages
2.1 PPT - Homogeneous and Hetero Mixtures
No ratings yet
2.1 PPT - Homogeneous and Hetero Mixtures
60 pages
Infer Ential
No ratings yet
Infer Ential
25 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
No ratings yet
ECON 361: Income & Inequality: Lecture 2: Review of Statistics
279 pages
A Complete Guide To Hypothesis Testing For Data Scientists Using Python - by Rashida Nasrin Sucky - Oct, 2020 - Towards Data Science
No ratings yet
A Complete Guide To Hypothesis Testing For Data Scientists Using Python - by Rashida Nasrin Sucky - Oct, 2020 - Towards Data Science
14 pages
Stats MCQ Notes
No ratings yet
Stats MCQ Notes
16 pages
10 Mar - AssQ
No ratings yet
10 Mar - AssQ
2 pages
Premili Definitions
No ratings yet
Premili Definitions
3 pages
Statistical Inference 1 and 2 1
No ratings yet
Statistical Inference 1 and 2 1
107 pages
Is Important Because:: TECH 6300 Introduction To Statistical Inference The Normal Distribution
100% (1)
Is Important Because:: TECH 6300 Introduction To Statistical Inference The Normal Distribution
19 pages
2statistical Analysis of Data 2
No ratings yet
2statistical Analysis of Data 2
43 pages
3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
No ratings yet
3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
10 pages
Final Exam of Business Statistics I at ADA University
No ratings yet
Final Exam of Business Statistics I at ADA University
14 pages
IB372 FA10 Lab01 Intro Statistics Presentation
100% (1)
IB372 FA10 Lab01 Intro Statistics Presentation
75 pages
Advanced Statistics
No ratings yet
Advanced Statistics
40 pages
CS194 Lec 06 EDA
No ratings yet
CS194 Lec 06 EDA
40 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
42 pages
LAB. Psychometrics Lesson 1
No ratings yet
LAB. Psychometrics Lesson 1
73 pages
1941 - National Building Code of Canada
No ratings yet
1941 - National Building Code of Canada
432 pages
Priyanka Pavithra Fundoodata
0% (1)
Priyanka Pavithra Fundoodata
170 pages
Complete Collection To Be Studied, It Contains All Subjects of Interest
No ratings yet
Complete Collection To Be Studied, It Contains All Subjects of Interest
3 pages
Lecture 4 - Data Science Statistics
No ratings yet
Lecture 4 - Data Science Statistics
21 pages
Shop Manual Komatsu pc300 7 PDF
88% (16)
Shop Manual Komatsu pc300 7 PDF
2 pages
COM 201 - Inferential Statistics - 18032022-1
No ratings yet
COM 201 - Inferential Statistics - 18032022-1
58 pages
Inferential Statistics
No ratings yet
Inferential Statistics
48 pages
Bio Statistics (Presentation)
No ratings yet
Bio Statistics (Presentation)
46 pages
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
100% (1)
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
14 pages
250 Lec 5 Fall 13
No ratings yet
250 Lec 5 Fall 13
42 pages
EC212: Introduction To Econometrics Multiple Regression: Inference (Wooldridge, Ch. 4)
No ratings yet
EC212: Introduction To Econometrics Multiple Regression: Inference (Wooldridge, Ch. 4)
89 pages
Introduction To Hypothesis Testing: Print Round
No ratings yet
Introduction To Hypothesis Testing: Print Round
2 pages
The Most Important Probability Distribution in Statistics
No ratings yet
The Most Important Probability Distribution in Statistics
57 pages
XS Series E Appen 7 Installation PDF
No ratings yet
XS Series E Appen 7 Installation PDF
101 pages
Hypothesis Testing in Machine Learning Using Python - by Yogesh Agrawal - 151413
No ratings yet
Hypothesis Testing in Machine Learning Using Python - by Yogesh Agrawal - 151413
15 pages
VOCALOID 6 Reference Manual ENG
No ratings yet
VOCALOID 6 Reference Manual ENG
88 pages
Lecture 5: Chapter 5 Statistical Analysis of Data Yes The "S" Word
No ratings yet
Lecture 5: Chapter 5 Statistical Analysis of Data Yes The "S" Word
42 pages
Government of West Bengal: Transport Department Islampur ARTO Form 23 Certificate of Registration
0% (1)
Government of West Bengal: Transport Department Islampur ARTO Form 23 Certificate of Registration
1 page
Anderson ButcherConroy
No ratings yet
Anderson ButcherConroy
21 pages
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
No ratings yet
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
50 pages
WS - 3 Class X Phy CH - 10 (Light - Refraction) - 1
No ratings yet
WS - 3 Class X Phy CH - 10 (Light - Refraction) - 1
3 pages
Activity Based Costing
No ratings yet
Activity Based Costing
34 pages
L Matching Reflection Coefficient Using Matlab
100% (1)
L Matching Reflection Coefficient Using Matlab
7 pages
Lecture 0&1 Introduction Sensors
No ratings yet
Lecture 0&1 Introduction Sensors
34 pages
Chapter One To Five Collective
No ratings yet
Chapter One To Five Collective
33 pages
Lecture 23: Outline: Yell If You Have Any Questions
No ratings yet
Lecture 23: Outline: Yell If You Have Any Questions
43 pages
Mbafm MMPC 020
No ratings yet
Mbafm MMPC 020
28 pages
Topic 2
No ratings yet
Topic 2
17 pages
Review Quiz - Attempt Review2
No ratings yet
Review Quiz - Attempt Review2
11 pages
Filling Station Case Study
No ratings yet
Filling Station Case Study
22 pages
Mechanical Tube English
No ratings yet
Mechanical Tube English
8 pages
B. Stage 1 and 2
No ratings yet
B. Stage 1 and 2
20 pages
MSA Case Studies
No ratings yet
MSA Case Studies
10 pages
Rapid Serial Visual Presentation in Dynamic Graph Visualization
No ratings yet
Rapid Serial Visual Presentation in Dynamic Graph Visualization
8 pages
Unit 4 - Week 2: Introduction To Python: Assignment 2
No ratings yet
Unit 4 - Week 2: Introduction To Python: Assignment 2
4 pages
F4 Chapter 3 (Exercise 6)
No ratings yet
F4 Chapter 3 (Exercise 6)
3 pages
Calculators List Allowed
No ratings yet
Calculators List Allowed
1 page
OD328816327605052100
No ratings yet
OD328816327605052100
1 page
DEGUZMAN KS3 LeaP G8Q3W6
No ratings yet
DEGUZMAN KS3 LeaP G8Q3W6
3 pages
A Feasibility Study of Eco Bag Made in Banana Fiber
No ratings yet
A Feasibility Study of Eco Bag Made in Banana Fiber
3 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet