Lecture 15 - Statistics For Data Science (Inferential Statistics)
Lecture 15 - Statistics For Data Science (Inferential Statistics)
Inferential Statistics
Hypothesis
A hypothesis is an assumption that is neither confirmed nor disproved.
An example of a hypothesis is: "There is no difference between wages between
men and women"
Hypothesis testing: Analyze data to either reject or accept the hypothesis
Level of significance
How strong is the evidence from the sample to reject the null hypothesis?
Typical values: 5% (0.05) or 1% (0.01)
p-value
Charachterizes the strength of the sample evidence
If the p-value is less than or equal to the significance level, you reject the null
hypothesis.
If the p-value is greater than the significance level, you do not reject the null
hypothesis.
In [3]: # Example
# Dataset for the fuel cost of 25 families for this year
import pandas as pd
fuel_df = pd.read_csv('data/FuelsCosts.csv')
Hypothesis definition:
Null hypothesis: The population mean equals the null hypothesis mean (260).
Alternative hypothesis: The population mean does not equal the null hypothesis
mean (260).
Sampling distribution
(Hypothetically) If we are able to get many samples from the population (assuming
average fuel cost of population is 260)
Null hypothesis: The population mean equals the null hypothesis mean (260).
Alternative hypothesis: The population mean does not equal the null hypothesis
mean (260).
Sample is statistically significant (at 0.05 level) since it falls in the critical region
P-values
Assume null hypothesis is true, P-value is the probability that a sample will have an
effect at least as extreme as the effect observed in your sample.
Statistical Tests
Sampling distribution cannot be obtained in practice
Rely on statistical assumption regarding the population
Variance homogeneity
Equal variance in the samples
0 21 18 17
1 23 22 16
2 17 19 23
3 11 26 7
4 9 13 26
5 27 24 9
6 22 23 25
7 12 17 21
8 20 21 14
9 4 15 20
Out[9]: 0.1527571716627755
0.5280694573759905
Out[13]: 0.6457341109631506
Out[16]: (array([ 4., 11., 43., 69., 77., 63., 23., 8., 1., 1.]),
array([-2.83217082, -2.19848507, -1.56479933, -0.93111358, -0.29742783,
0.33625792, 0.96994366, 1.60362941, 2.23731516, 2.87100091,
3.50468665]),
<BarContainer object of 10 artists>)
Out[18]: 0.3422810733318329
Out[60]: 0.4637315273284912
Out[88]: 0.10476154088973999
T- test
One-sample t-test
Null: The population mean equals the hypothesized mean.
Alternative: The population mean does not equal the hypothesized mean.
0.10310814522004697
3.3912901586226853e-10
Two-sample t-test
Null hypothesis: The means for the two populations are equal.
Alternative hypothesis: The means for the two populations are not equal.
methodA methodB
0 72.471714 72.145335
1 72.100548 89.811362
2 69.700219 98.071997
3 61.294691 84.486978
4 76.509736 80.530738
0.00033619096524951366
0.00034397297724683126
Paired t-test
Typical use cases
0 1 90.562946 110.641983
1 2 94.815788 101.587971
2 3 109.562299 120.607170
3 4 90.221665 83.221677
4 5 97.597791 109.272439
In [23]: data.describe()
In [28]: sns.boxplot(data)
Out[33]: 0.002220727093721955
One-way ANOVA
Null: All group mean are equal
Alternative: Not all group mean are equal
One-way ANOVA
Random samples
Independent groups
Independent variable is categorical
Dependent variable is nominal
Your sample data should follow a normal distribution or each group has more than
15 or 20 observations
Groups should have roughly equal variances
In [54]: sns.boxplot(dataS)
Out[55]: 0.031054179322102093
Two-Way ANOVA
Two-way ANOVA to assess differences between group means that are defined by
two categorical factors
Example: In addition to gender (Male/Female) does the highest level of education
have an influence on salary
Questions:
Does factor 1 have an effect on dependent variable?
Does factor 2 have an effect on dependent variable?
Is there interaction between factor 1 and factor 2?
Out[16]: Income
Gender
Two-way ANOVA
Hypothesis (3 implicit hypothesis):
Null:
In [45]: model.summary()
Df Model: 5
C(Gender)
[T.Male]:C(Major) -3979.1836 2144.249 -1.856 0.066 -8226.924 268.557
[T.Psychology]
C(Gender)
[T.Male]:C(Major) -3155.5288 2144.249 -1.472 0.144 -7403.270 1092.212
[T.Statistics]
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\ANUP\anaconda3\Lib\site-packages\statsmodels\graphics\factorplot
s.py:113: FutureWarning: The provided callable <function mean at 0x0000021
B10FD4040> is currently using DataFrameGroupBy.mean. In a future version o
f pandas, the provided callable will be used directly. To keep current beh
avior pass the string "mean" instead.
plot_data = data.groupby(['trace', 'x']).aggregate(func).reset_index()
Out[88]: Enjoyment
Food
In [90]: model.summary()
Df Model: 3
std
coef t P>|t| [0.025 0.975]
err
C(Food)[T.Ice
-56.0283 2.239 -25.023 0.000 -60.488 -51.569
Cream]:C(Condiment)[T.Mustard]
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [92]: data2_anova.groupby(['Food']).mean('Enjoyment')
Out[92]: Enjoyment
Food
In [93]: data2_anova.groupby(['Condiment']).mean('Enjoyment')
Out[93]: Enjoyment
Condiment
Mustard 75.457300
C:\Users\ANUP\anaconda3\Lib\site-packages\statsmodels\graphics\factorplot
s.py:113: FutureWarning: The provided callable <function mean at 0x0000021
B10FD4040> is currently using DataFrameGroupBy.mean. In a future version o
f pandas, the provided callable will be used directly. To keep current beh
avior pass the string "mean" instead.
plot_data = data.groupby(['trace', 'x']).aggregate(func).reset_index()
Categorical variables
Product recommendation
0 10 3 1 76 1972 6.76 1
1 30 3 1 56 1968 0.65 0
2 35 2 1 41 1977 1.34 0
3 99 3 0 71 1968 2.90 0
ulcer
0 115
1 90
Name: count, dtype: int64
ulcer
0 0.560976
1 0.439024
Name: proportion, dtype: float64
Hypothesis
Null: The population propotion is same as the population mean
Alternate: The population propotion not the same as the population mean
rownames
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
... ... ... ... ... ... ... ... ... ... ...
79 1 28 95 1 1 0 0 0 2 2466
81 1 14 100 3 0 0 0 0 2 2495
82 1 23 94 3 1 0 0 0 0 2495
83 1 17 142 2 0 0 1 0 0 2495
84 1 21 130 1 1 0 1 0 3 2495
Out[121… low 0 1
smoke
0 86 29
1 44 30
Out[116… low 0 1
smoke
0 0.747826 0.252174
1 0.594595 0.405405
res.pvalue
Out[124… 0.026490642530502487