Common Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

•A one-sample z-test is used to compare a sample mean with a hypothesized population mean when the population standard deviation

is known

•A one-sample t-test is used to compare a sample mean with a hypothesized population mean when the population standard deviation is
unknown

•A two-sample independent z-test is used to compare the sample means from two independent populations when the population standard
deviations are known

•A two-sample independent t-test is used to compare the sample means from two independent populations

•A two-sample independent t-test is used to compare the sample means from two independent populations when the population standard
deviations are unknown

•A paired t-test is used to compare the sample means from two related (dependent) populations

•An ANOVA test is used to compare the sample means from two or more populations.

•An ANOVA test is used to compare the sample means from two or more independent populations

•A one-sample proportion z-test is used to compare a sample proportion with a population proportion

•A two-sample proportion z-test is used to compare the sample proportions from two independent populations

•A chi-square test for variance is used to compare a sample variance with a population variance

•A chi-square test of independence is used to check the dependence (relationship) between two categorical variables

•An F-test of equality of variances is used to compare the sample variances from two populations

CLT – The Sample Size should be more than 30


Since the p-value is less than 0.05 (level of significance), there is enough statistical evidence to reject the null hypothesis. So, the
jeweler will reject the null hypothesis. 

As the p-value (0.00086) is less than the level of significance (0.05), there is enough statistical evidence to reject the null hypothesis.
Thus, there is enough statistical evidence to conclude that the mean post-weight is less than the mean pre-weight.

As the calculated p-value (0.079) is greater than the level of significance (0.05), there is not enough statistical evidence to reject the
null hypothesis. Hence, there is not enough statistical evidence to say that the proportion of orders mailed within 72 hours after
they are received is smaller than 90%.

As the p-value is greater than the level of significance (0.05), we do not have enough evidence to reject the null hypothesis. Hence,
we do not have enough statistical evidence to say that the energy expenditure of obese and lean is different.

As the p-value is greater than the level of significance (0.05), we do not have enough statistical evidence to reject the null
hypothesis. Thus, there is not enough statistical evidence to say that the scores of the two groups of students are different.
The null hypothesis always contains some form of equality ( ) and the alternative hypothesis never contains
equality ( )

So, the valid null and alternative hypotheses are:

 If the p-value is less than the level of significance, we have enough statistical evidence to reject the null hypothesis.
 If the p-value is greater than the level of significance, we do not have enough statistical evidence to reject the null
hypothesis. Hence, we fail to reject the null hypothesis.
scipy.stats.ttest_ind
scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate', permutations=None, random
_state=None, alternative='two-sided', trim=0)[source]
Calculate the T-test for the means of two independent samples of scores.

This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test
assumes that the populations have identical variances by default.

Parameters
a, barray_like
The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default).
axisint or None, optional
Axis along which to compute test. If None, compute over the whole arrays, a, and b.
equal_varbool, optional
If True (default), perform a standard independent 2 sample test that assumes equal population variances [1]. If
False, perform Welch’s t-test, which does not assume equal population variance [2].
New in version 0.11.0.
nan_policy{‘propagate’, ‘raise’, ‘omit’}, optional
Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):
 ‘propagate’: returns nan
 ‘raise’: throws an error
 ‘omit’: performs the calculations ignoring nan values

The ‘omit’ option is not currently available for permutation tests or one-sided asympyotic tests.
permutationsnon-negative int, np.inf, or None (default), optional
If 0 or None (default), use the t-distribution to calculate p-values. Otherwise, permutations is the number of
random permutations that will be used to estimate p-values using a permutation test. If permutations equals or
exceeds the number of distinct partitions of the pooled data, an exact test is performed instead (i.e. each distinct
partition is used exactly once). See Notes for details.
New in version 1.7.0.
random_state{None, int,  numpy.random.Generator,
numpy.random.RandomState}, optional
If seed is None (or np.random), the numpy.random.RandomState singleton is used. If seed is an int, a
new RandomState instance is used, seeded with seed. If seed is already a Generator or RandomState instance
then that instance is used.
Pseudorandom number generator state used to generate permutations (used only when permutations is not
None).
New in version 1.7.0.
alternative{‘two-sided’, ‘less’, ‘greater’}, optional
Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
 ‘two-sided’: the means of the distributions underlying the samples are unequal.
 ‘less’: the mean of the distribution underlying the first sample is less than the mean of the distribution
underlying the second sample.
 ‘greater’: the mean of the distribution underlying the first sample is greater than the mean of the
distribution underlying the second sample.
New in version 1.6.0.
trimfloat, optional
If nonzero, performs a trimmed (Yuen’s) t-test. Defines the fraction of elements to be trimmed from each end of
the input samples. If 0 (default), no elements will be trimmed from either side. The number of trimmed elements
from each tail is the floor of the trim times the number of elements. Valid range is [0, .5).
New in version 1.7.
Returns
statisticfloat or array
The calculated t-statistic.
pvaluefloat or array
The p-value.
Notes

Suppose we observe two independent samples, e.g. flower petal lengths, and we are considering whether the two
samples were drawn from the same population (e.g. the same species of flower or two species with similar petal
characteristics) or two different populations.

The t-test quantifies the difference between the arithmetic means of the two samples. The p-value quantifies the
probability of observing as or more extreme values assuming the null hypothesis, that the samples are drawn from
populations with the same population means, is true. A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates
that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of
equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis
of equal population means.

By default, the p-value is determined by comparing the t-statistic of the observed data against a theoretical t-distribution.
When 1 < permutations < binom(n, k), where

 k is the number of observations in a,


 n is the total number of observations in a and b, and
 binom(n, k) is the binomial coefficient (n choose k),
the data are pooled (concatenated), randomly assigned to either group a or b, and the t-statistic is calculated. This
process is performed repeatedly (permutation times), generating a distribution of the t-statistic under the null hypothesis,
and the t-statistic of the observed data is compared to this distribution to determine the p-value.
When permutations >= binom(n, k), an exact test is performed: the data are partitioned between the groups in
each distinct way exactly once.

The permutation test can be computationally expensive and not necessarily more accurate than the analytical test, but it
does not make strong assumptions about the shape of the underlying distribution.
Use of trimming is commonly referred to as the trimmed t-test. At times called Yuen’s t-test, this is an extension of
Welch’s t-test, with the difference being the use of winsorized means in calculation of the variance and the trimmed
sample size in calculation of the statistic. Trimming is recommended if the underlying distribution is long-tailed or
contaminated with outliers [4].
scipy.stats.ttest_rel
scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate', alternative='two-sided')[source]
Calculate the t-test on TWO RELATED samples of scores, a and b.
This is a test for the null hypothesis that two related or repeated samples have identical average (expected)
values.

Parameters
a, barray_like
The arrays must have the same shape.
axisint or None, optional
Axis along which to compute test. If None, compute over the whole arrays, a, and b.
nan_policy{‘propagate’, ‘raise’, ‘omit’}, optional
Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):
 ‘propagate’: returns nan
 ‘raise’: throws an error
 ‘omit’: performs the calculations ignoring nan values

alternative{‘two-sided’, ‘less’, ‘greater’}, optional


Defines the alternative hypothesis. The following options are available (default is ‘two-sided’):
 ‘two-sided’: the means of the distributions underlying the samples are unequal.
 ‘less’: the mean of the distribution underlying the first sample is less than the mean of the distribution
underlying the second sample.
 ‘greater’: the mean of the distribution underlying the first sample is greater than the mean of the
distribution underlying the second sample.
New in version 1.6.0.
Key Differences Between Null and Alternative Hypothesis

The important points of differences between null and alternative hypothesis are explained as under:

1. A null hypothesis is a statement, in which there is no relationship between two variables. An alternative
hypothesis is a statement; that is simply the inverse of the null hypothesis, i.e. there is some statistical
significance between two measured phenomenon.
2. A null hypothesis is what, the researcher tries to disprove whereas an alternative hypothesis is what the
researcher wants to prove.
3. A null hypothesis represents, no observed effect whereas an alternative hypothesis reflects, some observed
effect.
4. If the null hypothesis is accepted, no changes will be made in the opinions or actions. Conversely, if the
alternative hypothesis is accepted, it will result in the changes in the opinions or actions.
5. As null hypothesis refers to population parameter, the testing is indirect and implicit. On the other hand, the
alternative hypothesis indicates sample statistic, wherein, the testing is direct and explicit.
6. A null hypothesis is labelled as H0 (H-zero) while an alternative hypothesis is represented by H1 (H-one).
7. The mathematical formulation of a null hypothesis is an equal sign but for an alternative hypothesis is not
equal to sign.
8. In null hypothesis, the observations are the outcome of chance whereas, in the case of the alternative
hypothesis, the observations are an outcome of real effect.
The one-sample proportions z-test is used to compare a sample proportion with a population proportion. The following
code has been provided:

from statsmodels.stats.proportion import proportions_ztest


proportions_ztest(count, nobs, value = 0.7, alternative='two-sided')

In the above line of code,

 proportions_ztest() is a function in scipy.stats that performs a one-sample proportions z-test.


 count is the number of successes out of the total number of observations in the sample
 nobs is the total number of observations in the sample
 value = 0.7 is a parameter that indicates that the hypothesized population proportion is 0.7
 alternative = 'two-sided' is an argument used to specify the tail of the test. This argument depends on the formulated
alternative hypothesis for the test.
Python writes scientific notation using 'e' for example, 7.84 x 10-5 is displayed as 7.84e-05. So the value 7.84e-05 i.e 7.84 x 10-5 is less
than 0.05
Here is a list of the important Python functions for conducting different types of Hypothesis Tests from the
scipy.stats library and the statsmodels library along with links to their official documentation: 
Tests from scipy.stats library (Alias: stats)

1. ttest_1samp(): Test to compare a sample mean with a population mean when the population standard deviation is
unknown.
2. ttest_ind(): Test to compare two sample means from two independent populations when the population standard
deviations are unknown.
3. ttest_rel(): Test to compare two sample means from two related (dependent) populations.
4. chi2_contingency(): Test to check the dependence(relationship) between two categorical variables.
5. shapiro(): Test to determine whether a sample has been drawn from a normal population.
6. levene(): Test to determine whether several samples have been drawn from populations with equal variances.
7. f_oneway(): Test to compare the sample means from several populations.
Note: In 'scipy.stats' test functions, the 'alternative' argument is used to define the alternative hypothesis. The following
options are available (the default is 'two-sided’):
 ‘two-sided’: to perform the test for a two-tailed alternative hypothesis (containing ≠ sign)
 ‘less’: to perform the test for a one-tailed alternative hypothesis (containing < sign)
 ‘greater’: to perform the test for a one-tailed alternative hypothesis (containing > sign)

 
Tests from Statsmodels library 

1. statsmodels.stats.proportion.proportions_ztest(): Test to compare proportions based on normal (z) test.


2. statsmodels.stats.multicomp.pairwise_tukeyhsd(): Test to conduct pairwise comparisons for several sample means
Note: In the test functions of statsmodels library,  the 'alternative' argument is used to define the alternative hypothesis.
The 'alternative' argument can take any of the possible values: ‘two-sided’, ‘smaller’, ‘larger’

Libraries Used in ANOVA Monograph


import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt


import statsmodels.api as sm

from statsmodels.formula.api import ols

from statsmodels.graphics.gofplots import ProbPlot

from statsmodels.graphics.factorplots import interaction_plot

from statsmodels.stats.multicomp import (pairwise_tukeyhsd,MultiComparison)

You might also like