FHA Notes
FHA Notes
Introduction
Healthcare analytics involves the application of data analysis and insights in the
healthcare industry to improve patient outcomes, operational efficiency, and decision-
making processes.
Fundamental aspects:
Data
The raw material of statistics is data. For our purposes we may define data as
numbers.The two kinds of numbers that we use in statistics are numbers that result from
the taking—in the usual sense of the term—of a measurement, and those that result from
the process of counting. For example, when a nurse weighs a patient or takes a
patient’s temperature, a measurement
Statistics: The meaning of statistics is implicit in the previous section. More concretely,
however, we may say that statistics is a field of study concerned with (1) the collection,
organization, summarization, and analysis of data; and (2) the drawing of inferences
about a body of data when only a part of the data is observed.
Sources of Data data are usually available from one or more of the following sources:
1. Routinely kept records
2. Surveys
3. Experiments
4. External sources
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public
health, genetics, ecology, and more. Here are some key aspects:
Study Design: Biostatisticians play a crucial role in designing experiments and studies.
They determine the sample size, randomization methods, and data collection
techniques to ensure that results are reliable and meaningful.
Data Collection: They collect data through various methods, such as surveys, clinical
trials, observations, or experiments. This data may include information on diseases,
treatments, genetics, environmental factors, and more.
Data Analysis: Once data is collected, biostatisticians use statistical methods to analyze
it. They employ techniques like hypothesis testing, regression analysis, survival
analysis, and more to draw conclusions and make inferences from the data.
Variable: variables include diastolic blood pressure, heart rate, the heights of adult males,
the weights of preschool children, and the ages of patients seen in a dental clinic.
Quantitative Variables: A quantitative variable is one that can be measured in the usual
sense
Qualitative Variables: Measurements made on qualitative variables convey information
regarding Attribute.
Random Variable: the values obtained arise as a result of chance factors, so that they
cannot be exactly predicted in advance, the variable is called a random variable. An
example of a random variable is adult height.
Discrete Random Variable: Variables may be characterized further as to whether they
are discrete or continuous. A discrete variable is characterized by gaps or interruptions
in the values that it can assume.
Continuous Random Variable : A continuous random variable does not possess the gaps
or interruptions characteristic of a discrete random variable
Population : A population or collection of entities may, however, consist of animals,
machines, places, or cells. a population of entities as the largest collection of entities for
which we have an interest at a particular time
Sample: A sample may be defined simply as a part of a population.
Introduction to biostatistics
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public health,
genetics, ecology, and more.
Here are some key aspects:
1. Study Design: Biostatisticians play a crucial role in designing experiments and
studies. They determine the sample size, randomization methods, and data
collection techniques to ensure that results are reliable and meaningful.
2. Data Collection: They collect data through various methods, such as surveys,
clinical trials, observations, or experiments. This data may include information on
diseases, treatments, genetics, environmental factors, and more.
3. Data Analysis: Once data is collected, biostatisticians use statistical methods to
analyze it. They employ techniques like hypothesis testing, regression analysis,
survival analysis, and more to draw conclusions and make inferences from the
data.
4. Interpretation: Biostatisticians interpret the results of their analyses, often
collaborating with researchers, doctors, or policymakers to understand the
implications of their findings. This interpretation guides decision-making in
healthcare, policy formulation, and scientific research.
5. Application in Public Health: Biostatistics plays a vital role in public health by
analyzing patterns of diseases, evaluating the effectiveness of interventions, and
predicting health outcomes in populations.
2. A key concept in the statement of this property is the concept of mutually exclusive
outcomes. Two events are said to be mutually exclusive if they cannot occur
simultaneously
The sum of the probabilities of the mutually exclusive outcomes is equal to 1. This is the
property of exhaustiveness and refers to the fact that the observer of a probabilistic
process must allow for all possible events, and when all are taken together, their total
probability is 1.
3. Consider any two mutually exclusive events, Ei and Ej. The probability of the
occurrence of either Ei or Ej is equal to the sum of their individual probabilities.
CALCULATING THE PROBABILITY OF AN EVENT
When probabilities are calculated with a subset of the total group as the denominator,
the result is a conditional probability
Joint Probability
Sometimes we want to find the probability that a subject picked at random from a
group of subjects possesses two characteristics at the same time. Such a probability is
referred to as a joint probability.
Problem:
The primary aim of a study by Carter et al. (A-1) was to investigate the effect of the age
at onset of bipolar disorder on the course of the illness. One of the variables investigated
was family history of mood disorders. Table shows the frequency of a family history of
mood disorders in the two groups of interest (Early age at onset defined to be 18 years or
younger and Later age at onset defined to be later than 18 years). Suppose we pick a
person at random from this sample. What is the probability that this person will be 18
years old or younger?
Solution:
For purposes of illustrating the calculation of probabilities we consider this group of 318
subjects to be the largest group for which we have an interest. In other words, for this
example, we consider the 318 subjects as a population. We assume that Early and Later
are mutually exclusive categories and that the likelihood of selecting any one person is
equal to the likelihood of selecting any other person. We define the desired probability
as the number of subjects with the characteristic of interest (Early) divided by the total
number of subjects. We may write the result in probability notation as follows:
Problem:
We wish to compute the joint probability of Early age at onset (E) and a negative family
history of mood disorders (A) from a knowledge of an appropriate marginal probability
and an appropriate conditional probability.
Solution:
The probability we seek is P(E∩A).
P(E) =141/ 318 = 0.4434,
and a conditional probability = P(A │B) = 28/ 141 = 0.1986
P(E∩A)= P(E) . P (A│B) = (0.4434).( 0.1986) = 0.0881
Independent Events
P(A│B) = P (A) . In such cases we say that A and B are independent events. The
multiplication rule for two independent events, then, may be written as
if two events
are independent, the probability of their joint occurrence is equal to the product of the
probabilities of their individual occurrences. when two events with nonzero probabilities
are independent, each of the following statements is true
Marginal Probability
Given some variable that can be broken down into m categories designated by A1;A2; . .
. ;Ai; . . . ;Am and another jointly occurring variable that is broken down into n categories
designated by B1; B2; . . . ; Bj; . . . ; Bn, the marginal probability of Ai P(Ai) is equal to the
sum of the joint probabilities of Ai with all the categories of B. That is,
Bayes Theorem
Conditional Probability
The conditional probability of A given B is equal to the probability of A∩ B divided by the
probability of B, provided the probability of B is not zero.
Problem:
Suppose we pick a subject at random from the 318 subjects and find that he is 18 years or
younger (E). What is the probability that this subject will be one who has no family
history of mood disorders (A)?
Solution: The total number of subjects is no longer of interest, since, with the selection of
an Early subject, the Later subjects are eliminated. We may define the desired probability,
then, as follows: What is the probability that a subject has no family history of mood
disorders (A), given that the selected subject is Early (E)? This is a conditional probability
and is written as P (A│B) in which the vertical line is read ―given.‖ The 141 Early
subjects become the denominator of this conditional probability, and 28, the number of
Early subjects with no family history of mood disorders, becomes the numerator. Our
desired probability, then, is
counts the number of ways to get x heads in n flips. For example, if x = 2 and n = 3, the
binomial coefficient is calculated as 3!/(2! × 1!), which is equal to 3; there are three distinct
ways to get two heads in three flips (i.e., head-head-tail, head-tail-head, tailhead-head).
Thus, the probability of getting two heads in three flips if p is .50 would be .375 (3 × .502
× (1 – .50)1), or 3 out of 8.
If the coin is fair, so that p = .50, and we flip it 10 times, the probability of six heads and
four tails is
If the coin is a trick coin, so that p = .75, the probability of six heads in 10 tosses is
Likelihoods may seem overly restrictive because we have compared only two simple
statistical hypotheses in a single likelihood ratio. The likelihood ratio of any two
hypotheses is simply the ratio of their heights on this curve.
The Normal distribution, chi-square distribution, binomial distribution, Poisson
distribution, and uniform distribution. Likelihoods are also a key component of Bayesian
inference. The Bayesian approach to statistics is fundamentally about making use of all
available information when drawing inferences in the face of uncertainty. Previous
information is quantified using what is known as a prior distribution. Mathematically, a
well-known conditional-probability theorem states that the procedure for obtaining the
posterior distribution of θ is as follows:
In this context, K is merely a rescaling constant and is equal to 1/P(D). We often write
this theorem more simply as
and then multiply it by the formula for the beta prior with a and b shape parameters,
︸ ︸
which suggests that we can interpret the information contained in the prior as adding a
certain amount of previous data (i.e., a – 1 past successes and b – 1 past failures) to the
data from our current experiment. Because we are multiplying together terms with the
same base, the exponents can be added together in a final simplification step:
This final formula looks like our original beta distribution but with new shape parameters
equal to x + a and n – x + b. In other words, we started with the prior distribution beta
(a,b) and added the successes from the data, x, to a and the failures, n – x, to b, and our
posterior distribution is a beta(x + a,n – x + b) distribution.
consider the previous example of observing 60 heads in 100 flips of a coin. Imagine that
going into this experiment, we had some reason to believe the coin’s bias was within .20
of being fair in either direction; that is, we believed that p was likely within the range of
.30 to .70. We could choose to represent this information using the beta(25,25) distribution
shown as the dotted line. The likelihood function for the 60 flips is shown as the dot-and-
dashed line and is identical to that shown in the middle panel.
The statistical distribution using appropriate software tool –Python
The data is described in such a way that it can express some meaningful information
that can also be used to find some future trends. Describing and summarizing a single
variable is called univariate analysis. Describing a statistical relationship between two
variables is called bivariate analysis. Describing the statistical relationship between
multiple variables is called multivariate analysis.
There are two types of Descriptive Statistics:
• The measure of central tendency
• Measure of variability
• Mean
• Median
• Median Low
• Median High
• Mode
Mean
It is the sum of observations divided by the total number of observations. It is also defined
as average which is the sum divided by count. The mean() function returns the mean or
average of the data passed in its arguments. If the passed argument is empty, Statistics
Error is raised.
Example: Python code to calculate mean
# Python code to demonstrate the working of
# mean()
# importing statistics to handle statistical
# operations
import statistics #
initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list
# elements
print ("The average of list values is : ",end="") print
(statistics.mean(li))
Output
The average of list values is : 2
The median_low() function returns the median of data in case of odd number of
elements, but in case of even number of elements, returns the lower of two middle
elements. If the passed argument is empty, StatisticsError is raised
# Python code to demonstrate the
# working of median_low() #
importing the statistics module
import statistics
# simple list of a set of integers set1
= [1, 3, 3, 4, 5, 7]
# Print median of the data-set
# Median value may or may not #
lie within the data-set
print("Median of the set is % s" % (statistics.median(set1)))
# Print low median of the data-set print("Low
Median of the set is % s " %
(statistics.median_low(set1))) Output:
Median of the set is 3.5
Low Median of the set is 3
In Python, you can use various libraries such as NumPy, SciPy, and Matplotlib to analyze
data and determine the statistical distribution. Here's an example of how you might find
the distribution of a dataset using these libraries:
Firstly, let's generate some sample data. For demonstration purposes, we'll create a
dataset following a normal distribution.
This code snippet demonstrates:
Generating a dataset of 1000 data points following a normal distribution. Plotting a
histogram to visualize the distribution of the generated data.
Fitting a normal distribution curve to the data and plotting it over the histogram.
The stats.norm.fit() function in this example fits a normal distribution to the data using
maximum likelihood estimation, estimating the mean and standard deviation of the
distribution. You can replace 'norm' with other distribution names like 'gamma', 'expon',
etc., to fit different distributions to your data.
This is a basic example, and in practice, you might need to preprocess and analyze your
data differently based on its characteristics and the specific analysis you're conducting.
But this should give you a starting point for determining the statistical distribution of
your data using Python.
import numpy as np import
matplotlib.pyplot as plt from
scipy import stats
# Generating a dataset with a normal distribution np.random.seed(42)
# Setting seed for reproducibility
data = np.random.normal(loc=0, scale=1, size=1000) # Mean=0, Standard
Deviation=1, 1000 data points
Definition :
A p value is the probability that the computed value of a test statistic is at least as extreme
as a specified value of the test statistic when the null hypothesis is true. Thus, the p value
is the smallest value of a for which we can reject a null hypothesis.
Generally, the level of statistical significance is often expressed in p-value and the
range between 0 and 1. The smaller the p-value, the stronger the evidence and hence, the
result should be statistically significant. Hence, the rejection of the null hypothesis is
highly possible, as the p-value becomes smaller. A statistician wants to test the hypothesis
H0: μ = 120 using the alternative hypothesis Hα: μ > 120 and assuming that α = 0.05. For
that, he took the sample values as n =40, σ = 32.17 and x̄ = 105.37. Determine the
conclusion for this hypothesis?
Solution:
We know that,
• One-sided p-value: You can use this method of testing if a large or unexpected
change in the data makes only a small or no difference to your data set. Typically,
this is unusual and you can use a two-sided p-value test instead.
• Two-sided p-value: You can use this method of testing if a large change in the
data would affect the outcome of the research and if the alternative hypothesis is
fairly general instead of specific. Most professionals use this method to ensure they
account for large changes in data.
Chi-square Test
The chi-square distribution is the most frequently employed statistical technique for the
analysis of count or frequency data. A statistical test that is used to compare observed
and expected results. The Chi-square statistic compares the size of any discrepancies
between the expected results and actual results.
and expected frequencies are close together and will be large if the differences are large.
The computed value of X2 is compared with the tabulated value of X2 with k – r degrees
of freedom. The decision rule, then, is: Reject H0 if X2 is greater than or equal to the
tabulated X2 for the chosen value of a. Types of Chi-square
• Tests of goodness-of-fit
• Test of independence
• Test of Homogeneity
Tests of goodness-of-fit
• The chi-square test for goodness-of-fit uses frequency data from a sample to test
hypotheses about the shape or proportions of a population.
• The data, called observed frequencies, simply count how many individuals from
the sample are in each category.
Problem:
Consider from a group of persons , certain Eye colour persons are selected from random
Eye colour in a sample of 40 ,Blue 12,brown 21,green 3,others 4.Eye colour in population
-Brown 80%,Blue 10%,Green ,2%,Others 8%. Is there any difference between proportion
of sample to that of population .Use α= 0.05
Problem:
A total 1500 workers on 2 operators(A&B) Were classified as deaf & non-deaf according
to the following table.is there association between deafness & type of operator .let α 0.05
Calculate:
E= Tr x Tc / GT
Steps:
1.Data
Represent 1500 workers,1000 on operator A 100 of them were deaf while 500 on operator
B 60 of them were deaf
2. Assumption
• Sample is randomly selected from the population.
3. Hypothesis
• HO: there is no significant association between type of operator & deafness.
• HA: there is significant association between type of operator & deafness.
4. Level of significance; (α = 0.05);
• % Chance factor effect area
• 95% Influencing factor effect area
• d.f.(degree of freedom)=(r-1)(c-1) =(2-1)(2-1)=1
D.f. 1 for 0.05=3.841
5. Apply a proper test of significance
6. Statistical decision
Calculated chi< tabulated chi
P>0.5
7. Conclusion
We accept H0
HO may be true
There is no significant association between type of operator & deafness
When 2x2 chi-square test have a zero cell (one of the four cells is zero) we cannot apply
chi-square test because we have what is called a complete dependence criteria.
But for axb chi-square test and one of the cells is zero when cannot apply the test unless
we do proper categorization to get rid of the zero cell.
Properties of Chi- square test:
1. The mean of the X2 distribution is equal to
the number of degrees of freedom
2. The variance of X2 distribution is twice the degree of freedom
3. If X2 is a chi-square variate with γ degree of freedom then X2 / 2 is a gamma
variate
4. Standard X2 variate tends to standard normal variate as n to ∞ Applications:
1. To test the hypothetical value of the population
2. To test the goodness of fit
3. To test the independence of attributes
4. To test the homogeneity of independent estimates
5. To combine various probabilities to give a single set of significance
Hypothesis Testing
A hypothesis may be defined simply as a statement about one or more populations.
Statistical hypotheses are hypotheses that are stated in such a way that they may be
evaluated by appropriate statistical techniques.
Hypothesis Testing Steps
1. Data. The nature of the data that form the basis of the testing procedures must be
understood, since this determines the particular test to be employed
2. Assumptions : A general procedure is modified depending on the assumptions
3. Hypothesis : There are two statistical hypotheses involved in hypothesis testing, and
these should be stated explicitly. The null hypothesis is the hypothesis to be tested. It
is designated by the symbol H0. The alternative hypothesis is a statement of what we
will believe is true if our sample data cause us to reject the null hypothesis the
alternative hypothesis by the symbol HA a certain population mean is not 50? The null
hypothesis is
• H0: =50 and the alternative is
• HA: ≠50
• Suppose we want to know if we can conclude that the population mean is greater
than
• 50. Our hypotheses are
• H0: μ ≤ 50 HA: μ > 50
If we want to know if we can conclude that the population mean is less than 50, the
hypotheses are
• H0: μ ≥50 HA: μ <50
4. Test statistic. The test statistic is some statistic that may be computed from the data of
the sample. As we will see, the test statistic serves as a decision maker, since the
decision to reject or not to reject the null hypothesis depends on the magnitude of the
test statistic.
An example of a test statistic is the quantity
6. Decision rule. The decision rule tells us to reject the null hypothesis if the value of the
test statistic that we compute from our sample is one of the values in the rejection
region and to not reject the null hypothesis if the computed value of the test statistic is
one of the values in the nonrejection region
7. Calculation of test statistic. From the data contained in the sample we compute a value
of the test statistic and compare it with the rejection and nonrejection regions that have
already been specified.
8. Statistical decision. The statistical decision consists of rejecting or of not rejecting the
null hypothesis
It is rejected if the computed value of the test statistic falls in the rejection region, and it
is not rejected if the computed value of the test statistic falls in the nonrejection region.
9. Conclusion.
• If H0 is rejected, we conclude that HA is true.
• If H0 is not rejected, we conclude that H0 may be true. 10. p values.
• The p value is a number that tells us how unusual our sample results are, given
that the null hypothesis is true.
• A p value indicating that the sample results are not likely to have occurred, if the
null hypothesis is true, provides justification for doubting the truth of the null
hypothesis.
Purpose of Hypothesis Testing
The purpose of hypothesis testing is to assist administrators and clinicians in making
decisions. The administrative or clinical decision usually depends on the statistical
decision. If the null hypothesis is rejected, the administrative or clinical decision usually
reflects this, in that the decision is compatible with the alternative hypothesis. The reverse
is usually true if the null hypothesis is not rejected. The administrative or clinical
decision, however, may take other forms, such as a decision to gather more data.
Hypothesis Testing:
A single population mean
The testing of a hypothesis about a population mean under three different conditions: (1)
when sampling is from a normally distributed population of values with known
variance; (2) when sampling is from a normally distributed population with unknown
variance, and (3) when sampling is from a population that is not normally distributed.
When sampling is from a normally distributed population and the population variance
is known, the test statistic for testing H0: μ – μ0
Problems:
1. Does the evidence support the idea that the average lecture consists of 3000 words
if a random sample of the lectures of 16 professors had a mean of 3472 words,
given the population standard deviation is 500 words? Use α = 0.01. Assume that
lecture lengths are approximately normally distributed. Show all steps.
μ = 3000
σ = 500
𝐱̅ = 3472
n = 16
α = 0.0
1) Ho: μ = 3000
2) Ha : μ ≠ 3000
3) α = 0.01
4) Reject Ho if z < −2.576 or z > 2.576
5) 𝐳 = 𝟑𝟒𝟕𝟐−𝟑𝟎𝟎𝟎 (𝟓𝟎𝟎 √𝟏𝟔) = 𝟑. 𝟕𝟖
6) Reject Ho, because 3.78 > 2.576
7) At α = 0.01, the population mean is not equal to 3000 words.
2. Suppose that scores on the Scholastic Aptitude Test form a normal distribution with μ
= 500 and α = 100. A high school counselor has developed a special course designed to
boost SAT scores. A random sample of 16 students is selected to take the course and then
the SAT. The sample had an average score of 𝑋 = 544. Does the course boost SAT scores?
Test at α = 0.01. Show all steps.
μ = 500
σ = 100
𝐱̅ = 544
n = 16
α = 0.01
1) Ho: μ = 500
2) Ha : μ > 500
3) α = 0.01
4) Reject Ho if z > 2.326
5) 𝐳 = 𝟓𝟒𝟒−𝟓𝟎𝟎 (𝟏𝟎𝟎 √𝟏𝟔) = 𝟏. 𝟕𝟔
6) Accept Ho, because 1.76 < 2.326
7) At α = 0.01, the population mean is equal to 500.
One-Sided Hypothesis Tests
Hypothesis test may be one-sided, in which case all the rejection region is in one or the
other tail of the distribution. Whether a one-sided or a two-sided test is used depends on
the nature of the question being asked by the researcher.
Problem
Researchers are interested in the mean age of a certain population. Let us say that they
are asking the following question: Can we conclude that the mean age of this population
is different from 30 years? Suppose, instead of asking if they could conclude that μ ≠ 30,
the researchers had asked: Can we conclude that μ < 30? To this question we would reply
that they can so conclude if they can reject the null hypothesis that μ ≥ 30.
1. Data. See the previous example.
2. Assumptions. See the previous example.
3. Hypotheses.
H0: μ >30
HA: μ < 30
The inequality in the null hypothesis implies that the null hypothesis consists
of an infinite number of hypotheses.
5.Test statistic.
= -2:12
8.Statistical decision. We are able to reject the null hypothesis since
-2:12 < -1:645 9. Conclusion.
THE DIFFERENCE BETWEEN TWO POPULATION MEANS
Hypothesis testing involving the difference between two population means is most
frequently employed to determine whether or not it is reasonable to conclude that the
two population means are unequal.
Problem:
1. Researchers wish to know if the data they have collected provide sufficient evidence to
indicate a difference in mean serum uric acid levels between normal individuals and
individuals with Down’s syndrome. The data consist of serum uric acid readings on 12
individuals with Down’s syndrome and 15 normal individuals. The means are = 4:5
mg/100 ml = 3:4 mg/100 ml.
We will say that the sample data do provide evidence that the population means are not
equal if we can reject the null hypothesis that the population means are equal. Let us
reach a conclusion by means of the ten-step hypothesis testing procedure.
1. Data. See problem statement.
2. Assumptions. The data constitute two independent simple random samples each
drawn from a normally distributed population with a variance equal to 1 for the
Down’s syndrome population and 1.5 for the normal population.
3. Hypotheses:
An alternative way of stating the hypotheses is as follows:
5. Distribution of test statistic. When the null hypothesis is true, the test statistic follows
the standard normal distribution.
6. Decision rule. Let α = 0.05. The critical values of z are 1:96.
Reject H0 unless -1:96 < zcomputed < 1:96.
7. Statistical decision. Reject H0, since 2:57 > 1.96.
8. Conclusion. Conclude that, on the basis of these data,
9. There is an indication that the two population means are not equal.
10. p value. For this test, p= 0.0102.
Problem:
The purpose of a study by Wilkins et al. (A-28) was to measure the effectiveness of
recombinant human growth hormone (rhGH) on children with total body surface area
burns > 40 percent. In this study, 16 subjects received daily injections at home of rhGH.
At baseline, the researchers wanted to know the current levels of insulin-like growth
factor (IGF-I) prior to administration of rhGH. The sample variance of IGF-I levels (in
ng/ml) was 670.81. We wish to know if we may conclude from these data that the
population variance is not 600.
1. Data. See statement in the example.
2.Assumptions. The study sample constitutes a simple random sample from a population
of similar children. The IGF-I levels are normally distributed. 3. Hypothesis
8. Statistical decision. Do not reject H0 since 6:262 < 16:77 < 27:488.
9. Conclusion. Based on these data we are unable to conclude that the population
variance is not 600.
10. p value. The determination of the p value for this test is complicated by the fact
that we have a two-sided test and an asymmetric sampling distribution. When we have
a two-sided test and a symmetric sampling distribution such as the standard normal
or t, we may, as we have seen, double the one-sided p value. Problems arise when we
attempt to do this with an asymmetric sampling distribution such as the chi-square
distribution
Hypothesis Testing Python
Imagine a woman in her seventies who has a noticeable tummy bump. Medical
professionals could presume the bulge is a fibroid. In this instance, our first finding (or
the null hypothesis) is that this woman has a fibroid, and our alternative finding is that
she does. We shall use terms null hypothesis (beginning assumption) and alternate
hypothesis (countering assumption) to conduct hypothesis testing. The next step is
gathering the data samples we can use to validate the null hypothesis.The following
options are the remaining ones:
T- Test: When comparing the mean values of two samples that specific characteristics
may connect, a t-test is performed to see if there exists a substantial difference. It is
typically employed when data sets, such as those obtained from tossing a coin 100 times
and stored as results, would exhibit a normal distribution. It could have unknown
variances. The t-test is a method for evaluating hypotheses that allows you to assess a
population-applicable assumption.
Assumptions o Each sample's data is randomly and uniformly
distributed (iid). o Each sample's data have a normal
distribution.
o Every sample's data share the same variance.
T-tests are of two types: 1. one-sampled t-test and 2. two-sampled t-test.
One sample t-test: The One Sample t-test ascertains if the sample average differs
statistically from an actual or apposed population mean. A parametric testing technique
is the One Sample t-test.
Example: You are determining if the average age of 10 people is 30 or otherwise. Check
the Python script below for the implementation.
Code
# Python program to implement T-Test on a sample of ages
# Importing the required libraries from
scipy.stats import ttest_1samp import
numpy as np # Creating a sample of
ages ages = [45, 89, 23, 46, 12, 69, 45, 24,
34, 67] print(ages)
# Calculating the mean of the sample
mean = np.mean(ages) print(mean)
# Performing the T-Test t_test,
p_val = ttest_1samp(ages, 30)
print("P-value is: ", p_val)
# taking the threshold value as 0.05 or 5%
if p_val < 0.05:
print(" We can reject the null hypothesis") else:
print("We can accept the null hypothesis")
Output
[45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
45.4
P-value is: 0.07179988272763554
We can accept the null hypothesis
Chi-Square test
Is a statistical method to determine if two categorical variables have a significant
correlation between them. Both those variables should be from same population and they
should be categorical like − Yes/No, Male/Female, Red/Green etc. For example, we can
build a data set with observations on people's ice-cream buying pattern and try to
correlate the gender of a person with the flavour of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavours by knowing the number
of gender of people visiting.We use various functions in numpy library to carry out the
chi-square test.
from scipy import stats import
numpy as np import
matplotlib.pyplot as plt x =
np.linspace(0, 10, 100) fig,ax =
plt.subplots(1,1)
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 4, 7, 6] for df, ls in
zip(deg_of_freedom, linestyles): ax.plot(x,
stats.chi2.pdf(x, df), linestyle=ls) plt.xlim(0,
10) plt.ylim(0, 0.4) plt.xlabel('Value')
plt.ylabel('Frequency') plt.title('Chi-Square
Distribution') plt.legend() plt.show()
Where:
• P: Observed sample proportion
• Po: Hypothesized Population Proportion
• n: Sample size
In this example, we are using the P-value to 0.86, Po to 0.80, and n to 100, and by using
this we will be calculating the z-test one proportional in the python programming
language.
Code: import
math
P = 0.86 Po = 0.80
n = 100 a = (P-Po)
b = Po*(1-Po)/n z
= a/math.sqrt(b)
print(z) Output:
1.4999999999999984