0% found this document useful (0 votes)
36 views43 pages

FHA Notes

Class notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views43 pages

FHA Notes

Class notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT I

Introduction

Fundamentals of Healthcare Analytics- Overview

Healthcare analytics involves the application of data analysis and insights in the
healthcare industry to improve patient outcomes, operational efficiency, and decision-
making processes.

Fundamental aspects:

1. Data Collection: Gathering and managing various types of healthcare data,


including patient records, clinical data, administrative information, and financial
data.

2. Data Processing and Integration: Cleaning, organizing, and integrating data


from disparate sources to create a unified dataset. This often involves using tools
and techniques to handle structured and unstructured data.

3. Descriptive Analytics: Utilizing historical data to understand past trends and


patterns in healthcare, such as patient demographics, disease prevalence, and
resource utilization.

4. Predictive Analytics: Forecasting future outcomes or trends based on historical


data. This involves using statistical models and machine learning algorithms to
predict events like disease outbreaks, patient readmissions, or resource needs.

5. Prescriptive Analytics: Recommending actions or interventions based on


predictive analytics to optimize decision-making. For instance, suggesting the
most effective treatment plans or resource allocation strategies.

6. Performance Measurement: Evaluating and monitoring the effectiveness of


healthcare initiatives, interventions, and programs through key performance
indicators (KPIs) and metrics.
7. Privacy and Security: Ensuring compliance with regulations like HIPAA (Health
Insurance Portability and Accountability Act) to safeguard patient data and
maintain confidentiality.
8. Technological Tools: Using advanced technologies like artificial intelligence,
machine learning, big data analytics, and data visualization tools to derive
meaningful insights from healthcare data.

9. Clinical Decision Support: Providing clinicians with data-driven insights at the


point of care to aid in diagnosis, treatment planning, and personalized medicine.

10. Healthcare Economics: Analyzing the financial aspects of healthcare, including


cost analysis, revenue cycle management, and reimbursement models.

Data

The raw material of statistics is data. For our purposes we may define data as
numbers.The two kinds of numbers that we use in statistics are numbers that result from
the taking—in the usual sense of the term—of a measurement, and those that result from
the process of counting. For example, when a nurse weighs a patient or takes a
patient’s temperature, a measurement

Statistics: The meaning of statistics is implicit in the previous section. More concretely,
however, we may say that statistics is a field of study concerned with (1) the collection,
organization, summarization, and analysis of data; and (2) the drawing of inferences
about a body of data when only a part of the data is observed.
Sources of Data data are usually available from one or more of the following sources:
1. Routinely kept records
2. Surveys
3. Experiments
4. External sources
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public
health, genetics, ecology, and more. Here are some key aspects:

Study Design: Biostatisticians play a crucial role in designing experiments and studies.
They determine the sample size, randomization methods, and data collection
techniques to ensure that results are reliable and meaningful.

Data Collection: They collect data through various methods, such as surveys, clinical
trials, observations, or experiments. This data may include information on diseases,
treatments, genetics, environmental factors, and more.
Data Analysis: Once data is collected, biostatisticians use statistical methods to analyze
it. They employ techniques like hypothesis testing, regression analysis, survival
analysis, and more to draw conclusions and make inferences from the data.

Interpretation: Biostatisticians interpret the results of their analyses, often collaborating


with researchers, doctors, or policymakers to understand the implications of their
findings. This interpretation guides decision-making in healthcare, policy formulation,
and scientific research.

Application in Public Health: Biostatistics plays a vital role in public health by


analyzing patterns of diseases, evaluating the effectiveness of interventions, and
predicting health outcomes in populations.

Variable: variables include diastolic blood pressure, heart rate, the heights of adult males,
the weights of preschool children, and the ages of patients seen in a dental clinic.
Quantitative Variables: A quantitative variable is one that can be measured in the usual
sense
Qualitative Variables: Measurements made on qualitative variables convey information
regarding Attribute.
Random Variable: the values obtained arise as a result of chance factors, so that they
cannot be exactly predicted in advance, the variable is called a random variable. An
example of a random variable is adult height.
Discrete Random Variable: Variables may be characterized further as to whether they
are discrete or continuous. A discrete variable is characterized by gaps or interruptions
in the values that it can assume.
Continuous Random Variable : A continuous random variable does not possess the gaps
or interruptions characteristic of a discrete random variable
Population : A population or collection of entities may, however, consist of animals,
machines, places, or cells. a population of entities as the largest collection of entities for
which we have an interest at a particular time
Sample: A sample may be defined simply as a part of a population.
Introduction to biostatistics
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public health,
genetics, ecology, and more.
Here are some key aspects:
1. Study Design: Biostatisticians play a crucial role in designing experiments and
studies. They determine the sample size, randomization methods, and data
collection techniques to ensure that results are reliable and meaningful.
2. Data Collection: They collect data through various methods, such as surveys,
clinical trials, observations, or experiments. This data may include information on
diseases, treatments, genetics, environmental factors, and more.
3. Data Analysis: Once data is collected, biostatisticians use statistical methods to
analyze it. They employ techniques like hypothesis testing, regression analysis,
survival analysis, and more to draw conclusions and make inferences from the
data.
4. Interpretation: Biostatisticians interpret the results of their analyses, often
collaborating with researchers, doctors, or policymakers to understand the
implications of their findings. This interpretation guides decision-making in
healthcare, policy formulation, and scientific research.
5. Application in Public Health: Biostatistics plays a vital role in public health by
analyzing patterns of diseases, evaluating the effectiveness of interventions, and
predicting health outcomes in populations.

Usage of biostatics in health care


• Documentation of medical history of diseases.
• Planning and conduct of clinical studies.
• Evaluating the merits of different procedures.
• In providing methods for definition of ―normal‖ and ―abnormal‖ .

Role of Biostatistics in patient care


• In increasing awareness regarding diagnostic, therapeutic and prognostic
uncertainties and providing rules of probability to delineate those uncertainties
• In providing methods to integrate chances with value judgments that could be
most beneficial to patient
• In providing methods such as sensitivity-specificity and predictivities that help
choose valid tests for patient assessment
• In providing tools such as scoring system and expert system that can help reduce
epistemic uncertainties
• In carrying out a valid and reliable health situation analysis, including in proper
summarization and interpretation of data.

COMPUTERS AND BIOSTATISTICAL ANALYSIS


The widespread use of computers has had a tremendous impact on health sciences
research in general and bio statistical analysis in particular. The necessity to perform long
and tedious arithmetic computations as part of the statistical analysis of data lives only
in the memory of those researchers and practitioners whose careers antedate the so-called
computer revolution.
The use of computers makes it possible for investigators to devote more time to
the improvement of the quality of raw data and the interpretation of the results. The
current prevalence of microcomputers and the abundance of available statistical software
programs have further revolutionized statistical computing. Computers currently on the
market are equipped with random number generating capabilities. As an alternative to
using printed tables of random numbers, investigators may use computers to generate
the random numbers they need .
Actually, the ―random‖ numbers generated by most computers are in reality
pseudorandom numbers because they are the result of a deterministic formula. serve
satisfactorily for many practical purposes.
The usefulness of the computer in the health sciences is not limited to statistical
analysis. Computers play a pivotal role in bio-statistical analysis, revolutionizing the way
researchers process, analyze, and interpret biological data. Here's how:
1. Data Processing: Computers efficiently handle large volumes of biological data,
such as DNA sequences, gene expressions, protein structures, and patient records.
They organize and preprocess this data for analysis.
2. Statistical Analysis: Software and algorithms perform complex statistical
analyses on biological data. This includes hypothesis testing, regression analysis,
survival analysis, and more. These analyses help researchers identify patterns,
correlations, and associations within biological datasets.
3. Machine Learning and AI: Computers utilize machine learning and artificial
intelligence techniques to identify subtle patterns within biological data that
might be challenging for humans to detect. These methods contribute to predictive
modeling, classification of diseases, drug discovery, and personalized medicine.
4. Visualization: Computers generate visual representations, such as graphs, charts,
and 3D models, to help researchers interpret and communicate their findings
effectively.
5. Database Management: Databases store vast amounts of biological data, and
computers efficiently manage, update, and retrieve this information for
researchers, facilitating cross-study comparisons and meta-analyses.
6. High-Performance Computing: Complex computational tasks, like molecular
modeling, simulation of biological systems, or analyzing large-scale genomic data,
require high-performance computing. Supercomputers and clusters of powerful
machines enable these calculations within reasonable time frames.
7. Reproducibility and Collaboration: Computers facilitate reproducibility in
research by allowing scientists to share code, algorithms, and methodologies.
Collaboration across geographical boundaries becomes easier through shared
platforms and cloud-based tools.

Introduction to probability • A probability provides a quantitative description of the


chances or likelihoods associated with various outcomes • It provides a bridge
between descriptive and inferential statistics
The concept of objective probability may be categorized further under the headings
• of (1) classical, or a priori, probability, and (2) the relative frequency, or a posteriori,
concept of probability.
Classical Probability:
If an event can occur in N mutually exclusive and equally likely ways,and if m of these
possess a trait E, the probability of the occurrence of E is equal to m=N.
If we read P(E) as ―the probability of E,‖ we may express this definition as
Relative Frequency Probability The relative frequency approach to probability depends
on the repeatability of some process and the ability to count the number of repetitions, as
well as the number of times that some event of interest occurs If some process is repeated
a large number of times, n, and if some Definition:
Resulting event with the characteristic E occurs m times, the relative frequency of
occurrence of E, m=n, will be approximately equal to the probability of E. P(E) = m / n
Subjective Probability
This concept of probability, one may evaluate the probability of an event that can only
happen once, for example, the probability that a cure for cancer will be discovered within
the next 10 years.
Bayesian Methods
Probabilities based on classical or relative frequency concepts are designed to allow for
decisions to be made solely on the basis of collected data, Bayesian methods make use of
what are known as prior probabilities and posterior probabilities.
Definition :
The prior probability of an event is a probability based on prior knowledge, prior
experience, or results derived from prior data collection activity. The posterior
probability of an event is a probability obtained by using new information to update or
revise a prior probability.
ELEMENTARY PROPERTIES OF PROBABILITY
The three properties are as follows.
1. Given some process (or experiment) with n mutually exclusive outcomes (called
events), E1; E2; . . . ; En, the probability of any event Ei is assigned a nonnegative
number. That is,

2. A key concept in the statement of this property is the concept of mutually exclusive
outcomes. Two events are said to be mutually exclusive if they cannot occur
simultaneously

The sum of the probabilities of the mutually exclusive outcomes is equal to 1. This is the
property of exhaustiveness and refers to the fact that the observer of a probabilistic
process must allow for all possible events, and when all are taken together, their total
probability is 1.

3. Consider any two mutually exclusive events, Ei and Ej. The probability of the
occurrence of either Ei or Ej is equal to the sum of their individual probabilities.
CALCULATING THE PROBABILITY OF AN EVENT
When probabilities are calculated with a subset of the total group as the denominator,
the result is a conditional probability
Joint Probability
Sometimes we want to find the probability that a subject picked at random from a
group of subjects possesses two characteristics at the same time. Such a probability is
referred to as a joint probability.
Problem:
The primary aim of a study by Carter et al. (A-1) was to investigate the effect of the age
at onset of bipolar disorder on the course of the illness. One of the variables investigated
was family history of mood disorders. Table shows the frequency of a family history of
mood disorders in the two groups of interest (Early age at onset defined to be 18 years or
younger and Later age at onset defined to be later than 18 years). Suppose we pick a
person at random from this sample. What is the probability that this person will be 18
years old or younger?

Solution:
For purposes of illustrating the calculation of probabilities we consider this group of 318
subjects to be the largest group for which we have an interest. In other words, for this
example, we consider the 318 subjects as a population. We assume that Early and Later
are mutually exclusive categories and that the likelihood of selecting any one person is
equal to the likelihood of selecting any other person. We define the desired probability
as the number of subjects with the characteristic of interest (Early) divided by the total
number of subjects. We may write the result in probability notation as follows:

P(E) = number of Early subjects/total number of subjects


= 141/ 318= 0. 4434 Problem:
What is the probability that a person picked at random from the 318 subjects will be Early
(E) and will be a person who has no family history of mood disorders (A)?
The probability we are seeking may be written in symbolic notation as P(E ∩ A ) in which
the symbol ∩ is read either as ―intersection‖ or ―and.‖ The statement E ∩ A indicates
the joint occurrence of conditions E and A. The number of subjects satisfying both of the
desired conditions is found in Table 3.4.1 at the intersection of the column labeled E and
the row labeled A and is seen to be 28. Since the selection will be made from the total set
of subjects, the denominator is 318. Thus, we may write the joint probability as
P ( E ∩ A) =28/ 318=0. 0881
The Multiplication Rule
A probability may be computed from other probabilities. For example, a joint
probability may be computed as the product of an appropriate marginal probability and
an appropriate conditional probability. This relationship is known as the multiplication
rule of probability.
The conditional probability of A given B is equal to the probability of A ∩ B divided by
the probability of B, provided the probability of B is not zero.

Problem:
We wish to compute the joint probability of Early age at onset (E) and a negative family
history of mood disorders (A) from a knowledge of an appropriate marginal probability
and an appropriate conditional probability.
Solution:
The probability we seek is P(E∩A).
P(E) =141/ 318 = 0.4434,
and a conditional probability = P(A │B) = 28/ 141 = 0.1986
P(E∩A)= P(E) . P (A│B) = (0.4434).( 0.1986) = 0.0881

The Addition Rule


Given two events A and B, the probability that event A, or event B, or both occur is equal
to the probability that event A occurs, plus the probability that event B occurs, minus the
probability that the events occur simultaneously. The addition rule may be written

When events A and B cannot occur simultaneously, P(A ∩ B) is sometimes called


―exclusive or,‖ and P( A U B) = 0. When events A and B can occur simultaneously, P( A
U B) is sometimes called ―inclusive or,‖ and we use the addition rule to calculate P( A
U B).

Independent Events
P(A│B) = P (A) . In such cases we say that A and B are independent events. The
multiplication rule for two independent events, then, may be written as

if two events
are independent, the probability of their joint occurrence is equal to the product of the
probabilities of their individual occurrences. when two events with nonzero probabilities
are independent, each of the following statements is true
Marginal Probability
Given some variable that can be broken down into m categories designated by A1;A2; . .
. ;Ai; . . . ;Am and another jointly occurring variable that is broken down into n categories
designated by B1; B2; . . . ; Bj; . . . ; Bn, the marginal probability of Ai P(Ai) is equal to the
sum of the joint probabilities of Ai with all the categories of B. That is,

Bayes Theorem

Conditional Probability
The conditional probability of A given B is equal to the probability of A∩ B divided by the
probability of B, provided the probability of B is not zero.

Problem:
Suppose we pick a subject at random from the 318 subjects and find that he is 18 years or
younger (E). What is the probability that this subject will be one who has no family
history of mood disorders (A)?
Solution: The total number of subjects is no longer of interest, since, with the selection of
an Early subject, the Later subjects are eliminated. We may define the desired probability,
then, as follows: What is the probability that a subject has no family history of mood
disorders (A), given that the selected subject is Early (E)? This is a conditional probability
and is written as P (A│B) in which the vertical line is read ―given.‖ The 141 Early
subjects become the denominator of this conditional probability, and 28, the number of
Early subjects with no family history of mood disorders, becomes the numerator. Our
desired probability, then, is

P (A│B) = 28/141=0 .1986


Problems
In an article appearing in the Journal of the American Dietetic Association, Holben et
al.(A-1) looked at food security status in families in the Appalachian region of southern
Ohio.The purpose of the study was to examine hunger rates of families with children in
a local Head Start program in Athens, Ohio. The survey instrument included the
18-question U.S.Household Food Security Survey Module for measuring hunger and
food security. In addition, participants were asked how many food assistance
programs they had used in the last 12 months. Table shows the number of food assistance
programs used by subjects in this sample. We wish to construct the probability distribution of
the discrete variable X, whereX =number of food assistance programs used by the study subjects.
Likelihood & odds
The likelihood function (likelihood) represents the probability of random variable
realizations conditional on particular values of the statistical parameters. The likelihood
is the chance, the possibility of doing or achieving something, and the condition that can
ensure success. The likelihood of a hypothesis (H) given some data (D) is the probability
of obtaining D given that H is true multiplied by an arbitrary positive constant K:
L(H) = K × P(D|H)
In most cases, a hypothesis represents a value of a parameter in a statistical model, such
as the mean of a normal distribution. Because likelihood is not actually a probability, it
does not obey various rules of probability; for example, likelihoods need not sum to 1.
In the case of a conditional probability, P(D|H), the hypothesis is fixed and the data are
free to vary. Likelihood, however, is the opposite. The likelihood of a hypothesis, L(H),
is conditioned on the data, as if they are fixed while the hypothesis can vary. Suppose a
coin is flipped n times, and we observe x heads and n – x tails. The probability of getting
x heads in n flips is defined by the binomial distribution as follows:

where p is the probability of heads and the binomial coefficient,

counts the number of ways to get x heads in n flips. For example, if x = 2 and n = 3, the
binomial coefficient is calculated as 3!/(2! × 1!), which is equal to 3; there are three distinct
ways to get two heads in three flips (i.e., head-head-tail, head-tail-head, tailhead-head).
Thus, the probability of getting two heads in three flips if p is .50 would be .375 (3 × .502
× (1 – .50)1), or 3 out of 8.
If the coin is fair, so that p = .50, and we flip it 10 times, the probability of six heads and
four tails is

If the coin is a trick coin, so that p = .75, the probability of six heads in 10 tosses is

Likelihoods may seem overly restrictive because we have compared only two simple
statistical hypotheses in a single likelihood ratio. The likelihood ratio of any two
hypotheses is simply the ratio of their heights on this curve.
The Normal distribution, chi-square distribution, binomial distribution, Poisson
distribution, and uniform distribution. Likelihoods are also a key component of Bayesian
inference. The Bayesian approach to statistics is fundamentally about making use of all
available information when drawing inferences in the face of uncertainty. Previous
information is quantified using what is known as a prior distribution. Mathematically, a
well-known conditional-probability theorem states that the procedure for obtaining the
posterior distribution of θ is as follows:

In this context, K is merely a rescaling constant and is equal to 1/P(D). We often write
this theorem more simply as

where means ―is proportional to


Conjugate distributions are convenient in that they reduce Bayesian updating to some
simple algebra. We begin with the formula for the binomial likelihood function

and then multiply it by the formula for the beta prior with a and b shape parameters,

o obtain the following formula for the posterior distribution:

︸ ︸

which suggests that we can interpret the information contained in the prior as adding a
certain amount of previous data (i.e., a – 1 past successes and b – 1 past failures) to the
data from our current experiment. Because we are multiplying together terms with the
same base, the exponents can be added together in a final simplification step:

This final formula looks like our original beta distribution but with new shape parameters
equal to x + a and n – x + b. In other words, we started with the prior distribution beta
(a,b) and added the successes from the data, x, to a and the failures, n – x, to b, and our
posterior distribution is a beta(x + a,n – x + b) distribution.
consider the previous example of observing 60 heads in 100 flips of a coin. Imagine that
going into this experiment, we had some reason to believe the coin’s bias was within .20
of being fair in either direction; that is, we believed that p was likely within the range of
.30 to .70. We could choose to represent this information using the beta(25,25) distribution
shown as the dotted line. The likelihood function for the 60 flips is shown as the dot-and-
dashed line and is identical to that shown in the middle panel.
The statistical distribution using appropriate software tool –Python
The data is described in such a way that it can express some meaningful information
that can also be used to find some future trends. Describing and summarizing a single
variable is called univariate analysis. Describing a statistical relationship between two
variables is called bivariate analysis. Describing the statistical relationship between
multiple variables is called multivariate analysis.
There are two types of Descriptive Statistics:
• The measure of central tendency
• Measure of variability

Measure of Central Tendency


The measure of central tendency is a single value that attempts to describe the whole set
of data. There are three main features of central tendency:

• Mean
• Median
• Median Low
• Median High
• Mode
Mean
It is the sum of observations divided by the total number of observations. It is also defined
as average which is the sum divided by count. The mean() function returns the mean or
average of the data passed in its arguments. If the passed argument is empty, Statistics
Error is raised.
Example: Python code to calculate mean
# Python code to demonstrate the working of
# mean()
# importing statistics to handle statistical
# operations
import statistics #
initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list
# elements
print ("The average of list values is : ",end="") print
(statistics.mean(li))

Output
The average of list values is : 2

The median_low() function returns the median of data in case of odd number of
elements, but in case of even number of elements, returns the lower of two middle
elements. If the passed argument is empty, StatisticsError is raised
# Python code to demonstrate the
# working of median_low() #
importing the statistics module
import statistics
# simple list of a set of integers set1
= [1, 3, 3, 4, 5, 7]
# Print median of the data-set
# Median value may or may not #
lie within the data-set
print("Median of the set is % s" % (statistics.median(set1)))
# Print low median of the data-set print("Low
Median of the set is % s " %
(statistics.median_low(set1))) Output:
Median of the set is 3.5
Low Median of the set is 3

In Python, you can use various libraries such as NumPy, SciPy, and Matplotlib to analyze
data and determine the statistical distribution. Here's an example of how you might find
the distribution of a dataset using these libraries:
Firstly, let's generate some sample data. For demonstration purposes, we'll create a
dataset following a normal distribution.
This code snippet demonstrates:
Generating a dataset of 1000 data points following a normal distribution. Plotting a
histogram to visualize the distribution of the generated data.
Fitting a normal distribution curve to the data and plotting it over the histogram.
The stats.norm.fit() function in this example fits a normal distribution to the data using
maximum likelihood estimation, estimating the mean and standard deviation of the
distribution. You can replace 'norm' with other distribution names like 'gamma', 'expon',
etc., to fit different distributions to your data.
This is a basic example, and in practice, you might need to preprocess and analyze your
data differently based on its characteristics and the specific analysis you're conducting.
But this should give you a starting point for determining the statistical distribution of
your data using Python.
import numpy as np import
matplotlib.pyplot as plt from
scipy import stats
# Generating a dataset with a normal distribution np.random.seed(42)
# Setting seed for reproducibility
data = np.random.normal(loc=0, scale=1, size=1000) # Mean=0, Standard
Deviation=1, 1000 data points

# Plotting a histogram to visualize the distribution


plt.hist(data, bins=30, density=True, alpha=0.5, color='blue')
plt.title('Histogram of Sample Data') plt.xlabel('Values')
plt.ylabel('Frequency')

# Fitting a distribution to the data


# You can try different distributions like 'norm' for normal distribution, 'gamma',
'expon', etc.
param = stats.norm.fit(data) # Fitting a normal distribution to the data x
= np.linspace(min(data), max(data), 100)
pdf_fitted = stats.norm.pdf(x, *param) plt.plot(x,
pdf_fitted, 'r-', linewidth=2) plt.show()

UNIT II STATISTICAL PARAMETERS

Statistical parameters p-values


The P-value is known as the probability value. It is defined as the probability of getting
a result that is either the same or more extreme than the actual observations. The Pvalue
is known as the level of marginal significance within the hypothesis testing that
represents the probability of occurrence of the given event. The P-value is used as an
alternative to the rejection point to provide the least significance at which the null
hypothesis would be rejected. If the P-value is small, then there is stronger evidence in
favour of the alternative hypothesis. P-value Table

The P-value table shows the hypothesis interpretations:

Definition :
A p value is the probability that the computed value of a test statistic is at least as extreme
as a specified value of the test statistic when the null hypothesis is true. Thus, the p value
is the smallest value of a for which we can reject a null hypothesis.
Generally, the level of statistical significance is often expressed in p-value and the
range between 0 and 1. The smaller the p-value, the stronger the evidence and hence, the
result should be statistically significant. Hence, the rejection of the null hypothesis is
highly possible, as the p-value becomes smaller. A statistician wants to test the hypothesis
H0: μ = 120 using the alternative hypothesis Hα: μ > 120 and assuming that α = 0.05. For
that, he took the sample values as n =40, σ = 32.17 and x̄ = 105.37. Determine the
conclusion for this hypothesis?
Solution:
We know that,

Now substitute the given values

Now, using the test static formula, we get t = (105.37 –


120) / 5.0865
Therefore, t = -2.8762
From the Z-Score table, we can find the value of P(t>-2.8762)
From the table, we get
P (t<-2.8762) = P(t>2.8762) = 0.003
Therefore,
If P(t>-2.8762) =1- 0.003 =0.997
P- value =0.997 > 0.05
Therefore, from the conclusion, if p>0.05, the null hypothesis is accepted or fails to reject.
Hence, the conclusion is ―fails to reject H0.‖

There are two types of p-value you can use:

• One-sided p-value: You can use this method of testing if a large or unexpected
change in the data makes only a small or no difference to your data set. Typically,
this is unusual and you can use a two-sided p-value test instead.
• Two-sided p-value: You can use this method of testing if a large change in the
data would affect the outcome of the research and if the alternative hypothesis is
fairly general instead of specific. Most professionals use this method to ensure they
account for large changes in data.

Chi-square Test
The chi-square distribution is the most frequently employed statistical technique for the
analysis of count or frequency data. A statistical test that is used to compare observed
and expected results. The Chi-square statistic compares the size of any discrepancies
between the expected results and actual results.

X2 is distributed approximately as x2 with k – r degrees of freedom. Oi is


the observed frequency for the ith category of the variable of interest,
and Ei is the expected frequency

Applications of Chi-square test:


1. Goodness-of-fit
2. The 2 x 2 chi-square test (contingency table, four fold table)
3. The a x b chi-square test (r x c chi-square test)
Steps of Chi hypothesis testing
1. Data: Counts or proportion.
2. Assumption: random sample selected from a population.
3. HO : no sign. Difference in proportion, no significant association.
HA: sign. Difference in proportion, significant association.
4. Level of sign.
• df 1st application=k-1(k is no. of groups)
• df 2nd &3rd application=(column-1)(row-1)
• IN 2nd application(contingency table)
• Df=1, tab. Chi= 3.841 always
• Graph is one side (only +ve)
4. Apply appropriate test of significance
6. Statistical decision
7. Conclusion
Calculated chi <tabulated chi, P>0.05,Accept HO,(may be true)
If calculated chi> tabulated chi ,P<0.05,Reject HO& accept HA.

The Decision Rule


The quantity

and expected frequencies are close together and will be large if the differences are large.
The computed value of X2 is compared with the tabulated value of X2 with k – r degrees
of freedom. The decision rule, then, is: Reject H0 if X2 is greater than or equal to the
tabulated X2 for the chosen value of a. Types of Chi-square
• Tests of goodness-of-fit
• Test of independence
• Test of Homogeneity

Tests of goodness-of-fit
• The chi-square test for goodness-of-fit uses frequency data from a sample to test
hypotheses about the shape or proportions of a population.
• The data, called observed frequencies, simply count how many individuals from
the sample are in each category.

Problem:
Consider from a group of persons , certain Eye colour persons are selected from random
Eye colour in a sample of 40 ,Blue 12,brown 21,green 3,others 4.Eye colour in population
-Brown 80%,Blue 10%,Green ,2%,Others 8%. Is there any difference between proportion
of sample to that of population .Use α= 0.05

Expected blue =10/100*40=4


Expected brown=80/100*40=32
Expected green=2/100*40=0.8
Expected others=8/100*40=3 Steps:
1. Data
Represents the eye colour of 40 person in the following distribution
Brown=21 person,blue=12 person,green=3,others=4
2. Assumption
Sample is randomly selected from the population
3. Hypothesis
• Null hypothesis: there is no significant difference in proportion of eye colour of
sample to that of the population
• Alternative hypothesis: there is significant difference in proportion of eye colour
of sample to that of the population

4. Level of significance; (α =0.05)


• 5% Chance factor effect area ,95% Influencing factor effect area
• d.f.(degree of freedom)=K-1; (K=Number of subgroups) =4-1=3 • D.f.
for 0.5=7.81
5. Apply a proper test of significance

=(12-4)² (21-32)² (3-0.8)² (4-3)²


------------ +---------- +----------- + --------
4 32 0.8 3
=(64/4) + (121/32)+(4.8/0.8)+(1/3)
=16+3.78+6+0.3= Calculated
chi =26.08
6. Statistical decision:
Calculated chi> tabulated chi, P<0.5
7. Conclusion
We reject H0 &accept HA: there is significant difference in proportion of eye colour
of sample to that of the population

Applications of Chi-square test


1. Goodness-of-fit
2. The 2 x 2 chi-square test (contingency table, four fold table)
3. The a x b chi-square test (r x c chi-square test)

The Chi-Square Test for Independence


The second chi-square test, the chi-square test for independence, can be used and
interpreted in two different ways:
• Testing hypotheses about the relationship between two variables in a
population, or (2×2)
• Testing hypotheses about differences between proportions for two or more
populations.(a×b)
• The data, called observed frequencies, simply show how many individuals from
the sample are in each cell of the matrix.
• The null hypothesis for this test states that there is no relationship between the
two variables; that is, the two variables are independent.

2X2 chi square (contingency table) Expected value


E= Tr x Tc / GT
d.f = (r-1) . (c-1) = 1= 3.841

Problem:
A total 1500 workers on 2 operators(A&B) Were classified as deaf & non-deaf according
to the following table.is there association between deafness & type of operator .let α 0.05

Calculate:
E= Tr x Tc / GT
Steps:
1.Data
Represent 1500 workers,1000 on operator A 100 of them were deaf while 500 on operator
B 60 of them were deaf
2. Assumption
• Sample is randomly selected from the population.
3. Hypothesis
• HO: there is no significant association between type of operator & deafness.
• HA: there is significant association between type of operator & deafness.
4. Level of significance; (α = 0.05);
• % Chance factor effect area
• 95% Influencing factor effect area
• d.f.(degree of freedom)=(r-1)(c-1) =(2-1)(2-1)=1
D.f. 1 for 0.05=3.841
5. Apply a proper test of significance

=(100-106.7)² ( 900-893.3)² (60-53.3)²


--------------- + ---------------- + --------------
106.7 893.3 53.3
+(440-446.7)²
---------------
= 446.7
= 0.42+0.05+o.84+0.10
= 1.41

6. Statistical decision
Calculated chi< tabulated chi
P>0.5
7. Conclusion
We accept H0
HO may be true
There is no significant association between type of operator & deafness
When 2x2 chi-square test have a zero cell (one of the four cells is zero) we cannot apply
chi-square test because we have what is called a complete dependence criteria.
But for axb chi-square test and one of the cells is zero when cannot apply the test unless
we do proper categorization to get rid of the zero cell.
Properties of Chi- square test:
1. The mean of the X2 distribution is equal to
the number of degrees of freedom
2. The variance of X2 distribution is twice the degree of freedom
3. If X2 is a chi-square variate with γ degree of freedom then X2 / 2 is a gamma
variate
4. Standard X2 variate tends to standard normal variate as n to ∞ Applications:
1. To test the hypothetical value of the population
2. To test the goodness of fit
3. To test the independence of attributes
4. To test the homogeneity of independent estimates
5. To combine various probabilities to give a single set of significance

Hypothesis Testing
A hypothesis may be defined simply as a statement about one or more populations.
Statistical hypotheses are hypotheses that are stated in such a way that they may be
evaluated by appropriate statistical techniques.
Hypothesis Testing Steps
1. Data. The nature of the data that form the basis of the testing procedures must be
understood, since this determines the particular test to be employed
2. Assumptions : A general procedure is modified depending on the assumptions
3. Hypothesis : There are two statistical hypotheses involved in hypothesis testing, and
these should be stated explicitly. The null hypothesis is the hypothesis to be tested. It
is designated by the symbol H0. The alternative hypothesis is a statement of what we
will believe is true if our sample data cause us to reject the null hypothesis the
alternative hypothesis by the symbol HA a certain population mean is not 50? The null
hypothesis is
• H0: =50 and the alternative is
• HA: ≠50
• Suppose we want to know if we can conclude that the population mean is greater
than
• 50. Our hypotheses are
• H0: μ ≤ 50 HA: μ > 50
If we want to know if we can conclude that the population mean is less than 50, the
hypotheses are
• H0: μ ≥50 HA: μ <50
4. Test statistic. The test statistic is some statistic that may be computed from the data of
the sample. As we will see, the test statistic serves as a decision maker, since the
decision to reject or not to reject the null hypothesis depends on the magnitude of the
test statistic.
An example of a test statistic is the quantity

where μ 0 is a hypothesized value of a population mean. This test statistic is related to


the statistic

5. Distribution of test statistic

Distribution of the test statistic

6. Decision rule. The decision rule tells us to reject the null hypothesis if the value of the
test statistic that we compute from our sample is one of the values in the rejection
region and to not reject the null hypothesis if the computed value of the test statistic is
one of the values in the nonrejection region
7. Calculation of test statistic. From the data contained in the sample we compute a value
of the test statistic and compare it with the rejection and nonrejection regions that have
already been specified.
8. Statistical decision. The statistical decision consists of rejecting or of not rejecting the
null hypothesis
It is rejected if the computed value of the test statistic falls in the rejection region, and it
is not rejected if the computed value of the test statistic falls in the nonrejection region.
9. Conclusion.
• If H0 is rejected, we conclude that HA is true.
• If H0 is not rejected, we conclude that H0 may be true. 10. p values.
• The p value is a number that tells us how unusual our sample results are, given
that the null hypothesis is true.
• A p value indicating that the sample results are not likely to have occurred, if the
null hypothesis is true, provides justification for doubting the truth of the null
hypothesis.
Purpose of Hypothesis Testing
The purpose of hypothesis testing is to assist administrators and clinicians in making
decisions. The administrative or clinical decision usually depends on the statistical
decision. If the null hypothesis is rejected, the administrative or clinical decision usually
reflects this, in that the decision is compatible with the alternative hypothesis. The reverse
is usually true if the null hypothesis is not rejected. The administrative or clinical
decision, however, may take other forms, such as a decision to gather more data.

Hypothesis Testing:
A single population mean
The testing of a hypothesis about a population mean under three different conditions: (1)
when sampling is from a normally distributed population of values with known
variance; (2) when sampling is from a normally distributed population with unknown
variance, and (3) when sampling is from a population that is not normally distributed.
When sampling is from a normally distributed population and the population variance
is known, the test statistic for testing H0: μ – μ0

which, when H0 is true, is distributed as the standard normal

Problems:
1. Does the evidence support the idea that the average lecture consists of 3000 words
if a random sample of the lectures of 16 professors had a mean of 3472 words,
given the population standard deviation is 500 words? Use α = 0.01. Assume that
lecture lengths are approximately normally distributed. Show all steps.

μ = 3000
σ = 500
𝐱̅ = 3472
n = 16
α = 0.0

1) Ho: μ = 3000
2) Ha : μ ≠ 3000
3) α = 0.01
4) Reject Ho if z < −2.576 or z > 2.576
5) 𝐳 = 𝟑𝟒𝟕𝟐−𝟑𝟎𝟎𝟎 (𝟓𝟎𝟎 √𝟏𝟔) = 𝟑. 𝟕𝟖
6) Reject Ho, because 3.78 > 2.576
7) At α = 0.01, the population mean is not equal to 3000 words.

2. Suppose that scores on the Scholastic Aptitude Test form a normal distribution with μ
= 500 and α = 100. A high school counselor has developed a special course designed to
boost SAT scores. A random sample of 16 students is selected to take the course and then
the SAT. The sample had an average score of 𝑋 = 544. Does the course boost SAT scores?
Test at α = 0.01. Show all steps.

μ = 500
σ = 100
𝐱̅ = 544
n = 16
α = 0.01
1) Ho: μ = 500
2) Ha : μ > 500
3) α = 0.01
4) Reject Ho if z > 2.326
5) 𝐳 = 𝟓𝟒𝟒−𝟓𝟎𝟎 (𝟏𝟎𝟎 √𝟏𝟔) = 𝟏. 𝟕𝟔
6) Accept Ho, because 1.76 < 2.326
7) At α = 0.01, the population mean is equal to 500.
One-Sided Hypothesis Tests
Hypothesis test may be one-sided, in which case all the rejection region is in one or the
other tail of the distribution. Whether a one-sided or a two-sided test is used depends on
the nature of the question being asked by the researcher.

Problem
Researchers are interested in the mean age of a certain population. Let us say that they
are asking the following question: Can we conclude that the mean age of this population
is different from 30 years? Suppose, instead of asking if they could conclude that μ ≠ 30,
the researchers had asked: Can we conclude that μ < 30? To this question we would reply
that they can so conclude if they can reject the null hypothesis that μ ≥ 30.
1. Data. See the previous example.
2. Assumptions. See the previous example.
3. Hypotheses.
H0: μ >30
HA: μ < 30
The inequality in the null hypothesis implies that the null hypothesis consists
of an infinite number of hypotheses.
5.Test statistic.

6. Decision rule. Let us again use α= :05.


7. Calculation of test statistic

= -2:12
8.Statistical decision. We are able to reject the null hypothesis since
-2:12 < -1:645 9. Conclusion.
THE DIFFERENCE BETWEEN TWO POPULATION MEANS
Hypothesis testing involving the difference between two population means is most
frequently employed to determine whether or not it is reasonable to conclude that the
two population means are unequal.

Sampling from Normally Distributed Populations: Population


Variances Known When each of two independent simple random samples has been
drawn from a normally distributed population with a known variance, the test
statistic for testing the null hypothesis of equal population means is

Problem:
1. Researchers wish to know if the data they have collected provide sufficient evidence to
indicate a difference in mean serum uric acid levels between normal individuals and
individuals with Down’s syndrome. The data consist of serum uric acid readings on 12
individuals with Down’s syndrome and 15 normal individuals. The means are = 4:5
mg/100 ml = 3:4 mg/100 ml.
We will say that the sample data do provide evidence that the population means are not
equal if we can reject the null hypothesis that the population means are equal. Let us
reach a conclusion by means of the ten-step hypothesis testing procedure.
1. Data. See problem statement.
2. Assumptions. The data constitute two independent simple random samples each
drawn from a normally distributed population with a variance equal to 1 for the
Down’s syndrome population and 1.5 for the normal population.
3. Hypotheses:
An alternative way of stating the hypotheses is as follows:

4. The test statistic.

5. Distribution of test statistic. When the null hypothesis is true, the test statistic follows
the standard normal distribution.
6. Decision rule. Let α = 0.05. The critical values of z are 1:96.
Reject H0 unless -1:96 < zcomputed < 1:96.
7. Statistical decision. Reject H0, since 2:57 > 1.96.
8. Conclusion. Conclude that, on the basis of these data,
9. There is an indication that the two population means are not equal.
10. p value. For this test, p= 0.0102.

Hypothesis testing: a single population variance


The general principles presented in that section may be employed to test a hypothesis
about a population variance. When the data available for analysis consist of a simple
random sample drawn from a normally distributed population, the test statistic for
testing hypotheses about a population variance is

Problem:
The purpose of a study by Wilkins et al. (A-28) was to measure the effectiveness of
recombinant human growth hormone (rhGH) on children with total body surface area
burns > 40 percent. In this study, 16 subjects received daily injections at home of rhGH.
At baseline, the researchers wanted to know the current levels of insulin-like growth
factor (IGF-I) prior to administration of rhGH. The sample variance of IGF-I levels (in
ng/ml) was 670.81. We wish to know if we may conclude from these data that the
population variance is not 600.
1. Data. See statement in the example.
2.Assumptions. The study sample constitutes a simple random sample from a population
of similar children. The IGF-I levels are normally distributed. 3. Hypothesis

4. Test statistic. The test statistic is given by Equation


5. Distribution of test statistic. When the null hypothesis is true, the test statistic is
distributed as x2 with n - 1 degrees of freedom.
6. Decision rule. Let α=0.05. Critical values of x2 are 6.262 and 27.488.Reject H0 unless the
computed value of the test statistic is between 6.262 and 27.488. The rejection and non
rejection regions are shown in fig 7. Calculation of test statistic.

8. Statistical decision. Do not reject H0 since 6:262 < 16:77 < 27:488.
9. Conclusion. Based on these data we are unable to conclude that the population
variance is not 600.
10. p value. The determination of the p value for this test is complicated by the fact
that we have a two-sided test and an asymmetric sampling distribution. When we have
a two-sided test and a symmetric sampling distribution such as the standard normal
or t, we may, as we have seen, double the one-sided p value. Problems arise when we
attempt to do this with an asymmetric sampling distribution such as the chi-square
distribution
Hypothesis Testing Python
Imagine a woman in her seventies who has a noticeable tummy bump. Medical
professionals could presume the bulge is a fibroid. In this instance, our first finding (or
the null hypothesis) is that this woman has a fibroid, and our alternative finding is that
she does. We shall use terms null hypothesis (beginning assumption) and alternate
hypothesis (countering assumption) to conduct hypothesis testing. The next step is
gathering the data samples we can use to validate the null hypothesis.The following
options are the remaining ones:

o Although the null hypothesis (H0) was correct, we rejected it. o


Although the null hypothesis (H0) was incorrect, we did not dismiss it.
P-value:- The likelihood of discovering the recorded or more severe outcomes whenever
the null hypothesis (H0) of a research question is true is known as the P value or
computed probability; the meaning of "severe" relies upon how the hypothesis has been
tested.When your P value falls below the selected significance threshold, you dismiss the
null hypothesis and agree that your sample contains solid proof that the alternative
hypothesis is true. It Still does not Suggests a "significant" or "important" change; you
must determine that while evaluating the applicability of your conclusion in the actual
world.

T- Test: When comparing the mean values of two samples that specific characteristics
may connect, a t-test is performed to see if there exists a substantial difference. It is
typically employed when data sets, such as those obtained from tossing a coin 100 times
and stored as results, would exhibit a normal distribution. It could have unknown
variances. The t-test is a method for evaluating hypotheses that allows you to assess a
population-applicable assumption.
Assumptions o Each sample's data is randomly and uniformly
distributed (iid). o Each sample's data have a normal
distribution.
o Every sample's data share the same variance.
T-tests are of two types: 1. one-sampled t-test and 2. two-sampled t-test.
One sample t-test: The One Sample t-test ascertains if the sample average differs
statistically from an actual or apposed population mean. A parametric testing technique
is the One Sample t-test.
Example: You are determining if the average age of 10 people is 30 or otherwise. Check
the Python script below for the implementation.
Code
# Python program to implement T-Test on a sample of ages
# Importing the required libraries from
scipy.stats import ttest_1samp import
numpy as np # Creating a sample of
ages ages = [45, 89, 23, 46, 12, 69, 45, 24,
34, 67] print(ages)
# Calculating the mean of the sample
mean = np.mean(ages) print(mean)
# Performing the T-Test t_test,
p_val = ttest_1samp(ages, 30)
print("P-value is: ", p_val)
# taking the threshold value as 0.05 or 5%
if p_val < 0.05:
print(" We can reject the null hypothesis") else:
print("We can accept the null hypothesis")

Output
[45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
45.4
P-value is: 0.07179988272763554
We can accept the null hypothesis
Chi-Square test
Is a statistical method to determine if two categorical variables have a significant
correlation between them. Both those variables should be from same population and they
should be categorical like − Yes/No, Male/Female, Red/Green etc. For example, we can
build a data set with observations on people's ice-cream buying pattern and try to
correlate the gender of a person with the flavour of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavours by knowing the number
of gender of people visiting.We use various functions in numpy library to carry out the
chi-square test.
from scipy import stats import
numpy as np import
matplotlib.pyplot as plt x =
np.linspace(0, 10, 100) fig,ax =
plt.subplots(1,1)
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 4, 7, 6] for df, ls in
zip(deg_of_freedom, linestyles): ax.plot(x,
stats.chi2.pdf(x, df), linestyle=ls) plt.xlim(0,
10) plt.ylim(0, 0.4) plt.xlabel('Value')
plt.ylabel('Frequency') plt.title('Chi-Square
Distribution') plt.legend() plt.show()

Calculating one-proportional Z-test using formula z=(P-Po)/sqrt(Po(1-Po)/n

Where:
• P: Observed sample proportion
• Po: Hypothesized Population Proportion
• n: Sample size
In this example, we are using the P-value to 0.86, Po to 0.80, and n to 100, and by using
this we will be calculating the z-test one proportional in the python programming
language.

Code: import
math
P = 0.86 Po = 0.80
n = 100 a = (P-Po)
b = Po*(1-Po)/n z
= a/math.sqrt(b)
print(z) Output:
1.4999999999999984

You might also like