Introduction To Statistical Analysis
Introduction To Statistical Analysis
Introduction:
Decision makers make better decisions when they use all available information in
decision makers with methods for obtaining and analyzing information to help
and interpreting numerical data for the purpose of assisting in making a more
effective decision.
Biostatistics is the science of conducting study in biological or health data
The measurement scales of the variable separate into four common types of
scales are used: nominal, ordinal, interval and ratio.
2. The ordinal level of measurement classifies data into categories that can
be ranked; however, precise differences between the ranks do not exist: attitude,
grade ( A, B , B, C , C, D , D, F ) etc.
Sources of Data:
Sampling Methods:
There are many ways to collect a sample. The most commonly used methods are:
A. Probability Sampling:
B. Nonprobability Sampling:
1. Judgement Sampling: In this case, the person taking the sample has direct or
indirect control over which items are selected for the sample.
3. Quota Sampling: In this method, the decision maker requires the sample to
contain a certain number of items with a given characteristic. Many political polls
are, in part, quota sampling.
Parameter is a characteristic or measure obtained by using all the data values for
a specific population.
The arithmetic mean or Mean is the average value of all the data
X i
The mean of population; i 1
; for ungrouped data
N
fX i i
i 1
; for grouped data
N
X i
The mean of sample; X i 1
; for ungrouped data
n
fX i i
X i 1
; for grouped data
n
x = sample mean
xi = data
N = population size
n = sample size
fi = frequency in the class interval
k = number of classes
Mode is the data which have mostly frequency (some group of data have no mode,
some group have more than one mode).
6
X i
The mean; X i 1
n
10 21 33 53 54
X = = 34.2
5
For the data group II; 10, 12, 34, 34, 34, 46, 55, 56, 60, 60
34 46
The median; Med. = 40
2
For the data group III; 10, 12, 34, 34, 34, 46, 55, 56, 60, 60, 60, 65, 67
Measurement of Variation
Range R
Quartile Deviation Q.D.
Mean Deviation M.D.
Standard Deviation
Range; R = Xmax - Xmin
7
Q3 Q1
Quartile Deviation; Q.D.
2
X i 1
i X
Mean Deviation; M.D. = ; for ungrouped data
n
i 1
fi X i X
M.D. = ; for grouped data
n
(x ) i
2
For population; i 1
; ungrouped data
N
f (x ) i i
2
i 1
; grouped data
N
( x x) i
2
For sample; S i 1
; ungrouped data
n 1
f ( x x) i i
2
S i 1
; grouped data
n 1
(x ) i
2
For population; i 1
; ungrouped data
N
( x 34.2) i
2
i 1
17.36
5
( x x) i
2
For sample; S i 1
; ungrouped data
n 1
( x 34.2) i
2
S i 1
19.41
4
Example 68 65 12 22 63 43 32 43 42 25
49 27 27 74 38 49 30 51 42 28
36 36 27 23 28 42 31 19 32 28
50 46 79 31 38 30 27 28 21 43
22 25 16 49 23 45 24 12 24 12
69 25 27 47 44 51 23
μ = X = 36.6
(x ) i
2
For population; i 1
; ungrouped data
N
57
( x 36.37) i
2
For population; i 1
15.45 ; N = 57
57
9
( x x)
i
2
For sample; S i 1
; ungrouped data
n 1
57
( x 36.37)
i
2
For sample; S i 1
15.58 ; n = 57
56
The coefficient of variation ( C.V.) is the ratio of standard deviation and mean.
S
C.V. = x 100%
x
Example : To compare the age and cholesterol variation between male and female.
X S CV X S CV
13 25
Male 46 13 100 28.26% 100 25 100 25.00%
46 100
10 30
Female 40 10 100 25.00% 110 30 100 27.27%
40 110
The age of male is more variation than female, but the cholesterol of female is
In the male group the age data is more variation than the cholesterol value.
In the female group the cholesterol data is more variation than the age value.
10
Probability
Multiplication Rule
In a sequence of n events in which the first one has k 1 possibilities and the
second event has k2 and the third has k3, and so forth, the total number of
possibilities of the sequence will be k1 . k2 . k3 . . . kn
n! = 1 x 2 x 3 x . . . x n
0! = 1
number of outcomes in E
The probability of any event E is
total number of outcomes in the sample space
n( E )
P(E) =
n( S )
1. Addition Rule 1
When two event A and B are mutually exclusive, the probability that A or B
2. Addition Rule 2
3. Multiplication Rule 1
4. Multiplication Rule 2
5. Conditional Probability
The conditional probability of an event B is relationship of an event A was
defined as the probability that event B occurs after event A has already occured.
P( A and B)
P(B/A) =
P( A)
P( A ) = 1 – P(A)
E
Exxaam
mppllee 11 On the survey of 500 normal persons, distributed by the blood type.
P(A and B) = 0
Example 2 In the survey of school health for eye and dental health problem in
primary school among 200 students, they have 30 cases eye problem 50 cases
Example 3 The survey of smoking habit and lung cancer among 200 males.
________________________________________________________
Probability Distribution
Random variable : The relation for transform the sample space events to
the figure.
On the survey the family that have two children in the family
X = 0, 1, 2
E ( x) xi p( xi )
all x
x
2
2 = V(x) = E [x – E(x)] = E [x2] - [E(x)]2 = 2
i p( xi ) 2
all x
E ( x) xp( x)dx
allx
2 V ( x) x p( x)dx 2
2
allx
17
Probability Distribution
1. The Binomial Distribution
A binomial experiment is a probability experiment that satisfies the
following four requirements:
1) Each trial can have only two outcomes, these outcomes can be considered
as either interesting or non-interesting.
P(x) = n Cx px qn-x ; x = 0, 1, 2, . . ., n
E
Exxaam
mppllee 22 In the clinical trial for the new treatment, after treat the new treatment
to the patient, the probability of each patient have good result is 0.4 . If the
experiment trial 15 patients, to find the probability :
p = 0.4 ; q = 1 – p = 0.6
Binomial probability; P( x) =
n
C x p x q n x ; x = 0,1,2,3,…,n
P(x) =
15
C x (.4) x (.6)15x ; x = 0,1,2,3,…,15
the probability :
= 0.8719
P(x>4) = P(x 5)
= 1 - P(x 4)
4
15 15 x
= 1 - C x (.4) (.6)
x
x 0
= 1 - 0.2173 = 0.7827
19
we cannot limit the number of outcome for random variable. The probability of x
occurrences in an interval of time, volume, area, etc. for a variable.
e x
P( x) = ; x = 0, 1, 2, . . .
x!
e = 2.71828
E
Exxaam
mppllee 33 In the hospital, at the emergency unit, there are 3 emergency cases
e x
Poisson probability; P( x) ; x 0,1,2,3,...
x!
e3 3x
P( x) ; x 0,1,2,3,...
x!
the probability that :
a. no case admit
P(X = 0) = 0.0498
The data xi are normal distribution with the mean ( ) and the standard
2
1 -1/2 x
f (x ; , ) = e ; < x <
2 2
xi
Zi
The mean of the standard score ; z 0
Example 4 The weight of the normal person are normal distribution with the
mean 50 kg. and the standard deviation 10 kg. How many percentage of all the
normal person that have the weight between 45 to 65 kg. and how many
50, 10
xi
Zi
22
The probability of the person that have the weight between 45 to 65 kg.
45 50 65 50
P(45 < x < 65) = P( Z )
10 10
= 1 - 0.3085 - 0.0668
= 0.6247
It is 62.47% of the normal person that have the weight between 45 to 65 kg.
The probability of the person that have the weight greater than 60 kg.
60 50
P(x > 60) = P( Z ) = P (Z > 1) = 0.1587
10
It is 15.87% of the normal person that have the weight greater than 60 kg.
23
Sampling distribution
We are sampling in the population with the sample size n;
All possible sample mean ( X i ) are normal distribution, with the value of the
average of the sample means is the same as population mean and the variance
of the sample means is as the population variance divided by the sample size.
x
2
2
x n
1. Sampling with replacement
2
The variance of the sample mean 2
x n
2 N n
The variance of the sample mean x2
n N 1
24
N n 2
In population have large N; 1 ; so 2
N 1 x n
Example: In the population of size 5 (N=5) . The data are 6, 8, 10, 12, 14
The mean; = 10
The variance; 2 = 8
6 8 10 12 14
___________________________________________________________
___________________________________________________________
Sample mean ( X )
25
xi fi
6 1
7 2
8 3
9 4
10 5
11 4
12 3
13 2
14 1
Total 25
_______________________________________________________
x = 10 =
8 2
x2 = 4 =
2 n
2
x = and x2 = x =
n ; n
26
6 8 10 12 14
___________________________________________________________
Sample mean ( X )
x = 10 =
27
8 52 2 N n
x2 = 3 =
2 5 1 n N 1
N n
1
If we have large n ; N 1
2
x = and x2 = x =
n ; n
2
x = and =
2
x
x =
n ; n
All X i Normal
xi
Zi ; 2 known
/ n
xi
t ; df n 1; 2unknown
S/ n
28
Example 5 The average of the height of the students in the school is 158 cm. with
the standard deviation 15 cm. in the sampling 100 students in this school , what is
the probability of the sample mean of the height between 155 to 160 cm.?
xi
Zi ; 2 known
/ n
The probability of the sample mean between 115 to 121 cm.is
155 158 160 158
P(155 x 160) = P( Z )
15 / 100 15 / 100
= P (-2.0 Z 1.33)
= 1 - .0228 - .0918
= 0.8854
Example 6 The average of birthweight in the rural area 2500 gm.,now we have
the new public health ,in the sample of 25 livebirths that have the standard
deviation of birthweight 1000 gm. What is the probability to have the sample
x
t ; df n 1
S/ n
the probability to have the sample mean of birthweight greater than 3,000 gm.
3000 2500
P ( x > 3000) = P ( t > )
1000 / 25
= 0.01
29
^ P(1 P)
and the standard deviation all possible p ; p̂
n
p P
p Normal ; Z
P (1 P)
n
Example 7 In preschool children, the proportion of the children who have dental
health problem 15% , this year among 200 new preschool students, what is the
probability to have 40 and more students have dental health problem? And the
chance to have dental health problem not more than 12% ?
P = 15% = 0.15
the probability to have 40 and more students have dental health problem
40
n = 200, p̂ = = 0.2
200
p̂ P 0.2 0.15
Z = = = 1.98
P(1 P) (0.15)(0.85)
n 200
the chance to have dental health problem not more than 12% ?
p̂ = 0.12,
p̂ P 0.12 0.15
Z = = = -1.19
P(1 P) (0.15)(0.85)
n 200
Estimation parameter
Type of estimation : Point Estimation
Interval Estimation
x Z /2 ; known
n
S
x t /2 ; df n 1; unknown
n
2. Estimating the population variance ( 2 )
(n 1) S 2 (n 1) S 2
2
2 /2 2 /2
enzyme we have x = 20 and 2 = 40. To estimate the mean of the level enzyme
n = 15, x = 20 , 2 = 40 , = 6.32 ,
Estimate the mean of the level enzyme of the normal person at 95% confidence
interval.
31
= x Z / 2
n
1.96(6.32)
20 20 3.19
15
Example 9 In the study of serum of 16 infants the mean 5.96 mg% and the
mean serum
S
x t /2 ; df n 1; unknown
n
n = 16, x = 5.96 , S = 3.5
(3.5)
5.96 2.131 = 5.96 1.86
16
(n 1) S 2 (n 1) S 2
2
2 /2 2 /2
p̂(1 p̂)
at (1 - ) 100% of P; p̂ Z / 2
n
result have 160 patients got well in 3 days. What is the proportion of patients
get well in 3 days after they have new treatment at 95% confidence interval?
160
n = 200 ; p̂ = 0.8
200
p̂(1 p̂)
Estimate P ; p̂ Z / 2
n
0.8(0.2)
Estimate at 95% CI.; P 0.8 1.96
200
= 0.80 0.055
= 0.745 , 0.855
Hypothesis Testing
Statistical Hypotheses
H0: P = 0.5
Ha: P ≠ 0.5
34
Some researchers say that a hypothesis test can have one of two outcomes: you
accept the null hypothesis or you reject the null hypothesis. Many statisticians,
however, take issue with the notion of "accepting the null hypothesis." Instead,
they say: you reject the null hypothesis or you fail to reject the null hypothesis.
Hypothesis Tests
State the hypotheses. This involves stating the null and alternative
hypotheses. The hypotheses are stated in such a way that they are mutually
exclusive. That is, if one is true, the other must be false.
Formulate an analysis plan. The analysis plan describes how to use sample
data to evaluate the null hypothesis. The evaluation often focuses around a
single test statistic.
Analyze sample data. Find the value of the test statistic (mean score,
proportion, t-score, z-score, etc.) described in the analysis plan.
35
Interpret results. Apply the decision rule described in the analysis plan. If the
value of the test statistic is unlikely, based on the null hypothesis, reject the
null hypothesis.
Decision Errors
Type I error. A Type I error occurs when the researcher rejects a null
hypothesis when it is true. The probability of committing a Type I error is
called the significance level. This probability is also called alpha, and is
often denoted by α.
Type II error. A Type II error occurs when the researcher fails to reject a
null hypothesis that is false. The probability of committing a Type II error is
called Beta, and is often denoted by β. The probability of not committing a
Type II error is called the Power of the test.
Null Hypothesis H 0
Statistical Decision
Decision Rules
The analysis plan includes decision rules for rejecting the null hypothesis. In
practice, statisticians describe these decision rules in two ways with reference to a
P-value or with reference to a region of acceptance.
The set of values outside the region of acceptance is called the region of
rejection. If the test statistic falls within the region of rejection, the null
hypothesis is rejected. In such cases, we say that the hypothesis has been
rejected at the α level of significance.
These approaches are equivalent. Some statistics texts use the P-value approach;
others use the region of acceptance approach. In subsequent lessons, this tutorial
will present examples that illustrate each approach.
A test of a statistical hypothesis, where the region of rejection is on only one side
of the sampling distribution, is called a one-tailed test.
37
For example, suppose the null hypothesis states that the mean of the blood sugar is
less than or equal to 100. The alternative hypothesis would be that the mean of the
blood sugar is greater than 100. The region of rejection would consist of a range of
numbers located on the right side of sampling distribution; that is, a set of numbers
greater than 100.
H0: μ ≤ 100
Ha: μ > 100
For example, suppose the null hypothesis states that the mean of blood sugar is
equal to 110. The alternative hypothesis would be that the mean of blood sugar is
less than 110 or greater than 110. The region of rejection would consist of a range
of numbers located on both sides of sampling distribution; that is, the region of
rejection would consist partly of numbers that were less than 110 and partly of
numbers that were greater than 110.
H0: μ = 110
Ha: μ ≠ 110
Effect Size
To compute the power of the test, one offers an alternative view about the "true"
value of the population parameter, assuming that the null hypothesis is false.
The effect size is the difference between the true value and the value specified in
the null hypothesis.
For example, suppose the null hypothesis states that a population mean is equal to
100. A researcher might ask: What is the probability of rejecting the null
hypothesis if the true population mean is equal to 90? In this example, the effect
size would be 90 - 100, which equals -10.
Sample size (n). Other things being equal, the greater the sample size, the
greater the power of the test.
Significance level (α). The higher the significance level, the higher the
power of the test. If you increase the significance level, you reduce
the region of acceptance. As a result, you are more likely to reject the null
hypothesis. This means you are less likely to accept the null hypothesis
when it is false; i.e., less likely to make a Type II error. Hence, the power of
the test is increased.
The "true" value of the parameter being tested. The greater the difference
between the "true" value of a parameter and the value specified in the null
39
hypothesis, the greater the power of the test. That is, the greater the effect
size, the greater the power of the test.
All hypothesis tests are conducted the same way. The researcher states a
hypothesis to be tested, formulates an analysis plan, analyzes sample data
according to the plan, and accepts or rejects the null hypothesis, based on results of
the analysis.
State the hypotheses. Every hypothesis test requires the analyst to state
a null hypothesis and an alternative hypothesis. The hypotheses are stated
in such a way that they are mutually exclusive. That is, if one is true, the
other must be false; and vice versa.
Formulate an analysis plan. The analysis plan describes how to use sample
data to accept or reject the null hypothesis. It should specify the following
elements.
to 0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
Test method. Typically, the test method involves a test statistic and
a sampling distribution. Computed from sample data, the test statistic might be a
mean score, proportion, difference between means, difference between
proportions, z-score, t-score, chi-square, etc. Given a test statistic and its sampling
distribution, a researcher can assess probabilities associated with the test statistic.
40
If the test statistic probability is less than the significance level, the null hypothesis
is rejected.
Analyze sample data. Using sample data, perform computations called for
in the analysis plan.
where Parameter is the value appearing in the null hypothesis, and Statistic is
the point estimate of Parameter. As part of the analysis, you may need to compute
the standard deviation or standard error of the statistic. Previously, we presented
common formulas for the standard deviation and standard error.
When the parameter in the null hypothesis involves categorical data, you may use a
chi-square statistic as the test statistic. Instructions for computing a chi-square test
statistic are presented in the lesson on the chi-square goodness of fit test.
Interpret the results. If the sample findings are unlikely, given the null
hypothesis, the researcher rejects the null hypothesis. Typically, this involves
comparing the P-value to the significance level, and rejecting the null hypothesis
when the P-value is less than the significance level.
41