Research Methods and Statistics
Research Methods and Statistics
Methods and
Statistics
1
Preface
I wrote this book because there is a large gap between the elementary statistics
course that most people take and the more advanced research methods courses
taken by graduate and upper-division students so they can carry out research
projects. These advanced courses include difficult topics such as regression,
forecasting, structural equations, survival analysis, and categorical data, often
analyzed using sophisticated likelihood-based and even Bayesian methods.
However, these advanced courses typically devote little time to helping students
understand the fundamental assumptions and machinery behind these methods.
Instead, they teach the material like witchcraft: Do this, do that, and voilà—
Statistics! Students thus have little idea what they are doing and why they are
doing it. Like trained parrots, they learn how to recite statistical jargon mindlessly.
The goal of this book is to make statistics less like witchcraft, to treat students
like intelligent humans and not like trained parrots—thus the title, Research
Methods and Statistics.
This book will cause students and researchers to think differently about things,
not only about math and statistics, but also about research, the scientific method,
and life in general. It will teach them how to do good modeling—and hence good
statistics—from a standpoint of deep knowledge rather than rote knowledge. It
will also provide them with tools to think critically about the claims they see in the
popular press, and to design their own studies to avoid common errors.
This book is not a ―cookbook.‖ Cookbooks tell you all about the what but nothing
about the why. With computers, software and the Internet readily available, it is
easier than ever for students to lose track of the why and focus on
the what instead. This book takes exactly the opposite approach. It will empower
students and researchers to use advanced statistical methods with confidence.
2
Contents
DESCRIPTIVE STATISTICS 23
3
Research your idea. See if there's a demand. A lot of people
have great ideas, but they don't know if there's a need for it.
You also have to research your competition.
4
I
THE NATURE OF DATA
Anything that can be counted or measured is called a variable. Knowledge of
the different types of variables, and the way they are measured, play a crucial
part in choice of coding and data collection. The measurement of variables can
be categorized as categorical (nominal or ordinal scales) or continuous (interval
or ratio scales).
A nominal scale allows for the classification of objects, individual and responses
based on a common characteristic or shared property. A variable measured on
the nominal scale may have one, two or more sub-categories depending on the
degree of variation in the coding. Any number attached to a nominal
classification is merely a label, and no ordering is implied: social worker, nurse,
electrician, physicist, politician, teacher, plumber, etc.
An ordinal scale not only categorizes objects, individuals and responses into sub-
categories on the basis of a common characteristic it also ranks them in
descending order of magnitude. Any number attached to an ordinal
classification is ordered, but the intervals between may not be constant: GCSE,
A-level, diploma, degree, postgraduate diploma, higher degree, and doctorate.
The interval scale has the properties of the ordinal scale and, in addition, has a
commencement and termination point, and uses a scale of equally spaced
intervals in relation to the range of the variable. The number of intervals
between the commencement and termination points is arbitrary and varies from
one scale to another. In measuring an attitude using the Likert scale, the
intervals may mean the same up and down the scale of 1 to 5 but multiplication is
not meaningful: a rating of ‗4‘ is not twice as ‗favourable‘ as a rating of ‗2‘.
In addition to having all the properties of the nominal, ordinal and interval scales,
the ratio scale has a zero point. The ratio scale is an absolute measure allowing
multiplication to be meaningful. The numerical values are ‗real numbers‘ with
which you can conduct mathematical procedures: a man aged 30 years is half
the age of a woman of 60 years.
5
Categorical Continuous
Name [1] . . . Yes Attitudes (Likert Scale): Income (£000s per annum)
[0] . . . No [5] . . . strongly agree
Occupation [4] . . . agree Age (in years)
[1] . . . Good [3] . . . uncertain
Location [0] . . . Bad [2] . . . disagree Reaction Time (in seconds)
[1] . . . strongly disagree
Site [1] . . . Female Absence (in days)
[0] . . . Male Age:
[4] . . . Old Distance (in kilometres)
[1] . . . Right [3] . . . Middle-aged
[0] . . . Wrong [2] . . . Young Length (metres)
[1] . . . Child
[1] . . . Extrovert Attitude (Thurstone & Cheve)
[0] . . . Introvert Income:
[3] . . . High
[1] . . . Psychotic [2] . . . Medium
[0] . . . Neurotic [1] . . . Low
Qualitative Quantitative
Table I
A Two-Way Classification of Variables
6
Methods of Data Collection
The major approaches to gathering data about a phenomenon are from primary
sources: directly from subjects by means of experiment or observation, from
informants by means of interview, or from respondents by questionnaire and
survey instruments. Data may also be obtained from secondary sources:
information that is readily available but not necessarily directly related to the
phenomenon under study. Examples of secondary sources include published
academic articles, government statistics, an organization‘s archival records to
collect data on activities, personnel records to obtain data on age, sex,
qualification, length of service, and absence records of workers, etc. Data
collected and analyzed from published articles, research papers and journals
may be a primary source if the material is directly relevant to your study. For
instance, primary sources for a study conducted using the Job Descriptive Index
may be Hulin and Smith (1964-68) and Jackson (1986-90), whereas a study
using an idiosyncratic study population, technique and assumptions, such as
those published by Herzberg, et alia (1954-59), would be a secondary source.
Data primarily and directly gathered Data not primarily or directly gathered
for the purpose of the study for the purposes of the study
Archives Other
records
Previous
unrelated
studies
Unstructured Non-participant Quasi Mailed
7
Figure 1
A Classification of Methods of Data Collection
Procedures for Coding Data
A coding frame is simply a set of instructions for transforming data into codes
and for identifying the location of all the variable measured by the test or
instrument. Primary data gathered from subjects and informants is amenable to
control during the data collection phase. The implication is that highly structured
data, usually derived from tests, questionnaires and interviews, is produced
directly by means of a calibrated instrument or is readily produced from raw
scores according to established rules and conventions. Generally, measures
such as physical characteristics such as height and weight are measured on the
ratio scale. Whereas psychological attributes such as measures of attitude and
standard dimensions of personality are often based on questions to which there
is no appropriate response. However, the sum of the responses is interpreted
according to a set of rules and provides a numerical score on the interval scale
but is often treated as though the measures relate the ratio scale. Norms are
available for standard tests of physical and psychological attributes to establish
the meaning of individual scores in terms of those derived from the general
population. A questionnaire aimed at determining scores as a measure of a
psychological attribute are said to be pre-coded; that is, the data reflects the
coder‘s prior structuring of the population. The advantages of pre-coding are
that it reduces time, cost and coding error in data handling. Ideally, the pre-
coding should be sufficiently robust and discriminating as to allow data
processing by computer.
Once we have a coding frame, data relating to an individual or case can be read
off just as we would read the data in a sales catalogue or a coded matrix from a
holiday brochure:
8
The following statements may or may not be true for your work organization. For each item
below, please answer by ticking (/) the appropriate box below the categories at the top of the
page. If you cannot decide then tick the box which is closest to your answer.
Definitely More True More False Definitely
True than False than True False
1. I feel that I am my own boss in most matters
2. The company is generally quick to use
improved work methods
3. The company has a real interest in the
welfare and happiness of those who work here
4. The company tries to improve working conditions
5. A person can makes his or her own decisions
around here without checking with anybody else
6. Everyone has a specific job to do
7. The company has clear-cut reasonable goals
and objectives
8. Most people here make their own rules on the job
Table 1
9
Survey Instrument and Coding Frame Matrix for Research Data
Table 2
Response Matrix Representing Categories of Interest
1. Representational Approach
The response of the informant is said to express the surface meaning of
what is ―out there‖ requiring the researcher to apply codes to reduce the
data, whilst at the same time, reflecting this meaning as faithfully as
possible. At this stage of the process, the data must be treated
independently from any views the researcher may hold about underlying
variables and meanings.
10
2. Anchored-in Approach
The researcher may view the responses as having additional and implicit
meanings that come from the fact that the responses are dependent on
the data-gathering context. For example, in investigating worker
involvement, we might want to conduct this with a framework comprising
of types of formal and informal worker/manager interactions. As a
consequence, the words given by informants can be interpreted to
produce codes on more than one dimension relating to the context: (a)
nature of the contact: formal versus informal, intermittent versus
continuous contact, etc. (b) initiator of contact: worker versus manager.
The coding frame using this approach takes into account ―facts‖ as being
anchored to the situation, rather than treating the data as though they are
context-free.
3. Hypothesis-Guided Approach
Although similar to the second approach, we may view the data as having
multiple meanings according the paradigm or theoretical perspective from
which they are approached (e.g. phenomenological or hermeneutic
approach to investigating a human or social phenomenon). The
hypothesis-guided approach recognizes that the data do not have just one
meaning which refers to some reality approachable by analysis for the
surface meaning of the words: words have multiple meanings, and ―out
there‖ is a multiverse rather than a universe. In the hypothesis-guided
approach, the researcher might use the data, and other materials, to
create or investigate variables that are defined in terms of the theoretical
perspective and construct propositions. For example, a data set might
contain data on illness and minor complaints that informants had
experienced over a period of say, one year. Taking the hypothesis-
guided approach, the illness data might be used as an indicator of
occupational stress or of a reaction to transformational change. Hence,
the coding frame is based on the researcher'‘ views and hypotheses rather
than on the surface meanings of the responses.1
1
In the case of anchored and hypothesis-guided coding frames, there may well be categories that
are not represented in small samples of data. Indeed, this is very likely for the hypothesis-guided
coded frame; that is, you may want the frame to be able to reflect important concept, if only by
showing that they did not occur among the responses.
11
II
ANALYSIS OF INDIVIDUAL
OBSERVATIONS
In the analysis of individual observations, or ungrouped data, consideration will
be given to all levels of measurement to determine which descriptive measures
can be used, and under what conditions each is appropriate.
One of the most widely used descriptive measures is the ‗average‘. One speaks
of the ‗average age‘, average response time‘, or ‗average score‘ often without
being very specific as to precisely what this means. The use of the average is
an attempt to find a single figure to describe or represent a set of data. Since
there are several kinds of 'average', or measures of central tendency, used in
statistics, the use of precise terminology is important: each ‗average‘ must be
clearly defined and labelled to avoid confusion and ambiguity. At least three
kinds of common uses of the ‗average‘ can be described:
12
The Mode
The mode can be defined as the most frequently occurring value in a set
of data; it may be viewed as a single value that is most representative of
all the values or observation in the distribution of the variable under study.
It is the only measure of central tendency that can be appropriately used
to describe nominal data. However, a mode may not exist, and even if it
does, it may not be unique:
The Median
13
simple terms, the median splits the data into two equal parts, allowing us to state
that half of the subjects scored below the median value and half the subjects
scored above the median value. If an observed value occurs more than once, it
is listed separately each time it occurs:
Averages much more sophisticated than the mode or median can be used at the
interval and ratio level. The arithmetic mean is widely used because it is the
most commonly known, easily understood and, in statistics, the most useful
measure of central tendency. The arithmetic mean is usually given the notion
and can be computed:
[x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + . . . xn] / N
where, x1, x2, x3 . . . xn are the values attached to the observations; and, N is the
total number of observations:
Reaction Time 625 500 480 500 460 500 575 530 525
(m/secs)
14
= (x)/N = 4695/8 = 586.875 m/secs.
The fact that the arithmetic mean can be readily computed does not mean that it
is meaningful or even useful. Furthermore, the arithmetic mean has the
weakness of being unduly influenced by small, or unusually large, values in a
data set. For example: five subjects are observed in an experiment and display
the following reaction times: 120, 57, 155, 210 and 2750 m/secs. The arithmetic
mean is 658.4 m/secs, a figure that is hardly typical of the distribution of reaction
times.
Measures of Dispersion
The degree to which numerical data tend to spread about an average value is
called the variation or dispersion of the data. Various measures of dispersion or
variation are available such as range, mean deviation, semi-interquartile range,
and the standard deviation.
Where is the arithmetic mean of the data set and [x-] is the absolute value of
the deviation of x from (note that in mathematics, the absolute value of a
number is the number without the associated sign and is indicated by two vertical
lines: |-4| = 4, |+4| = 4).
Compute, and make observation on, the mean absolute deviation for the two
sets of data derived from sample (a) and sample (b):
(a) 2, 3, 6, 8, 11 (b) 2, 4, 7, 8, 9
Conclusion
The mean absolute deviation indicates that sample (b) shows less dispersion
than sample (a).
15
Variance and Standard Deviation
The most useful measures of dispersion, and those with the most desirable
mathematical properties, are variance and standard deviation. The standard
deviation is particularly useful when dealing with the Normal Distribution, since
any Normal Distribution can be completely determined when the arithmetic mean
and standard deviation are known. When it is necessary to distinguish the
standard deviation of a population from the standard deviation of a sample drawn
from this population, we often use the symbol for the former and s for the
latter. For ungrouped data the variance and standard deviation from a sample
can be determined from the following derived formulae:
where, s2 = variance; = the arithmetic mean; [x - ]2 = the squared difference
between the observed value and the arithmetic mean; and N = the number of
observations. To compute the variance or standard deviation of a sample from a
population, the value for N should be substituted by [N - 1]; however, with large
samples, say >50, the difference is so small as to be of no significance. The
standard deviation is the square root of the variance. Formula (2) above is often
easier to compute for manual operations:
Find the standard deviation of each set of data: (a) 2, 3. 6, 8, 11 and (b) 2, 4,
7, 8, 11.
The above results should be compared with those of the mean absolute
deviation computed previously. It will be noted that the standard deviation
does indicate that the data set (b) shows less dispersion than data set (a).
However, the effect is masked by the fact that extreme values affect the
standard deviation much more than the mean absolute deviation. This of
course is to be expected since the deviations are squared in computing the
standard deviation.
16
For grouped data the arithmetic mean is computed as: = fmf and the
variance: s2 = f[m - ]2/f where m is the class midpoint.
1. 68.27 percent of observations (x) are included between the value of the
arithmetic mean and +/- one standard deviation
2. 95.45 percent of observations (x) are included between the value of the
arithmetic mean and +/- two standard deviations; and,
3. 99.73 per cent of observations (x) are included between value of the
arithmetic mean and +/- three standard deviations.
Whilst the traditional training programme shows a lower mean score than that of
the computer-based programme, how can we compare the relative variability
bearing in mind that one shows a lower mean than the other?
The Pearson Coefficient of Variability can assist in such instances, where s is the
standard deviation and is the arithmetic mean of the sample:
17
variable in an absolute sense, but the variability of the observed values
expressed as a percentage are also greater.
The consequences for the normal distribution are that the mean deviation and
the semi-interquartile range (or median) are equal respectively to 0.7979 and
0.6745 times the standard deviation. For moderately skewed distributions we
have the empirical formulae:
Note: Since the set of data is a sample drawn from a population, the number of observation
(N) has been adjusted to N-1.
18
Standardized Variables and Standard Scores
Which measures the deviation from the mean in units of the standard deviation is
called a standardized variable and is a dimensionless quantity, that is,
independent of the units used. If deviations from the mean are given in units of
the standard deviation, they are said to be expressed in standard scores or
standard units. These are of great value in comparison of distributions and are
often used in aptitude and educational testing.
Computation
Thus, the subject had a score 0.8 of a standard deviation above the mean for
the numerical test, but only 0.5 of a standard deviation above the mean in
verbal reasoning. Thus, the subject‘s standing was higher in numerical
aptitude.
Miscellaneous Averages
19
Weighted Arithmetic Mean
It is not always the case that values of a variable (x) are of equal importance, for
example in the assessment of a learning event. To compute the arithmetic
mean certain weights (w) may be attached to the variable:
= wjxj[1/x]
Geometric Mean
For large values of X, the geometric mean can be most easily computed by the
application of logarithms10
Log10G = log10x]1/n
20
Log10G = 10.950533[1/5] = 2.1901066
G = 154.9196831 say, 155 percent increase over each ten year period.
Alternatively, a rough and ready estimate can be derived from the following:
1/n
G = [no. of subjects at end period /no. of subjects at beginning period]
1/5
= [1233/138]
= 154.9589975 percent
But this increase is for each period of ten years, to obtain the annual rate of
increase over the total of fifty years:
1/50
G = [1233/138] -1
= 1.044772373 – 1
The nature of the geometric mean is that if any value for x is zero, then the
product will be zero. It is also meaningless of any of the values are negative.
This can be overcome by expressing the values of x with care, and selecting the
values of x to be averaged.
Harmonic Mean
The harmonic mean (H) is useful for solving problems that involve variables
expressed in time, such as reaction times, number of errors per hour, etc. The
harmonic mean can be computed:
H = N/[1/x]
The arithmetic mean would be 3 minutes to complete the task (12 minutes
divided by 4 subjects), which on average is 20 tasks per hour/subject or 80 tasks
21
per hour for all four subjects. However, when computed separately for each
subject, the total number of tasks completed per hour is 104. The total number
of tasks completed per hour for all four subjects must be computed some other
way:
= 2.31 minutes
Since the four subjects participated in the experiment for one hour, representing
240 minutes of running time, the average number of tasks completed by the
group of four subjects is:
22
III
Descriptive Statistics
Data
Data are very important for scientific study, and statistics is a discipline that deals
with the collection, presentation and analysis of data. In this chapter we are going
to study how we can summarize and describe a set of data. When we study a set
of data we need to identify the following important characteristics of the dataset.
Primary and secondary data. When the data are collected by us it is called
primary data. We always have the individual values of the data. When the
dataset is collected by others, it is called secondary data. Sometimes the
data is grouped into a table, and is called grouped data.
Frequency Distribution
23
Example 1
A random sample of 100 households in a town was selected and their monthly
town gas consumption (in cubic metres) in last month were recorded as follows:
55 82 83 109 78 87 95 94 85 67
80 109 83 89 91 104 90 103 67 52
107 78 86 29 72 66 92 99 60 75
88 112 97 88 49 62 70 66 88 62
72 85 81 78 77 41 105 92 94 74
78 75 87 83 71 99 56 69 78 60
119 39 104 86 67 79 98 102 82 91
7
46 120 73 125 132 86 48 55 112 28
42 24 130 100 46 57 31 129 137 59
102 51 135 53 105 110 107 46 108 117
(ii) Insofar as possible, equal class intervals are preferred. But the first and
last classes can be open-ended to cater for extreme values.
In example 1, the sample size is 100 and the range for the data is 113 (137 - 24).
A frequency distribution with six classes is appropriate and it is shown below.
24
Frequency distribution of household town gas consumption
Class limits: are the numbers that typically serve to identify the classes in a
listing of a frequency distribution. Thus, in the above frequency distribution, for
the class whose frequency is 30, its lower class limit is 80 and upper class limit is
99.
Class interval: is the width of a class. The class interval of a class is computed
by subtracting the lower limit (boundary) of the class from the lower limit
(boundary) of the next class.
Class midpoint or class mark: is the point dividing the class into equal halves
on the basis of class interval. This point can be obtained by adding the lower and
upper limits (boundaries) of a class and dividing by 2.
Relative frequency of a class: is the frequency of the class divided by the total
frequency of the distribution.
25
Measure of Central Tendency
A value that would describe the 'centre' of a distribution would be visually located
near the spot where most of the data seem to be concentrated. Consequently,
values that fulfil this role are called measures of central tendency.
The most common measures of the central tendency of a data set are arithmetic
mean or simply as mean, median and mode.
The mean of a set of numerical data is the sum of the set divided by the number
of observations, that is, their average.
The median of a distribution is the value which divides the distribution so that an
equal number of values lie on either side of it, i.e., half of the items have values
smaller or equal to it and half of the items have values larger or equal to it.
The mode of a set of numerical data is the value which occurs most frequently.
The following table shows the hourly wage rates of eight sampled construction
workers.
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( x i ) $35 38 46 60 65 69 72 78
x i
x1 x 2 x3 x 4 x5 x6 x7 x8
Mean ( x ) i 1
( )
8 8
463
57.875 ($)
8
n 1 9
Location of the median: 4.5 th
2 2
x 4 x5 60 65
Median = 62.5 ($)
2 2
The following table shows the daily wages of a random sample of construction
workers. Calculate its mean, median and mode.
Solution
Number of
Daily Wages ($) Workers Class Mark f i xi
fi xi
fx i i
82 ,350 .0
Mean ( x ) i 1
6
823 .5 ($
f
100
i
i 1
2
Research Methods and Statistics
200 - 399 5 5
400 - 599 15 20
600 - 799 25 45
800 - 999 30 75
1000 - 1199 18 93
1200 - 1399 7 100
Total 100
0.5n F3
Median = L4 (c 4 ) where L is the lower class boundary,
f4
c is the class interval.
0.5(100) 45
799.5 (200) 832.8 ($)
30
As f 4' 30 is the largest relative density, so mode lies in the 4th class.
f 4' f 3'
Mode L4 (c4 )
( f 4' f 5' ) ( f 4' f 3' )
30 25
799 .5 (200 ) 858 .3 ($)
(30 18 ) (30 25 )
3
Research Methods and Statistics
Mean
Advantages: (i) All values in the distribution are used in its calculation, so
it can be regarded as more representative than the other
two measures.
Median
Advantage: Its result will not be affected by extreme values and open
end classes.
Mode
Advantages: (i) Its result will not be affected by extreme values and open
end classes.
4
Research Methods and Statistics
(i) Always select the mean whenever there is no special reason for choosing
the other two measures.
(iii) Select the mode if integral result is preferred as in cases the data are in
ordinal scales.
5
Research Methods and Statistics
Example 1
Sample for 31 32 32 33 32
Company A
Sample for 28 29 32 35 36
Company B
Both samples have the same mean, 32 grams. It is obvious that company A, in
comparison with company B, bottles strawberry jam with a more consistent
content. We say that the variability of the observations is smaller for company A.
Therefore in buying strawberry jam we would feel more confident that the bottle
we select will be closer to the advertised average content if we buy from
company A.
The most important measures of variability or dispersion are the range, mean
deviation, standard deviation and variance.
(There are some other measures like quartile deviation and percentiles. We shall
not study these measures. Read our textbook if interested)
The range of a set of numbers is the difference between the largest and the
smallest number in the set.
The following table shows the hourly wage rates of eight sampled construction
workers.
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( x i ) $35 38 46 60 65 69 72 78
6
Research Methods and Statistics
Though range is simple and can be obtained easily, its result is unstable. This is
particularly true if the sample size is large. So whenever the sample size is over
10, we seldom choose to use range to indicate variability of the data.
Mean deviation is the average of the absolute deviation of the numerical data
from their mean.
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( x i ) $35 38 46 60 65 69 72 78
xi x
22.87 19.87 11.87 2.125 7.125 11.12 14.12 20.12
xi 57.875 5 5 5 5 5 5
x i 57.875
109.25
Mean deviation i 1
13.656 ($)
8 8
The mean deviation is a good measure to show the extent of variation of the data
in a distribution. However, when this measurement is used in further analysis, it
would give rise to some unnecessary tedious mathematical problem as a result
of its absolute value term. To avoid this pitfall, we can use the standard deviation
instead.
(x i )2
i 1
, where is the population mean
N
To compute the sample standard deviation (s) we use the above formula,
replacing by x and N by n 1 .
(x x )
i
2
s i 1
n 1
7
Research Methods and Statistics
Worker i 1 2 3 4 5 6 7 8 Total
Hourly wage
rate ( x i ) $35 38 46 60 65 69 72 78 463
( x 35.875) 2
s 16.226($)
7
(x x )
i
2
s2 i 1
n 1
The following table shows the daily wages of a random sample of construction
workers. Calculate its mean deviation, variance, and standard deviation.
8
Research Methods and Statistics
Solution
Number of
Daily Wages ($) Workers Class Mark f i xi x
fi xi f i xi 823.5
f i xi x
21,160
Mean deviation i 1
6
211 .60 ($)
f
100
i
i 1
Number of
Daily Wages Workers Class Mark fi ( xi x)2
($) fi xi
6462400
Variance ( s 2 ) 65, 276.77
99
Standard deviation =
65276.77 255.49
9
Research Methods and Statistics
The values of the standard deviations cannot be used as the bases of the
comparison because:
(a) units of measurements of the two distributions may be different,
and
(b) average values of two distributions may be widely dissimilar.
The correct measure that should be used is the coefficient of variation (CV ) .
s
CV 100%
x
Example 4
The following table shows the summary statistics for the daily wages of two types
of workers.
Variation I > II 20 24
CVI 100% 20% CVII 100% 16%
100 150
10
Research Methods and Statistics
PART B: Probability
―Perhaps it was man‘s unquenchable thirst for gambling that led to the early
development of probability theory. In an effort to increase their winnings,
gamblers called upon the mathematicians to provide optimum strategies for
various games of chance.‖ ---- from Walpole R.E. Introduction to Statistics
Probability is the basis upon which the discipline of statistics has been developed
and applied in many fields associated with chance occurrences such as politics,
business, weather forecasting, and scientific research. Probability may be taken
as a tool with which we may solve problems involving uncertainties. In fact
uncertainty is a basic element of human experiences. To cite some examples:
travelling time, number of customers, rainfall, temperature, share price
movement, length of our life, etc.
The third approach is very mathematical. A number of axioms have been set up
and from these some theorems of probability have been developed. This
approach is too abstract and usually used by mathematicians.
11
Research Methods and Statistics
Example 1
Three items are selected at random from a manufacturing process. Each item is
inspected and classified defective (D) or non-defective (N).
Example 2
The event that the number of defectives in above example is greater than 1.
Example 3
Suppose a licence plate containing two letters following by three digits with the
first digit not zero. How many different licence plates can be printed?
(26)(26)(9)(10)(10) = 608,400
Example 4
Find the possible permutations (the number of ways where sequence of the
letters is counted) from 3 letters A, B, C.
12
Research Methods and Statistics
(n)( n 1)...( n r 2)( n r 1)(n r )( n r 1)...( 2)(1) n!
(n r )( n r 1)...( 2)(1) (n r )!
Example 5
How many 7-letter words can be formed using the letters of the word
'BENZENE'?
7!
The number of 7-letter words that can be formed is 420
(1!)(3!)( 2!)(1!)
n!
Cr
r!(n r )!
n
13
Research Methods and Statistics
The answer
5!
3!(5 3)!
Example 6
4!
C3 4
3!(4 3)!
4
Example 7
A box contains 8 eggs, 3 of which are rotten. Three eggs are picked at random.
Find the probabilities of the following events.
(a) Exactly two eggs are rotten.
(b) All eggs are rotten.
(c) No egg is rotten.
Solution:
(a) The 8 eggs can be divided into 2 groups, namely, 3 rotten eggs as the first
group and 5 good eggs as the second group.
Thus the probability of having exactly two rotten among the 3 randomly
C C 15
selected eggs is 3 2 5 1
8 C3 56
14
Research Methods and Statistics
3 C 3 5 C 0 1
8 C3 56
3 C 0 5 C 3 10 5
8 C3 56 28
Rules of probability
Addition Rule: For any events that are not mutually exclusive
P( A B) P( A) P( B) P( A B)
where A B is the union of two sets A and B, it is the set of elements that
belong to A or to B or to both.
Illustrative example
180 students took examinations in English and Mathematics. Their results were
as follows:
15
Research Methods and Statistics
Example 8
A card is drawn from a complete deck of playing cards. What is the probability
that the card is a heart or an ace?
Solution
Let A be the event of getting a heart, and B be the event of getting an ace.
The probability that the card is a heart or an ace is P( A B) .
P( A B) P( A) P( B) P( A B)
13 4 1 16 4
52 52 52 52 13
What is the probability of getting a total of '7' or '11' when a pair of dice are
tossed?
Solution
Total number of possible outcomes = (6)(6) = 36
16
Research Methods and Statistics
Possible outcomes of getting a total of '7' :{1,6; 2,5; 3,4; 4,3; 5,2; 6,1}
Possible outcomes of getting a total of '11' : {5,6; 6,5}
Let A be the event of getting a total of '7', and B be the event of getting a total of
'11'.
The probability of getting a total of '7' or '11' is P( A B) .
P( A B) P( A) P( B) P( A B) P( A) P( B) ...A and B are mutually exclusive
6 2 2
36 36 9
Example 9
A coin is tossed six times in succession. What is the probability that at least one
head occurs?
1 1 21212121212 64
63
Conditional Probability
Let A and B be two events. The conditional probability of event A given that event
B has occurred, denoted by P( A / B) is defined as
P( A B)
P( A / B) provided that P(B) > 0.
P( B)
Similarly, the conditional probability of B given that event A has occurred is
P( A B)
defined as P( B / A) , provided P(A) > 0.
P( A)
Example 10
17
Research Methods and Statistics
A hamburger chain found that 75% of all customers use mustard, 80% use
ketchup, and 65% use both, when ordering a hamburger. What are the
probabilities that:
(a) a ketchup-user uses mustard?
(b) a mustard-user uses ketchup?
Solution
Let A be the event of using mustard, and B be the event of using ketchup.
It is given that: P( A) 0.75 ; P( B) 0.80 ; P( A B) 0.65
P( A B) 0.65
(a) P(a ketchup-user uses mustard) P( A / B) 0.8125
P( B) 0.80
P( A B) 0.65
(b) P(a mustard-user uses ketchup) P( B / A) 0.8667
P( A) 0.75
Multiplicative Rule
P( A B) P( A) P( B / A)
or = P( B) P( A / B)
Statistically Independence: the occurrence or non-occurrence of one event has
no effect on the probability of occurrence of the other event.
Solution
Let Ai be the event of getting '7' in the i-th throw and B j be the event of getting
'11' in the j-th throw.
18
Research Methods and Statistics
6 2 2 6 1
36 36 36 36 54
P( B1 ) P( A / B1 ) P( B2 ) P( A / B2 ) ... P( Bk ) P( A / Bk )
Example 12
Suppose 50% of the cars are manufactured in the United States and 15% of
these are compact; 30% of the cars are manufactured in Europe and 40% of
these are compact; and finally, 20% are manufactured in Japan and 60% of
these are compact. If a car is picked at random from the lot, find the probability
that it is a compact.
P( A) P( A B1 ) P( A B2 ) P( A B3 )
P( B1 ) P( A / B1 ) P( B2 ) P( A / B2 ) P( B3 ) P( A / B3 )
Baye's Theorem
If E1, E 2 ,..., E k are mutually exclusive events such that E1 E 2 ... E k contains
P ( E i D) P ( Ei D)
P ( E i / D) k
P( E
P ( D)
j D)
j 1
P( Ei ) P( D / Ei )
P ( E1 ) P ( D / E1 ) P ( E 2 ) P ( D / E 2 ) ... P ( E k ) P ( D / E k )
19
Research Methods and Statistics
Example 13
Suppose a box contains 2 red balls and 1 white ball and a second box contains 2
red ball and 2 white balls. One of the boxes is selected by chance and a ball is
drawn from it. If the drawn ball is red, what is the probability that it came from the
1st box?
Solution
Let A be the event of drawing a red ball and B be the event of choosing the 1st
box.
1 2 2
Given: P( B) P( B' ) ; P( A / B) ; P( A / B' )
2 3 4
P(Coming from the 1st box/the drawn ball is red) P( B / A)
P( A B) P( A B)
P( A) P( A B) P( A B' )
P( B) P( A / B) ( 1 )(2 ) 4
2 3
P( B) P( A / B) P( B' ) P( A / B' ) ( 1 )(2 ) ( 1 )(2 ) 7
2 3 2 4
20
Research Methods and Statistics
Example 1
so X = {0, 1, 2, 3}
Example 2
x 0 1 2 3 4
1 4 6 4 1
P(X=x) 16 16 16 16 16
Cx
That is, P( X x) 4
, x 0, 1, 2, 4
16
21
Research Methods and Statistics
Example 3
Let random variable, X be the sum of the two dice. Then the probability
distribution of X is:
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(X=x) 36 36 36 36 36 36 36 36 36 36 36
f ( x) P( X x)
where the function is evaluated at all possible values of x.
2. P ( X x ) 1.
x
Mathematical Expectations
E ( X ) or x xP( X x)
x
Var ( X ) or x E [( X x ) 2 ]
2
( x x ) 2P( X x)
x
x 2 P( X x) x
2
22
Research Methods and Statistics
Example 4
Calculate the mean and variance of the discrete probability distribution in
example 2 and 3.
Definition :
1 1 x 2
f (x ) exp[ ( ) ] for < x <
(2 ) 2
= 3.14154
1. It is a continuous distribution.
2. The curve is symmetric and bell-shaped about a vertical axis through the
mean .
3. The total area under the curve and above the horizontal axis is equal to 1.
23
Research Methods and Statistics
Notation : Z ~ N(0, 1)
Example 7
Given Z ~ N(0, 1)
(b) P(0 < Z < 1.73) = P(Z > 0) - P(Z > 1.73) = 0.5 - 0.0418 = 0.4582
(c) P(2.42 < Z < 0.8) = 1 - P(Z < -2.42) - P(Z > 0.8)
= 1 - 0.00776 - .2119 = 0.78034
(d) P(1.8 < Z < 2.8) = P(Z > 1.8) - P(Z > 2.8) = 0.0359 - 0.00256 = 0.03334
Let the corresponding z value be z1, then we have P(Z < z1) = 0.05.
From the standard normal distribution table we have P(Z < -1.64) = 0.05.
So z1 = -1.64
Let the corresponding z value be z1, then we have P(0 < Z < z1) =
0.3944.
From the standard normal distribution table we have P(0 < Z < 1.25) =
0.3944.
24
Research Methods and Statistics
So z1 = 1.25
Theorem :
X
Z
x1 x2
P( x1 X x2 ) P( Z )
Example 8
45 X 62
Solution: P(45 X 62 ) P
45 50 62 50
P Z
10 10
P(0.5 Z 1.2)
Example 9
25
Research Methods and Statistics
X 125
(a) P( X 125) P( )
125 80
P( Z ) P( Z 1.5) 0.0668
30
65 X 95
(b) P(65 X 95 ) P
65 80 95 80
P Z
30 30
Example 10
On an examination the average grade was 74 and the standard deviation was 7.
If 12% of the class are given A‘s, and the grades are curved to follow a normal
distribution, what is the lowest possible A and the highest possible B?
x 74
P( X x1 ) 0.12 P Z 1 0.12
7
x1 74
Thus 1.175
7
i.e. x1 74 (7)(1.175 ) 82 .2 83
26
Research Methods and Statistics
Example 11
1. In testing 10 items as they come off an assembly line, where each test or
trial may indicate a defective or a non-defective item.
2. Five cards are drawn with replacement from an ordinary deck and each
trial is labelled a success or failure depending on whether the card is red
or black.
Definition :
Notation : X ~ b(n, p)
n x n x
P(X = x) = p q x = 0, 1, , n
x
p+q=1
Example 12
27
Research Methods and Statistics
20 1
2 18
9
(a) P( X 2) 0.28517
2 10 10
P( X 2) 1 P( X 0) P( X 1)
20 1 9 20 1 9
0 20 19
(b)
1 1 .12158 0.27017 0.60825
0 10 10 1 10 10
Example 13
A test consists of 6 questions, and to pass the test a student has to answer at
least 4 questions correctly. Each question has three possible answers, of which
only one is correct. If a student guesses on each question, what is the probability
that the student will pass the test?
1
Let X be the no. of correctly answered questions among 6 questions. X b(6, )
3
6
2 3
6 6 6 x
P( X 4) P( X x) 1
x
x4 x 4 x
3
Theorem
The mean and variance of the binomial distribution with parameters of n and p
are
= np and 2 = npq respectively where p + q = 1.
Example 14
28
Research Methods and Statistics
(x 05
. ) np (x 05
. ) np
P( X x) P( Z )
(npq ) (npq )
Remark : If both np and nq are greater than 5, the approximation will be good.
Example 15
A process yields 10% defective items. If 100 items are randomly selected from
the process, what is the probability that the number of defective exceeds 13?
Let X be the no. of defective in a random sample of 100 items. X b(100, 0.1)
X ' 13 .5 13 .5 10
P P Z P( Z 1.167 ) 0.121
3
Example 17
A multiple-choice quiz has 200 questions each with four possible answers of
which only one is the correct answer. What is the probability that sheer
guesswork yields from 25 to 30 correct answers for 80 of the 200 problems about
which the student has no knowledge?
Let X be the no. of correct answers for 80 with sheer guesswork. X b(80, 0.25)
29
Research Methods and Statistics
24.5 20 30.5 20
P Z P(1.16 Z 2.71) 0.1230 0.00336 0.1196
15 15
Definition :
Notation : X ~ Po()
30
Research Methods and Statistics
x
P(X = x) = e x = 0, 1, 2,
x!
e = 2.718283
Example 17
Let X be the no. of particles entering the counter in a given millisecond. X Po(4)
e 4 4 6
P( X 6) 0.1042
6!
Example 19
Ships arrive in a harbour at a mean rate of two per hour. Suppose that this
situation can be described by a Poisson distribution. Find the probabilities for a
30-minute period that
e 110
(a) P( X 0) 0.3679
0!
e 113
(b) P( X 3) 0.0613
3!
Theorem :
The mean and variance of the Poisson distribution both have mean .
31
Research Methods and Statistics
If n is large and p is near 0 or near 1.00 in the binomial distribution, then the
binomial distribution can be approximated by the Poisson distribution with
parameter np.
Example 20
If the prob. that an individual suffers a bad reaction from a certain injection is
0.001, determine the prob. that out of 2000 individuals, more than 2 individuals
will suffer a bad reaction.
2000 1998
= 1 0.001 0.999
0 2000 2000
0.001 0.999
1 1999 2000
0.001 0.999
2
0 1 2
21 e 2 2
P(1 suffers) = 2
1! e
2 2
P(2 suffer) = 2 e 22
2! e
Then the required probability = 1 52 0.323
e
General speaking, the Poisson distribution will provide a good approximation to
binomial when
(i) n is at least 20 and p is at most 0.05; or
(ii) n is at least 100, the approximation will generally be excellent provided p<
0.1.
Example 21
Two percent of the output of a machine is defective. A lot of 300 pieces will be
produced. Determine the probability that exactly four pieces will be defective.
32
Research Methods and Statistics
Let X be the no. of defective pieces among 300 pieces. X b(300, 0.02)
By Poisson Approximation:
np (300)(0.02) 6
e 6 6 4
P( X 4) 0.1338
4!
33
Research Methods and Statistics
Definition
An illustrating example
From the above table, we can see that if we draw a sample and use the sample
mean to estimate the population mean, the accuracy of our estimate depends on
which sample we have drawn, which in turn depends on chance.
34
Research Methods and Statistics
Hence the average value of the sample mean is equal to the population mean.
We call the sample mean an unbiased estimator of the population mean.
The variance of the sample mean (i.e., the average square deviation of the
sample mean from the population mean) is: V ( y) ( y Y ) 2 P( y) =
1 1 2 1 1
(0.5 1.5) 2 * (1.0 1.5) 2 * (1.5 1.5) 2 * (2.0 1.5) 2 * (2.5 1.5)2 *
6 6 6 6 6
=0.4167
If repeated samples of size n are drawn from any infinite population with
mean and variance 2, and n is large (n 30), the distribution of x , the
sample mean, is approximately normal, with mean (i.e. E ( x ) ) and
2
variance 2/n (i.e. V ( x) ), and this approximation becomes better as n
n
becomes larger.
(ii) If n is small, say less than 30, the sampling distribution is not so
normal. A t-distribution will be used (discussed later).
35
Research Methods and Statistics
n 2
In the above example, N=4, n=2, V ( x) (1
) =(1-2/4)(1.6667/2) = 0.4167. If
N n
the population is big (or the sample is drawn with replacement), then
2
V ( x) =1.6667/2=0.8333.
n
In this course we assume a big population or sampling with replacement.
Example 1
An electrical firm manufactures light bulbs that have a length of life that is
approximately normal distributed with mean equal to 800 hours and a standard
deviation of 40 hours. Find the probability that a random sample of 16 bulbs will
have an average life of less than 775 hours.
2 40 2
Let X be the average life of the 16 bulbs. X N( x 800, x2 )
n 16
X x 775 x 775 800
P( X 775) P P Z
x 40
x
16
Example 2
The mean IQ scores of all students attending a college is 110 with a standard
deviation of 10.
(a) If the IQ scores are normally distribution, what is the probability that the
score of any one student is greater than 112?
(b) What is the probability that the mean score in a random sample of 36
students is greater than 112?
(c) What is the probability that the mean score in a random sample of 100
students is greater than 112?
Solution
36
Research Methods and Statistics
112 110
P( X 112 ) P Z P( Z 0.2) 0.4207
10
(b) Let X 1 be the mean score of a sample of 36 students.
2 10 2
X 1 N( x 110 , x2 )
n 36
112 110
P( X 1 112) P Z P( Z 1.2) 0.1151
10
36
2 10 2
X 2 N( 110, )
n 100
112 110
P( X 2 112) P Z P( Z 2) 0.0.228
10
100
Estimation
Estimation is the process of using statistics from sample data to estimate the
parameters of the population. A statistic is a random variable which depends on
which sample is drawn from a population.
1. x
2. s2 2
3. P P
37
Research Methods and Statistics
There are two important properties for an estimator, namely, unbiasedness and
efficiency.
For a point estimate, both the accuracy and reliability of the estimation are
unknown. For an interval estimate, the width of the interval gives the accuracy
and the probability gives the reliability of the estimation.
Examples 3
(a) The mean and standard deviation for the quality point averages of a
random sample of 36 college seniors are calculated to be 2.6 and 0.3,
respectively. Find a 95% confidence interval for the mean of the entire
senior class.
Solution
38
Research Methods and Statistics
ˆ ˆ 0.3 0.3
x z 0.025 x z 0.025 2.6 1.96 2.6 1.96
n n 36 36
2.502 2.698
(b) Let n1 be the required sample size.
ˆ
z 0.025 0.05 1.96 0.3 0.05
n n
1 1
2
(1.96)(0.3)
n1 138.30 139
0.05
A summary table for constructing (1 )% confidence interval for mean and
proportion
n1 n2 2 ,
39
Research Methods and Statistics
The contents of seven similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8,
10.0, 10.2 and 9.6 liters. Find a 95% confidence interval for the mean of all such
containers, assuming an approximate normal distribution.
Solution
Given: n = 7 x 70 x 700.48
2
x
2
( x)
x
2
70
2
700.48 70
x 10 s 2
n 7 0.08
n 7 n 1 6
s s 0.2828 0.2828
x t 6,0.025 x t 6,0.025 10 2.447 10 2.447
n n 7 7
9.738 10.262
Example 5
40
Research Methods and Statistics
Let P be the actual proportion of families in this city with colour sets.
ˆˆ ˆˆ
pˆ z 0.025 pq P pˆ z 0.025 pq
n n
(. 68 )(. 32 ) (. 68 )(. 32 )
0.68 1.96 P 0.68 1.96 0.64 P 0.72
500 500
Examples 6
A standardized chemistry test was given to 50 girls and 75 boys. The girls made
an average grade of 76 with a standard deviation of 6, while the boys made an
average grade of 82 with a standard deviation of 8. Find a 96% confidence
interval for the difference 1 and 2, where 1 is the mean score of all boys and 2
is the mean score of all girls who might take this test.
Given: n1 75 , n2 50 , x1 82 , s1 8 , x 2 76 , s2 6 ,
(1 ) .96 0.04
ˆ 12 ˆ 22 ˆ 12 ˆ 22
( x1 x 2 ) z 0.02 1 2 ( x1 x 2 ) z 0.02
n1 n2 n1 n2
82 6 2 82 6 2
(82 76) 2.05 1 2 (82 76) 2.05 ,
75 50 75 50
n1 30 & n2 30 , so ˆ 1 s1 & ˆ 2 s 2
41
Research Methods and Statistics
3.43 1 2 8.57
Example 7
In a batch chemical process, two catalysts are being compared for their effect on
the output of the process reaction. A sample of 12 batches is prepared using
catalyst 1 and a sample of 10 batches was obtained using catalyst 2. The 12
batches for which catalyst 1 was used gave an average yield of 85 with a sample
standard deviation of 4, while the average for the second sample gave an
average of 81 and a sample standard deviation of 5. Find a 90% confidence
interval for the difference between the population means, assuming the
populations are approximately normally distributed with equal variances.
Solution
Let 1 and 2 be the mean population yield using catalyst 1 and catalyst 2,
respectively.
Given: n1 12 , n2 10 , x1 85 , s1 4 , x 2 81 , s2 5 ,
1 1 1 1
( x1 x2 ) t 20,0.05 ( s p ) 1 2 ( x1 x2 ) t 20,0.05 ( s p )
n1 n2 n1 n2
1 1 1 1
(85 81) (1.725 )( 4.478 ) 1 2 (85 81) (1.725 )( 4.478 )
12 10 12 10
0.69 1 2 7.31
42
Research Methods and Statistics
Example 8
The weight of 10 adults selected randomly before and after a certain new diet
was introduced was recorded as follows:
Solution
d
d i
= -1.6 s2
(d i (1.6)) 2
40.7
n 1
d
n
6.38
A 98% confidence interval is 1.6 (2,821)
10
That is 7.29 d 4.09
Example 9
Solution
43
Research Methods and Statistics
Let P1 and P2 be the true fraction of defectives of the existing and the new
processes, respectively.
n2 2000 , x 2 80 , pˆ 2 80 0.04
2000
(1 ) .90 0.10
pˆ 1qˆ1 pˆ 2 qˆ 2 pˆ 1qˆ1 pˆ 2 qˆ 2
( pˆ 1 pˆ 2 ) z 0.05 P1 P2 ( pˆ 1 pˆ 2 ) z 0.05
n1 n2 n1 n2
0.001697 P1 P2 0.021697
44
Research Methods and Statistics
Statistical Hypothesis
Now let us make a 95% confidence interval about the mean breaking strength of
the population as below:
0.5 0.5
P 7.8 1.96* 7.8 1.96* 0.95 ;
50 50
As there is a probability of 0.95 that the mean breaking strength is between 7.66
kg and 7.94 kg, it is highly unlikely that the null hypothesis = 8 kg is true and
hence should be rejected.
There are four possible situations for the above decision making exercise:
H 0 is correct H 0 is wrong
Accept H 0 Correct Type 2 error
decision
Reject H 0 Type 1 error Correct
decision
45
Research Methods and Statistics
1. Null hypothesis, H0
2. Alternative hypothesis, H1
H1 : 0 (two-tail test)
H1 : 0 (One-tail test)
H1 : 0 (One-tail test)
In the one-tail test we have some expectation about the direction of the
error when the null hypothesis is wrong, while in the two-tail test we don‘t
have such expectation.
3. Test statistics
is the value, based on the sample, used to determine whether the null
hypothesis should be rejected or accepted.
4. Critical region
is a region in which if the test statistic falls the null hypothesis will be
rejected.
46
Research Methods and Statistics
5. Types of error
4. Select the appropriate test statistic and establish the critical region.
47
Research Methods and Statistics
1 2 d 0 Paired observations d d0
t with n 1
sd n
p p0 Large sample pˆ p0
z
p0 (1 p0 )
n
p1 p2 0 Large samples ( p1 pˆ 2 )
ˆ
z
1 1
pˆ (1 pˆ )
n1 n2
n pˆ n pˆ
pˆ 1 1 2 2
n1 n2
48
Research Methods and Statistics
Example 1
Computation:
n = 50 x 7.8 0.5
x 7.8 8
z 2.828
0.5
n 50
Conclusion: As the sample z (= -2.828) falls inside the critical region, so reject
the null hypothesis at 0.01 level of significance and conclude that is
significantly smaller than 8 kilograms.
Example 2
The average length of time for students to register for fall classes at a certain
college has been 50 minutes with a standard deviation of 10 minutes. A new
registration procedure using modern computing machines is being tried. If a
random sample of 12 students had an average registration time of 42 minutes
with a standard deviation of 11.9 minutes under the new system, test the
hypothesis that the population mean is now less than 50, using a level of
significance of (1) 0.05, and (2) 0.01. Assume the population of times to be
normal.
Let be the population mean time for students to register in the new registration
procedure.
49
Research Methods and Statistics
Critical region: (n = 12 < 30; and the new is unknown, so t-test should
be used ) degree of freedom ( ) = n -1 = 12 -1 =11
Computation:
n = 12 x 42 s = 11.9
x 42 50
t 2.329
s 11.9
n 12
(2) Identical with those of (1) except the critical region would be replaced by:
Example 3
50
Research Methods and Statistics
Critical region: (As both n1 and n 2 are smaller than 30 and their standard
deviations are unknown, so t-test has to be used.)
Computation:
n1 12 x1 85 s1 4
n2 10 x 2 81 s2 5
( x1 x 2 ) ( 1 2 ) (85 81) 0
t 2.086
1 1 1 1
sp 20.05
n1 n2 12 10
Conclusion: As the sample t (=2.086) falls inside the critical region, so reject the
null hypothesis at 0.10 level of significance and conclude that the mean abrasive
wear of material 1 is significantly higher than that of the material 2.
Example 4
51
Research Methods and Statistics
Sample
Analysis 1 2 3 4 5
x-ray 2.0 2.0 2.3 2.1 2.4
Chemical 2.2 1.9 2.5 2.3 2.4
Assuming the populations normal, test at the 0.05 level of significance whether
the two methods of analysis give, on the average, the same result.
Let 1 and 2 be the mean iron content determined by the laboratory chemical
analysis and X-ray fluorescence analysis respectively; and
Null hypothesis: 1 2 or D 0
Alternative hypothesis: 1 2 or D 0
Computation:
Sample
Analysis 1 2 3 4 5
x-ray 2.0 2.0 2.3 2.1 2.4
Chemical 2.2 1.9 2.5 2.3 2.4
di -0.2 0.1 -0.2 -0.2 0
5 5
d i 0 .5
i 1
d
i 1
i
2
0.13
d
d 0.5 0.1
5 5
52
Research Methods and Statistics
n d 2 d
2
(5)(0.13) (0.5) 2
s
2
0.02
n(n 1)
d
(5)(4)
d D (0.1) 0
t 1.5811
sd 0.02
n 5
Conclusion: As the sample t (=-1.5811) falls outside the critical region, so reject
the alternative hypothesis at 0.05 level of significance and conclude that there is
no significant difference in the mean iron content determined by the above two
analyses.
Example 5
Computation:
x 5
n = 100 x=5 pˆ 0.05
n 100
pˆ P 0.05 0.1
z 1.667
P (1 P ) (0.1)( 0.9)
n 100
53
Research Methods and Statistics
Conclusion: As the sample z (=-1.667) falls inside the critical region, so reject the
null hypothesis at 0.05 level of significance and conclude that P is significantly
smaller than 0.1. That is, the production method has been improved in lowering
the proportion of defective below the current 10%.
Example 6
A vote is to be taken among the residents of a town and the surrounding country
to determine whether a proposed chemical plant should be constructed. The
construction site is within the town limits and for this reason many voters in the
country feel that the proposal will pass because of the large proportion of town
voters who favor the construction. To determine if there is a significant difference
in the proportion of town voters and county voters favoring the proposal, a poll is
taken. If 120 of 200 town voters favor the proposal and 240 of 500 county
residents favor it, would you agree that the proportion of town voters favoring the
proposal is higher than the proportion of county voters? Use a 0.025 level of
significance.
Let P1 and P2 be the proportions of town voters and country voters, respectively,
favouring the proposal.
Null hypothesis: P1 P2 or P1 P2 0
Alternative hypothesis: P1 P2 or P1 P2 0
Computation:
x1 120
n1 200 x1 120 pˆ 1 0 .6
n1 200
x 2 240
n2 500 x 2 240 pˆ 2 0.48
n 2 500
( pˆ 1 pˆ 2 ) ( P1 P2 ) (0.6 0.48 ) 0
z 2.870
1 1 1 1
pˆ (1 pˆ ) (0.514 )( 0.486 )
n1 n 2 200 500
54
Research Methods and Statistics
Conclusion: As sample z (=2.870) falls inside the critical region, so reject the null
hypothesis at 0.025 level of significance and conclude that the proportion of town
voters favouring the proposal is significantly larger than that of the country voters.
55
Research Methods and Statistics
There are two types of chi-square tests: goodness-of-fit test and tests for
independence.
Goodness-of-fit Test
(Oi Ei ) 2
2
test
Ei
Example 1
Faces
1 2 3 4 5 6
Observed 20 22 17 18 19 24
Expected
56
Research Methods and Statistics
By comparing the observed frequencies with the expected frequencies, one has
to decide whether the die is fair die or not.
Computation:
1
Expected value = nP( X i) 120( ) 20
6
i 1 2 3 4 5 6
Observed (Oi ) 20 22 17 18 19 24
Expected ( E i ) 20 20 20 20 20 20
Oi E i 0 2 -3 -2 -1 4
6
(Oi Ei ) 2
2
i 1 Ei
Conclusion: As the sample 2 (= 1.7) falls outside the critical region, so reject the
alternative hypothesis and conclude that the die is a fair die.
Example 2
57
Research Methods and Statistics
1.45 - 1.95 2
1.95 - 2.45 1
2.45 - 2.95 4
2.95 - 3.45 15
3.45 - 3.95 10
3.95 - 4.45 5
4.45 - 4.95 3
Chi-squared test can be applied to test whether the above frequency distribution
can be approximated by a normal distribution or not.
Computation:
For finding the expected values, the mean and standard deviation of the
frequency distribution have to be found first.
( fi ) Class mark
Class Oi ( xi )
boundaries
1.45 - 1.95 2 1.7
1.95 - 2.45 1 2.2
2.45 - 2.95 4 2.7
2.95 - 3.45 15 3.2
3.45 - 3.95 10 3.7
3.95 - 4.45 5 4.2
4.45 - 4.95 3 4.7
n = 40 fx 136.5 fx 2
484.75
x
fx 136.5 3.4125
f 40
58
Research Methods and Statistics
s
fx fx
2 2
/n
484.75 136.5
2
40 0.6969
n 1 39
L 3.4125 U 3.4125
z value i Z i ;
0.6969 0.6969
where L i and U i are the Lower and Upper Boundaries of the ith class.
L 3.4125 U 3.4125
p value i Z i
0.6969 0.6969
L 3.4125 U 3.4125
Ei (40) P i Z i
0.6969 0.6969
In order to satisfy the rule that the expected value in each cell is larger than or
equal to 5, we have to combine the first three classes in to one cell and the last
two classes into another cell. As such, the number of cells (n) is 4.
Conclusion: Since 12,0.05 3.841, so the sample 2 (= 2.901) falls outside the
critical region. As such, reject the alternative hypothesis and conclude that the
distribution of battery lives can be approximated by a normal distribution.
59
Research Methods and Statistics
The Chi-square test procedure can also be used to test the hypothesis of
independence of two variables/attributes. The observed frequencies of two
variables are entered in a two-way classification table, or contingency table.
Remark: The expected frequency of the cell in the ith row and jth column in
the contingency table
The degrees of freedom for the contingency table is equal to (r 1) (c 1) where
r is the number of rows and c is the number of columns in the table.
Example 3
Suppose that we wish to study the relationship between grade point average and
appearance.
Computation:
60
Research Methods and Statistics
Appearance 1 2 3 4 Totals
attractive 14 11 10 5 40
(9) (10.33) (11) (9.67)
ordinary 10 16 16 14 56
( 12.6) ( 14.47) (15.4) (13.53)
unattractive 3 4 7 10 24
(5.4) (6.2) (6.6) (5.8)
Totals 27 31 33 29 120
Remark: The approach for the test of homogeneity is the same as for the
test of independence of variables/attributes.
Example 4
61
Research Methods and Statistics
Manager
Result A B C
Purchases show profit 63 71 55
Purchase do no show profit 37 29 45
Total 100 100 100
Null hypothesis: the rates of stock purchases that resulted in profit were the
same for the three stock portfolio managers
Computation:
Manager
Result A B C Total
Purchases show profit 63 71 55 189
(63) (63) (63)
Purchase do no show profit 37 29 45 111
(37) (37) (37)
Total 100 100 100 300
Conclusion: As the sample 2 (= 5.49) falls outside the critical region so reject
the alternative hypothesis and conclude that there is no sufficient evidence to
support the rates of purchases resulted in profit of the three portfolio managers
were different.
62
Research Methods and Statistics
IV
Standard Probability
Distributions
Suppose an event E can happen in h ways out of a total of n possibly equally
likely ways. Then the probability of occurrence of the event is denoted by
p = h/n
q = 1–p
The probability scale runs from impossibility on the one hand to absolute
certainty on the other. The two poles of this continuum are given the numerical
values of 0 and 1 respectively. Therefore, the probability of any event occurring
must lie somewhere between these two extremes. The probability of my death
is absolutely certain indeed, the probability of my death next year can be
determined by a study of death rates in the past:
Age Group Deaths per 1000 per Year Probability of Dying in Year+1
63
Research Methods and Statistics
Population
64
Research Methods and Statistics
How well a sample represents a given population depends on the sample frame,
the sample size and the specific sample design:
Sample Size: the planning of, and reasons for choosing, the
number of subjects in the sample.
Sample Frame
2
Random numbers can be generated from scientific calculators, or be extracted from statistical
tables.
65
Research Methods and Statistics
Sample Design
Sample Size
How big should my sample be? It all depends on what you want to do with your
findings and the type of relationship you want to establish in your study. The
sample size is crucially important in a correlational research design, i.e. tests of
hypotheses and significance, or establishing an association or relationship
between two of more variables. However, there is no relationship between the
sample size and the size of the study population, sample size is determined by
the variability of the factor, element or characteristic prevalent in the study
population. Other things being equal, precision increases steadily up to sample
sizes of 50-200; after that, there is only a modest gain in precision to increasing
sample size. Whilst increasing sample size reduces errors attributable to
sampling, managing large amounts of data may increase non-sampling errors
(e.g. field-work problems, interviewer-induced bias, clerical errors in transcribing
data, etc.). In determining the size of a sample, consideration should be given
to:
3
See Jakobowitz, K., (1999), How to Develop and Write a Business Research Proposal, pp.58-
74, ISBN 0-9538664-0-8
66
Research Methods and Statistics
the extent to which the variability of the factor in the chosen study
population is known, or can be estimated.4
Computation
The standard deviation of the study population is unknown, hence the normal
distribution cannot be used but a derivative, Student‘s t-distribution, can be
applied:
ns = [t0.025s/L]2
where:
ns = sample size
t = confidence limit value derived from statistical tables for v = n – 2
degrees of freedom, and t/2 (t-0.025 – t+0.025)
s = sample standard deviation obtained from the pilot-study
L = level of tolerance or error (say, 5 points on the instrument scale of
100)
Hence:
2
Note The normal distribution provides a sample size of [1.96(15)/5] = 35; and, using an
estimate of the population standard deviation [(n/n-1)/s] provides a sample size
2
of [1.96(16.667)/5] = 43
4
It is rare indeed to have a known population variance and so be able to determine the
population standard deviation. Hence, deviates of the normal distribution cannot be applied, and
a derivative such as Student‘s t-test should always be used for an unknown standard deviation of
the study population.
67
Research Methods and Statistics
Sampling Error
The standard error of mean is the most commonly used statistic to describe
sampling error5:
SE = [s2/n]0.5
Where: s2 is the variance derived from the sample
n is the sample size
and [sum]0.5 is the square root of the product
Computation
The standard error of mean = [s2/n]0.5 = [(27.42)2/21]0.5 = 5.98
Conclusion
We can be 67 percent certain that the true population mean lies + one standard
error from the sample mean, i.e. 136.81 + 5.98, or between 130.83 and 142.79
on the survey instrument scale.
Also, we can be 95 percent certain that the true population mean lies + two
standard errors from the sample mean, i.e. 136.81 + 11.96, or between 124.85
and 148.77 on the survey instrument scale.
5
For a normal, or approximately asymmetrical distribution, the 95 percent confidence limits of the
0.5
population true mean can be computed: sample mean + 1.96s/(n ).
68
Research Methods and Statistics
The mean of the sample response is: x/n = (15/50) = 0.3 (in other words, a
proportion of thirty percent of the sample claim the characteristic).
Thus, we can be 67 percent certain (i.e. + one standard deviation from the
mean) that the value of the true population characteristic lies between one
standard error of the sample mean: 0.3 + 0.0648, or between 23.52 and 36.48
percent of the study population. Also, we can be 95 percent confident that the
true population mean lies + two standard errors of the sample mean: 0.03 +
0.1296, or between 17.04 and 42.96 percent of the study population.
Out of a random sample of 1000 subjects, 500 say they have a positive attitude
towards the object or subject of the study. What conclusions can we draw about
the percentage of the total population who may hold this view?
Our sample estimate (p) is, in effect, a random selection from a normal distribution (n>30) with
mean = and a standard error (= standard deviation) of [(100 -)/n] . Accordingly we can
0.5
state with 95 percent confidence that the true population mean lies between:
0.5
+ [1.96[(100-)/n]] = 50 percent + [1.96[50(100-50)/1000] = 3.1 percent
Conclusion We can state with 95 percent confidence that the true proportion of people
who hold this view is within the range of 46.9 to 53.1 percent
69
Research Methods and Statistics
2. Calculating the Sample Size for Established Accuracy from a Large Pilot-
Study
ns = 4(100-)/L2
In the survey, ns = 400 and = (120/400) = 30 percent and provides the computed accuracy at
the 95 percent confidence limits:
lies between + 1.96 [(100-)/n]
0.5
= 30 percent + 1.96[30(100-30)/400]
0.5
To compute the sample size necessary to reduce the standard error of + 4.6 percent to one of +
2 percent, with L = 2 and = 30:
2
ns = 4(100-)/L
2
= 4(30)(100-30)/2
= 2100
70
Research Methods and Statistics
V
The Normal Distribution
the total area under the curve is equal to unity, i.e. 1.0000
71
Research Methods and Statistics
68.27 percent
95.45 percent
99.73 percent
Statistical tables by White, Yeats and Skipworth (1991) and Murdoch and Barnes
(1985) of the areas of the standardized normal distribution give the probability
that a random variable will be greater than the mean (U), i.e. the area in the tail.
Hence, for the z = 1.2 we are given a value of 0.11507, but this represents the
probability of a random sample being greater than the mean, i.e. the probability
of being in the tail of the distribution. To find the probability of a random sample
falling in the acceptance region (z = 0 to z = 1.2), we must subtract this value
from 0.5:
72
Research Methods and Statistics
Computation
The intelligence quotient recorded as 120 and 130 can actually have a value
from 119.5 and 130.5 assuming that the IQ is recorded to the nearest unit of
measurement. Hence:
< Pr >
Conclusion
The number of British adults in the sample having an IQ of between 120 and
130 = 5000 (0.07562) = 378.1 (say, 378)
73
Research Methods and Statistics
VI
Statistical Decision Theory:
Tests of Hypothesis and
Significance
Very often in practice we are called upon to make decisions about populations on
the basis of sample information. Such decisions are called statistical decisions.
For example, we may wish to decide on the basis of sample data whether one
psychological procedure is better than another, whether the findings from survey
data are representative of the population, whether the conclusions reached as to
an experiment are valid, etc.
What is a Hypothesis?
Example
74
Research Methods and Statistics
‗no‘. It might equally be the case that a low level of school resources is
also to blame. The main point to note is that the scientific process never
leads to certainty in explanation, only to the rejection of existing
hypotheses and the construction of new ones.
Once the conceptual propositions have been established (i.e., the meanings in
scientific terms) we then need an operational proposition that defines the
concepts in such a way that they can be observed and measured. This may be
derived from a score achieved on a particular scale of authoritarianism; indirectly,
we may study the relationship between childhood aggression and exposure to
violent television programmes, but we still need to define both the variables
under study – aggression and television violence – in operational terms. The
former might be simply a tally of aggressive acts such as hitting, fighting,
damaging property, etc. Or it might be based on the analysis of projective test
material (Thematic Apperception Test). A panel of judges may be used to
develop an operational definition of aggression by watching a child in a free-play
situation and then rate the child‘s aggressiveness on a five-point scale.
Alternatively, we could observe children as they play with a selection of toys we
had previously classified as aggressive (guns, tanks, knives, etc,) and toys
classified as non-aggressive (cars, dolls, puzzles, etc.).
Defining violence may be a little more difficult to agree on. What constitutes
television violence? The problem here is both cultural and the difference in
precision between what the general public will accept in defining a term and what
researchers will accept. To operationalize the concept of television violence we
could use a checklist of items, such as ―Was there physical contact of an
aggressive nature?‖ ―Has an illegal act taken place?‖ etc. Perhaps you can
establish a criterion that a violent television programme must have five or more
items checked ‗yes‘ for it to be considered violent.
75
Research Methods and Statistics
Problem or General Hypothesis: You expect some children to read better than
others because they come from homes, in which there are positive values and
attitudes to education.
Unconfirmed Hypotheses
76
Research Methods and Statistics
Statistical Hypotheses
77
Research Methods and Statistics
78
Research Methods and Statistics
Step 1
A statistical or Null Hypothesis is set up
This is the initial assumption is almost invariably an assumption about the value
of a population parameter: the probability (p) of the subject choosing the colour of
the card correctly is 0.5, hence the subject is merely guessing and the
experimental results are due to chance:
H0 : p = 0.5
Step 2
An Alternative Hypothesis is defined that is accepted if the test
permits us to reject the null hypothesis
The subject is not guessing and the experimental results are indicative of the
subject having powers of extra-sensory perception:
H1 : p > 0.5
Step 3
An appropriate level of significance () is established
= 0.05
79
Research Methods and Statistics
Step 4
The appropriate sampling distribution is defined and the
critical values established to identify the accept/reject regions
If the null hypothesis (H0) is true, the mean and standard deviation of the number
of cards identified correctly is given by:
Z1
Critical Region
= 0.05
Accept H0 Reject H0
Step 5
The decision rule is established
(i) If the z score observed is greater than 1.645, the results are significant at
the 0.05 level and the subject has powers of extra-sensory perception:
accept H0
(ii) If the z score is less than 1.645 the results are due to chance, i.e. not
significant at the 0.05 level, and reject H0.
80
Research Methods and Statistics
Step 6
The position of the Sample result is computed
If the null hypothesis is valid, the sampling distribution will have a mean of = Np
= 50(0.5) = 25, and a standard deviation of = [(Np(1-p)]0.5 = 50(0.5)(0.5) = 3.54
Step 7
Conclusion
Since 32 in standard score = 1.84, which is greater than 1.645, decision (i) holds,
i.e. we accept H0 : we conclude that the subject has powers of extra-sensory
perception. It does not follow from this that the null hypothesis is true, it merely
means that there is insufficient evidence to reject it (this is equivalent to a ―not
proven‖ verdict).
Let m1 and m2 be the sample means obtained from a large sample of sizes n1
and n2 drawn from respective populations having means and and standard
deviations andConsider the null hypothesis that there is no difference
between the population means, i.e. =or the two samples are drawn from
populations having the same mean.
Suppose the two groups come from two populations having the respective
means of and Then we have to decide between the hypotheses:
81
Research Methods and Statistics
Under the null hypothesis it is assumed that both groups come from the same
population. Since the population standard deviation is not known, an estimate of
the population standard deviation is computed by pooling the two sample
standard deviations:
For a two-tailed test the results are significant at the 0.05 level if z lies outside the
range –1.96 to +1.96. Hence we conclude that at a 0.05 level there is a
significant difference in performance of the two groups and that the second group
is probably better.
Two groups, A and B, consist of 100 subjects each who have a phobia
towards spiders. A psychoanalytical technique is administered to group
A but not group B (which is called the control group); otherwise the two
groups are treated identically. It is found that in groups A and B, 75 and
65 subjects, respectively, manage the phobia effectively. Test the
hypothesis that the psychoanalytical technique is effective using a 0.05
level of significance.
= 0.0648
= +1.54
82
Research Methods and Statistics
It should be noted that our conclusions depend on how much we are willing to
risk being wrong. If the results are actually due to chance and we conclude that
the psychoanalytical treatment is effective (Type I error), we might proceed to
treat large numbers of people only to find then that it is ineffective. However, in
concluding that the psychoanalytical technique does not help when it actually
does (Type II error), may be a dangerous conclusion especially if people‘s well
being is at stake.
px(1-p)n-x
p(X) = [n!/x!(n-x)!]
In this case n = 100 questions and the probability of successfully answering each
question is 1 in 4 = 0.25. By substitution:
= np = 100(0.25) = 25
83
Research Methods and Statistics
84
Research Methods and Statistics
VII
Small Sampling Theory
For samples of size n>30, called large samples, the sampling distributions of
many statistics are approximately normal. In such cases we can use the statistic
z to formulate decision rules or tests of hypotheses and significance. In each
case the results hold good for infinite populations. In a test of means in
sampling without replacement from finite populations, the z score is given by:
z = m -
(n-1)0.5
However, for sample of size n<30, called small samples, this particular
approximation is not good and becomes worse with decreasing sample size.
Hence, exact sampling with appropriate modifications must be made. If we
consider sample of size n drawn from a normal, or assumed normal, distribution
with a population mean of , using the sample mean m and sample standard
deviation s, the sampling distribution for the t statistic can be expressed:
t = m -
s/(n-1)0.5
85
Research Methods and Statistics
Degrees of Freedom
Confidence Limits
86
Research Methods and Statistics
1. We are dealing with a small sample of n<30 and the population standard
deviation is unknown. Hence, the sample mean of 30 is a random
selection from a t distribution with a population standard deviation =
standard error = s/n0.5. Thus, we can state that the sample mean is no
more than t0.025 standard errors from the population mean:
2. From tables of the t distribution the value for t0.025 based on (n-1) = 8
degrees of freedom = + 2.306. Hence, substituting in the formula:
To test the hypothesis that the observed sample mean (m) is equal to the
population mean (), or that the observed mean (m1) from one sample is equal to
the observed mean of another sample (m2):
H0 : m =
or, H0 : m1 = m2 ... ... two-tailed test
To test the hypothesis that the observed sample mean (m) is greater than, or
less than, the population mean ():
H0 : m >
or, H0 : m <
or, H0 : m1 > m2
or, H0 : m1 < m2 ... ... one-tailed test
Similarly, to test the hypothesis that an observed sample mean is not equal to the
population mean, or that an observed sample mean is not equal to the observed
mean of another sample, we use a two-tailed test. In a two-tailed test, the area
representing the level of significance must be distributed between the two tails.
87
Research Methods and Statistics
Thus, for a level of significance of = 0.05 the area in each tail is equal to /2 =
0.025:
Step 1
Establish the Null Hypothesis
The mean scores achieved in the post-test by subjects following the traditional
learning programme are equal to the mean scores achieved in the post-test by
subjects exposed to the computer-aided learning programme,
i.e. H0 : m1 = m2
Alternative Hypotheses
The mean scores achieved in the post-test by subjects following the traditional
learning programme are not equal to the mean marks achieved in the post-test
by subjects exposed to the computer-aided leaning programme.
88
Research Methods and Statistics
H1 : m1 > m2
or, H1 : m1 < m2
i.e. H1 : m1 is not equal to m2
Step 2
Decision Rule
If the t calculation is less than –2.000, or greater than +2.000, then reject the null
hypothesis and accept the alternatives hypotheses, otherwise accept H 0.
Step 3
Graphically describe the distribution showing the critical values and the region of
acceptance/rejection.
CV1 CV2
0.95
/2=0.025 /2=0.025
-2.000 0 +2.000
Reject H0 Reject H0
Accept H0
89
Research Methods and Statistics
Step 4
Collect Sample Data and Compute the Result of the Test
90
Research Methods and Statistics
111294 16 76
111295 15 54
111296 31 83
111297 17 70
111298 48 74
111299 18 54
111300 32 67
111301 11 57
111302 43 94
111303 30 91
111310 42 60
111311 49 84
111312 33 48
111315 17 56
111318 29 65
111321 24 63
111322 24 66
111323 14 45
111324 16 70
111325 23 69
111326 31 75
= [(40)(15.412282)+ (20)(11.403952)]/(40+20-2)
= 17.8462
Hence:
= (63.50 – 75.95)/[17.8462(0.025+0.05)]0.5
= -2.5474
91
Research Methods and Statistics
Step 5
Conclusion
Since tcalc = -2.5474 which is less than the critical value of –2.000 we must reject
the null hypothesis at the 0.05 level of significance, and accept the alternative
hypothesis. The mean score obtained by the sample of subjects exposed to the
computer-aided learning package is not equal to the mean score by the sample
of subjects following the traditional learning programme.
Indeed, one can claim that the mean score of subjects exposed to the traditional
learning programme is less than the mean score of subjects following the
computer-aided learning programme.
Using the standard procedure for significance testing, for example can derive the
appropriate test of a specified population mean:
1. Null Hypothesis
H0 : = 100
Alternative Hypothesis
H1 : > 100
2. Level of Significance
= 0.05 is considered to be adequate
3. Critical Value
Since the sample size n<30 and the population standard deviation is unknown,
then t is the appropriate sampling distribution, with a mean of 100 and a standard
error of s/(n0.5) = 5/(80.5) = 1.7678.
92
Research Methods and Statistics
The reject region will be t0.05 standard errors above the population mean; from
tables for t0.05 with (n-1) = (8-1) = 7 degrees of freedom the critical value is
established as +1.895
4. Decision Rule
If the tcalc for the sample mean is less than +1.895 we must accept the null
hypothesis. However, if the tcalc for the sample mean is greater than +1.895 then
the null hypothesis must be rejected and the alternative hypothesis accepted.
Displayed graphically:
CV=1.895
Accept H0 Reject H0
5. Computation of t
Accept Reject
To find the position of the observed sample mean in the t distribution:
t = (m - )/standard error
= (110 – 100)/1.7678
= +5.657
6. Conclusion
Since the t score of +5.657 is greater than the critical value of +1.895 we must
reject the null hypothesis at the 0.05 level of significance, and accept the
alternative hypothesis. The population mean for the proprietary test of aptitude
is not 100; or, a very rare event has occurred. Further sampling would be
advised to confirm this conclusion.
93
Research Methods and Statistics
m1 = 96 s1 = 4.0 n1 = 6
m2 = 92 s2 = 3.5 n2 = 5
H0 : m1 = m2
t = (m1 – m2)/SE
Since the population standard deviation () is unknown, this can be estimated
from pooling the two sample standard deviations (s1 and s2):
= 4.18
= 1.58
94
Research Methods and Statistics
4. Conclusion
Since the calculated value of t is greater than the critical value of –2.262, and is
less than the critical value of +2.262, we must accept the null hypothesis: there is
no significant difference between the two sample means at the 0.05 level of
significance.
We have seen that to be 95 percent certain that the sample mean is not more
than a specified limit from the population, or true, mean we can compute:
N = [t0.025s/L]2
= [(2.306)(8)/3]2
While the z and t tests are useful for examining single differences, such as
sample means, there can be problems when a set of differences is to be
examined. A significance test considers how likely it is that a given result is due
to sampling error rather than representing a real difference. By convention, we
reject the null hypothesis of ‗no difference‘ if the probability of this is as low as
95
Research Methods and Statistics
0.05 (i.e. this means that if we do twenty experiments or tests we are very likely
to be making at least one Type I error).
The F-distribution is used to compare the variability of two samples rather than
the sample mean values, for example:
n1 = 10 s12 = 4.099
n2 = 10 s22 = 2.011
1. Null Hypothesis
H0 : 12 = 22
Alternative hypothesis
3. Decision Rule
The critical value for F is derived from table of the F-distribution, the degrees of
freedom being:
96
Research Methods and Statistics
CV
0.99
=0.05
3.18
Accept H0 Reject H0
F = s22/s12
= 16.802/4.044
= 4.153
5. Conclusion
Since the calculated F value of 4.153 is greater than the critical value of 3.18, we
must reject the null hypothesis and accept the alternative hypothesis 6. There is
a significant difference in the variability of the two tests at the 0.05 level of
significance: the scores derived from our measurement instrument are
significantly more variable than those derived from the existing measurement
instrument. However, we would accept the null hypothesis at the 0.01 level of
significance where the F calculated value of 4.135 is less than Fcv = 5.35
6
Occasionally, we may have no prior knowledge of the variability of two observations and a two-
tailed test may be called for. In this instance, the lower critical value for F is the reciprocal of the
upper critical value: in the example above, for v1 = 9, v2 = 9 at 0.05 level of significance, the upper
critical value 4.03 and the reciprocal is 0.2421.
97
Research Methods and Statistics
As we have seen, the results obtained in samples do not always agree precisely
with theoretical results expected according to the rules of probability. With
problems of categorical data, the previous methods of applying the z and t
sampling distributions are unsatisfactory. is a measure of the discrepancy
existing between observed and expected frequencies. Suppose that in a
particular sample a set of events E1, E2, E3 . . En are observed to occur with
frequencies o1 o2 o3 . . . on called observed frequencies, and that according to
probability rules they are expected to occur with frequencies e1 e2 e3 . . . en called
expected frequencies. Often we wish to know whether observed frequencies
differ significantly from expected frequencies.
Event E1 E2 E3 E4 E5 ... En
Observed o1 o2 o3 o4 o5 ... on
Frequency
Expected e1 e2 e3 e4 e5 ... en
Frequency
98
Research Methods and Statistics
1- CV
0 Chi-Square
Degrees of Freedom
(Number of Columns (k) in the Table– 1)(Number of Rows (h) in the Table– 1)
Thus, for a table of six columns and four rows, v = (k-1)(h-1) = (6-1)(4-1) = 15
As with the normal and t distributions, we can set a level of significance and
define confidence intervals with tables of the distribution, for example:
0.05 0.025 0.01 0.005
v
1 3.84146 5.02389 6.63490 7.87944
5 11.0705 12.8325 15.0863 16.7496
10 18.3070 20.4832 23.2093 25.1882
15 24.9958 27.4884 30.5779 32.8013
20 31.4104 34.1696 37.5662 39.9968
99
Research Methods and Statistics
Contingency Tables
= (o – e ) /e
n n
2
n
Column (k)
I II Total
Row (h)
A a1 a2 NA
B b1 b2 NB
Total N1 N2 N
Applying the cell frequency coding above, the expected frequencies are
computed as follows:
Column (k)
I II Total
Row (h)
A N1Na/N N2Na/N NA
B N1Nb/N N2Nb/N NB
Total N1 N2 N
For example:
100
Research Methods and Statistics
Given a placebo 81 35
treatment
Assuming that all subjects told the truth, test the hypothesis that there is
no difference between treatment by hypnotherapy and the placebo at a
level of significance of 0.05
101
Research Methods and Statistics
CV = 3.84146
1 - = 0.95
= 0.05
0
Accept H0 Reject H0
If the calculated value of is less than 3.84146 we must accept the null
hypothesis, i.e. that there is no difference between the hypnotherapy treatment
and the placebo. Otherwise reject.
4. Computation of
4. Conclusion
Since the calculated value of 2.56643 is less than the critical value of 3.84146,
the null hypothesis cannot be rejected at the 0.05 level of significance: there is no
significant difference between the two treatments.
102
Research Methods and Statistics
VIII
Linear Regression and
Correlation
If X and Y denote two variables under consideration, a scatter diagram shows the
location points X,Y on a rectangular co-ordinate system. If all the points in this
scatter diagram seem to lie near a perceived line, the correlation or fit is termed
linear.
*
*
* *
* ** *
* ** *
* *
* * *
*
* *
* * * *
* * * * * *
* * * * * *
*
* * * *
* * * *
* * * * * * *
* * * * * *
* *
**
* * * *** *
* * * ** * * *
* * * * * ** *
No Correlation or ‗Fit‘
Curvilinear
103
Research Methods and Statistics
Linear Regression
Very often in practice a linear relationship is found to exist between two variable
say, X and Y. It is frequently desirable to express this relationship in
mathematical form by determining the regression equation that connects these
two variables. The least square line is one with the best goodness of fit in that
deviation or error is minimum. The least square line can be expressed:
Y = a + bX
Where a is the intersect of the line with the vertical axis and can be computed:
YX2 -XXY
a = NX2 – (X)2
NXY - XY
b = NX2 – (X)2
b = d/c
d
a
c
0 X
104
Research Methods and Statistics
Compute the regression equation for the following data, using X as the
independent variable:
X 1 3 4 6 8 9 11 14
Y 1 2 4 4 5 7 8 9
Y 15-
10-
*
*
*
5- *
* *
*
*
0 5 10 15 20 X
The scatter diagram indicates that the estimate for the value a is positive, and
lies between 0 and 5
X Y X2 XY Y
1 1 1 1 1
3 2 9 6 4
4 4 16 16 16
6 4 36 24 16
8 5 64 40 25
9 7 81 63 49
11 8 121 88 64
14 9 196 126 81
105
Research Methods and Statistics
Y = 0.545 + 0.636X
Y = aN + bX
hence,
40 = 8a + 56b . . . ... ... (1)
364 = 56a + 524b . . . ... ... (2)
40 = 8a + 56(0.636)
= 8a + 35.6364
a = (40 – 35.6364) / 8
= 0.545
The regression equation allows us to compute the value of Y for any value of X.
106
Research Methods and Statistics
Linear Correlation
The statistic r is called the coefficient of correlation and is defined by the equation
for Pearson‘s Product-Moment formula:
r = NXY – (X)(Y)
[[NX – (X)2][NY2 – (Y)2]]0.5
2
The equation automatically gives the sign of r as well as clearly showing the
symmetry between the variable X and Y. It should be noted that in the instance
of linear regression, the quantum r is the same, regardless of whether X or Y is
considered as the independent variable. Thus, r gives a very good measure of
the relationship between two or more variables and lies between –1 and +1.
The close the value of r is to 0, the less there is a linear relationship between the
variables.
= 672 / [(1056)(448)]0.5
= 0.977
Coefficient of Determination
The ratio of the explained variation to the unexplained variation is termed the
coefficient of determination. If there is no explained variation (i.e. r = 0), this
ratio is zero. If there is no unexplained variation (i.e. r = 1), this ratio is one.
Since the ratio is always positive, we denote the coefficient of determination as:
= (Yestimate – Ymean)2
(Y – Ymean)2
107
Research Methods and Statistics
However, it should be noted that both the coefficient of correlation and the
coefficient of determination do not measure a cause-and-effect relationship, but
merely the strength of association between two or more variables. For instance,
research in the 1930‘s showed a strong positive correlation between the
incidence of prostitution and the output of steel in Pennsylvania. It would be
nonsense to claim that the former ‗causes‘ the latter.
The n pairs of values (X, Y) of two variables can be thought of as a sample from
a population of all possible such pairs. Hence, this bivariate population can be
assumed to have a bivariate normal distribution. The population coefficient of
correlation may be denoted . Tests of significance concerning require the
sampling distribution of r:
The former equation (1) is more generally applicable. Hence, for r = 0.9777:
For a two-sided test, the critical value for t at the 0.05 level of significance and for
v = (8 – 2) = 6 degrees of freedom, = +2.447. Since the calculated t value of
54.306 is greater than the critical value of +2.447, we must reject the null
hypothesis and conclude that the coefficient of correlation is greater than 0.
Hence, our observation is significant at the 5 percent level (see footnote 7: for v =
6, r must be greater than 0.7067 at the 5 percent level of significance)
108
Research Methods and Statistics
109
Research Methods and Statistics
rpbis = X0 – X1 [(p)(1-p)]0.5
x
where
X0 = mean score of X for respondents scoring 0 on Y = 54.67
X1 = mean score of X for respondents scoring 1 on Y = 49.00
x = standard deviation of all scores X = 4.820
p = proportion of respondents scoring 0 on Y = 0.6
(1-p) = proportion of respondents scoring 1 on Y = 0.4
Thus,
= (5.67/4.82) * (0.24)0.5
= 0.5763
7
See White, J., Yeats, A. and Skipworth, (1991), Tables for Statisticians, Table 15, p.29, Critical
values of the product-moment correlation.
110