SCR 314 Social Statistics Lecture Notes 2021
SCR 314 Social Statistics Lecture Notes 2021
1
unemployment, population growth rate, housing, schooling, medical facilities, crime rate, living
standards etc…in a country. In particular, statistics holds a central position in almost every field
like Industry, Commerce, Trade, Physics, Chemistry, Economics, Mathematics, Biology, Botany,
Psychology, Astronomy, Criminology, Social work etc. In essence application of statistics is
very wide. Statistics are important for many reasons. For example statistics may be useful in:
1. The evaluation of the quality of services available to a particular group or organization
2. Analyzing behaviors of groups of people in their environment and special situations
3. Determining the wants or needs of people through statistical sampling
4. Providing simple yet instant information on the matter it centers on.
5. Statistical methods are useful tools in aiding researches and studies in different fields such as
economics, social sciences, business, medicine and many others.
6. Provides a vivid presentation of collected and organized data through the use of figures,
charts, diagrams and graphs.
7. Helps provide more critical analyses of information to enable decision making, law making
or policy formulation.
8. Statistical techniques are used to make decisions that affect our daily lives. That is, they
affect our personal welfare.
Examples of application of statistics in various fields:
Statistics in School
May be used to see how the students are performing collectively in their studies.
Gives information about the school‟s population change for planning and allocation of
resources.
Helps in processing certain evaluations and surveys given to help improve the school‟s
system
Determine the relationship of educational performance to other factors such as
socioeconomic background, gender, and region.
Statistics in Social Science
Helps in providing the government more information about its citizens – planning and
allocation of resources
Statistical results may initiate social reforms that would help benefit the standard of living
Aids in knowing which problems or matters are there to prioritize and give much attention to
Statistics in Sports
2
Gives a vivid summary of the events in a game with the help of well-tabulated scores and
other parameters
Statistics in Science
Endangered species of different wildlife could be protected through regulations and laws
developed using statistics
Epidemics and diseases are monitored with the aid of statistics.
Helps in the evaluation of certain medical practices and the effectiveness of drug
Statistics in Criminology
Question:
Discuss the place of statistics in Criminology and Social Work
a) Criminology
Determine predominant crimes in the society
To establish the perpetrators of the various crimes in society
To establish victims of various crimes
Formulation of laws/policies based on stat
In summary, there are at least three reasons for studying statistics:
(1) Data are everywhere,
(2) Statistical techniques are used to make many decisions that affect our lives, and
(3) No matter what your career, you will make professional decisions that involve data. An
understanding of statistical methods will help you make these decisions more effectively
3
TOPIC 2: PLANNING, CODING, GROUPING AND PROCESSING DATA
• Data collection plan refers to a document that defines all details concerning data
collection, including how much and what type of data is required, when and how it
should be collected.
• Some common data collection instruments used in Social Sciences includes;
questionnaires, interview schedule, observation schedule, focus group discussion and
document analysis.
Purpose of data collection
According to Kombo D. K.et al (2006), data collection is done because of the following reasons;
Stimulate new ideas by identifying areas related to the research topic
Create awareness and improvement by highlighting situation as it appears
Influence legislative policies and regulations
Provide justification of an existing programme
Evaluate responsiveness and effectiveness of the study
Promote decision making and resource allocation based on solid evidence
Why should we plan for data collection?
• It helps to ensure that the data gathered contain real information, useful to the
improvement effort.
• It prevents errors that commonly occur in the data collection process.
• It saves time and money that otherwise might be spent on repeated or failed attempts to
collect useful data.
Preparation for data collection
The success of any study depends on how well one plans for the actual data. Most important are
the data collectors in the field who are in charge of gathering and recording accurate and reliable
data. Therefore their preparation cannot be taken for granted. In preparation for data collection
one needs to:
• plan the data collection visits;
• prepare the data collection forms needed for field visits;
• prepare information materials and tools for data collectors; and
• Arranging for regular communications.
4
Steps for an effective data collection plan:
5
• A schedule of visits to survey sites
• The contact details of the sites to be visited
• Copies of letter(s) of endorsement and introduction
• Relevant handouts or instruction sheets
• Pens (pencils should not be used to record data), a clipboard and other supplies
• A field notebook to record any significant events
• Field allowance for local expenses
• Get advice from affected people
Principals of data collection
• Explore specific needs i.e. vulnerability of various groups to ensure that their benefits are
considered
• Examine the accurate information, event or opinion
• Consider misunderstood information
• Identify abandoned groups to ensure their benefits are considered
• Understand changes and trend affecting the society
• Prepare for unexpected events that may happen
• Examine effects of the study on the overall society
• keep in mind of how to use information
Recruiting and training data collectors
• Good listeners
• Fast in responding to training and demonstrate that he/she would follow instructions and
application of protocol
• Researchers and their assistants should be open and feel responsible for the needs of
participants
• Try to put yourself in the participant‟s shoes and try to think what the participant would
find unpleasant
6
• The researcher should provide concrete information prior to/during recruitment, including
what exactly is expected from participants
• The researcher should pass on complaint if any from participant immediately to the
project leader.
• The researcher must justify the research via an analysis of the balance of costs
• Researchers should obtain consent from the subjects used and give information
voluntarily
• Researcher should be open and honest with other researchers and research subjects
• Researcher must fully explain the research in advance and „debrief‟ subjects afterwards
• Use electronic data processing machines in studies involving a large number of cases or
complex analysis procedure.
• Number of cases.
Coding data
7
• As much information as possible should be included at this stage to avoid losing details
that would initially be omitted.
• Understand the coding scheme for consistency.
• Code categories in the instruments should be exhaustive: only one code assigned to each
response category.
• Choice of coding procedure done according to the level of indicator or type of data
(numeric or categorical) you have.
8
TOPIC 3: SAMPLING AND SAMPLING TECHNIQUES
Research involves studying a particular phenomenon to establish its position/status. The phenomenon the
researcher is interested in is called the target population.
What is a target population?
1. Are the members of a real or hypothetical set of people, events or objects which the researcher
wishes to generalize the results of the findings
2. Is a set of people or objects the researcher intents to reach or question
3. Is the population of individuals which we are interested in describing and making statistical
inferences
4. Is the collection of all individuals, families, groups, organizations, and events that we are
interested in finding out about. For example, all undergraduate students in MMUST
However in research it is not possible to study the whole population as it is in census. This is because of
the cost involved, the time and logistic requirements. Mainly social scientists opt to study a portion of the
targeted population with a view of inference to the target population. To do so they employ the concept of
sampling.
What is sampling?
1. The act, process, or technique of selecting a representative part of a population for the purpose of
determining parameters or characteristics of the whole population
2. The process of selecting a sub-set of people, events, cases or objects from a set in order to draw
conclusions about the entire set
3. Statistical method of obtaining representative data or observations from a group (lot, batch,
population, or universe).
5. Sampling the process of taking any portion of a population or universe as representative of that
population or universe.
The end product in the sampling process is a sample. Samples are used in statistical testing when
population sizes are too large for the test to include all possible members or observations. A sample
should represent the whole population and not reflect bias toward a specific attribute.
What is a sample?
1. A subset containing the characteristics of a larger population
2. A set of individuals or items selected from a population for analysis to yield estimates of, or to
test hypotheses about, parameters of the whole population.
3. A sample is a smaller, manageable version of a larger group.
4. A portion, piece, or segment that is representative of a whole
5. A portion of the members of a set of people, events or objects which the researcher wishes to use
to generalize the results of the findings
6. A sample is some part of a larger body specially selected to represent the whole.
Types of samples
1. A biased sample
A sample in which the items selected is not as a result of probability. Items are selected because
they share some property
2. A random sample
10
Is a sample whose selection is not biased but subject to probability where each member in the
population has a chance to be selected.
Sampling techniques
Are strategies or methods used in selecting a sample from the target population.
There are mainly two types of sampling techniques in research namely probability and non-probability
sampling.
1. Probability sampling
Is a method of sampling that utilizes some form of random selection. In order to have a random selection
method, you must set up some process or procedure that assures that the different units in your population
have equal probabilities of being chosen. Humans have long practiced various forms of random selection,
such as picking a name out of a hat, or choosing the short straw. These days, we tend to use computers as
the mechanism for generating random numbers as the basis for random selection
Probability sampling methods
a) Simple Random Sampling
Is a sampling scheme with the probability that any of the possible subsets of the sample is
equally likely to be the chosen
A way of selecting the sample is by means of a table of random numbers. SRS can be
with or without replacement.
b) Systematic Sampling/Interval Random Sampling
Where each element in the population has the same chance of being selected from the
sample.
Where every Kth person, starting with a person randomly selected from among the first K
persons is selected.
This method is referred to as a systematic sample with a random start.
c) Stratified Sampling
This is where populations are classified into strata and separate samples selected from
each strata.
The ultimate function of stratification is to organize the population into homogeneous
subsets and to select the appropriate number of elements from each.
Reasons for Stratification
12
TOPIC 3: CLASIFICATION AND TABULATION OF DATA
1. Classifications of data
Data classification is the categorization of data for its most effective and efficient use.
a) According to Nature
Data can either be:
i) Quantitative data
The information obtained from numeral variables e.g. age, bills, etc
ii) Qualitative Data
It is the information obtained from variables in the form of categories, characteristics, names or labels or
alphanumeric variables (e.g. birthdays, gender etc.)
b) According to Source
i) Primary data
First- hand information obtained from autobiography, financial statement
ii) Secondary data
This is second-hand information obtained from biography, weather forecast, newspapers etc.
c) According to Measurement
i) Discrete data
Are countable numerical observations which assume whole numbers only
- has an equal whole number interval
- obtained through counting (e.g. corporate stocks, etc.)
ii) Continuous data
Measurable observations that assume both whole and decimals or fractions
-obtained through measuring (e.g. bank deposits, volume of liquid etc.)
d) According to arrangement
i) Ungrouped data
Is raw data which has been obtained from the field and in its natural form with no specific arrangement
ii) Grouped Data
Organized set of data arranged in a particular form as either tallying, simple frequency table or grouped
table
13
Tabulation of data
It is cumbersome to study or interpret large data without grouping it, even if it is arranged sequentially.
For this, the data are usually organized into groups called classes and presented in a table which gives the
frequency in each group. Such a frequency table gives a better overall view of the distribution of data and
enables a person to rapidly comprehend important characteristics of the data.
The process of placing classified data into tabular form is known as tabulation. A table is a symmetric
arrangement of statistical data in rows and columns. Rows are horizontal arrangements whereas columns
are vertical arrangements. It may be simple, double or complex depending upon the type of classification.
Simple frequency tables
It is a table containing raw data arranged in ascending or descending order indicating the number of
frequency for each variable
Example
A survey was taken on Maple Avenue. In each of 20 homes, people were asked how many cars were
registered to their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Arrange the data in a simple frequency table
360–369 2
370–379 3
380–389 5
390–399 7
400–409 5
410–419 4
420–429 3
430–439 1
Total 30
15
6.0 Probability
Probability is a branch of mathematics that deals with calculating the likelihood of a given event's
occurrence and can be expressed either as a fraction, decimal or percentage. The probability of an event
ranges between 1 and 0. An event with a probability of 1 is considered a certainty: for example, the
probability of dying. An event with a probability of 0 is considered as impossibility: for example, the
probability of being God.
In general, for any event A, a minimum value P (A) = 0 and the maximum value of P (A) = 1. Therefore
the probability of any event ranges between 0≤ P (A) ≤1
Definition of terms
a) Sample Space
It is the set of all possible elementary events or outcomes for the experiment.
Examples
i. For the experiment of throwing a coin, S = {H, T} where "S" represents the sample space, "H"
the elementary event of getting a head and "T" the elementary event of getting a tail.
ii. For the experiment of rolling a dice, S = {1, 2, 3, 4, 5, 6}, where "S" represents the sample space,
"1" the elementary event of the number 1 appearing on the top of the dice, "2" the elementary
event of the number 1 appearing on the top of the dice, ... and"6" the elementary event of the
number 6 appearing on the top of the dice.
iii. For the experiment of tossing 3 coins, S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
iv. For the experiment of rolling 2 dice
v. For the experiment of throwing a dice and a coin
b) Event
It is any subset of the sample space or an event is one or more outcomes of an experiment.
Examples
i. In the experiment of rolling a dice, whose sample space is Ω = {1, 2, 3, 4, 5, 6}
a. Event of getting an even number (say Event "A"), would be represented as A = {2, 4, 6}
b. Event of getting a prime number (say Event "H"), would be represented as H = {2, 3, 5}
c. Event of getting a multiple of 3 (say Event "F"), would be represented as F = {3, 6}
A, H and F are all Subsets of Ω
ii. In the experiment of tossing 3 coins whose sample space is S = {HHH, HHT, HTH, HTT, THH,
THT, TTH, TTT}
a. Event of getting two heads (say Event "C"), would be represented as C = {HHT, HTH,
THH}
b. Event of getting all three of the same kind (say Event "G"), would be represented as G =
{HHH, TTT}. C and G are Subsets of S.
c) An outcome
An outcome is the result of a single trial of an experiment.
Example
The possible outcomes on landing on yellow, blue, green or red ball
The outcome of getting a 1, 2, 3, 4, 5 or 6 in throwing a dice
Probability of an event
The probability of event A is the number of ways event A can occur divided by the total number
of possible outcomes.
16
The Number Of Ways Event A Can Occur
P(A) =
The total number Of Possible Outcomes
Example
Experiment 1: A spinner has 4 equal sectors colored yellow, blue, green and red. After
spinning the spinner, what is the probability of landing on each color?
Outcomes: The possible outcomes of this experiment are yellow, blue, green, and red.
Experiment 2: A single 6-sided die is rolled. What is the probability of each outcome? What is
the probability of rolling an even number? or rolling an odd number?
# of ways to roll a 2 1
P(2) = =
total # of sides 6
17
Example:
The probability of landing on a 1 excludes a 2, 3, 4, 5, or 6
Failing excludes passing and vice versa
In general if two events A and B are mutually exclusive, then their probability of occurring is given by:
P (A or B) = P (A) + P (B)
Independent Events
Definition: Two events, A and B, are independent if the fact that A occurs does not affect the
probability of B occurring.
Some other examples of independent events are:
Landing on heads after tossing a coin AND rolling a 5 on a single 6-sided die.
Choosing a marble from a jar AND landing on heads after tossing a coin.
Choosing a 3 from a deck of cards, replacing it, AND then choosing an ace as the second card.
Rolling a 4 on a single 6-sided die, AND then rolling a 1 on a second roll of the die.
In general if two events A and B are independent, then their probability of occurring is given by:
P (A and B) = P (A) * P (B)
18
WORKED EXAMPLES OF NON-PARAMETRIC TESTS
A contingency table is a table that shows the relationship between two categorical variables. The
Chi-square statistic reflects the strength of this relationship. All else equal, the greater the chi-
square statistic, the stronger the relationship. The chi square statistic is usually reported at the
bottom of a contingency table. The probability associated with the chi-square statistic indicates
the probability that you would be incorrect if you were to assert that there is a relationship
between these same two variables in the population from which you drew your sample.
19
•Third – If your probability is .05 or less, then you can generalize from a random sample to a
population, and claim the two variables are associated in the population.
Question 1
A public opinion poll surveyed a simple random sample of voters to establish whether there is a
relationship between gender and voting preference. Respondents were classified by gender (male
or female) and by voting preference (Republican, Democrat, or Independent). Results are shown
below.
Voting Preferences
Gender Republican Democrat Independent
Male 200 150 50
Female 250 300 50
a) Giving reasons, which is the best statistical test for analysing the data?
b) State and explain any three assumptions of the test statistic identified in a above
c) State the null hypothesis of the study.
d) At 0.05 significance level, does the data support the null hypothesis?
e) Report your findings.
Voting Preferences
Gender Republican Democrat Independent Total
Male 200 150 50 400
Female 250 300 50 600
Total 450 450 100 1000
20
At df = 2 and significance level = 0.05, the chi square critical is 5.99 form the table of critical
values of the chi square.
Since chi square calculated (16.2037) is greater than the chi square critical (5.99), we reject the
hypothesis that voting preference is not guided by gender, in deed it is
Question 2
A researcher sought to establish the relationship between stream and perception in mathematics.
The results are tabulated below.
Perception in mathematics
Stream Excellent Average Poor
3A 88.7 60.2 40.1
3B 82.6 64.2 37.3
3C 85.6 66.4 42.8
a) Giving reasons, which is the best statistical test for analysing the data?
b) State the null hypothesis to be tested.
c) At 0.05 significance level, does the data support the null hypothesis?
d) Report your findings.
Critical Values of Chi square
Level of Significance
df 0.20 0.10 0.05 0.02 0.01 0.001
1 1.642 2.706 3.841 5.412 6.635 10.828
2 3.219 4.605 5.991 7.824 9.210 13.816
3 4.642 6.251 7.815 9.837 11.345 16.266
4 5.989 7.779 9.488 11.668 13.277 18.467
5 7.289 9.236 11.070 13.388 15.086 20.515
6 8.558 10.645 12.592 15.033 16.812 22.458
7 9.803 12.017 14.067 16.622 18.475 24.322
8 11.030 13.362 15.507 18.168 20.090 26.124
9 12.242 14.684 16.919 19.679 21.666 27.877
10 13.442 15.987 18.307 21.161 23.209 29.588
11 14.631 17.275 19.675 22.618 24.725 31.264
12 15.812 18.549 21.026 24.054 26.217 32.909
13 16.985 19.812 22.362 25.472 27.688 34.528
21
2. SPEARMAN’S RANK CORRELATION
The Spearman rank-order correlation coefficient also referred to as Spearman Correlation
Coefficient or Spearman's rho. It is typically denoted either with the Greek letter rho (ρ), or rs is
a nonparametric measure of the strength and direction of association that exists between two
variables measured on at least an ordinal scale. It is denoted by the symbol rs (or the Greek letter
ρ, pronounced rho). The test is used for either ordinal variables or for continuous data that has
failed the assumptions necessary for conducting the Pearson's product-moment correlation. For
example, you could use a Spearman‟s correlation to understand whether there is an association
between exam performance and time spent revising; whether there is an association between
depression and length of unemployment; and so forth.
Assumptions of Spearman’s Rank Correlation
1. Your two variables should be measured on an ordinal, interval or ratio scale.
2. There needs to be a monotonic relationship between the two variables. A monotonic
relationship exists when either the variables increase in value together, or as one variable
value increases, the other variable value decreases. Whilst there are a number of ways to
check whether a monotonic relationship exists between your two variables, we suggest
creating a scatterplot using SPSS Statistics, where you can plot one variable against the other,
and then visually inspect the scatterplot to check for monotonicity. The following graphs
illustrate monotonic functions:
22
Not monotonic - as the x variable increases the y variable sometimes decreases and
sometimes increases.
Spearman’s correlation coefficient
Spearman‟s correlation coefficient is a statistical measure of the strength of a monotonic
relationship between paired data. In a sample it is denoted by and is by design constrained as
follows: -1≤ rs≤1
And its interpretation is similar to that of Pearsons, e.g. the closer is to the stronger the
monotonic relationship. Correlation is an effect size and so we can verbally describe the strength
of the correlation using the following guide for the absolute value of rs:
.00-.19 “very weak”, .20-.39, “weak”; .40-.59, “moderate”; .60-.79, “strong”; .80-1.0, “very
strong”
Formula for calculating Spearman’s correlation coefficient
Where: d2 is the sum of the squared differences between the pairs of ranks, and n is the number
of pairs.
The advantages of this coefficient are that, if calculation is to be done by hand, it is easier to
calculate, and can be used for any data that can be ranked - which includes quantitative data.
23
5. Use the following formula below to calculate the Spearman‟s correlation coefficient.
Example
A researcher sought to establish the rating whether the price of a bottle of water decreases as
distance from the Contemporary Art Museum increases. The results are tabulated below.
Convenience Store Distance from CAM (m) Price of 50cl bottle (€)
1 50 1.80
2 175 1.20
3 270 2.00
4 375 1.00
5 425 1.00
6 580 1.20
7 710 0.80
8 790 0.60
9 890 1.00
10 980 0.85
a) Which is the best test statistic suitable to analyse the data and why?
b) State the assumptions of the identified test statistic
c) State the null hypothesis to be tested
There is no significant relationship between the price of a convenience item and distance
from the Contemporary Art Museum.
d) Does the data support the null hypothesis?
Calculate the value of Spearman’s correlation coefficient using step 1 -4 above.
24
Distance
Price of Difference
Convenience from Rank Rank
50cl bottle between d²
Store CAM distance price
(€) ranks (d)
(m)
1 50 10 1.80 2 8 64
2 175 9 1.20 3.5 5.5 30.25
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
6 580 5 1.20 3.5 1.5 2.25
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² =
285.5
The answer will always be between 1.0 (a perfect positive correlation) and -1.0 (a perfect
negative correlation).
Now to put all these values into the formula.
Find the value of all the d² values by adding up all the values in the d² column. In our
example this is 285.5. Multiplying this by 6 gives 1713.
Now for the bottom line of the equation. The value n is the number of sites at which you
took measurements. This, in our example is 10. Substituting these values into n(n2 – n)
we get 10(102-1)= 10(100-1) =10*99 = 990
We now have the formula: rs = 1 - (1713/990) which gives a value for rs: 1 - 1.73 = -
0.7303
What does this rs value of -0.73 mean?
The closer rs is to +1 or -1, the stronger the likely correlation. A perfect positive correlation
is +1 and a perfect negative correlation is -1. The rs value of -0.7303 suggests a fairly strong
negative relationship.
25
A further technique is now required to test the significance of the relationship. We do so by
comparing the calculated spearman rank correlation test statistic and the critical spearman
rank correlation test statistic. To get the latter, we check the value of the statistic where the
level of significance on a two or one tailed (depends of the way the Ho is stated) meet the
degree of freedom (d.f) of the spearman rank correlation.
The degree of freedom for spearman rank correlation is given by the number of pairs in your
sample minus 2 i.e. df= (n-2). In the example it is 8 i.e. (10 - 2). Our level of significance is
0.05 on a two-tailed test (Ho). In the table of critical values for Spearman's rank correlation
coefficient check for the spearman rank correlation statistic value given by the meeting point
of the level of significance 0.05 and n-2 (n=8), the test statistic is 0.738
The value is
Critical values for Spearman's rank correlation coefficient
df Two-sided
(n-2) .10 .05 .01
5 .900 -- --
6 .829 .886 --
7 .714 .786 .929
8 .643 .738 .881
9 .600 .700 .833
10 .564 .648 .794
11 .536 .618 .818
12 .497 .591 .780
13 .475 .566 .745
14 .457 .545 .716
15 .441 .525 .689
16 .425 .507 .666
17 .412 .490 .645
18 .399 .476 .625
19 .388 .462 .608
20 .377 .450 .591
21 .368 .438 .576
22 .359 .428 .562
23 .351 .418 .549
24 .343 .409 .537
26
25 .336 .400 .526
26 .329 .392 .515
27 .323 .385 .505
28 .317 .377 .496
29 .311 .370 .487
30 .305 .364 .478
Since our Spearman's rank correlation coefficient calculated value (0.7303) is less than our
Spearman's rank correlation coefficient critical value (0.738), we accept our null hypothesis.
Example 2
Two doctors assessed the condition of eight patients suffering from particular symptoms. To do
this they ranked the patients from 1 (best) to 8 (worst): the results are tabulated.
Patient Doctor A Doctor B
1 4 5
5 1 3
3 3 1
4 2 2
5 6 6
6 5 4
7 8 7
Parametric testing is defined by making one or more assumptions about the population's
properties. The most common assumptions to make are that the population will be normally
distributed or have data based on an equal-interval scale
27
variable would be "educational level", which has two groups: "undergraduates" and
"postgraduates").
Assumptions of independent sample t-test
When you choose to analyse your data using an independent t-test, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using an
independent t-test. You need to do this because it is only appropriate to use an independent t-test
if your data "passes" six assumptions that are required for an independent t-test to give you a
valid result
1. Your dependent variable should be measured on a continuous scale (i.e., it is measured at
the interval or ratio level). Examples of variables that meet this criterion include revision
time (measured in hours), intelligence (measured using IQ score), exam performance
(measured from 0 to 100), weight (measured in kg), and so forth.
2. Your independent variable should consist of two categorical, independent groups.
Example independent variables that meet this criterion include gender (2 groups: male or
female), employment status (2 groups: employed or unemployed), smoker (2 groups: yes or
no), and so forth.
3. You should have independence of observations, which means that there is no relationship
between the observations in each group or between the groups themselves. For example,
there must be different participants in each group with no participant being in more than one
group.
4. There should be no significant outliers. Outliers are simply single data points within your
data that do not follow the usual pattern. The problem with outliers is that they can have a
negative effect on the independent t-test, reducing the validity of your results.
5. Your dependent variable should be approximately normally distributed for each group
of the independent variable. We talk about the independent t-test only requiring
approximately normal data because it is quite "robust" to violations of normality, meaning
that this assumption can be a little violated and still provide valid results. You can test for
normality using the Shapiro-Wilk test of normality.
6. There needs to be homogeneity of variances. You can test this assumption using Levene‟s
test for homogeneity of variances.
28
The independent t-test, as we have already mentioned is used when we wish to compare the
statistical significance of a possible difference between the means of two groups on some
independent variable and the two groups are independent of one another. The formula for the
independent sample t-test is:
and
We also need to know the degrees of freedom for the independent t-test which is:
29
Fixed Shift 79 83 68 59 81 76 80 74 58 49 68
Rotating Shift 63 71 46 57 53 46 57 76 52 68 73
a) Which test statistic would be most suitable to analyse the above data and why
b) Explain three assumptions of the test above
c) State the null hypothesis to be tested
d) Using the scores above determine if there is a significant difference in job satisfaction
between the two groups of workers
X1 X2 (X1)2 (X2)2
79 63 6241 3969
83 71 6889 5041
68 46 4624 2116
59 57 3481 3249
81 53 6561 2809
76 46 5776 2116
80 57 6400 3249
74 76 5476 5776
58 52 3364 2704
49 68 2401 4624
68 73 4624 5329
775 662 55837 40982
We can use the totals from this worksheet and the number of subjects in each group to calculate
the sum of squares for group 1, the sum of squares for group 2, the mean for group 1, the mean
for group 2, and the value for the independent t.
30
Therefore our t-calculated value is 2.209. We need to compare this with the t-critical from
statistical tables.
The degree of freedom for the t-test is given by: df = n1 + n2 - 2 = 11 + 11 - 2 = 20. The
significance level is 0.05
To know the critical value for critical t, we use the statistical tables for t-test with an alpha level
of 0.05 and a two-tailed test. Look for the column of the table under .05 for Level of significance
for two-tailed tests, read down the column until you are level with 20 in the df column, and you
will find the critical value of t which is 2.086.
31
Finally, compare the calculated t value (2.209) with the critical t value (2.086).That means our
result is significant if the calculated t value is greater than or equal to -2.086 or is less than or
equal to 2.086.
Since our calculated value of t (2.209) is greater than the critical value of t (2.086), we reject the
null hypothesis and accept the alternative hypothesis.
Therefore, there is a significant difference in job satisfaction between the two groups of workers
as shown by the t –test, t(20) = 2.209, p = 2.086, at 0.05.
Exercise
A researcher sought to establish whether two types of music, type-I and type-II, had different
effects upon the ability of college students to perform a series of mental tasks requiring
concentration? The researcher picked a fairly homogeneous subject pool of 30 college students,
randomly sorting them into two groups, A and B, of sizes Na=15 and Nb=15. (It is not essential
for this procedure that the two samples be of the same size.) He then had the members of each
group, one at a time, perform a series of 40 mental tasks while one or the other of the music
types is playing in the background. For the members of group A it is music of type-I, while for
those of group B it is music of type-II. The following table shows how many of the 40
components of the series each subject was able to complete.
Group A 26 21 22 26 19 22 26 25 24 21 23 23 18 29 22
music of type-I
Group B 18 23 21 20 20 29 20 16 20 26 21 25 17 18 19
music of type-II
Do two types of music, type-I and type-II, have different effects upon the ability of college
students to perform a series of mental tasks requiring concentration?
32
4. PAIRED SAMPLES T-TESTS
The dependent t-test (called the paired-samples t-test in SPSS) compares the means between two
related groups on the same continuous, dependent variable. For example, you could use a
dependent t-test to understand whether there was a difference in smokers' daily cigarette
consumption before and after a 6 week hypnotherapy programme (i.e., your dependent variable
would be "daily cigarette consumption", and your two related groups would be the cigarette
consumption values "before" and "after" the hypnotherapy programme).
33
Notice that we subtract the score for the first X from the paired second X. This is probably so
that when we are finding the difference between the pre-test and post-test, that we subtract the
pre-test (X1) from the post-test (X2). The degree of freedom for the dependent-t test is df = n – 1
and n is the number pairs of subjects in the study.
a) What is the appropriate test statistic that will be used to analyse this data and why.
In this problem we are comparing pre-test and post-test scores for a group of subjects. At the
same time the dependent variable is in ratio while the independent variable is categorical (pretest
and posttest). This would be an appropriate situation for the dependent t-test.
b) What are the three basic assumptions of the test statistic in (a) above?
c) State the null hypothesis to be tested
Note: Our problem stated that the therapy would decrease the depression score. Therefore our
alternative hypothesis states that mu1 (the pre-test score) will be greater than mu2 (the post-test
score).
d) Does the anger management therapy significantly reduce the scores on the depression scale?
The pre-test and post-test scores, as well as D and D2 are shown in the following table
D= (X2-
Pre-Test (X1) Post-Test (X2)
X1) D2
14 0 -14 196
6 0 -6 36
4 3 -1 1
15 20 5 25
34
3 0 -3 9
3 0 -3 9
6 1 -5 25
5 1 -4 16
6 1 -5 25
3 0 -3 9
-39 351
Set the alpha level. Note: As usual we will set our alpha level at .05, we have 5 chances in 100
of making a type I error.
3. Calculate the value of the appropriate statistic. Also indicate the degrees of freedom for
the statistical test if necessary. t = -2.623 df = n - 1 = 10 - 1 = 9 Note: We have calculated the t-
value and will also need to know the degrees of freedom when we go to look up the critical value
of t.
4. Write the decision rule for rejecting the null hypothesis. Reject H0 if t is <= -1.833 Note:
To write the decision rule we need to know the critical value for t, with an alpha level of .05 and
a one-tailed test. That means our result is significant if the calculated t value is less than or equal
to -1.833. Note: Why are we looking for a negative value of t? This is a little tricky, but we are
looking at the post-test being less than the pre-test. Now the dependent t is calculated by
35
subtracting the pre-test from the post-test so if the post-test is actually less than the pre-test, post-
test minus pre-test will be a negative quantity.
5. Write a summary statement based on the decision. Reject H0, p < .05, one-tailed Note:
Since our calculated value of t (-2.623) is less than or equal to -1.833, we reject the null
hypothesis and accept the alternative hypothesis.
6. Write a statement of results in Standard English. The management therapy did
significantly reduce the depression scores for the adolescents.
Exercise
Consider the following study in which standing and supine systolic blood pressures were
compared. This study was performed on twelve subjects. Their blood pressures were measured in
both positions.
Standing 132 146 135 141 139 162 128 137 145 151 131 143
Supine 136 145 140 147 142 160 137 136 149 158 120 150
Does the data suggest that there is no difference between the mean blood pressures in the two
populations?
36
5. THE ANALYSIS OF VARIANCE (ANOVA)
Analysis of variance like the t-test, tests the hypothesis that the means of OVERALL groups are
equal. One of the most important difference is that in a t-test only TWO groups are distinguished,
whereas analysis of variance usually compares three or more groups. For example, you can use
ANOVA to determine whether the means of a number of groups are equal.
Do students in Boys' only, Girls' only and Co-educational schools, on average, perform equally
on a self-efficacy questionnaire? or suppose you distinguish between age groups: young- less
than 25 years; Middle- 25 to 45 years; and adult- over 45 years. Do all age groups, on average,
spend equal amount of time on face book? ANOVA answers these kinds of questions.
Assumptions of ANOVA
1. Interval data. ANOVA assumes an interval-level dependent. With Likert scales and other
ordinal dependents, the nonparametric Kruskal-Wallace test is preferred.
2. Homogeneity of variances. The dependent variable should have the same variance in each
category of the independent variable. When there is more than one independent, there must
be homogeneity of variances in the cells formed by the independent categorical variables.
The reason for this assumption is that the denominator of the F-ratio is the within-group
mean square, which is the average of group variances taking group sizes into account. When
groups differ widely in variances, this average is a poor summary measure. However,
ANOVA is robust for small and even moderate departures from homogeneity of variance.
Still, a rule of thumb is that the ratio of largest to smallest group variances should be
3.0 or less. The Levene's test of homogeneity of variance is computed by SPSS to test the
ANOVA assumption that each group (category) of the independent)(s) has the same
variance. If the Levene statistic is significant at the .05 level or better, the researcher rejects
the null hypothesis that the groups have equal variances.
3. Random sampling. For purposes of significance testing, the subjects in each group are
randomly sampled.
4. Multivariate normality. For purposes of significance testing, variables follow multivariate
normal distributions. The dependent variable is normally distributed in each category of the
independent variable(s). ANOVA is robust even for moderate departures from multivariate
normality.
37
5. Equal or similar sample sizes. The groups formed by the categories of the independent(s)
should be equal or similar in sample size. The more the groups are similar in size, the more
robust ANOVA will be with respect to violations of the assumptions of normality and
homogeneity of variance.
ANOVA compares variance between the groups with the variance within the groups. Dividing
the former variance with the later, you get the F-statistic. Therefore instead of ANOVA, the term
F-test is also used.
Example 1
A manager wishes to determine whether the mean times required to complete a certain task differ
for the three levels of employee training. He randomly selected 6 employees with each of the
three levels of training (Beginner, Intermediate and Advanced). The data is summarized below.
Beginner: 56, 63, 52, 61, 65, 67
Intermediate: 41, 44, 51, 43, 41, 55
Advanced: 55, 45, 45, 60, 49, 53
a) Which is the best test static to analyse the data and why
When we want to compare the mean difference of more than two groups
b) State three assumptions of using test static
• Sample must be randomly selected
• All of the standard deviations are the same -No standard deviation is more than twice any
other.
• All of the populations are normally distributed
• The dependent variable should be interval or ratio
• More than two independent categorical groups
c) State the null hypothesis
There is no statistically significant difference in the mean times required to complete a
certain for the three levels of employee training
d) Does the data support the hypothesis?
ANOVA= F-ratio = Between Column Variance (BCV)/Within Column Variance (WCV)
38
Beginner Intermediate Advanced
56 41 55
63 44 45
52 51 45
61 43 60
65 41 49
67 55 53
60.6667 45.8333 51.1667
39
c) Variance for beginners = Total variance/N-1=161.333/6-1= 32.2666
d) Variance for intermediate = Total variance/N-1=168.833/6-1= 33.76666
e) Variance for beginners = Total variance/N-1=176.8333/6-1= 35.36666
Within column variance(WCV) = sum of the variance of the groups/number of groups
= (32.2666 + 33.76666+ 35.36666)/3 = 101.39986/3=33.8
ANOVA= F-ratio = Between Column Variance (BCV)/Within Column Variance (WCV)
=338.7222/33.8=10.021
F critical / tabulated is established from ANOVA tables through the intersection of Degree of
freedom of the numerator and Degree of freedom of the denominator.
Degree of freedom of the numerator = n-1 = 3-1 = 2, where n is the number of groups
Degree of freedom of the denominator = (N1-1)+( N2-1)+ (N3-1) = (6-1)+(6-1)+(6-1) = 5+5+5=
15
whereN1, N2 and N3 are number of elements in group 1, 2 and 3 respectively
Critical F-ratio = 3.68, that is where 2 and 15 intersect of the ANOVA table
Ho: There is no difference in the mean times required to complete a certain task by levels of
training
We now compare the F calculated and the F critical to make a decision whether to accept or
reject the null Ho: There is no difference in the mean times required to complete a certain task by
levels of training
Since Calculated F-ratio (10.02) is more than the critical F-ratio (3.68) we reject the null
hypothesis that there is no difference in the mean times required to complete a certain task by
levels of training, the results indicate a statistically significant difference
40
F Distribution critical values for P=0.05
Denominator
Numerator DF
1 161.45 199.50 215.71 224.58 230.16 236.77 241.88 245.95 248.01 250.10 252.20 253.25 254.06 254.19
2 18.513 19.000 19.164 19.247 19.296 19.353 19.396 19.429 19.446 19.462 19.479 19.487 19.494 19.495
3 10.128 9.5522 9.2766 9.1172 9.0135 8.8867 8.7855 8.7028 8.6602 8.6165 8.5720 8.5493 8.5320 8.5292
4 7.7086 6.9443 6.5915 6.3882 6.2560 6.0942 5.9644 5.8579 5.8026 5.7458 5.6877 5.6580 5.6352 5.6317
5 6.6078 5.7862 5.4095 5.1922 5.0504 4.8759 4.7351 4.6187 4.5582 4.4958 4.4314 4.3985 4.3731 4.3691
7 5.5914 4.7375 4.3469 4.1202 3.9715 3.7871 3.6366 3.5108 3.4445 3.3758 3.3043 3.2675 3.2388 3.2344
10 4.9645 4.1028 3.7082 3.4780 3.3259 3.1354 2.9782 2.8450 2.7741 2.6996 2.6210 2.5801 2.5482 2.5430
15 4.5431 3.6823 3.2874 3.0556 2.9013 2.7066 2.5437 2.4035 2.3275 2.2467 2.1601 2.1141 2.0776 2.0718
20 4.3512 3.4928 3.0983 2.8660 2.7109 2.5140 2.3479 2.2032 2.1241 2.0391 1.9463 1.8962 1.8563 1.8498
30 4.1709 3.3159 2.9223 2.6896 2.5336 2.3343 2.1646 2.0149 1.9317 1.8408 1.7396 1.6835 1.6376 1.6300
60 4.0012 3.1505 2.7581 2.5252 2.3683 2.1666 1.9927 1.8365 1.7480 1.6492 1.5343 1.4672 1.4093 1.3994
120 3.9201 3.0718 2.6802 2.4473 2.2898 2.0868 1.9104 1.7505 1.6587 1.5544 1.4289 1.3519 1.2804 1.2674
500 3.8601 3.0137 2.6227 2.3898 2.2320 2.0278 1.8496 1.6864 1.5917 1.4820 1.3455 1.2552 1.1586 1.1378
1000 3.8508 3.0047 2.6137 2.3808 2.2230 2.0187 1.8402 1.6765 1.5811 1.4705 1.3318 1.2385 1.1342 1.1096
41