Data Analysis and Interpretations Chapter 8
Data Analysis and Interpretations Chapter 8
Quantitative data in a raw form, that is, before these data have been processed and analysed,
convey very little meaning to most people. These data therefore need to be processed to make them
useful, that is, to turn them into information. Quantitative analysis techniques such as graphs,
charts and statistics allow us to do this; helping us to explore, present, describe and examine
relationships and trends within our data.
Virtually any business and management research you undertake is likely to involve some
numerical data or contain data that could usefully be quantified to help you answer your research
question(s) and to meet your objectives. Quantitative data refers to all such data and can be a
product of all research strategies. It can range from simple counts such as the frequency of
occurrences to more complex data such as test scores, prices or rental costs. To be useful these
data need to be analysed and interpreted. Quantitative analysis techniques assist you in this
process. They range from creating simple tables or diagrams that show the frequency of
occurrence and using statistics such as indices to enable comparisons, through establishing
statistical relationships between variables to complex statistical modelling.
This chapter builds on the ideas outlined in earlier chapters about data collection. In this section
of this chapter, we attempt to provide an in-depth discussion of the wide range of graphical and
statistical techniques of analyzing of data. Likewise this section of the chapter does not attempt
to provide an in-depth discussion of the wide range of issues that need to be considered at the
planning and analysis stages of your research project, and outlines analytical techniques. This
section of the chapter left for the students as a reading assignment from the book called Research
Methods for Business Students 4th edition (from page 406 - 462) by Mark Saunders, Philip
Lewis, and Adrian Thornhill. The issues needs to be discussed are concerned with:
preparing, inputting into a computer and checking your data;
choosing the most appropriate tables and diagrams to explore and present your data;
choosing the most appropriate statistics to describe your data;
choosing the most appropriate statistics to examine relationships and trends in your data.
Statistics in research
Students are often intimidated by statistics. This brief overview is intended to place statistics in
context and to provide a reference sheet for those who are trying to interpret statistics that they
read. It does not attempt to show or to explain the mathematics involved. Although it is helpful if
those who use statistics understand the math, the computer age has rendered that understanding
unnecessary for many purposes. Practically speaking, students often simply want to know
whether a particular result is significant, i.e. how likely it is that the obtained result may be
attributable to something other than chance. Computer programs can easily produce numbers that
allow such conclusions, if the student knows which tests to use and has an understanding of what
the numbers mean. This summary is intended to help achieve that understanding.
Statistics can define from two perspectives. That is the definition give given according to the
Concise Oxford Dictionary and Business statistics, which is summarized as follows:
The emphasis on this course is not on the actual collection of numerical facts (data) but
on their classification, summarisation, display and analysis. These processes are carried
out in order to help us understand the data and make the best use of them.
Any data you use can take a variety of values or belong to various categories, either numerical or
non-numerical. The 'thing' being described by the data is therefore known as a variable. The
values or descriptions you use to measure or categorise this 'thing' are the measurements. These
are of different types, each with its own appropriate scale of measurement which has to be
considered when deciding on suitable methods of graphical display or numerical analysis.
Categorical data
This is generally non-numerical data which is placed into exclusive categories and then counted
rather than measured. People are often categorised by sex or occupation. Cars can be categorised
by make or colour.
Nominal data
The scale of measurement for a variable is nominal if the data describing it are simple labels or
names which cannot be ordered. This is the lowest level of measurement. Even if it is coded
numerically, or takes the form of numbers, these numbers are still only labels. For example, car
registration numbers only serve to identify cars. Numbers on athletes vests are only nominal and
make no value statements. All nominal data are placed in a limited number of exhaustive
categories and any analysis is carried out on the frequencies within these categories. No other
arithmetic is meaningful.
Ordinal Data
If the exhaustive categories into which the set of data is divided can be placed in a meaningful
order, without any measurement being taken on each case, then it is classed as ordinal data. This
is one level up from nominal. We know that the members of one category are more, or less, than
the members of another but we do not know by how much. For example, the cars can be ordered
as: 'small', 'medium', 'large' without the aid of a tape measure. Degree classifications are only
ordinal. Athlete’s results depend on their order of finishing in a race, not by 'how much' separates
their times. Questionnaires are often used to collect opinions using the categories: 'Strongly
agree', 'Agree', 'No opinion', 'Disagree' or 'Strongly disagree'. The responses may be coded as 1,
2, 3, 4 and 5 for the computer but the differences between these numbers are not claimed to be
equal.
There are very few examples of genuine interval scales. Temperature in degrees Centigrade
provides one example with the 'zero' on this scale being arbitrary. The difference between 30°C
and 50°C is the same as the difference between 40°C and 60°C but we cannot claim that 60°C is
twice as hot as 30°C. It is therefore interval data but not ratio data. Dates are measured on this
scale as again the zero is arbitrary and not meaningful.
Ratio data must have a meaningful zero as its lowest possible value so, for example, the time
taken for athletes to complete a race would be measured on this scale. Suppose Bill earns £20
000, Ben earns £15 000 and Bob earns £10 000. The intervals of £5000 between Bill and Ben and
also between Ben and Bob genuinely represent equal amounts of money. Also the ratio of Bob's
earnings to Bill's earnings are genuinely in the same ratio, 1 : 2, as the numbers which represent
them since the value of £0 represents 'no money'. This data set is therefore ratio as well as
interval.
Various definitions exist for the distinction between these two types of data. Non-numerical,
nominal, data is always described as being qualitative or non-metric as the data is being
described some quality but not measured. Quantitative or metric data which describes some
measurement or quantity is always numerical and measured on the interval or ratio scales. All
definitions agree that interval or ratio data are quantitative. Some text-books, however, use the
term qualitative to refer to words only, whilst others also include nominal or ordinal numbers.
Problems of definition could arise with numbers, such as house numbers, which identify or rank
rather than measure.
The population is the entire group of interest. This is not confined to people, as is usual in the
non-statistical sense. Examples may include such 'things' as all the houses in a local authority
area rather than the people living in them.
It is not usually possible, or not practical, to examine every member of a population, so we use a
sample, a smaller selection taken from that population, to estimate some value or characteristic
of the whole population. Care must be taken when selecting the sample as it must be
representative of the whole population under consideration otherwise it doesn't tell us anything
relevant to that particular population.
Occasionally the whole population is investigated by a census, such as is carried out every ten
years in the British Isles to produce a complete enumeration of the whole population. The data
are gathered from the whole population. A more usual method of collecting information is by a
survey in which only a sample is selected from the population of interest and its data examined.
Examples of this are the Gallup polls produced from a sample of the electorate when attempting
to forecast the result of a general election.
Analysing a sample instead of the whole population has many advantages such as the obvious
saving of both time and money. It is often the only possibility as the collecting of data may
sometimes destroy the article of interest, e.g. the quality control of rolls of film.
The ideal method of sampling is random sampling. By this method every member of the
population has an equal chance of being selected and every selection is independent of all the
others. This ideal is often not achieved for a variety of reasons and many other methods are used.
Descriptive Statistics
If the data available to us cover the whole population of interest, we may describe them or
analyse them in their own right, i.e. we are only interested in the specific group from which the
measurements are taken. The facts and figures usually referred to as 'statistics' in the media are
very often a numerical summary, sometimes accompanied by a graphical display, of this type of
data, e.g. unemployment figures. Much of the data generated by a business will be descriptive in
nature, as will be the majority of sporting statistics. In the next few weeks you will learn how to
display data graphically and summarise it numerically.
Inferential Statistics
Alternatively, we may have available information from only a sample of the whole population of
interest. In this case the best we can do is to analyse it to produce the sample statistics from
which we can infer various values for the parent population. This branch of statistics is usually
referred to as inferential statistics. For example we use the proportion of faulty items in a
sample taken from a production line to estimate the corresponding proportion of all the items.
A descriptive measure from the sample is usually referred to as a sample statistic and the
corresponding measure estimated for the population as a population parameter.
The problem with using samples is that each sample taken would produce a different sample
statistic giving us a different estimate for the population parameter. They cannot all be correct so
a margin of error is generally quoted with any sample estimations. This is particularly important
when forecasting future statistics.
In this chapter you will learn how to draw conclusions about populations from sample statistics
and estimate future values from past data.
A measure of central tendency is meant to tell us the “center” of a data set or population. What
do we mean by “center”? That’s an inherently vague question. We might mean a typical value,
or the most common value, or a value that’s in the middle… We need to be more specific.
Mean
The mean is what we usually think of as the average (although “average” can be used to refer to
other measures of central tendency as well). For a sample, the mean is the sum of all
observations divided by the number of observations. Here is the formula for the sample mean:
n
∑ xi
x̄= i=1
n
For the mean of a population, in principle we do the same thing: add up the values of all
possible observations, and divide by the number of possible observations. But what if there is no
maximum number of observations? Consider rolling a standard 6-sided die. We could roll it an
infinite number of times, so any finite number of rolls is only a sample. What is the population
mean? You have to take each possible value for an outcome (in this case, one through six),
multiply by its frequency as a fraction of all outcomes, and add up the results. Here is the
formula for the population mean:
μ= ∑ xf ( x )
x∈ X
(The expression f(x) means the frequency of the value x in the population; it is a number between
0 and 1.) We use the Greek letter mu (μ) to stand for the population mean, which we sometimes
call the “true” mean. For the throw of 6-sided die, the population mean is 1*(1/6) + 2*(1/6) +
3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6) = 3.5.
In most cases, we don’t actually know the population mean, so we try to estimate it with the
sample mean.
The technique we just used to find the population mean is also useful when you’re given sample
data in the form of a frequency table, with each value that occurred alongside the frequency with
which it occurred. E.g.,
Answers to “How Many Times Did You Use the Restroom Today?”
Answer # Subjects
0 1
1 3
2 4
3 8
4 5
5 3
We take each possible answer and multiply by its frequency in the sample. The sample mean
here is [0(1) + 1(3) + 2(4) + 3(8) + 4(5) + 5(3)]/24 = 2.92. (Notice that this is the same as
finding the fractional frequency of each value as # Subjects/24, multiplying by the answer value,
and then adding them up.)
Median
The median is the value of the observation exactly in the middle of the sample or population,
such that half the observations have a higher value and half have a lower value. (If there is an
even number of observations, then there is no true middle value, so the median is defined as the
mean of the two middle values.)
In the restroom example, the median is 3 (because there are eight observations higher than 3 and
eight observations lower, and the middle eight observations are all the same). What would
happen if we added three more people who went to the restroom 6 times each? The total
observations would be 29, so we’d be looking for the 15 th highest (and 15th lowest) observation.
It’s still 3!
Mode
The mode is the most common outcome in the sample or population. For example, the modal
sex in the Female is female (because there are more women than men). Mode means the single
most common outcome, not necessarily the majority outcome.
The examples involve nominal data, and that’s where mode is most often useful. But it can be
used with numbers, too. In the restroom frequency table above, the mode is 3. In this example,
the possible outcomes are numerical, but they are also discrete, meaning they take on a countable
number of different values. You can’t use the bathroom one-half a time!
The mode makes less sense with characteristics that are not discrete, but are instead continuous,
meaning the variable can take on an uncountable number of different values. Consider height. If
you measure height precisely enough, it’s difficult to find anyone who is exactly any height you
specify in advance – e.g., 6’0”. Everyone you find will be just slightly above or slightly below
it. The frequency of any precisely defined height is approximately zero! So to define the mode
in cases like this, you need to establish intervals (such as inches of height, which actually
includes an interval of heights rounded off to the nearest whole inch).
Variance: a measure of dispersion in which observations are weighted by the square of their
n
∑ ( x i− x̄ )2
s2 = i=1
distance from the mean, as given by the following formula: n−1
Note that this gives greater weight to an observation the further it is from the mean. For
example, suppose the mean is zero. An observation of 4 (or -4) would be weighted four times as
much as an observation of 2 (or -2), despite being only twice as far from the mean.
Standard deviation: the square root of the variance (for both the sample and the population).
s= √ s 2 σ =√ σ 2
We use these measures when we want to divide a group or population into a number of equal-
sized subgroups. Quartiles are four equal-sized subgroups; quintiles are five equal-sized
subgroups; deciles are ten equal-sized subgroups. (There are others, but these are the most
common.)
Note that “equal-sized” is with respect to the number of observations or members in each
subgroup, not the size of the group’s interval. For instance, households are often divided into
income quintiles. The top quintile includes the 20% of households that have the highest annual
income. The bottom quintile includes the 20% of households that have the lowest annual
income. These quintiles include equal numbers of households, but they will not correspond to
the same size intervals of incomes.
Percentiles or “Xiles” are used for various purposes, but most often in economics for dividing the
population into income groups. This can be useful for getting a sense of the dispersion of
incomes in the economy. But to see how they can be misleading, notice that the dividing lines
are much like the median: they can be invariant to changes on either side of them. E.g., people
at the top of the top quintile, or the bottom of the bottom quintile, could get richer or poorer
without affecting the quintile dividing lines.
Descriptive statistics: Tabular and Graphical methods
Frequency distribution is a tabular presentation of data, which shows the frequency of the
appearance of data elements in several nonoverlapping classes. The purpose of the frequency
distribution is to organize masses of data elements into smaller and more manageable groups.
The frequency distribution can present both qualitative and quantitative data. Besides, when
summarizing a large set of data it is often useful to classify the data into classes or categories and
to determine the number of individuals belonging to each class, called the class frequency. A
tabular arrangement of data by classes together with the corresponding frequencies is called a
frequency distribution or simply a frequency table. Consider the following definitions:
Class: A grouping of data elements in order to develop a frequency distribution.
Class Width: The length of the class interval. Each class has two limits. The lowest
value is referred to as the lower class limit, and the highest value is the upper class limit.
The difference between the upper and the lower class limits represents the class width.
Class Midpoint: The point in each class that is halfway between the lower and the upper
class limits.
Frequency: The number of observations in a class.
Relative Frequency Distribution: A tabular presentation of a set of data which shows
the frequency of each class as a fraction of the total frequency. The relative frequency
distribution can present both qualitative and quantitative data.
Percent Frequency Distribution: A tabular presentation of a set of data which shows
the percentage of the total number of items in each class. The percent frequency of a
class is simply the relative frequency multiplied by 100.
Cumulative Frequency Distribution: A tabular presentation of a set of quantitative data
which shows for each class the total number of data elements with values less than the
upper class limit.
Cumulative Relative Frequency Distribution: A tabular presentation of a set of
quantitative data which shows for each class the fraction of the total frequency with
values less than the upper class limit.
Cumulative Percent Frequency Distribution: A tabular presentation of a set of
quantitative data which shows for each class the fraction of the total frequency with
values less than the upper class limit.
Besides to the above points, it is important to discus the graphs and charts with frequency
Distributions. Some of the types of graphs and charts are discussed below:
1. Bar Graph: A graphical method of presenting qualitative data that have been
summarized in a frequency distribution or a relative frequency distribution.
2. Pie Chart: A graphical device for presenting qualitative data by subdividing a circle into
sectors that correspond to the relative frequency of each class.
3. Dot Plot: A graphical presentation of data, where the horizontal axis shows the range of
data values and each observation is plotted as a dot above the axis.
4. Histogram: A graphical method of presenting a frequency or a relative frequency
distribution.
5. Ogive: A graphical method of presenting a cumulative frequency distribution or a
cumulative relative frequency distribution.
6. Stem-and-Leaf Display: An exploratory data analysis technique (the use of simple
arithmetic and easy-to-draw pictures to look at data more effectively) that simultaneously
rank orders quantitative data and provides insight into the shape of the underlying
distribution.
7. Crosstabulation: A tabular presentation of data for two variables. Rows and columns
show the classes of categories for the two variables.
8. Scatter Diagram: A graphical method of presenting the relationship between two
quantitative variables. One variable is shown on the horizontal and the other on the
vertical axis.
Having discussed the above important presentation of data using tabular and graphical methods
of descriptive statistics, lets know illustrate each of the above points using some examples.
Example 1:
A student has completed 20 courses in the School of Accounting and Finance. Her grades in the
20 courses are shown below.
A B A B C
C C B B B
B A B B B
C B C B A
(a) Develop a frequency distribution for her grades.
Answer: To develop a frequency distribution we simply count her grades in each category.
Thus, the frequency distribution of her grades can be presented as
Grade Frequency
A 4
B 11
C 5
20
(b) Develop a relative frequency distribution for her grades.
Answer: The relative frequency distribution is a distribution that shows the fraction or
proportion of data items that fall in each category. The relative frequencies of each category can
be computed by equation 2.1.
12
10
Frequency
0
A B C
Grades
Exercise 1:
There are 800 students in the School of Business Administration at UTC. There are four majors
in the school: Accounting, Finance, Management and Marketing. The following shows the
number of students in each major:
Example 2:
In a recent campaign, many airlines reduced their summer fares in order to gain a larger share of
the market. The following data represent the prices of round-trip tickets from Atlanta to Boston
for a sample of nine airlines.
120 140 140
160 160 160
160 180 180
Construct a dot plot for the above data.
Answer: The dot plot is one of the simplest graphical presentations of data. The horizontal axis
shows the range of data values, and each observation is plotted as a dot above the axis. Figure
below shows the dot plot for the above data. The four dots shown at the value of 160 indicate
that four airlines were charging $160 for the round-trip ticket from Atlanta to Boston.
Answer: The first step for developing a frequency distribution is to decide how many classes
are needed. There are no "hard" rules for determining the number of classes; but generally, using
anywhere from five to twenty classes is recommended, depending on the number of
observations. Fewer classes are used when there are fewer observations, and more classes are
used when there are numerous observations. In our case, there are only 30 observations. With
such a limited number of observations, let us use 5 classes. The second step is to determine the
width of each class. By using the following equation which states we can determine the class
width. In the above data set, the highest value is 359, and the lowest value is 60. Therefore,
(b) What are the lower and the upper class limits for the first class of the above frequency
distribution?
Answer: The lower class limit shows the smallest value that is included in a class. Therefore,
the lower limit of the first class is 60. The upper class limit identifies the largest value included
in a class. Thus, the upper limit of the first class is 119. (Note: The difference between the
lower limits of adjacent classes provides the class width. Consider the lower class limits of the
first two classes, which are 60 and 120. We note that the class width is 120 - 60 = 60.)
(c) Develop a relative frequency distribution and a percent frequency distribution for the
above.
Answer: The relative frequency for each class is determined by the use of equation.
Relative Frequency of a Class =
Where n is the total number of observations. The percent frequency distribution is simply the
relative frequencies multiplied by 100. Hence, the relative frequency distribution and the percent
frequency distribution are developed as shown on the next page.
Relative and percent frequency distributions of waiting times at first county bank
Waiting Times Relative Percent
(Seconds) Frequency Frequency Frequency
Answer: The cumulative frequency distribution shows the number of data elements with values
less than or equal to the upper limit of each class. For instance, the number of people who
waited less than or equal to 179 seconds is 16 (6 + 10), and the number of people who waited
less than or equal to 239 seconds is 24 (6 + 10 + 8). Therefore, the frequency and the cumulative
frequency distributions for the above data will be as follows.
Frequency and cumulative frequency distributions for the waiting times at first county
bank
Waiting Times Cumulative
(Seconds) Frequency Frequency
60 - 119 6 6
120 - 179 10 16
180 - 239 8 24
240 - 299 4 28
300 - 359 2 30
(e) How many people waited less than or equal to 239 seconds?
Answer: The answer to this question is given in the table of the cumulative frequency. You can
see that 24 people waited less than or equal to 239 seconds.
(f) Develop a cumulative relative frequency distribution and a cumulative percent frequency
distribution.
Answer: The cumulative relative frequency distribution can be developed from the relative
frequency distribution. It is a table that shows the fraction of data elements with values less than
or equal to the upper limit of each class. Using the table of relative frequency, we can develop
the cumulative relative and the cumulative percent frequency distributions as follows:
Relative frequency and cumulative relative frequency and cumulative percent frequency
distributions of waiting times at first county bank
Cumulative Cumulative
Waiting Times Relative Relative Percent
(Seconds) Frequency Frequency Frequency
60 - 119 0.2000 0.2000 20.00
120 - 179 0.3333 0.5333 53.33
180 - 239 0.2667 0.8000 80.00
240 - 299 0.1333 0.9333 93.33
300 - 359 0.0667 1.0000 100.00
NOTE: To develop the cumulative relative frequency distribution, we could have used the
cumulative frequency distribution and divided all the cumulative frequencies by the total number
of observations, that is, 30.
(g) Construct a histogram for the waiting times in the above example.
Answer: One of the most common graphical presentations of data sets is a histogram. We can
construct a histogram by measuring the class intervals on the horizontal axis and the frequencies
on the vertical axis. Then we can plot bars with the widths equal to the class intervals and the
height equivalent to the frequency of the class that they represent. In Figure 2.4, the histogram
of the waiting times is presented. As you note, the width of each bar is equal to the width of the
various classes (60 seconds), and the height represents the frequency of the various classes. Note
that the first class ends at 119; the next class begins at 120, and one unit exists between these two
classes (and all other classes). To eliminate these spaces, the vertical lines are drawn halfway
between the class limits. Thus, the vertical lines are drawn at 59.5, 119.5, 179.5, 239.5, 299.5,
and 359.5.
Histogram of the waiting times at first county bank
Ogive for the cumulative frequency distribution of the waiting times at first county bank
Waiting Times (in seconds)
Waiting Times
Exercise 4:
The following data set shows the number of hours of sick leave that some of the employees of
Bastien's, Inc. have taken during the first quarter of the year (rounded to the nearest hour).
19 22 27 24 28 12
23 47 11 55 25 42
36 25 34 16 45 49
12 20 28 29 21 10
59 39 48 32 40 31
(a) Develop a frequency distribution for the above data. (Let the width of your classes be 10
units and start your first class as 10 - 19.)
(b) Develop a relative frequency distribution and a percent frequency distribution for the data.
(c) Develop a cumulative frequency distribution.
(d) How many employees have taken less than 40 hours of sick leave?
Exercise 5:
The sales record of a real estate company for the month of May shows the following house prices
(rounded to the nearest $1,000). Values are in thousands of dollars.
105 55 45 85 75
30 60 75 79 95
(a) Develop a frequency distribution and a percent frequency distribution for the house prices.
(Use 5 classes and have your first class be 20 - 39.)
(b) Develop a cumulative frequency and a cumulative percent frequency distribution for the
above data.
(c) What percentage of the houses sold at a price below $80,000?
Example 4:
The test scores of 14 individuals on their first statistics examination are shown below.
95 87 52 43 77 84 78
75 63 92 81 83 91 88
a) Construct a stem-and-leaf display for these data.
Answer: To construct a stem-and-leaf display, the first digit of each data item is arranged in an
ascending order and written to the left of a vertical line. Then, the second digit of each data item
is written to the right of the vertical line next to its corresponding first digit as follows.
4 3
5 2
6 3
7 7 8 5
8 7 4 1 3 8
9 5 2 1
Now, the second digits are rank ordered horizontally, thus leading to the following stem-and-leaf
display.
4 3
5 2
6 3
7 5 7 8
8 1 3 4 7 8
9 1 2 5
Answer: Each line in the above display is called a stem, and each piece of information on a
stem is a leaf. For instance, let us consider the fourth line:
7 5 7 8
The stem indicates that there are 3 scores in the seventies. These values are
75 77 78
Similarly, we can look at line five (where the first digit is 8) and see
8 1 3 4 7 8
This stem indicates that there are 5 scores in the eighties, and they are
81 83 84 87 88
At a glance, one can see the overall distribution for the grades. There is one score in the forties
(43), one score in the fifties (52), one score in the sixties (63), three scores in the seventies (75,
77, 78), five scores in the eighties (81, 83, 84, 87, 88), and three scores in the nineties (91, 92,
95).
Exercise 6:
Construct a stem-and-leaf display for the following data.
22 44 36 45 49 57 38 47 51 12
18 48 32 19 43 31 26 40 37 52
Example 5:
The following is a crosstabulation of starting salaries (in $1,000s) of a sample of business school
graduates by their gender.
Starting Salary
Female 12 84 24 120
Male 20 48 12 80
(a) What general comments can be made about the distribution of starting salaries and the
gender of the individuals in the sample?
Answer: Using the frequency distribution at the bottom margin of the above table it is noted
that majority of the individuals in the sample (132) have starting salaries in the range of $20,000
up to $25,000, followed by 36 individuals whose salaries are at least $25,000, and only 32
individuals had starting salaries of under $20,000. Now considering the right-hand margin it is
noted that the majority of the individuals in the sample (120) are female while 80 are male.
(b) Compute row percentages and comment on the relationship between starting salaries and
gender.
Answer: To compute the row percentages we divide the values of each cell by the row total and
express the results as percentages. Let us consider the row representing females. The row
percentages (across) are computed as (12/120)(100)=10%; (84/120)(100)=70%; (24/120)
(100)=20% Continuing in the same manner and computing the row percentages for the other row
we determine the following row percentages table:
Starting Salary
(c) Compute column percentages and comment on the relationship between gender and starting
salaries.
Column percentages are computed by dividing the values in each cell by column total and
expressing the results as percentages. For instance for the category of "Less than 20" the column
percentages are computed as (12/32)(100)=37.5 and (20/32)(100)=62.5 (rounded). Continuing in
the same manner the column percentages will be as follows.
Starting Salary
Gender Less than 20 20 up to 25 25 and more
Considering the "Less than 20" category it is noted that the majority (62.5%) are male. In the
next category of "20 up to 25" the majority (63.6%) are female. Finally in the last category of
"25 and more" the majority (66.7%) are female.
Exercise 7:
A survey of 400 college seniors resulted in the following crosstabulation regarding their
undergraduate major and whether or not they plan to go to graduate school.
Undergraduate Major
Yes 35 42 63 140
No 91 104 65 260
1 1 94
2 2 78
3 2 70
4 1 88
5 3 68
6 4 40
7 8 30
8 3 60
Develop a scatter diagram for the relationship between the number of absences (x) and their
average grade (y).
Answer: A scatter diagram is a graphical method of presenting the relationship between two
variables. The scatter diagram is shown in Figure 2.6. The number of absences (x) is shown in
the horizontal axis and the average grade (y) on the vertical axis. The first student has one
absence (x=1) and an average grade of 94 (y=94). Therefore, a point with coordinates of x=1
and y=94 is plotted on the scatter diagram. In a similar manner all other points for all 8 students
are plotted.
The scatter diagram shows that there is a negative relationship between the number of absences
and the average grade. That is, the higher the number of absences, the lower the average grade
appears to be.
Exercise 7:
You are given the following ten observations on two variables, x and y.
x y
1 8
5 15
6 20
4 12
2 10
8 20
9 26
1 5
6 18
8 26
(a) Develop a scatter diagram for the relationship between x and y.
(b) What relationship, if any, appears to exist between x and y?
Inferential Statistics
We have been talking about ways to calculate and describe characteristics about data.
Descriptive statistics tell us information about the distribution of our data, how varied the data
are, and the shape of the data. Now we are also interested in information related to our data
parameters. In other words, we want to know if we have relationships, associations, or
differences within our data and whether statistical significance exists. Inferential statistics help
us make these determinations and allow us to generalize the results to a larger population.
Inferential statistics is defined as the branch of statistics that is used to make inferences about the
characteristics of populations based on sample data. We provide background about parametric
and nonparametric statistics and then show basic inferential statistics that examine associations
among variables and tests of differences between groups.
In the world of statistics, distinctions are made in the types of analyses that can be used by the
evaluator based on distribution assumptions and the levels of measurement data. For example,
parametric statistics are based on the assumption of normal distribution and randomized
sampling that result in interval or ratio data. The statistical tests usually determine significance of
difference or relationships. These parametric statistical tests commonly include t-tests, Pearson
product-moment correlations, and analyses of variance.
Nonparametric statistics are known as distribution-free tests because they are not based on the
assumptions of the normal probability curve. Nonparametric statistics do not specify conditions
about parameters of the population but assume randomization and are usually applied to nominal
and ordinal data. Several nonparametric tests do exist for interval data, however, when the
sample size is small and the assumption of normal distribution would be violated. The most
common forms of nonparametric tests are chi square analysis, Mann-Whitney U test, the
Wilcoxon matched-pairs signed ranks test, Friedman test, and the Spearman rank-order
correlation coefficient. These non-parametric tests are generally less powerful tests than the
corresponding parametric tests. Table below provides parametric and nonparametric equivalent
tests used for data analysis. The following sections will discuss these types of tests and the
appropriate parametric and nonparametric choices.
Number of Samples
Different situations require different testing procedures. The following discussion is organized
according to the number of samples that is being evaluated: one sample, two samples, and more
than two samples. In each category, both parametric and nonparametric tests are explained.
Suppose that a manufacturer of car tires claims that the mean life time of a particular type of tire
(measured in miles driven) is 40,000 miles. A consumer organization tests 40 of these tires in
real life circumstances and finds a mean life time of 37,000 miles with a standard deviation of
7500 miles in the life times of the tires in this sample. The question now is: does the test of this
consumer organization indicates that the mean life time of this particular type of tire is
significantly different from the claimed 40,000 miles or is it not significantly different from
40,000 miles as claimed by the manufacturer. In this case the difference between the
manufacturer’s claim and the sample result could be explained by statistical fluctuations; taking
a different sample may yield a slightly different result.
To answer questions like these we will study a branch of statistics called hypothesis testing. This
can be done in a variety of different ways but not all set-ups work well. In this and all subsequent
papers on hypothesis testing we will use the "Classical Approach" which always works. I request
that you do not deviate from this approach and use this set-up for all hypothesis testing problems.
Also a familiar set-up will make it easier for you to work with these rather lengthy problems.
The classical approach is a four step procedure. First, we formulate a working hypothesis called
the null hypothesis. Second, we will formulate a decision rule stating when to accept the null
hypothesis. Third, we will calculate the test statistic, a formula specific for the sort of hypothesis
test we will perform, and compare the value of this test statistic to the decision rule to decide
whether or not to accept the null hypothesis. Finally we formulate an answer statement to the
question we have been given.
Two tailed: The key word for a two tailed test is the mention of the words "significantly
different". In the conclusion we than use words like "is not significantly different from".
One tailed: We have two types of one tailed tests; left tailed and right tailed tests. It pays off to
be systematic.
A left tailed test is performed in case of a problem asking us whether or not the
population mean is significantly less than a stated mean value. The null hypothesis in a
left tailed test states that the population mean is not significantly less that the stated mean
value and the decision rule is to accept the null hypothesis when the calculated value is
like "is not significantly less than" or "is significantly less than".
A right tailed test is performed when the question is something like "is the population
mean significantly more than"? The null hypothesis in case of a right tailed test states that
the population mean is not significantly more than the stated mean value and the decision
rule is to accept the null hypothesis when the calculated value is less than the table value (for
instance in z-test:
z < z c ). In the conclusion we use words like "is not significantly more
involving given populations mean µ and a sample mean x . The sample size n has to be less than
30, in symbolic formn<30 . The result of the test will be a conclusion in which we state that the
population mean is, or is not significantly different from, or is less than or more than what is
stated.
X−μ
t=
s
The test statistic for this type of hypothesis testing is √ n , the critical value t c is to be
found from the t-distribution table.
Example:
A manufacturer of strapping tape claims that the tape has a mean breaking strength of 500
pounds per square inch (psi). A random sample of 16 specimens is drawn from a large shipment
of tape and a sample mean breaking strength of 480 psi is computed. The sample standard
deviation is 50 psi. Can we conclude from these data that the mean breaking strength for this
shipment is less than what is claimed by the manufacturer? Use the 0.01 level of significance.
Solution:
The word "less than" in the last sentence of the problem tells us to perform a left tailed test.
value
tc.
As outlined earlier in an earlier document called "Confidence Intervals" this table works with the
concept of "degrees of freedom" which is abbreviated as d.f. in the table above. Degrees of
freedom is defined here as the sample size minus one, in symbols d . f .=n−1 . In order to find the
critical value for this test we use the "One tail" line of the table with α =0 . 01 and
d . f .=16−1=15 . Follow the arrows in the table underneath to find the critical value t c=2 . 602 .
Since we are on the left side of the mean we have to use a negative value for the critical value
The decision rule is thus that we accept the null hypothesis if t >−2. 602
X−μ 480−500
t= t= =−1 .6
s 50
Using √ n we find √16 .
Combining the info from steps 2 and 3 in a picture with a normal curve we get:
49 %
50 %
tc = - 2.602 t=0
t = -1.6
We see that t is in the shaded region, i.e. in the acceptance region formulated in step 2. Thus we
conclude that at the 0.01 level of significance the mean breaking strength of the tape is not
significantly less than 500 psi. This means that the manufacturers claim is accepted as correct.
Exercise 1:
Exercise 2:
A city health department wishes to determine if the mean bacteria count per unit volume of water
at Siesta Lake Beach is higher than the safety level of 200. Researchers have collected 10 water
samples and have found the bacteria count per unit volume to be 185, 190, 215, 198, 204, 207,
211, 205, 198 and 210. At the 0.1 level of significance, do the data warrant cause for concern?
2
A chi-square ( χ ) test for goodness of fit is performed when the question is whether or not an
observed pattern or a distribution of numbers is significantly different from an expected pattern
or a distribution of numbers.
It is used to compare observed and expected frequencies within a group in a sample, i.e. whether
the observed results differ from the expected results, with the expected results derived either
from the whole population or from theoretical expectations.
In addition to the universal assumptions, the Chi-square goodness of fit test rests on the
assumption that the categories in the cross tabulation are mutually exclusive and exhaustive, the
dependent variable must be nominal, and no expected frequency should be less than 1, and no
more than 20% of the expected frequencies should be less than 5.
The chi-square statistic is looked up in a table of critical values, and the statistic must be larger
than the critical value to reject the null hypothesis. Chi square values range from 0 into the
hundreds, and higher numbers show increasing independence of the variables.
Examples:
The expected or claimed number of cars rented out per category (like small, medium size,
large, SUV etc) versus the actual number of cars rented out per category.
The expected number of customers per two-hour time periods entering a shop vs. the
actually observed number of customers per two-hour time periods.
The nationwide number of ex-convicts arrested again after their release from prison vs.
the number of arrests of ex-convicts observed in a particular city.
We will work these three examples in this chapter. It is essential to start each of the problems for
this test by starting with a table looking like this one:
We will always have a null hypothesis which states that the observed distribution is not
significantly different from the expected distribution and of course use words relevant to that
particular problem.
χ2 < χ 2
The decision rule for this test will always be c where the critical value has to be read
2
from the χ distribution table. The only two numbers needed to look up this critical value are the
level of significance α and the number of degrees of freedom. The degrees of freedom for this
test will be defined as the number of categories minus 1. This is how we find the critical value
for a particular problem: suppose that we use α =0 . 05 and have 5 degrees of freedom (6
categories). We can find the critical value by looking at the point where the two arrows in the
2
table meet. We find the critical χ value to be 11.071.
( E−O)2
χ 2 =∑
The test statistic is E where E and O are the expected and observed frequencies
per category. How to find these values and work out the problems will hopefully become clear
when working the examples below.
All submitted work concerning hypothesis testing will have to follow the usual 4 step format.
Example:
A new branch of a large car rental company is to be opened on a sunny island. The general
management of this company from previous experiences expects 25% of the car rental contracts
to be for small cars, 35% for medium size cars, 10% for large cars, 25% for SUV’s and the
remaining 5% for specialty cars such as vans, pick ups etc. The local manager on this island
decides to test whether or not this distribution is actually what they see happening in their office.
Out of 200 randomly sampled car rental contracts they note that 37 are for small cars, 81 for
medium size cars, 14 for large cars, 61 for SUV’s and 7 for specialty cars. Can it, at the 0.01
level of significance be concluded that the general management’s claim is correct?
Solution:
Before we start our usual 4-step hypothesis testing routine we have to collect all the information
in a table. What we really need in the table is a list of the 5 categories of rental cars, the 5
observed frequencies for each of the categories and the 5 expected frequencies. These expected
frequencies we have to calculate from the percentages given but that is easily done. After all, the
general management claims that 25% of all cars rented are small cars. In the sample of 200 cars
this means that . 25⋅200=50 are small cars. Likewise . 35⋅200=70 are medium size cars.
Continuing like this we can collect all the info in the following table:
Small 37 50
Medium 81 70
Large 14 20
SUV 61 50
Specialty Car 7 10
The observed distribution of the local manager is not significantly different from the expected
distribution given by the general management of this company
Since there are 5 categories of rental cars then the number of degrees of freedom is
d . f .=5−1=4 . Using the χ 2 distribution table as explained above we'll find a critical value of
χ 2=13 . 28
c . Using the form of the decision rule outlined above we'll state that we accept the
2
null hypothesis if χ <13 .28
Step 3 : The Test Statistic
( E−O)2
χ =∑
2
Using E we find
2
Combining the info from steps 2 and 3 in a picture with a χ distribution we get:
2
We see that χ is in the shaded region, that is it's in the acceptance region formulated in step 2.
Thus we conclude that at the 0.01 level of significance the observed distribution is not
significantly different from the expected distribution as stated by the general management of the
car rental company.
Exercise 1:
It is claimed that there is no preference for customers to come into a shop as far as the time of
day is concerned. To test the correctness of this claim the manager decides to tally the number of
customers that enter the shop during 6 two-hour periods in a particular week and arrives at the
following information:
Time period No of customers
08 – 10 19
10 – 12 27
12 – 14 38
14 – 16 38
16 – 18 32
18 – 20 26
Judging from these data, can it at the 0.05 level of significance be concluded that customers
indeed have no preference as far as the time of the day is concerned to visit this shop?
Exercise2:
A national study revealed that, within 5 years of their release from prison, 20% of criminals had
not been arrested again, 38% had been arrested once, and so on. The table underneath shows the
complete distribution:
0 20.0
1 38.0
2 18.0
3 13.5
4 or more 10.5
A social agency in a large city has developed a guidance program for former prisoners who settle
there. Anxious to compare local results with the national figures, the director of the social agency
at random selected 200 former prisoners who were in the guidance program. His results are
summarized in the following table:
0 58
1 62
2 28
3 16
4 or more 36
At the 0.01 level of significance, how would the director of this social agency compare the local
results with the National figures?