Engineering Stats Lecture Notes
Engineering Stats Lecture Notes
OUTCOMES
This unit deals with the role of statistics in the data analysis process. Concepts that are basic to the study of
statistics are discussed.
We live in an era where we are faced with increasing amounts of information, referred to as data. To perform
many tasks efficiently we need a basic understanding of statistical methods. The field of statistics covers a
problem-solving process that seeks answers to questions through data. To be an informed consumer of
information you must be able to:
extract information from tables and graphs
follow numerical arguments
understand the basics of how data should be gathered, summarised and analysed to draw statistical
conclusions
1
Example 1.1
As part of a weekly check to access the calibration of a filling machine, the quality control manager randomly
selects 50 bottles of beer that were filled on a specific day.
4. Interpret results
1.2. DEFINTION
Statistics is the scientific discipline that provides methods to help us make sense of data by:
collecting data in a methodical way
analysing data using methods to organise and summarise data using tables, graphs and numbers
interpret data to draw conclusions or to answer questions
2
The field of statistics is subdivided into descriptive and inferential statistics:
Descriptive statistics includes the collection and summarising of data to give an overview of the
information collected
Inferential statistics is the process of making an estimate, prediction or decision about a population
based on sample data
o A population is almost always very large
o A sample is drawn and data summarised using descriptive techniques
o The results are used to make decisions about the population
o Reliability of decisions/conclusions are measured
Confidence level: the proportion of times that an estimating procedure will be correct
in the long run
Significance level: how frequently the conclusion will be wrong
3
Quantitative variables are further classified as discrete or continuous
o Discrete variables are countable
Values are obtained through counting
E.g. the number of students in the class
o Continuous variables have infinite number of possible values that are not countable
Values are obtained through measuring or weighing
E.g. weight, length, time taken to complete a task, age, etc.
Example 1.2
Distinguish between qualitative and quantitative variables.
1. Gender
2. Temperature
3. Postal code
Example 1.3
Distinguish between discrete and continuous variables.
2. Number of cars that arrive at the KFC drive-through between 10h00 and 12h00
3. Distances different model cars with the same tank capacity can drive in city driving conditions
4. Temperature
4
1.4. MEASUREMENT
Measurement is the process we use to assign a value to the observations or elements of a variable. There are
four levels or scales of measurement: nominal, ordinal, interval, ratio. The analyses depend on the scale used
to measure a variable.
5
1.4.4. 1.4.4 Ratio scale
Data can be arranged in order
Both differences between data values and ratios of data values are meaningful
This scale must contain a zero value that indicates that nothing exists for the variable at the zero point
Arithmetic operations can be performed on the numeric values themselves.
E.g. money:
o The zero point is meaningful, i.e. at zero you have none
o The difference between R10 and R20 is the same as the difference between R50 and R60
o R10 is twice as much as R5
Arithmetic operations can be performed on the numbers themselves
Variables such as distance, height, weight, and time use the ratio scale
Activity 1.1
Categorise these measurements according to level.
Variable/measurements Measurement scale
1. Species of fish in the Vaaldam
2. Cost of rod and reel
3. Time of return home
4. Rating area of fishing area: Poor, fair, good
5. Number of fish caught
6. Temperature of water
Activity 1.2
The student council at a university with 10000 students is interested in the proportion of students who favour
a change in the admission requirements at the university. Two hundred students are interviewed to determine
their attitude towards this proposed change. Of the 200, 64 (32%) are in favour of the change. The student
council announced that less than 35% of all students are in favour of a change.
6
c) Classify the variable in terms of type and measurement scale
7
UNIT 2: COLLECTION OF DATA
OUTCOMES
This unit deals with how and where to obtain data that can be used to make informed decisions. The quality
of the final product depends on the quality of the raw material used.
Remember the acronym GIGO - garbage in, garbage out.
2.2.2. Observations
Collecting data relies on watching or listening, and then counting or measuring events as they happen without
anything being done to the individuals or objects. Draw up an observation sheet and keep count of the
observations using a tally table. Keep count by using straight lines for each item counted ( | | | | ). The fifth line
is a line across the first four lines so that we can count in multiples of five ( | | | | ).
Example 2.2:
The police wanted to determine whether motorists using a certain road wore seatbelts. They observed if the
driver used a seatbelt as the cars passed by and counted how many wore seatbelts and how many did not.
Personal interview
Obtain data verbally and face to face
Candidates are selected at random
Popular method for conducting market research about products
Interviewers must be trained to ask questions and record responses, which make this method more
costly and time consuming
You can obtain in-depth responses, not only by listening to the answer but also by interpreting their
body language
Difficult questions can be clarified
Show visual displays or products to the respondent to provide better communication and motivation
to participate in survey
Telephone interview
Present questionnaire by telephone
Telephone surveys are less costly than personal interviews and can be conducted over wider
geographical areas
People are more open in their opinions as there is no face-to-face contact
Some people in the sample will not have phones or will not be home when you call them
Mail questionnaire
Respondents are asked to complete and return a questionnaire, which they receive in the mail,
newspaper, magazine or attached to a product
Cover a wider geographical area in comparison with telephone or personal interviews since it is the
least expensive
Can cover a larger sample as it is relatively cheap
Respondents can remain anonymous and will be more open and honest in their opinion
Disadvantages include a low response rate, inappropriate answers to questions and illiteracy of some
people included in the sample
Internet-based questionnaires
E-mail with a link to the questionnaire on a secure website
Quick and inexpensive, but often less detailed
Disadvantages include a low response rate, excluding people who do not have a computer or are
computer illiterate
Activity 2.1
Rate the survey method as either 1 (most appropriate), 2 (less appropriate) or 3 (least appropriate), under the
following circumstances:
Decide where in the random table you should start. You can choose to use the first two digits, the middle two
digits or the last two digits. You can even choose which columns to use. You can make this decision by using
the “goldfish-bowl” technique or by closing your eyes and pointing to a spot in the table.
Suppose you have decided to start in the first column with the first two digits. If we reach the bottom of the
last column on the right and are still short of our desired number, we can go back to the beginning and start
reading the third and fourth digits of each number. According to the table, employee numbers 70, 23, 20, 22,
53, 39, 48, 64, 12 and 45 will be in the sample of ten.
If a number occurs more than once, you skip it. You cannot use any population ID twice because there is a
unique ID assigned to each element in the population.
Activity 2.3
Each student at the university has a mailbox on campus. The mailboxes are numbered 0000 to 9000. Use the
random number table and select 10 mailbox numbers in your sample.
Question 2
Select a random sample of 25 of the 371 active telephone area codes in South Africa.
Question 3
At a party, 30 students are 21 or older and 15 are under 21. Select a sample of size 5 to measure attitudes
towards alcohol.
OUTCOMES
Describe data by summarizing and displaying it using tables and graphs so that the salient features of the
dataset are more easily understood.
Data formats
Raw data: list of the observations for each variable
o Provide little information
o Raw data must be organised and summarised to get the overall picture
Frequency distribution/table: arrange raw data in a summarised form
Graphs: visual representation of frequency tables
o Complements a table by showing the data’s general structure more clearly and to reveal
relationships that might be overlooked in a table.
o Type of graph depends on the type of data, the complexity of the data and the requirements of
the user
Example 3.1
Activity 3.1
A bio-kinetics instructor wants to study the different types of rehabilitation required by her patients. She
selected a random sample of her patients and recorded the body part requiring rehabilitation. The following
results were obtained:
Construct a frequency distribution and a relative frequency distribution to describe the data. Give a short
interpretation of your results.
Total
3.1.2. Cross-tabulation
Summary table with two categorical variables (bivariate) is known as a two-way or contingency
frequency table
One variable in the rows and the other variable in the columns
Each row and column combination is called a cell
o Observed cell counts = number of times each combination occurs in the data for all cells (joint
data/information)
o Marginal totals = total observed cell counts in each row and also in each column
o Grand total = total of all the observed cell counts in the table
E.g. data collected at a university to compare students, staff and management on the basis of their
transportation to campus (taxi, bus, car, train, motorcycle, bicycle or on foot)
o 3 7 two-way frequency table
Activity 3.2
People believe that organic foods are healthier than conventional grown fruit and vegetables. An investigation
was carried out on a sample of 10000 food items. The following table displays the frequencies for all possible
category combinations of the two variables: food type and pesticide status. Briefly comment on these results.
Pesticides
Food type Present Not present Total
Organic 28 99 127
Conventional 9085 788 9873
Total 9113 887 10000
Dimensions:
Marginal information:
Joint information:
Activity 3.3
One hundred students, majoring in Sciences were classified according to gender and year of study. Ten were
1st year women, 20 were senior women, 40 were 1st year men and 30 were senior men. Arrange the data in a
contingency table. Briefly comment on your results.
Dimensions:
Marginal information:
Joint information:
Steps
Communicate a single variable
Identify, scale and label axes
o Variable categories on x-axis
o Frequency, relative frequency, frequency percentage on y-axis
Category axis
o Bars represent categories
o The bars must all have the same width
o Make the bars reasonably wide so that they can be clearly seen
o Gap between bars
Example 3.2
DIY
Activity 3.4
Draw a simple bar chart showing the ages of employees and draw conclusions from your results.
Age Number of employees
20 11
21 4
22 8
23 6
24 5
5) Graph:
3.1.4. Comparative bar graphs
To compare two or more data sets
Use contingency data as input
Multiple bar chart:
o Bars are grouped together in each category
o Can use relative frequency if sample sizes are different
Segmented /stacked bar chart:
o Category are stacked
o Useful to emphasise relative proportions of components that make up the category
Use a key to distinguish between the categories
2) Draw a multiple bar chart showing the ages of male and female employees
3) Draw a percentage component bar chart showing the ages of male and female employees
Steps
Draw a circle to represent the entire data set
Keep the categories to 10 or fewer
For each category calculate the slice size, relative to the whole circle
Put any labelling outside the circle
Look for categories that form large and small proportions of the data set when interpreting the chart
Example 3.5
DIY
Activity 3.7
Draw a pie chart to portray how you might divide up your day, using the following data:
Travelling 10%
Working 30%
Eating 10%
Sleeping 28%
Social life 7%
Other 15%
3.1.6. Pictogram
Not included in curriculum
3.2. SUMMARISING QUANTITATIVE DATA IN TABLES
3.2.1. The ordered array of data
If there are not too many observations we can use the collected data in its raw form, also known as
ungrouped data
Arrange the data in an array: sort data in numerical order from small to big
o Necessary for some statistical procedures, such as the median, percentiles and quartiles
Example 3.6
Arrange the following data in an array:
Data: 4 80 50 10 5
Array: 4 5 10 50 80
Activity 3.8
Arrange the following numbers in an array:
Data: 67 23 56 45 56 41 34 33 0 18 23
Array:
Steps
Construct a horizontal axis and label it
Mark the axis with a scale to fit the smallest to largest value in the data set
For each observation, place a dot above its value on the number line
If there are two or more observations with the same value, stack the dots vertically
The number of dots above a value represents the frequency of occurrence of that value
Example 3.7
Activity 3.9
The following table lists 15 popular cereals and the amounts of sodium and sugar in a single serving of 180ml.
Cereal Sodium (mg) Sugar (g) Cereal Sodium (mg) Sugar (g)
A 290 2 I 250 10
B 200 3 J 125 14
C 230 3 K 220 3
D 125 13 L 0 7
E 260 5 M 220 12
F 200 11 N 170 3
G 210 12 O 140 10
H 140 10
Stem Leaf
7 6
Steps
Select leading digit(s) for the stem
Find the smallest number and the largest number data for the first stem and the last stem
List all possible stems in increasing order, to the left of the line
The trailing digit(s) become the leaves
o Place the leaves with the same stem on the same row as the stem.
o Arrange the leaves in each row from lowest to highest to form a stem–and-leaf plot.
Use a label to indicate the units for stems and leaves in the display.
The count of the number of leaves per row is the frequency of each row
Activity 3.10
The following is an array of the daily litres of used sunflower oil bought by a biodiesel plant.
58 63 69 69 70 71 71 72 72 72 73 73 74 75 77
79 80 82 84 84 85 88 91 91 91 94 96 97 99 100
Example 3.9
Activity 3.11
Value (x) Frequency (f)
4 1
6 2
14 1
15 4
17 1
18 1
Total
Additional information
Different formats are used to denote the class intervals
E.g. from 1 to under 10 could be represented as:
o [1, 10)
o 1 x 10
Example 3.10
Activity 3.12
Frequency distribution of acrylamide levels in Big Mac potato fries (from Example 3.10)
Class intervals Frequency
[151, 187) 3
[187, 223) 5
[223, 259) 6
[259, 295) 6
[295, 331) 8
[331, 367) 2
Total 30
1) The number of franchises with acrylamide levels in their French fries between 259 and 295 is____
2) The frequency for the class with acrylamide levels between 187 and 223 is____
3) The upper boundary of the first class is____
4) The lower boundary of the third class is____
5) The total number of observations in the data set is____
Relative frequency distribution
When the proportion of observations in each class interval instead of the actual number of observations
is recorded, the distribution is known as a relative frequency distribution
Relative frequency distributions are useful for comparing two data sets, especially when the sample
sizes or measurement scales differ substantially
A relative frequency of a class is the observed frequency of the class divided by the total number of
observations in the data set. If the percentage is required, multiply the result by 100
1) Range:
2) Number of class intervals:
3) Width of class interval:
4) Test:
5) Lower and upper class boundaries of each interval (in table)
6) Tally (data are sorted)
7) Frequency (in table)
8) Total frequency (in table)
Relative Cumulative
Class intervals Frequency Midpoint
frequency frequency
Steps
1) Mark the class boundaries on the x-axis. The class intervals are equal in width; therefore the points
must be equidistant from one another
2) Use either f or %f on the y-axis. A proper scale showing the true zero must be used on the y-axis in
order not to misrepresent the character of the data
3) Draw a rectangle for each class directly above the corresponding interval. The height of each rectangle
is the frequency (or relative frequency) of the corresponding class
4) There are no gaps between the bars of the histogram
Uniform distribution:
Frequencies are evenly distributed across the scale
Example 3.12
9
8
7
6
Frequency
5
4
3
2
1
0
115 151 187 223 259 295 331 367 403
Acrylamide levels
Activity 3.15
Construct a histogram for time spent using computers (from Activity 3.13) and comment on the shape
Class intervals Frequency
[0, 3.5) 10
[3.5, 7.0) 8
[7.0, 10.5) 8
[10.5, 14.0) 13
[14.0, 17.5) 8
[17.5, 21.0) 2
[21.0, 24.5) 1
Total 50
3.3.2. The polygon and relative polygon
The polygon is a line graph that can be used to portray the shape of the distribution. A polygon that uses the
relative frequencies of the intervals is called a relative polygon. It has the same shape as the frequency polygon,
but uses a percentage scale / relative frequency on the y-axis.
Steps
1) Determine the class midpoint (x) of each class
2) Mark the class midpoints on the x-axis
3) Mark the frequencies on the y-axis using a proper scale
4) Plot each midpoint together with its corresponding frequency
5) Connect the successive dots with a straight line to form the polygon
6) Begin and end the line on the horizontal axis with a frequency of zero
Activity 3.16
Construct a frequency polygon for time spent on the Internet (from activity 3.13)
Class intervals Frequency Midpoint
Steps
1) The frequency distribution must show class boundaries and cumulative frequencies
2) The frequency scale on the y-axis must extend to the total of the frequencies
3) Mark the class boundaries on the x-axis
4) For each class plot the upper boundary together with the cumulative ‘less-than’ class frequency
5) Draw a smooth curve through the points. The ‘less-than’ curve slopes upwards and to the right
6) The ‘less-than’ ogive begins on the horizontal axis at the lower class boundary of the first class
7) If the cumulative frequencies are expressed as percentages of the total, a relative ogive can be drawn
Activity 3.17
Construct an ogive of the amount of time staff spent on computers (from Activity 3.13). Comment on the
graph
Cumulative
Class intervals x-value
frequency
[0, 3.5) 10
[3.5, 7.0) 18
[7.0, 10.5) 26
[10.5, 14.0) 39
[14.0, 17.5) 47
[17.5, 21.0) 49
[21.0, 24.5) 50
3.4 Using software
There are a number of useful software packages available for data presentation which are easy to use
Computers can help you develop your ideas about how to organize the information by using a ‘try and
refine’ approach, which would take too long to carry out manually
o For example, if you decide to break the information down in a certain way, and the results are
not what you need, it is a simple matter to create new ones and experiment again
Computer software can produce accurate and professional graphs and charts from data, but they are
only as useful as the data and instructions used to make them
UNIT 4: SUMMARISING DATA USING NUMERICAL DESCRIPTORS
In this unit we look at numerical measures that can be used to describe the characteristics of data collected in
their raw form (ungrouped data) as well as for data summarised into frequency distributions (grouped data).
Activity 4.1
A city planner working on bikeways needs information about local bicycle commuters. She designs a
questionnaire. One of the questions asks how many minutes it takes the rider to pedal from home to his or her
destination. A sample of 12 local bicycle commuters yielded the following times:
22 29 27 30 12 22 31 15 26 16 48 23
o x
xf
n
o Procedure for ungrouped frequency table data:
Multiply every value of the variable (x) with the corresponding frequency (f)
Sum the value xf
Divide the sum by n
o Procedure for grouped frequency table data:
Multiply the midpoint of every class (x) with the corresponding frequency (f)
Sum the value xf
Divide the sum by n
Extra activity
Calculate the mean number of cellphones in a household from the following ungrouped frequency table:
Number of cellphones (x) Frequency (f) xf
0 5
1 18
2 26
3 8
4 3
Total (n)
Activity 4.2
Calculate the mean number of hours of personal computer usage per week from the following grouped
frequency table:
Class Frequency (f) Midpoint (x) xf
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 5
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n)
Characteristics of the arithmetic mean
1. It is the arithmetic average of all the quantitative measurements in the data set
2. Every numerical data set has only one mean
3. It is reliable because it reflects all the values in data set
4. It is sensitive to every value in the data set and can be greatly affected by the presence of even a single
extreme value (or outlier)
Note: An outlier is an unusually large or small observation in comparison with the rest of the values
5. It is useful for further inferential statistical procedures
6. It can be calculated using a pocket calculator with pre-programmed formulae
o Setup Down arrow 3:Stat 1:On
o Mode 2:Stat 1:(1-VAR)
o Enter x-values (raw data, discrete values or class midpoints from frequency table) under X
o Enter frequencies under FREQ
o STAT 4:Var 2: x
4.1.2. Median
The median is the value that occupies the middle position of a data set arranged in a numerical order. This
means that there are an equal number of data values in the ordered distribution that are above it and below it.
Interpretation: at most 50% of observations are below the median value and at most 50% of
observations are above the median value
Ordered set:
Median position:
Median value:
Interpretation:
Numerical order: 275 296 299 322 323 332 333 337 347 350 353 357 358 393
Position median:
Median value:
Interpretation:
Grouped / frequency data
For ungrouped frequency data, we use the cumulative frequencies to find the median value (not in textbook).
Add cumulative frequencies to the table
n 1
Determine the median position =
2
Find the median value:
o If n is odd
The median is the first x-value for which the cumulative frequency is greater than or
equal to the median position
o If n is even
Round down the median position value and find the first x-value for which the
cumulative frequency is greater than or equal to the rounded position value
Round up the median position value and find the first x-value for which the cumulative
frequency is greater than or equal to the rounded position value
Take an average of the two x-values
Extra activity
Calculate the median number of cellphones in a household from the following ungrouped frequency table:
Number of cellphones (x) Frequency (f) Cumulative frequency
0 5
1 18
2 26
3 8
4 3
Total (n) 60
With grouped frequency data we are unable to determine where the true middle value falls, but we can use a
formula or the ogive to estimate the median.
o Median L
n2 F c
fm
L = lower boundary of median class
fm = frequency of median class
c = class width
F = cumulative frequency just before the median class
Activity 4.4
Estimate the median number of hours of personal computer usage per week for a sample of 16 people.
Class Frequency (f) Cumulative frequency
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 5
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n) 16
Median position:
Median value:
Using the ogive:
n
Find the median position on the y-axis
2
Draw a straight horizontal line from the y-axis to the ogive
Draw straight vertical line from the ogive to the x-axis
The corresponding value on the x-axis is the estimated median
Use the ogive of this data to estimate the median value and compare your answer to the estimate using the
formula (not in textbook)
16 16
15
14
12
Cumulative frequency
12
10
8 7
4
2
2
0
0
1.95 3.95 5.95 7.95 9.95 11.95
Hours
Activity 4.5
A telephone company conducted a study on the length of calls. Determine the modal length of calls for a
sample of 10 calls (in minutes): 1.4 15.5 2.1 8.0 15.5 1.4 17.7 7.2 9.1 15.5
Ordered array
Mode
Modality
Interpretation
Activity 4.6
Estimate the modal number of hours of personal computer usage per week from the following grouped
frequency table data:
Class Frequency (f)
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 9
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n) 20
Modal class:
Modal value:
Using the histogram
Identify the longest bar on the histogram as the modal bar
Draw a line from the top right corner of the modal bar up to right corner of the bar to its immediate
left
Draw a line from the top left corner of the modal bar up to the top left corner of the bar to its immediate
right
Draw a line parallel to the y-axis through the intersection point of the previous two lines down to the
x-axis
The value on the x-axis approximates the modal value
Use the histogram of this data to estimate the modal value and compare your answer to the estimate using the
formula (not in textbook)
10
9
8
7
Frequency
6
5
4
3
2
1
0
[1.95, 3.95) [3.95, 5.95) [5.95, 7.95) [7.95, 9.95) [9.95, 11.95)
Hours
Activity 4.7
The National Housing Department conducted a survey to estimate the average number of livable square meters
for low cost housing. The reported mean was 24.5 square meters and the median 22.2 square meters. Which
measure of central tendency is more appropriate? Explain your answer.
4.2. MEAURES OF DISPERSION
An average summarises a set of data in just one number. Two sets of data can have the same mean and yet be
very different if one is more spread out than the other. To describe this difference quantitatively, we use a
measure of dispersion / variability / spread. This is a descriptive measure that indicates the amount of variation
in a data set.
x x
2
s
n 1
x = sample mean
x x = deviation from the mean
x x
2
= squared deviation from the mean
x x
2
= sum of squared deviation from the mean
x x
2
Determine the range, standard deviation and variance of the traveling time for the riders and interpret the result
in the context of the data.
Range =
x x x x x
2
22
29
27
30
12
22
31
15
26
16
48
23
Sum = Sum = Sum =
Mean =
Standard deviation =
Variance =
Grouped data
For both ungrouped and grouped frequency data we can calculate /estimate the standard deviation using the
following formula:
x x
2
f
s
n 1
Need a frequency table with the columns: values / midpoint and frequencies
Compute the mean x
Compute the squared deviation between each value / midpoint and the mean
Multiply this by the corresponding frequency
Add the results
Divide by n – 1
Take the square root to get the standard deviation
Range =
Mean =
Standard deviation =
Variance =
Activity 4.11
Estimate the range, standard deviation and variance of number of hours of personal computer usage per week.
Range =
Mean =
[1.95, 3.95) 2
[3.95, 5.95) 5
[5.95, 7.95) 9
[7.95, 9.95) 3
[9.95, 11.95) 1
Total (n) 20
Standard deviation =
Variance =
This is a unit-free number because the standard deviation and mean are measured using the same units. This
measure is useful to compare two or more sets of data with different means, measurement units or sample
sizes. The higher the result the more variability there is in a set of data.
Activity 4.12
Two growers of grapefruit have obtained the following statistics regarding the mass of their current crops.
Grower A: x = 300g with s = 20g
Grower B: x = 280g with s = 40g
Mean
Median
Mode
Skewed distributions
A distribution is skewed if the tail of the graph extends more to one side than the other. In a skewed distribution,
the mode stays at the peak of the distribution because outliers do not influence the mode at all. The influence
of the outliers is the highest on the arithmetic mean because the mean is affected by all values in the data set,
including the extreme ones, and tends to be located toward the tail of the skewed distribution. The median,
being dependent on the number of values in the data set rather than on the size of those values, is less sensitive
than the mean, since only the middle measurements are used for its calculation and is located somewhere
between the mode and the mean.
Mean Mode
Median
Pearson’s second coefficient of skewness
Arithmetic measure of skewness
Indicates which pattern is in the data
Calculated using the formula:
3 mean median
SK
standard deviation
SK is measures on a scale from -3 to +3
If SK = 0 symmetrical distribution
If SK > 0 but less than +3 positively skewed
If SK < 0 but greater than -3 negatively skewed
Activity 4.13
In a sample showing the sodium contents (in milligrams per kilogram) of chocolate pudding made from instant
mix, the mean is = 2965.20, the median is 2946 and the standard deviation = 543.52. Calculate a coefficient
of skewness for this distribution and interpret your answer.
Consider a data set with a mean of 100 and a standard deviation of 15:
Determine which values are within 1 standard deviation of the mean
o The mean minus 1 standard deviation = 100 – 15 = 85
This means that “85” is 1 standard deviation below the mean
o The mean plus 1 standard deviation = 100 + 15 = 115
This means that “115” is 1 standard deviation above the mean
o All observation that fall “85” and “115” are within 1 standard deviation of the mean
Determine how many standard deviations the values 70 and 130 are above / below the mean
70 100
o Value 70: z 2
15
The value “70” is 2 standard deviations below the mean
130 100
o Value 130: z 2
15
The value “130” is 2 standard deviations above the mean
4. The following two rules can be applied, depending on the shape of the distribution:
If the distribution is symmetrical, use the Empirical Rule to make a statement about the
proportion of data values that fall into an interval
o Approximately 68% of observations fall within 1 standard deviation from the mean
o Approximately 95% of observations fall within 2 standard deviations from the mean
o Approximately 99.7% of observations fall within 3 standard deviations from the
mean
A more general interpretation of the standard deviation is derived from Chebysheff’s
theorem, which applies to distributions of all shapes
1
o At least 1 2 100% of observations lie within z standard deviations from the
z
mean
Activity 4.14
Consider the time (in minutes) taken to travel to work for a sample of 25 people from Gauteng, with:
Mean = 31.7 minutes; Median = 30.88 minutes; Standard deviation = 7.94 minutes
Activity 4.15
The number of cars entering a parking area during a sample of ten-minute intervals yield the following
summary statistics:
Mean = 19.6; Median = 22.5; Standard deviation = 8.72
Determine the range of values for 2 standard deviations around the mean
Use Chebyshev’s theorem to determine the proportion of the distribution that fall between 6.52 and 32.68
UNIT 6: SUMMARISING BIVARIATE DATA: SIMPLE REGRESSION AND
CORRELATION ANALYSIS
Outcomes
This unit deals with methods to summarize data consisting of observations on two quantitative variables
(bivariate data) presented as ordered pairs. The purpose is to understand if there is a relationship between
two variables and what you can do if a relationship exists.
Purpose
Regression and correlation analysis are statistical tools used to study the relationship between two variables
of which one is dependent and the other independent. It is used to determine:
Whether there is a relationship between the variables,
How good that relationship is
How the relationship can be used to make estimates
Examples of relationships
Sales and earnings
Cost and number of items produced
Shoe size and intelligence
Effort and results
Steps
1. Collect pairs of data (x, y). The data are paired in a way that matches each value from one data set with
a corresponding value from a second data set
2. Select which variable is the dependent (y) variable and which is the independent (x) variable. The label
y goes to the variable which we want to predict. The other variable is then labeled as x
3. Arrange the data in two columns, x and y
4. Draw a set of axis
5. The horizontal axis represents the x-variable and is scaled so that any x value can be easily located
6. The vertical axis represents the y-variable and is scaled so that any y value can be easily located
7. Each pair of observations (x, y) is plotted as a point. That is where a vertical line from the value on the
x-axis meets a horizontal line from the value on the y-axis
8. The points are not connected
9. Scatter plots can take on the following patterns:
The plot can show no relationship, because no pattern can be identified
The plot can show a positive relationship because the dots start at the bottom left and move
upwards to the top right. Although the data points do not fall exactly on a line, they appear to
cluster about a line. A positive relationship means that if the x-variable increases, the y-variable
will also increase
The plot can show a negative relationship because the dots start at the top left and moves
downwards to the bottom right. A negative relationship means that if the x-variable increases,
the y-variable will decrease
If all the points fall exactly along a straight line, in a negative or positive direction, we say the
relationship is perfect
Activity 6.2
During the baking of a certain type of bread roll on very low heat, each bread roll goes through a series of heat
processes. The length of time spent under this heat treatment is related to the lifespan of the bread rolls. A
sample of 8 bread rolls that underwent different baking times were selected and the life span (in hours) of each
was recorded. Draw a scatter plot and interpret.
25
20
15
Life span
10
0
0 5 10 15 20
Length of time
6.3. CORRELATION ANALYSIS (r)
A correlation exists between two variables when one of them is related or can be influenced by the other in
some way. The linear correlation coefficient is a numerical measure to describe the degree of strength and
direction by which one variable is related to another. A linear relationship means that when graphed, the points
approximate a straight line pattern. We use an equation, known as Pearson’s Product Moment Correlation
Coefficient, to measures this strength. This correlation coefficient is represented by the letter r. It is calculated
using the formula:
n xy x y
r
n x 2 x 2 n y 2 y 2
We need the sample size, the sum of the values, the sum of squares and the sum of the product
If raw data are provided, you can enter the data in your calculator and find r
Activity 6.3
Calculate the correlation coefficient and coefficient of determination for the bread roll data (Activity 6.2).
Interpret your answers.
Length of
Observation Life span (y) x2 y2 xy
time (x)
1 18 23
2 13 20 169 400 260
3 18 18 324 324 324
4 15 16 225 256 240
5 10 14
6 12 11 144 121 132
7 8 10 64 100 80
8 4 7 16 49 28
Sum
6.4. REGRESSION ANALYSIS
If the correlation coefficient indicates that a relation exists between the two variables, the next step is
to determine the equation of the straight line that best describes the pattern of the relationship between
the two variables
o This equation, known as the regression equation, can be used to predict a dependent variable
(y) if the independent variable (x) is known
‘Best’ refer to how close the predictions of y are to the actual values of y
A linear relationship between the two variables means that the equation will result in a straight line if
plotted on a graph
Where
ŷ = estimated / predicted value of y for a given value of x
a = intercept on the y-axis
b = slope (the average / predicted change in y for every 1-unit change in x)
a y bx
y b x
n n
ŷ a bx
x 98 x 2
1366 y 119 y 2
1975 xy 1618
25
20
15
Life span
10
0
0 5 10 15 20
Length of time
If a bread roll spends 16 hours under heat treatment, how long to expect it to remain fresh?
Interpreting regression output (not in textbook)
The following tables list the regression and correlation analysis output for the bread roll data.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.870
R Square 0.757
Adjusted R Square 0.717
Standard Error 2.878
Observations 8
ANOVA
df SS MS F Significance F
Regression 1 155.2 155.2 18.7 0.005
Residual 6 49.7 8.3
Total 7 204.9
Steps:
o Rank x and y values separately from smallest to largest (find the average rank for ties)
o Calculate the difference between ranks of the two variables (d)
o Square the differences and compute the sum
o Substitute into the equation and interpret
Activity 6.5
Determine Spearman’s rank correlation coefficient for the bread roll data and interpret your answer.
Length of
Observation Life span (y) Rank (x) Rank (y) d d2
time (x)
1 18 23
2 13 20 5 7
3 18 18 6 1.5 2.25
4 15 16 6 5 1 1
5 10 14 3 4 -1 1
6 12 11 4 3 1 1
7 8 10 2 0 0
8 4 7 1 1 0 0
Sum
UNIT 8: BASIC PROBABILITY CONCEPTS
The theory of probability grew out of the study of various games of chance using coins, dice, cards, lottery and
gambling machines. Since then probability theory has been developed to determine uncertainties in our everyday
lives as well.
We make decisions in the face of uncertainty. We know that it is possible that things could turn out in different
ways but we simply do not know how probable each possibility is. Our need to cope with this risk or the ‘chance
that it will happen’ leads us to the use of the probability theory.
Inferential statistics involves using statistics obtained from a sample to make estimates and decisions concerning
the entire population. We can never be certain that our decisions are correct, but to assess how good they will be
we need to know how to measure ‘chance’ and probability’. The science of measuring ‘uncertainty’ is called
probability.
Probability describes the relative possibility (chance or likelihood) that an event will occur.
8.1. LANGUAGE OF PROBABILITY
An experiment is a trial or action that generates the uncertain outcomes to which we will assign
probabilities.
A particular result of an experiment is an outcome.
A sample space of a random experiment is a list of all the possible outcomes of the random experiment.
The individual outcomes in a sample space are called simple events. An event is a collection of one or
more outcomes of an experiment.
Activity 8.1
The type of transmission - automatic (A) or manual (M) - is recorded for each of the next two cars purchased
from a certain dealer.
1. What is the random experiment?
number of successes
PE
total number of outcomes
Activity 8.2
In drawing a card from a deck of 52 cards, what is the probability that it will be an ace?
As you increase the number of times an experiment is repeated, the empirical probability of an event approaches
the classical probability of the event.
NOTE: Chance behaviour is unpredictable over the short run but has a regular and predictable pattern in the long
run.
Activity 8.3
In a survey, a sample of 100 students were asked if they think that cloning of humans should be allowed. Ninety-
two said it should not be allowed, five said it should be allowed and three had no opinion. Calculate the
probabilities for each event in the survey.
Activity 8.4
We all know that fruit is good for us and that we don’t eat enough. In a recent study done among a random sample
of 75 teenage boys, the following information was collected:
1. What is the probability of a teenage boy eating 3 fruit servings per day?
Example 8.3
Given a patient’s health and extent of injuries, a doctor may feel that the patient has a 90% chance of full recovery.
Additional example
What is the probability that the traffic lights will work at the intersection of University Road and Kingsway Road?
2. The sum of the probabilities for all the possible outcomes (or events) of an experiment must be equal to one:
PE 1
3. The complement rule: The complement of an event E is that event E does not occur. That includes all the
outcomes in the sample space that do not belong to event E. The sum of the probabilities of all events in the
sample space is equal to 1. If the probability of event E occurring is P(E) and the probability that event E does
PE PE 1
P E 1 P E
Activity 8.5
The probability that a typist will make at most five mistakes is 0.64. What is the probability that she will make
more than five mistakes?
Activity 8.6
Consider the example from activity 8.4:
1. What is the probability of a teenage boy eating 3 fruit servings per day?
2. What is the probability of a teenage boy not eating 3 fruits per day?
3. What is the probability that teenage boys eat at least 1 fruit per day?
8.4. FORMING NEW EVENTS
Once some events have been specified, there are several useful ways of manipulating them to create new events
known as compound events.
Additional example
Consider an experiment of rolling two fair 4-sided die (one red and one blue), where the sides are numbered as 1,
2, 3 and 4:
Sample space:
Complement rule:
Union:
Intersection:
Note: Section 8.4.3 (display events graphically) is discussed before Section 8.5.1 (addition rules).
8.5. PROBABILITY RULES FOR COMPOUND EVENTS
We use various rules of probability to compute the probabilities of the more complex, related events.
1. If an event (E) is made up by two (or more) simple events (A and B), the probability P(E) can be formed either
as:
The union of two or more events – all the outcomes that make up the two events
The intersection of two events – the outcomes that fulfil the conditions for both events
2. Two or more events are mutually exclusive if the occurrence of one event means that none of the other events
can occur at the same time. An outcome can belong to event A or to event B but not to both. For example, if
you flip a coin, you can either have Heads or Tails. Both can’t happen at the same time!
3. Two events are independent if the occurrence of one is in no way affected by the occurrence of the other; that
is, they are unrelated. If you flip two coins and you obtained a Head on the one, it will have no influence on
the outcome of the second flip.
4. If there is a particular relationship between events such that the occurrence of one event affects the occurrence
of the second event, the events are dependent and the probability attached to the occurrence of such events is
known as conditional probability.
Single event
Union and intersection of two events
To avoid double counting the probability of the outcomes that fulfil the conditions for both events, i.e. the
intersection of the two events, P(A and B) is subtracted from the sum of the probability of A and B
Activity 8.8
There are two secretaries in the office. The probability that the one secretary will be absent on any given day is
0.08 and the probability that the other one will be absent on any given day is 0.07. The probability that both will
be absent is 0.02. What is the probability that on a given day:
1. either or both secretaries will be absent
Two events are independent if the occurrence of one does not change the probability of the next one
occurring
Activity 8.9
The quality control manager of a company questions the reliability of the two quality control checks in the food
processor manufacturing process. A worker who manually checks the processors performs one check and a
computer monitor performs a second check. The manager knows that 5% of the time the worker is apt to miss a
defective processor and that 2% of the time the computer will malfunction and fail to detect defective processors.
What is the probability that a worker will miss a defective processor and the computer will malfunction, causing
a defective processor to stop manufacturing?
It therefore follows that the conditional probability can be expressed as the ratio of the marginal probability
of the conditioning event and the intersection probability
P A B
o P B | A
P A
Activity 8.11
A medical researcher has discovered a new test for tuberculosis. If the test indicates a person has tuberculosis, the
test is positive. Experimentation has shown that the probability of a positive test is 0.82, given that a person has
tuberculosis. The probability is 0.04 that the test registers positive and that the person does not have tuberculosis.
Assume that in the general population the probability that a person has tuberculosis is 0.20.
Activity 8.11
The following table shows the results of a study on 102 children in which a child’s IQ was examined and the
presence of a specific gene was found in the child:
Gene present Gene not present Total
High IQ 33 19 52
Normal IQ 39 11 50
Total 72 30 102
3. The gene
4. A normal IQ
Find P A B using:
A Venn diagram
A probability rule
A contingency table
8.5.4. Tree diagrams
Another useful method of calculating probabilities if there are several stages or trials in the experiment is to use
a probability tree. All the possible outcomes of the experiment are represented by the branches of the tree.
Steps
Plot a dot on the left to represent the root of the tree.
Construct a column for each trial.
Start on the left and determine the possibilities for the first trial which forms the branches of the tree in
the first column.
Branches grow from each of the original branches, representing the possibilities for the second trial. The
second stage is based on the choice made in the first stage.
The branches of the tree are weighted by probabilities; therefore show the probabilities for each event on
the branches.
List all the outcomes together with the joint probability for each combined outcome.
Add the probabilities. Because the tree represents the sample space of the experiment, the sum of the
probabilities should equal 1.
Example 8.15
A bag contains 5 red balls and 3 black balls. Two balls are drawn from the bag (one after the other). Construct a
probability tree to list all the possible outcomes together with each outcome’s probability.
NOTE: The first stage of the tree shows the marginal probabilities and the second stage shows the conditional
probabilities
Activity 8.12
Approximately 10% of people are left-handed. Two people are selected at random. Construct a tree diagram.
NOTE: Both the first and second stages of the tree show marginal probabilities since the events are independent
P A B P A P B
If events A and B are independent: P A | B P A
P B P B
Additional example
A factory has 2 machines, A and B. Machine A us used 70% of the time. The probability that machine A produces
a defective item is 0.1 and the probability that machine B produces a defective item is 0.2. Draw a tree diagram
to represent the given probabilities:
Use the information from the tree diagram to construct a contingency table
An item is inspected and it is defective. What is the probability that it was produced by machine A?
8.6. COUNTING THE POSSIBILITIES
Probability is based on the number of successes and the possible number outcomes that make up the numerator
and denominator. A collection of rules for counting the number of outcomes that can occur for a particular
experiment can be used.
Example 8.16
A computer password is to be made up consisting of four alphabetical characters. How many different computer
passwords can be designed if repetition of letters is allowed?
Activity 8.13
If a restaurant menu had a choice of 3 salads, 6 main dishes and six desserts, how many different possible dinners
can be ordered?
Extra example
How many Gauteng number plates are possible?
8.6.2. Permutation rule
The permutation rule is used to determine the number of ways to arrange n distinct objects taking x at a time in a
specific order, i.e. select a subset of x items out of n and arrange them in order.
n!
Px
n
n x !
Read as “n-permutation-x”
Note: n! (pronounced n factorial) is the product of the whole numbers from n downwards to 1.
Example 8.17
Select a group of 3 people from 10 to fill the roles of chairperson, secretary and treasurer in a committee. The
number of possible ways to fill these roles is:
Activity 8.15
Assume there are five carriages that need to be unloaded at a dock but there is only enough time left in the day to
unload three of them. Since the goods in each of the carriages are needed by customers, the order of unloading is
important. In how many ways can three of the five carriages be unloaded in first, second and third order?
Extra example
How many pin numbers consisting of 4 digits can you create if all the digits must be different number?
8.6.3. Combination rule
The combination rule is used to determine the number of ways to select x objects from a larger set of n objects
without regard to the order in which the objects are selected.
n n!
Cx
x x ! n x !
n
Read as “n-combination-x”
Example 8.18
A group of 7 mountain climbers wants to form a mountain climbing team of 5. How many different teams could
be formed?
Activity 8.15
You are given a list of 10 books and you are to read 4 of them. How many possible combinations of 4 books are
available from the list of ten?
Extra example
A lottery game consists of selecting 3 numbers from 1 to 10. How many possible lottery numbers can be selected?
UNIT 9: PROBABILITY DISTRIBUTIONS
Outcomes
After completion of this unit, you will be able to:
define a probability distribution
distinguish between discrete and continuous random variables
find the probability for a binomial investigation
find the probability for a Poisson investigation
find the probabilities for a normally distributed variable by transforming it into a standard random variable
Definitions
In investigations involving chance it is difficult to predict the exact value of a variable and so it is known
as a random variable
A probability distribution is a listing of all the possible outcomes of a random variable together with the
probability of its occurrence
o The sum of the probabilities of all possible outcomes is 1
Probability distributions are classified as either discrete or continuous, depending on the type of random
variable:
o A random variable is discrete if it can assume a countable number of possible values such as 0, 1,
2, 3, etc.
o A continuous random variable has an infinite number of possible values and can take on any value
over a given interval of values
The mean of a probability distribution is referred to as its expected value. For a discrete probability distribution
the mean is denoted by:
E X x P x
where P x P X x
Example 9.1 (empirical)
A survey asked a sample of 200 people how many times they donate blood each year. The results are summarised
as a probability distribution (i.e. the empirical distribution is used as an estimate of the population distribution).
The random variable (X) represents the number of donations for one year.
x f P(x) E(x)
0 60 0.30 0.00
1 50 0.25 0.25
2 50 0.25 0.50
3 20 0.10 0.30
4 10 0.05 0.20
5 6 0.03 0.15
6 4 0.02 0.12
Total 200 1.00 1.52
Interpretation
0.3 of the people did not donate blood
0.25 of the people donated once
The expected number of times someone will donate blood is 1.52
Sample space
Probability distribution
x P(x) E(x)
Total
Expected value
P X 1
P X 2
Steps
Find the probability () of a success in each trial
Find the number of trials (n)
Decide on the number of successes (x) for which you want to determine the probability
Substitute the values into the formula
Activity 9.1
A shoe store’s records show that 30% of customers making a purchase use a credit card to make payment. This
morning 7 customers purchased shoes from the store.
Activity 9.2
A tollgate operator has observed that cars arrived randomly at an average of 360 cars per hour.
Characteristics
The graph for a continuous random X variable is a smooth curve
The curve is unimodal (single mode)
The distribution is symmetrical around its mean and bell-shaped in appearance
The left and right hand tails of the distribution extend indefinitely
The x-axis represents all the possible values of the X variable
Two parameters describe the normal distribution: the mean ( ) describes where the distribution is centred
and the standard deviation () describes how much the curve spreads out around the centre
o Notation: X ~ N(,2)
Probabilities for continuous variables correspond to areas under the normal curve
The total area under the normal curve is equal to 1 or 100% (The sum of all the probabilities of a
probability distribution =1). This means that the areas to the left and the right of the mean will each
comprise 50% of the total area
Additional exercise
Consider the following three different normal distribution. Discuss the similarities and differences between these
distributions in terms of their means and variances/standard deviations.
B C
1) A vs. B
2) B vs. C
3) A vs. C
Note: If the area falls to the left of , the z is negative; if the area falls to the right of , the z is positive
Alternative approach
Write the given probability in notation
Determine if the z score is on the positive or negative side of the graph (through logical reasoning)
Share the given area (probability)
Find the z-score (often through logical reasoning)
x
Find the x-value by substituting all known values into z and solving for the unknown x
Example 9.4
The time it takes a randomly selected person to perform a task is normally distributed with a mean value of 120
seconds and a standard deviation of 20 seconds.
1. The probability that a randomly selected employee will complete the task between 100 and 130 seconds:
2. The probability that a randomly selected employee will complete the task between 75 and 100 seconds:
3. The probability that a randomly selected employee will complete the task within 75 seconds:
4. The probability that a randomly selected employee will complete the task in more than 75 seconds:
5. The 10% of the employees who complete the task within the shortest time are to be given advanced
training. What task times qualify individuals for such training?
Activity 9.3
The lifetimes of a certain kind of battery have a mean of 300 hours and a standard deviation of 35 hours. Assume
that the lifetimes, measured to the nearest hour, follow a normal distribution.
1. Determine the percentage of batteries that have a lifetime of more than 320 hours
2. Determine the value above which the best 30% of the batteries lie
3. Determine the proportion of batteries that have a lifetime from 250 to 350 hours
4. Determine the proportion of batteries with a lifetime between 250 and 280 hours
5. Determine the maximum lifetime below which the weakest 20% of the batteries will lie
6. Determine the minimum lifetime above which the 60% of the batteries with the longest life will fall
UNIT 10: ESTIMATION
The objective of most statistical studies is inference. Inferential methods presented in the following two units use
information contained in a sample to reach conclusions about the characteristics of the population from which the
sample was drawn. There are two general procedures for making inferences about populations: estimation and
hypothesis testing.
Statistic Parameter
Mean x µ
Standard deviation s σ
Variance (not in book) s2 σ2
Proportion P π
Size n N
1
o And the standard error: p
n
p 1 p
o If π is unknown: p
n
10.3. ESTIMATING POPULATION PARAMETERS
Two types of estimates can be made about a population: a point estimate or an interval estimate.
The discrepancy between a sample statistic and its population parameter is called sampling error. Measuring
sampling error forms a large part of inferential statistics. A point estimate will almost be never correct because
the sample is one point in the sample space of sample means. Each possible sample statistic will result in a
different estimate for the population parameter.
The probability associated with an interval estimate is known as the confidence level (1- ) and is a measure of
confidence that the interval estimate will include the population parameter. A high probability, say 99%, means
more confidence and hence a wider interval.
p 1 p
o pz
n
Activity 10.1
Identify the critical z-values associated with 90%, 95%, 98% and 99% confidence levels.
Additional
Identify the critical t-values associated with 90%, 95%, 98% and 99% confidence levels for a sample of size 15.
Activity 10.2
In 36 randomly selected seawater samples, the mean sodium chloride concentration was 23 cm3/m3 and the
standard deviation was 6.7 cm3/m3. Construct a 98% confidence interval estimate for the mean sodium chloride
concentration.
Activity 10.3
The time taken to complete the same task (in minutes) was recorded for nine participants in a training exercise:
8 7 8 9 7 7 9 10 9
Construct a 95% confidence interval for the average time taken to complete the task.
Activity 10.4
A medical researcher wished to determine the percentage of females who take vitamins. A study of 180 females
showed that 25% took vitamins. Construct a 99% confidence interval for the percentage of females who take
vitamins.
10.4. SAMPLE SIZE (n)
An important question that needs to be answered in statistical inference is how large a sample size is needed to
guarantee a certain level of confidence for a given margin of error? An appropriate sample size depends on:
Level of confidence
Variability in the population being studies
o The greater the variability in the population, the larger the sample required
Maximum allowable error (E)
o This is the maximum amount a point estimate should differ (above or below) the parameter being
estimated, i.e. the difference between the sample statistic and the population parameter
Formula for determining the sample size when estimating the population mean:
z
2
n
E
Formula for determining the sample size when estimating the population mean:
1 z 2
n
E2
Activity 10.5
A cheese processing company wants to estimate the mean cholesterol content of all 50g servings of cheese. The
estimate must be within 0.5g of the population mean. Determine the minimum required sample size to construct
a 95% confidence interval for the population mean. Assume the population standard deviation is 2.8g.
Activity 10.6
A daily newspaper is investigating the reading habits of the home delivery customers. If a previous survey
indicated that 50% of readers read the editorial page, what sample size should be used to estimate, within 4%, the
proportion of people who read the editorial page if α = 0.1.
UNIT 11: HYPOTHESIS TESTING
In Unit 10 you learnt how to estimate the value of the population parameter of interest. In this unit you will learn
how to test a claim or hypothesis about a population parameter.
To make this decision, it is necessary do decide whether the difference that exists between the hypothesized
population parameter and the sample result is significant and therefore not supportive of the hypothesis, or
whether the difference is a chance difference and therefore supportive of the claim.
In the sampling process, any one of the samples in the sampling distribution might be selected. Most of the time
this sample mean would not equal the population mean. Such a difference is due to the sampling process and is
known as a chance difference. The difference is not large enough to cause concern. Results are statistically
significant if the difference between the sample result and the statement made in H0 if unlikely to occur due to
chance alone. It indicates that the sample came from a population with a mean other than the hypothesised mean.
11.1 A SINGLE SAMPLE CLASSICAL HYPOTHESIS TEST
Steps
Understand the problem
Set up the null and alternative hypothesis.
Define the test procedure and decision rule
Select the significance level for the test
Determine the type of sampling distribution (z or t)
Determine the critical value(s) and corresponding rejection region
Collect and analyse the data
Collect the data, calculate the test statistic
Draw conclusions and make recommendations
Make the statistical decision by comparing the test statistic with the rejection region
Interpret the statistical decision
Some Common Phrases that indicate the direction of test (note a few changes compared to the textbook)
> (right-tailed) < (left-tailed) ≠ (two-tailed)
Is greater than Is less than Is not equal to
Is above Is below Is different
Is higher than Is lower than It has changed
Is longer than Is shorter than
Is bigger than Is smaller than
Is increase Is decreased or reduced
An incline A decline
For example, when = 0.10, there is a 10% chance of rejecting a true H0. You can decrease the probability of
rejecting H0 when it is actually true by lowering the significance level. The significance level is the maximum
probability of making a type I error and is denoted by . The purpose of the level of significance is to provide a
probability basis for deciding whether an observed difference between a sample statistic and a hypothesized
parameter is a chance difference or a statistically significant difference, since is the probability that the test
statistic will fall in the rejection area. Usually tests are performed at an -value of 0.01, 0.02, 0.05 or 0.10.
11.4.3. Determine the type of sampling distribution
This step will enable you to know whether to use the normal z-distribution of the t-distribution in determining the
rejection area.
Use the normal z-distribution:
o when the distribution is approximately normal with known
o when the sample size n 30 with an unknown or known
Use the t-distribution:
o when is unknown and the sample size n < 30
o If n 30 the distribution approximates the normal curve via the Central Limit Theorem and you
use the normal z-table
11.4.4. Determine the critical value(s) and identify the rejection region
The critical value represents the maximum number of standard deviations the sample mean or proportion
can differ from the hypothesized value before HO is rejected.
The critical value separates the area under the curve into two regions - the non-rejection region and the
rejection region.
The rejection region (or decision rule) is a range of values such that, if the test statistic falls into that range,
we reject HO.
Steps
Specify the level of significance ()
Decide whether the test is two-tailed, left-tailed or right-tailed (based on the alternative hypothesis
Determine the critical value(s) and identify the rejection region
Sketch the normal curve. Draw a vertical line at each critical value and shade the rejection region(s).
State the rejection region in words.
Critical value
Example 11.1
A two-tailed hypothesis test at the 5% level of significance for a normal distribution contains 2.5 % of in each
tail. The area to look up in the normal table is (0.5 – 0.025) = 0.475. If you look up 0.475 the corresponding z-
value is -1.96 to the left and +1.96 to the right. These two values are your critical values.
The rejection region is given as: Reject H0 if the z-test > 1.96 or if the z-test < -1.96
A one-tailed test to the right at a 5% level of significance means all 5% will go into the right-hand tail; the area is
0.45, which will result in a z-value of +1.64. This 1.64 is your critical value.
The rejection region is given as: Reject H0 if the z-test > 1.64
A one-tailed test to the left at a 5% level of significance means all 5% will go into the left-hand tail; the area is
0.45, which will result in a z-value of -1.64. The rejection region is given as: Reject H0 if the z-test < -1.64. This
-1.64 is your critical value.
The rejection region is given as: Reject H0 if the z-test < −1.64
Activity 11.1
Find the critical value(s) and rejection region for a two-tailed test, a left-tailed test and a right-tailed test for an
approximately normal distribution at a level of significance of 1%, 2% and 10%.
1% 2% 10%
Two-tailed
Right-tailed
Left-tailed
Extra activity
Find the critical value(s) and rejection region for a two-tailed test, a left-tailed test and a right-tailed test for t-
distribution, where n = 12, at a level of significance of 1%, 2% and 10%.
1% 2% 10%
Two-tailed
Right-tailed
Left-tailed
11.4.5. Conduct the statistical test (calculating the test statistic)
By making use of the appropriate formula for the sampling distribution we calculate how many standard errors
the sample result is away from the assumed population value.
Distribution Test statistic
x
z
Z
n
Single mean (µ)
x
tn 1 t
s
n
p
z
Single proportion (π) Z 1
n
Activity 11.2
A sample of 100 healthy adult males has a systolic blood pressure of 125 mmHg with a standard deviation of 15.
Test at a 2% level of significance whether the mean systolic blood pressure is different from the generally accepted
level of 130 mmHg.
Identify the test:
Identify the distribution:
1) Hypotheses
2) Critical value
3) Test statistic
4) Decision
5) Conclusion
Example 11.3
A process takes an average time of 35 minutes. It is thought that a certain modification would reduce this time,
and after being modified, the process is repeated 13 times, giving an average time of 33.3 minutes with a standard
deviation of 2.4 minutes. Is there any significant reduction in the time at a level of significance of 0.05?
Identify the test:
Identify the distribution:
Identify the form of the alternative hypothesis:
Activity 11.3
From past records we know that the average unbroken sleep periods of patients with a certain kind of insomnia is
2.8 hours. A new drug is tested on a sample of 25 patients and this yields an average of 3 hours unbroken sleep
with a standard deviation of 0.8 hours. Is there a significant improvement on the unbroken number of hours sleep?
Test at = 2.5%.
Identify the test:
Identify the distribution:
1) Hypotheses
2) Critical value
3) Test statistic
4) Decision
5) Conclusion
Example 11.4
Directors of a company claim that 90% of the workforce supports a new shift pattern that they have suggested. A
random survey of 100 people in the workforce finds 85 in favour of the new scheme. Test at a 5% level if there is
a significant difference between the survey results and the directors’ claim.
Identify the test:
Identify the distribution:
Identify the form of the alternative hypothesis:
Activity 11.4
To determine if new flavours of ice cream must be introduced into the market, a random sample of 320 people
was asked to taste and choose their favourite ice cream flavour. Of the 320 people surveyed, 58 responded that
that they preferred the chocolate flake flavour. Test the claim that less than 25% of people prefer chocolate flake
flavour ice cream at the = 0.01 level of significance.
Identify the test:
Identify the distribution:
1) Hypotheses
2) Critical value
3) Test statistic
4) Decision
5) Conclusion
11.2 HYPOTHESIS TESTING USING THE p-VALUE APPROACH
The p-value of the test statistic is the probability that a sample statistic takes a value equal to, or more extreme
than, the one used in the hypothesis, when the null hypothesis is true.
Steps
State the null and alternative hypotheses
Determine the sampling distribution (z or t)
Calculate the test statistic
Calculate the p-value
o Note: we will only calculate p-values for tests that use the Z-distribution
NOTE: for examination purposes in 2017, we will only focus on p-value calculations for hypothesis tests based
on the Z distribution.
Example 11.5
A cellphone operator’s manager believes that customer monthly cellphone bills average more than R85 per month.
To test this claim, a sample of 64 customer cellphone accounts was randomly selected. The calculated test statistic
was equal to 1.5.
Hypotheses:
H0: µ = 85
Ha: µ > 85
Test statistic = 1.5
p-value:
Decision:
Conclusion:
Decision:
Conclusion:
Example 11.4 (one sample proportion)
Directors of a company claim that 90% of the workforce supports a new shift pattern that they have suggested. A
random survey of 100 people in the workforce finds 85 in favour of the new scheme.
Hypotheses:
H0: π = 0.9
Ha: π ≠ 0.9
Test statistic = −1.67
p-value =
Decision:
Conclusion:
Test statistic =
p-value =
Decision:
Conclusion:
11.3 TESTING FOR DIFFERENCES AMONG MEANS AND PROPORTIONS
Not for examination purposes for 2017
Steps
State H0 and Ha: The null hypothesis states that the two variables are statistically independent. This means
that knowledge of the one variable does not help in predicting the other variable.
H0: The variables are independent (no relationship)
Ha: The variables are dependent (is a relationship)
Select the level of significance ()
State the decision rule by defining the rejection region
o Use the level of significance and degrees of freedom to find the critical value from the 2 table
o df = (r – 1)(k – 1)
o The 2 distribution is positively skewed, so the critical value will always be in the right-hand tail
o The rejection region for H0 goes from the critical value to right, i.e. reject H0 if the test statistic is
greater than the critical value
2
0 ( , df)
Calculate the value of the chi-square test by substituting cell by cell the values from the fo and fe table into
the formula:
fo fe
2
o
2
fe
fo = observed frequencies in each cell
fe = expected frequencies in each cell
row total column total
fe
overall total
Compare the test statistic with the rejection region
o Make a decision in terms of H0
o Reach a conclusion in terms of H1
Activity 11.8
A manufacturer of women’s clothing is interested to know if age is a factor in whether women would buy a
particular garment, depending on its quality. A researcher sampled three age groups and each woman was asked
to rate the garment as excellent, average or poor. Test the hypothesis, at a 5% level of significance, that rating is
not related to age group.
Rephrase: Test the hypothesis, at a 5% level of significance, if rating is related to age group
1) Hypotheses
2) Critical value
3) Test statistic
4) Decision
5) Conclusion
11.4.2. Goodness-of-fit tests
The 2 goodness-of-fit test for uniform distributions is used to determine whether a set of sample data differs
significantly from what is expected.
Equal distribution across categories
OR equal to some known distribution
Steps
State H0 and Ha
o Equal distribution
H0: All categories are equal (π1 = π2 =…= πk)
Ha: At least 1 category is different
o Equal to some known distribution
H0: Distribution follows a known pattern
Ha: At least 1 category is different from the known pattern
Select the level of significance ()
State the decision rule by defining the rejection region
o Use the level of significance and degrees of freedom to find the critical value from the 2 table
o df = (r – 1)
Calculate the value of the χ2 test using the observed (fo) and expected (fe) frequencies:
fo fe
2
o
2
fe
fo = observed frequencies in each cell
fe = expected frequencies in each cell
fe i n
Example 11.12
A manufacturer of soap wishes to know if consumers have a preference for bath soap fragrances. To answer their
question, a random sample of 200 adult shoppers is offered a free bar of soap. The recipients may choose from
among four flavours.
H0: Equal distribution
Example 11.13
The distribution of car manufacturers’ shares of the national market are given. A random sample of 2000 car
owners in Pretoria revealed the following ownership pattern: Volkswagen 758, Toyota 680, Delta 300, BMW 162
and Mercedes 100. Does the ownership pattern in Pretoria differ significantly from the national pattern?
H0: Equal to some known distribution
Activity 11.9
A delivery of assorted nuts is labelled as having 45% walnuts, 20% hazelnuts, 20% almonds and 15% brazil nuts.
By randomly picking several scoops of nuts from the bag delivered, the following count was obtained:
Walnuts Hazelnuts Almonds Brazil nuts Total
Observed 92 69 32 42
Could these findings be a basis for an accusation of mislabelling at a 2.5% level of significance?
1) Hypotheses
2) Critical value
3) Test statistic
4) Decision
5) Conclusion
UNIT 7: TIME SERIES
This unit discusses the general use of forecasting in business and several methods that are available for making
forecasts.
Forecasting is the science of predicting the future. It is used in the decision-making process to help business
people reach conclusions about buying, selling, producing and many other actions. Time series analysis is known
as a forecasting tool. The objective is to analyse how observed data changes over time in order to detect patterns
that will enable us to predict future values. Time series helps us cope with uncertainty about the future. Time
series data is numerical data gathered on a given characteristic over a period of time at regular intervals.
The observed time series data consists of four separate components – trend, cyclical, seasonal and irregular.
Secular trend (T) is the underlying long-term movement (increase or decrease) over time in the recorded
data values and is usually the result of long-term factors such as changes in the population size,
demographic characteristics of the population, technology and consumer preferences
Cyclical variations (C) are medium-term changes caused by circumstances which repeat in cycles and
cause upward and downward swings, not of equal length, throughout the series
o E.g. the general business cycle of prosperity, recession, depression and recovery
Random or irregular variations (I) occur over short intervals and are unpredictable with no pattern to their
behaviour
o Variations due to everyday unpredictable influences
o E.g. the weather, political unrest, crime, war, transport breakdowns
Seasonal variations (S) are short-term fluctuations that tend to repeat themselves of days, weeks, months,
quarters, etc.
o E.g. ice cream sales, an increase in the number of flu cases every winter, higher sales in shops
before Christmas, etc.
The classical time series model is known as the multiplicative model and is represented by the following equation
showing the four components that make up the time series and their relation to each other:
Y=TCSI
o The components are multiplied to provide the value of the observed dependent variable Y
o Trend (T) is in the same units as Y
o The other components are expressed as percent adjustment
A value > 100 above average effect of the component
A value < 100 below average effect of the component
For a time series composed only of annual data, there is no seasonal component. In that case the time series model
becomes:
Y=T×C×I
Periodically reported data, monthly, quarterly, weekly, daily, etc., include the influence of all four components of
the time series:
Y=T×C×S×I
The process of division can remove or isolate any component of this model and is called decomposition.
NOTE: Analysis of cyclical and irregular influences on data is useful for describing past variations but because
of their unpredictability, their value in forecasting is very limited. Instead, a number of business indicators are
used to forecast cyclical turning points. Predicting cyclical and irregular variations require techniques beyond the
scope of this unit.
If we ignore the C and I components, since by definition they cannot be predicted, the forecasting model will
become:
Ŷ T S
o Ŷ a bx
o If raw data are given, use your calculator to find the values of the intercept (a) and slope (b) using
regression analysis
Activity 7.2
The following table shows the number of traffic tickets issued by the Alberton Traffic Department for the first six
months of the year.
Month Time (X) Number of tickets (Y)
January 1 120
February 2 120
March 3 100
April 4 90
May 5 130
June 6 150
1. Use the method of least squares to find the trend line (different phrasing to the book)
o Add the index of time (X) from 1 to the last value in the time series
o Enter the X (independent) and Y (dependent) values in your calculator for regression analysis
o Trend line:
2. Forecast the number of tickets that the Traffic Department can expect to issue for the next three months
o July:
o August:
o September:
140
y = 4.8571x + 101.33
120
Numer of tickets
100
80
60
40
20
0
0 1 2 3 4 5 6 7
Time
Method of semi-averages
This technique involves the calculation of two averages which, when plotted on a graph as two separate points
and joined up, form a straight line.
Steps
Split the data into two equal groups
o If there are an odd number of years, simply omit the middle year
o It is important that the two groups in question have an equal number of data values
Calculate the arithmetic mean for each group
Plot the two means at the midpoints of the time intervals covered by the respective groups
Join these two points with a straight line - this is the trend line
Forecast:
Extend the straight line up to the required forecast period and use the graph to read the value off the y-axis
OR calculate the average increase per year by determining the difference between the two averages and
divide this difference by the number of years between the two averages
o Add this increment an appropriate number of times to the mean of the latter group
Steps
If an odd number moving average is calculated (i.e. 3, 5 or 7), there will be a middle time point opposite
which to record the answers
o E.g. a 3-year moving average
o Add the Y-values for time periods 1, 2, 3 and divide by 3
The answer corresponds to time period 2
o Move down one year and calculate the average Y-value for time periods 2, 3, 4 and divide by 3
The answer corresponds to time period 3
o Continue until you no longer have 3 Y-values to add
If an even number moving averages is calculated (i.e. 2, 4, 6, etc.), the resulting averages will correspond
between two time points. But a trend value is must coincide with a particular point in time, therefore an
extra step is required to centre the average, by averaging successive moving averages
o E.g. a 4-year moving average (example not in book)
o Add the Y-values for time periods 1, 2, 3, 4 and divide by 4
The answer corresponds to time period 2.5
o Add the Y-values for time periods 2, 3, 4, 5 and divide by 4
The answer corresponds to time period 3.5
o Take an average of the answer for times 2.5 and 3.5
The final answer will correspond to time period 3
o Continue until you no longer have 4 Y-values to add
The longer the time period covered in computing the average, the smoother the resulting curve. If the
period is too long, a straight line will result and the general direction of the curve will be lost
The moving average forecast for the next year is the average of the preceding period
Example 7.3
The quarterly sales of petrol at Jack’s Garage are represented in the table below.
3-quarterly moving 4-quarterly moving average
Time Quarter Sales
average 4-quarterly average Centred average
1 1 40
2 2 37 46
3 3 61 52 49.0 46.0
4 4 58 45 A C
5 1 16 43 B 50.1
6 2 55 51 52.8 49.0
7 3 82 55 45.3 48.3
8 4 28 50 51.3 53.1
9 1 40 46 55.0 51.0
10 2 70 53 47.0 51.8
11 3 50 62 56.5 58.4
12 4 66 57 60.3 56.6
13 1 55 54 53.0 58.0
14 2 41 62 63.0 62.8
15 3 90 65 62.5
16 4 64
A= B= C=
Activity 7.4
Construct a 4-year and a 5-year moving average and graph the results.
4-year moving average
Time Year Y 5-year moving average
4-year average Centred average
1 2005 14
2 2006 20
3 2007 40 26.4 26.0 27.8
4 2008 30 32.0 B D
5 2009 28 38.2 C 36.4
6 2010 42 A 37.8 37.1
7 2011 51 35.6 36.5 37.0
8 2012 25 37.5
9 2013 32
7.3.2. Seasonal variation (S)
Seasonal variations occur within a period of one year or less. Therefore period data (weekly, monthly, quarterly,
daily) is required. Seasonal variation is generally expressed as an index number and can be identified using the
ratio-to-moving–average method.
Ratio-to-moving-average-method
Steps
1. List data in date order
2. Determine the time period to be used for the moving average, e.g. monthly data use a 12-monthly moving
average, quarterly data a four-quarterly moving average and 6 days a week a six-daily moving average
3. Calculate the required moving average. If the period is an even number, centre the averages by averaging
adjacent moving averages
4. Express the original time series values as percentages of the corresponding centred moving averages:
Divide the average into the original data and multiply the result by 100. These are the individual seasonal
percentages
5. Summarise the seasonal percentages in a new table by grouping together the seasonal time periods. For
example, all the first quarters together, all the second quarters together, etc. Use the modified mean
approach to compute an unadjusted seasonal index per season
o The modified mean is the arithmetic mean of the values that remain after elimination of the
smallest and largest values in the column
6. Add the unadjusted seasonal indexes
7. Determine the factor needed to adjust the index numbers to typical index numbers
Typical quarterly index = 100 × 4
Typical monthly index = 100 × 12
total typical index
Factor
total of unadjusted indexes
8. Calculate the adjusted typical seasonal index numbers by multiplying the unadjusted index numbers by
this factor
9. Seasonal indexes can be included in short-term forecasts
Seasonalised forecasting for periodically reported data
Deseasonalising data
The influence of seasonality can be removed from a time series by dividing each original value in the series by
the appropriate typical seasonal index for that period and then multiplying the result by 100. The result is known
as deseasonalised data. Deseasonalised data is used if we wish to compare data across seasons to determine if an
increase or decrease, irrespective of seasonal trend, has taken place.
Example 7.4
The quarterly income (in R’000) of a soft drink company has been recorded for 4 years.
1 2 3 4 5 6
Year Quarter y 4-quarter moving average Centred % Seasonal index
2009 1 52 83.3
2 67 107.2
3 85 64.5 65.2 130.4 126.5
4 54 65.8 66.8 80.8 83.0
2010 1 57 67.8 68.4 83.3 83.3
2 75 69.0 69.9 107.3 107.2
3 90 70.8 71.2 126.4 126.5
4 61 71.5 71.8 85.0 83.0
2011 1 60 72.0 72.5 82.8 83.3
2 77 73.0 73.2 105.2 107.2
3 94 73.5 74.2 126.6 126.5
4 63 75.0 75.9 83.0 83.0
2012 1 66 76.8 77.3 85.4 83.3
2 84 77.8 78.3 107.3 107.2
3 98 78.8 126.5
4 67 83.0
Column 1: Quarterly data
Column 2: Sales data
Column 3: 4-quarterly moving average
Time period 2.5:
Quarter 1, 2013:
Quarter 2, 2013:
Quarter 3, 2013:
Quarter 4, 2013:
Activity 7.5
The owner of a pizzeria recorded the number of pizzas sold during the past weeks in order to determine the
influence of the day of the week on sales. Do a seasonal forecast per day for week 5.
Time Week Day Number sold 5-day moving average % Seasonal index
1 Monday 12 72.4
2 Tuesday 18 83.5
3 1 Wednesday 16 20.4 78.4 84.5
4 Thursday 25 A B 124.2
5 Friday 31 20.0 155 135.4
6 Monday 11 20.6 53.4 72.4
7 Tuesday 17 20.4 83.3 83.5
8 2 Wednesday 19 19.6 96.9 84.5
9 Thursday 24 20.2 118.8 124.2
10 Friday 27 20.0 135 135.4
11 Monday 14 19.4 72.2 72.4
12 Tuesday 16 20.2 79.2 83.5
13 3 Wednesday 16 19.8 80.8 84.5
14 Thursday 28 20.4 137.3 124.2
15 Friday 25 21.4 116.8 135.4
16 Monday 17 22.2 76.6 72.4
17 Tuesday 21 21.4 98.1 83.5
18 4 Wednesday 20 22.8 87.7 84.5
19 Thursday 24 124.2
20 Friday 32 135.4
A=
B=
Week Monday Tuesday Wednesday Thursday Friday Total
1 78.4 155
2 53.4 83.3 96.9 118.8 135
3 72.2 79.2 80.8 137.3 116.8
4 76.6 98.1 87.7
Unadjusted
Typical index 500 500 500 500 500
Factor
Seasonal index
Monday, Week 5:
Tuesday, Week 5:
Wednesday, Week 5:
Thursday, Week 5:
Friday, Week 5: