ST1381 Elementary Statistics PDF
ST1381 Elementary Statistics PDF
2 1 4 3 0 3 1
Quantitative Qualitative
discrete Continuous
Exercise
Read the following on attendance and grades, and
answer the questions.
A study conducted at NUL revealed that students
who attend class 95 to 100% of the time usually
received an A in the class. Students who attended
class 80 to 90% of the time usually received a B or a C
in the class. Students who attended class less than
80% of the time received a D or an F or eventually
withdrew from the class.
Continue…
Based on this information, attendance and grades are
related. The more you attend class, the more likely it
is you will receive a higher grade. If you improve
your attendance, your grades will improve. Many
factors affect your grade in a course. One factor that
you have considerable control over is attendance.
You can increase your opportunities for learning by
attending class more often.
Continue…
What are the variables under study?
What are the data in the study?
Are descriptive, inferential, both types of statistics
used?
What is the population under study?
Was the sample collected? If so, from where?
From the information given, comment on the
relationship between the variables.
Solution
grades and attendance.
Data consists of specific grades and attendance
numbers.
These are descriptive statistics; however, if an
inference were made to all students, then that would
be inferential statistics.
Population under study is ALL students at NUL.
While not specified, we probably have data from a
sample of NUL students.
Continue…
Based on the data, it appears that, in general, the
better your attendance, the higher your grade.
Exercise
Classify each of the following as nominal, ordinal,
interval or ratio-scaled data.
The time required to produce each tyre on an
assembly line.
The number of liters of milk a family drinks in a
month.
The ranking of four machines in your plant after
they have been designated as excellent, good,
satisfactory or poor.
Continue…
Major in college (mathematics, biology, psychology,
etc.)
The age of each of your employees.
The sales in maloti at a local pizza house each
month.
Elevations of Lesotho National Parks, in feet
above/below sea level.
The response time of an emergency unit.
A college student’s degree (associate, bachelor’s,
master’s, etc.)
Choosing a Sample
Recall that inferential statistics consists of
methods of drawing conclusions about the
population based on information obtained
from a sample of the population.
When one collects information from the
entire population, the exercise is referred to
as CENSUS
Continue...
Population: refers to the collection of all
individuals or items under consideration in a
statistical study.
Examples: All employed workers in Lesotho,
All registered voters in Lesotho.
Sample: refers to part of the population from
which information is collected.
Why do we use a
sample?
There are various reasons why we do not
investigate a whole population (take a census)
but rather investigate a sample from a
population:
Census is expensive- for example, it involves
millions questionnaires, travelling costs,
temporary personnel etc.
Census takes a long time- for example, it
involves the distribution and collections of
questionnaires, the processing of large
amounts of data etc.
Continue…
Sections of a population are inaccessible- it
is difficult to reach animals and plants on
very high mountains and access to persons in
hospitals and prisons is often forbidden, etc.
Inaccuracy of a census- for example, good
planning is necessary to take a census, there
is a large amount of administrative work,
mistakes are made by people working with
large datasets, etc.
Continue...
Once the researcher has decided that
sampling is appropriate, the next question to
consider is how to select a sample.
Remember the sample results will be used to
make conclusions concerning the entire
population.
Continue...
As a result, it is important for a sample to be
representative, that is, a sample should reflect
as closely as possible, the relevant
characteristics of the population under
consideration. Otherwise the sample is said
to be biased.
Examples of a biased
sample
It would not make sense to use the mean
weight of a sample of football team players to
make inferences about the mean weight of all
adult males in Lesotho.
It would be very unreasonable to try to
estimate the mean income of Roma residents
by sampling the incomes of people who work
at NUL.
Sampling Methods
Taking a sample is not simply a matter of
taking the nearest item.
If worthwhile conclusions relating to the
whole population are made from the sample,
it is essential to ensure that as far as possible
that the sample is free from bias.
Continue…
To obtain samples that are unbiased- i.e.. That
gives each subject in a population an equally
likely chance of being selected. Statisticians use
four basic methods of probability sampling.
a) Simple random Sampling
b) Systematic Sampling
c) Stratified Sampling
d) Cluster Sampling
Simple Random
Sampling
Is a probability sampling technique where by
each member of the population has an equal
and known chance/probability of being
selected into a sample.
There are different sampling procedures that
can be used to select a simple random
sample, namely:
Continue...
i) Lottery type of sampling procedure
Assign each student name a number, these
would be written on pieces of paper and be
blindly drawn from the box without
replacement.
Continue…
ii) Random number tables could be used: in this
method of selecting a simple random sample, we
start by assigning and allocating a unique number to
each member (unit) of the population. Often the
numbers 1 to N are allocated to the N units in the
population. We then open to any page of the table of
random numbers start at any point, read k digits and
move in any direction where k = number of digits
on N. The first n numbers not exceeding N will
identify the units to be included in the sample.
Example
A part of table of random numbers is given
02946 81881 96520 56247 17623
85697 62000 87957 07258 45054
26734 68026 52067 23123 73700
47829 31353 95944 72169 58374
76603 99339 40571 41186 04981
Use the above random number to select a random
sample of 8 units out of 600 units.
Solution
We first assign a unique number (1 to 600) to each of
the 600 units.
If we decide to start from the second row second
column, read the three digits from the left to the
right, the first 8 numbers less than 600 are:
569, 570, 054, 267, 346, 067, 231 237.
The units bearing the numbers 569, 570,54, 267, 346,
67, 231, 237 will be included in the sample.
Systematic Sampling
Is a probability sampling technique that can
be obtained by selecting a starting number at
random and each successive number
systematically form an orderly list of the
population. Every individual still has an
equal chance of being selected in the sample.
Example
To select a systematic sample of size n = 200 from a
population of size N=3000
Calculate the sampling interval k = N/n = 15.
Thus, we select one unit from the first 15 units at
random and every 15th unit thereafter. If the first
unit selected is 13, then the units of analysis
corresponding to the elements 13, 28,43,58, and so on
will be included in the sample.
Stratified Sampling
Is a probability sampling technique whereby
the population is divided into a number of
classes or strata and a sample is obtained by
combining samples drawn independently
from each strata.
Example 1
If a person conducting a customer
satisfaction survey selected a random
customers from each customer type in the
proportion to the number of customers of
that type in the population. If the sample of
size 40 is to be selected and 10% of the
customers are managers, 60% are users, 25%
are operators and 5% are database
administrators.
Continue...
Then 4 managers, 24 users, 10 operators and
2 administrators would be randomly
selected.
Example 2
Of the 130 students in year 1, 70 are boys and 60 are
girls. If we were to select a stratified random sample
based on gender, what proportion of our sample
should be boys and what should be girls?
70 7
Solution: proportion of boys = 53.8% of the
130 13
sample should be made up of boys.
60 6
Proportion of girls = 46.2 % of the sample
130 13
should be made up of girls.
Cluster Sampling
Is a probability sampling technique where by
the population is divided into separate
groups called clusters. Then a simple random
sample of clusters is selected from a
population.
Example
Let’s say we want to conduct a study
involving nurses in Lesotho. Instead of
randomly selecting 20% of all nurses in every
hospital in the country. We could randomly
select 20% of the hospitals and take all the
nurses in those hospitals to be part of our
sample.
Summary on Sampling
Methods
Random Sampling: Subjects are selected by random
numbers.
Systematic Sampling: Subjects are selected by using
every kth number after the first subject is randomly
selected from 1 through k.
Stratified Sampling: subjects are selected by dividing
the population into groups (strata), and subjects are
randomly selected within groups.
Cluster Sampling: Subjects are selected by using an
intact group that is representative of the population.
Statistical Errors
Despite efforts to obtain a good sample it is common
to make statistical errors.
There are two main types of errors, namely: Non-
Sampling and Sampling Errors.
Non-Sampling errors
These include all kinds of human errors such as
mistakes in
collecting,
reporting or
analyzing data, like making errors in calculations,
copying data incorrectly and so on.
Sampling errors
Sampling error is the difference between the results
we find by studying the complete population (a
census), and the results we find by studying only a
sample and using that sample to draw conclusions
about the population.
Continue...
Sampling error can occur in basically two ways: by
chance and by sampling bias.
30 – 39 3 0.075(7.5)
40 - 49 1 0.025(2.5)
50 – 59 8 0.200(20)
60 – 69 10 0.250(25)
70 – 79 7 0.175(17.5)
80 – 89 7 0.175(17.5)
90 – 99 4 0.100(10)
Total 40 1(100)
Continue...
Note that the relative frequencies must always add
up to 1(100%).
The table shows that 10% of the investments have
maturity period of between 90 and 99 days.
Cumulative Frequency
For a given class interval, The cumulative
frequency of a class is the sum of the
frequency for that class and all the previous
classes.
Continue...
Class limits
Freque Cumulative frequency
ncy
30 - 39 3 f1
40 - 49 1 f1+f2
50 – 59 8 f1+f2+f3
60 – 69 10 f1+f2+f3+f4
70 – 79 7 f1+f2+f3+f4+f5
80 – 89 7 f1+f2+f3+f4+f5+f6
90 – 99 4 f1+f2+f3+f4+f5+f6+f7
Table 5: The cumulative frequency
distribution for days before maturity for
40-short-term investments
Class Freque Cumulative
limits ncy frequency
30 - 39 3 3
40 - 49 1 4
50 – 59 8 12
60 – 69 10 22
70 – 79 7 29
80 – 89 7 36
90 – 99 4 40
Table 5: The cumulative frequency
distribution for days before maturity for
40-short-term investments
Class
Frequ Cumulat Relative Relative
limits ency ivfreque Frequen Cumulativ
ncy cy e
Frequency
30 - 39 3 3 0.075 0.075
40 - 49 1 4 0.025 0.100
50 – 59 8 12 0.200 0.300
60 – 69 10 22 0.250 0.550
70 – 79 7 27 0.175 0.675
80 – 89 7 36 0.175 0.900
90 – 99 4 40 0.100 1.000
Total 40 1
QUALITATIVE FREQUENCY
DISTRIBUTION
Remember: Qualitative data provide non-
numerical measures that categorize (or
classify) individual observations.
The construction of a frequency distribution
for qualitative data is much easier because
the nature of the data provides a
straightforward classification.
Example
A student has completed 20 courses in the
school of business administration. His grades
in the 20 courses are shown below:
A B A B C C C B B B
B A B B B C B C B A
Construct a frequency distribution.
Frequency Distribution for
grades.
Grade Frequency
A 4
B 11
C 5
TOTAL 20
Stem and Leaf Plot
A stem and leaf plot is a frequency
distribution which carries all the individual
values in the raw data.
It is constructed by breaking up every data
value into two components, a stem (usually
the entry’s leftmost digits) and a leaf (
usually the rightmost digit). For the number
173, for example, the stem would be “17” and
the leaf would be “3”.
Continue...
Data are then classified according to the
values of their stems.
Example
The test scores of 14 individuals on their first
statistics examination are shown below:
95 87 52 43 77 84 78 75 63 92
81 83 91 88
Construct a stem and leaf display for these
data.
Continue...
• The stems will be the number/s to the left and the
leaves will be a number/s to the right
• Stem Leaf Frequency
4 3 1
5 2 1
6 3 1
7 7 8 5 3
8 7 4 1 3 8 5
9 5 2 1 3
Continue...
• Now, the digits (stem) are ranked ordered
horizontally, thus leading to the following stem and
leaf display.
• Stem Leaf Frequency
4 3 1
5 2 1
6 3 1
7 5 7 8 3
8 1 3 4 7 8 5
9 1 2 5 3
Example
Consider the following data for car battery life (in
years).
2.2 4.1 3.5 4.5 3.2 3.7 3.0 2.6 3.4 3.5
1.6 3.1 3.3 3.8 3.1 4.7 3.7 2.5 4.3 4.2
3.4 3.6 2.9 3.3 3.9 3.1 3.3 3.1 3.7 4.4
3.2 4.1 1.9 3.4 4.7 3.8 3.2 2.6 3.9 3.0
Construct a stem and leaf display for these data.
Stem-and-Leaf display of
Battery Life
Stem Leaf Frequency
1 69 2
2 25669 5
3 00111122233344455
67778899 25
4 11234577 8
• Disadvantages
• Not very flexible with respect to the choice of the
number of classes.
• Cumbersome when the number of data values is
large
GRAPHICAL REPRESENTATION OF
FREQUENCY DISTRIBUTION
A graph does not replace a table, but complements it
by showing the data’s general structure more clearly.
It is more likely to observe the attention of a casual
observer and reveal trends or relationships that
might be overlooked in a table.
Continue...
For example, a graph will show the relationship
between two variables, or changes in a variable over
a time period.
Continue...
In this class we will concentrate on histograms, pie
charts, bar charts, frequency polygons and Ogives.
Histogram
DEF: A histogram is a graph that uses bars to
portray the frequencies or the relative frequencies
of possible outcomes for the numerical data. In
which the horizontal scale represent classes and
the vertical scale represents frequencies.
A rectangle (bar) is drawn above each class
interval with its height corresponding to the
interval’s frequency, relative frequency, or percent
frequency.
Continue…
In other words, in a histogram the base (horizontal
axis) of each bar corresponds to a class boundary of a
frequency distribution, and heights of the bars
represent the frequency, relative frequency, or
percent frequency associated with each bar.
The use of the class boundaries eliminates the spaces
between the bars to give a solid appearance.
Continue…
Histogram for the number of days before maturity for
40 short-term investments
12
10
10
8
8
7 7
Frequency
6 Frequency
4
4
3
2
1
0
0
39.5 49.5 59.5 69.5 79.5 89.5 99.5 More
Class Boundaries
Cumulative Frequency
Curves
We can plot cumulative frequencies against their
corresponding upper class boundaries and join the
points with a smooth curve.
The curve is called the Cumulative Frequency Curve or
the Ogive
We are going to use the data for the number of days
before maturity to demonstrate this.
Continue…
50
Cumulative Frequency Curve for the number of d
40
30
20
10
0
29.5 39.5 49.5 59.5 69.5 79.5 89.5 99.5
Class Boundaries
Continue...
If the relative cumulative frequencies have been used,
we would call the graph above the relative
cumulative frequency distribution.
Continue...
Ogives can be used to obtain certain quantities in the
frequency distribution called quantiles. These are
median, percentiles, the deciles, the quartiles.
Median
The median is the central value of the distribution.
The cumulative frequency at the median is 50%,
which means 50% of the data values are smaller than
or equal to the median.
Continue...
For example, the median of the numbers 24, 18, 21,
17, 19,12, 14 is.
18
The median of the numbers 24, 18, 21, 17, 14,10 is
17.5.
Thus to find the median of the smaller data set one
has to arrange the data in an increasing order.
Continue...
Then the median is the middle value if there are an
odd number of observations in the data set.
If the data consists of even numbers, then the median
is the average of the two middle numbers.
Continue...
How do we get the median for the number of days to
maturity?
The data consisted of 40 observations. Half of 40 is
20.
Thus from the point y = 20 on the y-axis, draw a
horizontal line towards the curve and drop it down
to the x-axis.
Read the x-coordinate of the point where the vertical
line meets the x-axis.
Continue....
The value of this point is
67.5;
hence the median is approximately 67.5.
Quartiles
Quartiles are values that divide a set of
observations into 4 equal parts and they are
normally denoted by Q1, Q2 and Q3.
The lower quartile, Q1, is a value such that one
quarter of all the values lies below Q1, that is,
the relative cumulative frequency at Q1 is 25%.
Deciles
These are points on the x-axis which divide a set of
observations into 10 equal parts. These values
denoted by D1, D2, D3,…, D9 are such that 10% of
the data falls below D1, 20% falls below D2, …, and
90% falls below D9.
Percentiles
These are the points on the sample range which
divide a set of data into 100 equal parts. These
values, denoted by P1, P2, P3,…, P99 are such that
1% of the data falls below P1, 2% falls below P2, …,
and 99% falls below P99
Continue...
50th percentile = Median = Q2
75th percentile = Q3
20th Percentile = 2nd Decile
Semi-Inter Quartile
Range
This is defined as
S.I.R = Q3 – Q1
2
I.R = Q3 – Q1
Relative Cumulative
Curve
A curve obtained by plotting the upper class
boundaries on the x-axis and the relative cumulative
frequencies on the y-axis is called a relative
cumulative frequency curve.
It is much easier to use the relative cumulative
frequency curve to obtain the quartiles than to use
the cumulative frequency curve. Using number of
days before maturity
Relative Cumulative
Frequency Curve
Relative Cumulative Ogive
120
100
80
60
40
20
0
29.5 39.5 49.5 59.5 69.5 79.5 89.5 99.5
Cl ass Boundaries
RCFC
To obtain the median from the relative cumulative
frequency curve, draw a horizontal line from the
point 50 on the y-axis towards the curve.
At the point where the line meets the curve, drop a
vertical line to the x-axis.
The point where the line meets the x-axis is the
median.
Continue...
The quartiles Q1 and Q3 are obtained in a similar
manner by starting at points 25 and 75,
respectively, on the y-axis.
In our example, the median is 67.5
Q1 = 57.5 and
Q3 = 83.5
Then
S.I.R = Q3 – Q1
2
83.5 – 57.5 =13
2
Continue....
Bar Charts
A chart with rectangular bars with lengths
proportional to the values that they present.
Bars can be plotted vertically or horizontally.
A vertical bar chart is sometimes called a
column bar chart.
There are different types of bar charts,
namely, simple bar chart, comparative bar chart
and component bar chart.
Simple Bar Chart
In simple bar charts, the data is represented by a
series of bars, the height (length) of each bar
indicating the size of the figure represented.
Example
There are 800 students in the School of Business
Administration at National University of Lesotho.
There are four majors in the school: Accounting,
Finance, Management and Marketing. The following
shows the number of students in each major:
Continue....
Major Number of Students
Accounting 240
Finance 160
Management 320
Marketing 80
construct a bar chart for the above data.
Continue...
Bar chart for majors in the School of Business Administration
350
300
250
Number of students
200
150
100
50
0
Acconting Finance Management Marketing
Majors
Comparative (Multiple)
Bar Chart
This type of chart shows several variables over the
same time period or a given variable over several
periods.
Two or more bars are grouped together and more
than one set of comparisons can be made. The use of
a key will help distinguish between the categories
Example
Draw a multiple bar chart to represent the imports
and exports of Lesotho (values in M) for the years
1991 to 1995
Years Imports Exports
1991 7930 4620
1992 8850 5225
1993 9780 6150
1994 11720 7340
1995 12150 8145
Continue....
Bar chart for imports and exports of Lesotho for the years 1991-
1995
14000
12000
10000
Imports & Exports
8000
Imports
6000
Exports
4000
2000
0
1991 1992 1993 1994 1995
Years
Pie Chart
A graphical device for presenting data by
subdividing a circle into sectors that corresponds
with a relative frequency of each class.
Angular measurement of a circle (pie) is 360. For
instance, a sector which has been allocated X units
must receive a portion of the pie of size
X *360
Total observations
Continue…
Pie chart for majors in the School of
. Business Administration
10%
30%
Accounting
Finance
Management
40%
Marketing
20%
Frequency polygon
It is constructed by plotting class frequencies against
class marks (Mid points) and connecting the
consecutive points by a straight lines.
Usually a frequency polygon is a closed figure.
Therefore additional class marks are added both
ends of the distribution, each with zero frequency.
Example
A supermarket recorded the number of items bought
by each customer and recorded in the following
table.
Draw a frequency polygon to illustrate these results.
Continue...
Number of Number
items of
bought customers
1-5 22
6-10 36
11-15 52
16-20 26
21-25 18
26-30 6
31-35 10
Continue....
Number Number
of Mid points
of items customers
bought
1-5 22 3
6-10 36 8
11-15 52 13
16-20 26 18
21-25 18 23
26-30 6 28
31-35 10 33
Total 170
Continue...
Frequency polygon for items bought per customer
60
52
50
Number of customers
40
36
30
26 Number of customers
22
20
18
10 10
6
0
3 8 13 18 23 28 33
Number of Items bought
MEASURES OF LOCATION
OR CENTRAL TENDENCY
We began our study of descriptive statistics by
leaning how to
Organize data into tables
Summarize data using graphical displays.
We shall now learn numerical methods of
summarizing data.
Continue...
We will look at some of the statistical measures
which define in some sense, the centre of a set of
data.
These are called measures of location or measures of
central tendency.
THE SIGMA
NOTATION
So far we have dealt with data in its numerical form
only, such as 21, 19, 17, 18, 20.
In other words we have dealt with specific
realizations of statistical variables of interest. But
when we need to write formulas or general
expressions involving statistical variables we need to
use variable names or symbols instead of numbers.
Continue...
Thus if we are talking about the ages of five students,
we may use the symbols, X1, X2, X3, X4 and X5 when
we do not yet know their actual values.
This means that X1 represents the age of the first
students, X2 represents the age of the second
students and so on.
Continue...
So for the numerical values 21, 19,17,18,
20
we can have x1 = 21, x2 = 19, x3 = 17, x4
= 18 and x5 = 20.
Continue....
Now, supposing we have a large data set, say 1000
values. It is clearly inconvenient to always write
down all 1000 variable names.
There are many ways to write down an expression
for the 1000 variables names without writing all of
them.
Continue...
One such representation is
x1, x2, x3,...,x1000 Equation (1)
The three dots represent the 996 missing symbols
written following the pattern already established by
the first three symbols.
Another way is to write down a typical symbol, say
xi, and then define the range of the subscript i
Continue....
For example equation 1 can be written as
xi, i=1, 2, 3, ... , 1000 Equation (2)
Any letter can be used instead of x, there is nothing
special about it.
Continue...
Now suppose we want to write down an expression
for the sum of the five variables, x1, x2, x3, x4 and x5.
We can write the sum as
x1+ x2+ x3+ x4 + x5 Equation (3)
Continue...
With this notation equation (3) can be written as
5 Equation (4)
x
j
j 1
Continue...
Thus Equation 3 and 4 are two ways of writing the
same thing. Therefore, we can equate the two
expressions as
5
x
j 1
j x1 x2 x3 x4 x5
Equation 5
Continue...
n
In general, x
j
j 1
is the sum of the quantities xj, from j = 1 to j = n and
is called the index of summation.
It does not matter what letter you use for the index
of summation as long as the same letter is used for
the summand and the lower limit of the summation.
Continue...
Thus
3 3 3
i j k 1 2 3
x
i 1
x x
j 1
x x
k 1
x
Continue...
Examples
If x1 2, x2 9, x3 1 and x4 3, then
4
x
i 1
i
x1 x2 x3 x4
= 2+9+1+3
=15
Continue...
3
x
i 2
i x2 x3
= 9+1
= 10
Continue...
If x1 1, x2 1 x3 1 and x4 1, then
4
x
i 1
i x1 x2 x3 x4
= 1+1+1+1
=4
Continue...
Since all values of xj are equal to 1, we could
alternatively write the above as
4
1 1 1 1 1 4
i 1
Equation (6)
Continue...
We can generalize the sum in Equation (6) by putting
as the upper limit an unknown number N in place of
the number 4. Then it is not difficult to see that
N
1 1 1 1 ... 1 N
i 1
Equation (7)
Continue...
Similarly
N
c c c c c ... c Nc
i 1
Equation (8)
But the statement
N
c
c 1
Nc
Continue....
Is not true. Why? It is because now c is the index of
summation and varies from 1 to N. The correct
expression is
N
c 1 2 3 4 ... N
c 1
Equation (9)
The index c first takes the value 1, then 2, then 3 and
finally it takes the value N.
Continue...
Suppose now instead of N values we have 100
values. Then
100
100 *101
c 1
c 1 2 3 4 ... 100
2
5050
Equation (10)
Continue...
In general we have a formula for summing an index
of summation
N
N ( N 1)
c 1
c 1 2 3 4 ... N
2
Equation (11)
Continue...
A sum can be represented in different ways if we
appropriately adjust the lower limit, the upper
limit and the index of summation. For example
3 2 4
x x
i 1
i
i 0
i 1 xi 1 x1 x2 x3
i 2
i 1,2,3 .
. 3
x
i 1
i x1 x2 x3
i 0,1,2
2
x i 0
i 1 x( 0 1) x(11) x( 2 1)
x1 x2 x3
i= 2,3,4
4
x
.
i 2
i 1 x( 2 1) x( 31) x( 4 1)
x1 x2 x3
Summation of Squares and
Cross Products
Consider measurements of two variables of interest,
say weight and height.
Let these variables be denoted by x and y
respectively.
Suppose five measurements of each variable are
taken and these are represented as x1, x2, x3, x4 and x5
for weights and y1, y2, y3, y4 and y5 for heights.
Continue
Then
5
x
j 1
2
j x x x x x
2
1
2
2
2
3
2
4
2
5
y
j 1
2
j y y y y y
2
1
2
2
2
3
2
4
2
5
x y
j 1
j j x1 y1 x2 y2 x3 y3 x4 y4 x5 y5
Continue...
Example:
Let x1=6, x2=1, x3=2, y1=5, y2=3 and y3=4. Evaluate
the following sums
3
x
3
yi
i
2 i 1
x i
i 1
This is done as follows:
Continue
3
.
i 1 2 3
x 2
i 1
x 2
x 2
x 2
= 62+12+22
= 36+1+4
= 41
Continue...
3
. x y
i 1
i i x y x y x
1 1 2 2 3 y3
= (6)(5)+(1)(3)+(2)(4)
=30+3+8
=41
Operational Rules for
Summation
There are three basic operational rules which help to
simplify the use of the sigma notation.
1. For any integer N
N N N
x
i 1
i yi xi yi
i 1 i 1
Continue...
2. If c is a constant, that is, does not depend on the
index of summation i, then
N N
cx
i 1
i c xi
i 1
and
N
c
i 1
Nc
DEFINITION OF THE
MEASURES OF LOCATION
The graphical representation of data gives us the
idea on the shape of the distribution (population).
The mean, median and mode which we shall define
in this section provide estimates for the centre of the
distribution
Continue...
The mean (or the arithmetic mean, to give its full
title) can be defined as the value which each item in
the distribution would have if all the values were
shared out equally among the items.
For instance, if three people had M2, M3 and M7
respectively, the mean amount would be M4, i.e.
M12 shared equally between the three people.
Continue...
Given a set of n values x1, x2,.., xn, the arithmetic
mean is defined as
x1 x2 ... xn
x
n
Equation (5.1)
If the n values are a sample then the arithmetic mean
is also defined as the sample mean. Equation 5.1 can
also be written as:
Continue...
. n
x i
1 n
x i 1
xi
n n i 1
Equation (5.2)
Example
A food inspector examined a random sample of 7 cans of
a certain brand of tuna to determine the percent of
foreign impurities.
The following data were recorded: 1.8, 2.1, 1.7, 1.6, 0.9,
2.7 and 1.8. Compute the sample mean.
Continue...
The sample mean is
1.8 2.1 1.7 1.6 0.9 2.7 1.8
x 1.8%
7
Equation (5.3)
Median
We define the median as the middle value of a set of
an ungrouped data and for grouped data the median
is the point which splits the frequency distribution in
such a way that 50% of the values are smaller than
that point.
Example
The nicotine contents for a random sample of 6
cigarettes of a certain brand are found to be 2.3, 2.7,
2.5, 2.9, 3.1 and 1.9 milligrams. Find the median
If we arrange these nicotine contents in an increasing
order of magnitude, we get
1.9 2.3 2.5 2.7 2.9 3.1 and the
median is then the mean of 2.5 and 2.7. Therefore the
mean would be
Continue...
2.5 2.7
. ~
x 2.6milligrams
2
Mode
We define the mode as the sample value with the
highest frequency. In other words, the mode can be
defined as the point of maximum frequency density.
Example
The number of incorrect answers on a true-false
competency test for a random sample of 15 students
were recorded as follows:
2, 1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4 and 2.
The mode for these data is 3
Bi-modal data:
In a case where we have more than two modes, the
data is said to be bi-modal.
Geometric Mean and
Harmonic Means
There are two other means apart from the arithmetic
mean though not used as often as the other averages.
These are the geometric mean and the harmonic mean.
Continue...
The Geometric mean of a set of n observations x1, x2,
..., xn is defined as
x x .x ...x
1
n
G 1 2 n
xG 6 * 8 6.9282
1
2
Harmonic Mean
The Harmonic Mean of a set of observations x1, x2,...xn,
is defined by
1 n
xH n
n
1 1 1
n
i 1 xi
i 1 xi
It is most frequently used in averaging speeds for
various distances covered, where the distance
remains constant.
Examples
The following examples illustrate the typical use of
the harmonic mean:
1. A car travels from point A to B at an average speed
of 60km/h and returns at an average speed of
40km/h. What is the average speed for the entire
journey?
2. Three Basotho athletes took part in the Comrades
marathon in May this year. Their average speeds
were recorded as 15km/h, 20km/h and 10km/h.
What was the average speed of the Lesotho athletes?
Continue...
2 2
xH 48km / h
2
1 1 1
.
i 1 xi
60 40
Therefore, the average speed of the entire journey is
48km/h.
n 3
xH n 13.846km / h
1 1 1 1
i 1 xi
15 20 10
Therefore, the average speed for the Lesotho athletes is
13.846km/h.
CALCULATION OF
AVERAGES FOR GROUPED
DATA
Suppose that instead of a raw set of data we have a
grouped frequency distribution, how do we calculate
the measures of average?
Consider the following general frequency
distribution with k classes and with a total of
observations equal to n:
Table 5.1: A general
Frequency Distribution
Class Class Mark Frequency Product
1 x1 f1 x1f1
2 x2 f2 x2f2
3 x3 f3 x3f3
. . . .
. . . .
. . . .
k xk fk xkfk
Total n n
f
i 1
i n x
i 1
i fi
Example
Class
Frequenc Class Product
limits y Marks
30 - 39 3 34.5 103.5
40 - 49 1 44.5 44.5
50 – 59 8 54.5 436
60 – 69 10 64.5 645
70 – 79 7 74.5 521.5
80 – 89 7 84.5 591.5
90 – 99 4 94.5 378
Total 40 2720
Continue...
The sample mean of the above frequency
distribution is defined as
n
x i fi
1 n
i 1
n
xi f i
f
n i 1
i
i 1
= 2720 = 68
40
Calculation of a
weighted average
If frequencies fi, i=1, 2, 3, ..., k are replaced by
numbers wi, i=1, 2, 3, ..., k called weights, whose
values represent the relative importance of the
classes or variables, then is called a weighted
average.
It is used to find the mean of dataset which values
are not equally represented.
Continue...
It is defined as
n
x w i i
xw i 1
n
w
i 1
i
1
xmod e l *w
1 2
Continue...
Where l = the lower class boundary of the modal
1 class,
= the frequency of the modal class minus the
frequency of the class immediately before the modal
class ( f m f m1 )
2 = the frequency of the modal class minus the frequency of
the class immediately after the modal class
( f m f m1 )
W is the class width.
Example
i Class Frequency, fi
1 1.5 – 1.9 3
2 2.0 – 2.4 10
3 2.5 – 2.9 18
4 3.0 – 3.4 10
5 3.5 – 3.9 7
4.0 - 4.4 2
50
Continue…
We calculate the mode of the frequency distribution
given in the above table. The modal class is 2.5 – 2.9
since it is the class with the highest frequency.
Thus l = 2.45, Δ1 = 18-10 = 8,
Δ2 = 18 - 10 = 8 and w = 0.5
Hence
Continue…
8
. xmod e 2.45 * 0.5
88
4
2.45 2.7
16
Median
. The median for grouped data is calculated using the
following formula n
xmedian l 2 F * w
f
Where l = the lower class boundary of the median
class,
n = total frequency,
F = the cumulative frequency of the previous class,
f = the frequency of the median class and
w = the class width
Continue
i Class
Class Mark, xi Frequency, fi Cumulative
Frequency
Total 50
Continue…
50
. xmedian 2.45 2 13 * 0.5
18
2.45 (0.6667 * 0.5)
2.45 0.3333 2.7833
MEASURES OF DISPERSION
OR VARIATION
A measure of variation is a way of indicating how
dispersed a set of observations is.
We need to know whether observations are closely
together or well spread out.
It is quite possible to have two sets of observations
with the same mean or median that differ
considerably in the variability of their measurements
about the average.
Example
Consider the following measurements, in liters, for
two samples of orange juice bottled by companies A
and B.
Sample A 97 100 94 103 106
Sample B 106 101 88 91 114
Both samples have the same mean, 100. But it is quite
clear that company A bottled orange juice with more
uniform content than company B. We say that the
variability or dispersion of the observations from the
average is less for sample A than for sample B.
Continue…
In this section we are going to discuss the following
common measures of dispersion (variation):
range
mean deviation
variance and standard deviation and
Coefficient of variation.
The Range
The range is the simplest measure of dispersion.
The range for a set of data is defined as the difference
between the largest and the smallest value.
The range of 19, 19, 19, 19, 19 is zero since the largest
value – smallest value = 19-19 = 0.
The range of the 19, 18, 17, 19, 18, 20 is 20 – 17 = 3.
The Mean Deviation
The mean deviation is defined to be the arithmetic
mean of the absolute deviations from the mean.
More precisely, if it is a sample of n values, the mean
deviation is defined as
xi x
D i 1
n
Continue…
Where x is the sample mean. The number xi x is
called the absolute deviation of the ith value from the
sample mean.
Example
The mean deviation of the sample 19, 19, 19, 19, 19 is
0
The mean deviation of the sample 18, 21, 20, 17, 19 is
18 19 21 19 20 19 17 19
19 19
D 1.2
5
Continue…
1
n
Since x
n
x
i 1
i 19
Thus as expected, there is more variability in the 18,
21, 20, 17, 19 than in the sample 19, 19, 19, 19, 19.
Mean deviation for
grouped data.
The mean deviation for grouped data with class
marks, and frequencies is calculated using the
following formula:
1 n
D f i xi x
n i 1
Example
Calculate the mean deviation of the following
distribution:
Class limits Frequency f i
10-14 3
15-19 5
20-24 7
25-29 4
30-34 2
Total 21
Example 21.2857
f x
Class limits Frequency
x*f x x
i i i i i xi x * f i
n 2
1
S xi x
2
n i 1
Or 1 n
S xi x
2 2 2
n i 1
Continue…
And the standard deviation is defined as the square
root of the variance, that is,
n
1
2
S S
2
xi x
Or
n i 1
n
1
2
S xi x
2
n i 1
Example
Clearly the standard deviation of 19, 19, 19, 19, 19 is
zero.
To calculate the standard deviation of 18, 21, 20, 17,
19, we find the mean
1 n
18 21 20 17 19
x xi 19
n i 1 5
Continue…
Table below can help us with the other calculations
xi xi x x
i x
2
18 -1 1
21 2 4
20 1 1
17 -2 4
19 0 0
Continue…
1
S
2
n
x x
2
i
n i 1
1
1 4 1 4 0 2.0
5
Hence
S 2
= 1.414
Continue…
Equivalently we can use the other formula
n 2
1
S 2 xi2 x
n i 1
1
182 212 20 2 17 2 19 2 19 2
5
324 441 400 289 361 361
1
5
1815 361
1
5
363 361
Continue….
The variance for grouped data with class marks, and
frequencies is calculated using the formula
S f i xi x
21 2
n
Continue…
An alternative formula for the variance for grouped
data is:
S f i xi x f i xi x
21 2 1 2 2
n n
Example
Find an estimate of the variance and standard
deviation of the following data for the marks
obtained in a test of 88 students.
Marks (x) Frequency (𝑓𝑖 )
0-9 6
10-19 16
20-29 24
30-39 25
40-49 17
Continue… 28.0227
Class Frequency, Class xi f i xi x xi x 2 f i xi x
2
fi Mark, xi
0–9 6 4.5 27 -23.5227 553.3174 3319.9044
10 – 19 16 14.5 232 -13.5227 182.8634 2925.8144
20 – 29 24 24.5 588 -3.5227 12.4094 297.8256
30 – 39 25 34.5 862.5 6.4473 41.9554 1048.885
40 – 49 17 44.5 756.5 16.4473 271.5014 4615.5238
n i 1
1
S (12207.9532) 138.727
2
88
Continue…
n 2
1
S
n i 1
f i xi x
S 138.727 11.778
Coefficient of Variation
The coefficient of variation (CV) or relative
standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean.
i.e.
s
CV 100%
x
Measures of Association
In this topic, we introduce methods for investigating
relationships between two statistical variables
Given sample values of two variables, we need
measures based on these samples, that would tell us
whether there is any association between the
variables.
In other words, bivariate analysis examines the way
in which the characteristics of one variable are
associated with the characteristics of another
variable.
Continue…
The following are some examples of questions open
to bivariate analysis:
Is educational attainment associated with race?
Is drug use associated with income?
Does religious affiliation vary by geographical
location?
Is crime associated with concentrated poverty?
Continue…
All these questions involve comparing two variables
to see if there is an association of one variable with
the other.
In this chapter, we will learn how to determine the
extent to which two variables are associated with one
another. We will focus only on relationships between
continuous variables which involves construction of
correlations and a simple introduction to regression.
Correlations
They are designed to measure the strength of the
relationship between two continuous variables.
Generally when social scientists discuss correlation
they are referring to the Pearson’s correlation
coefficient
Pearson’s correlation coefficient varies somewhere
between -1 and +1. The closer the correlation is to
either -1 or +1, the stronger the relationship
between the two variables.
A correlation of 0 indicates that there is absolutely
no association between the two variables
Negative and positive signs indicate direction of the
relationship
Continue…
POSITIVE CORRELATION: an increase in values
for one variable is associated with a increase in
the values for the other variable, for example,
as height increases so does shoe size.
NEGATIVE CORRELATION: an increase in
values for one variable is associated with a
decrease in values on another variable, for
example, as temperature reduces the use of
electricity for heating increases.
Continue…
x y
xy n
r
x x 2
y y 2
2 2
n n
Example
Here are the number of hours that 10 students spent
studying for a final exam (x) and their score on that
exam(y).
Hrs 7 8 4 9 13 5 9 6 16 3
Score 70 76 57 77 91 66 82 64 96 50
Calculate the correlation coefficient r for these data.
Solution
x y
xy x 2 y2
7 70 490 49 4900
8 76 608 64 5776
4 57 228 16 3249
9 77 693 81 5929
13 91 1183 169 8281
5 66 330 25 4356
9 82 738 81 6724
6 64 384 36 4096
16 96 1536 256 9216
3 50 150 9 2500
80 729 6340 786 55027
Continue…
For our formula, x 80 y 729 xy 6340
786
x 2
55027
y 2
Continue…
x y
xy n
r
x2 x
2
y 2
y 2
n n
80 729
6340
10
80 2 729 2
786 55027
10 10
0.969
Linear Regression
If we can determine that there is a linear correlation
between two variable, then the behavior of those
variables can be described graphically by a straight
line.
In this section we learn how to find the equation of
the line that best fits a set of data.
We will go on to use that equation to predict the
value of one of the variables for a particular value of
the other variable.
A regression line is a line that best fits a set of data.
Continue…
The general formula of a regression line is yˆ a bx
In general the equation ŷ , which is read as “y hat,” is
the predicted value of y for a given value of x. The
slope of the line is b, and we calculate it first. We
then use the value of b to help calculate a which is
the y- intercept of the line.
Continue…
Here are the formulas to calculate a and b.
n xy x y
b
n x x
2 2
a y bx
Example
Here are the scores of five randomly selected
students on test 1 and 2 in a Statistics class. Find the
equation of the regression line treating the score on
test 1 as x and the score on test 2 as y.
Student Test 1 score Test 2 score
1 83 83
2 86 84
3 76 63
4 92 83
5 71 55
Solution
We begin by creating a scatter plot, to make sure that
the data seem to have a linear relationship.
Continue…
2
x y xy x
83 82 6806 6889
86 84 7224 7396
76 63 4788 5776
92 83 7636 8464
71 55 3905 5041
408 367 30359 33566
Continue…
n xy x y
So b
n x x
2 2
530359 408367
533566 4082
2059
1366
1.5073
The slope of the regression equation line is 1.5073.
This tells us that each point increase on test 1, the
score on test 2 will increase by 1.5073
Continue…
Now we calculate the y-intercept a.
x
x
y
y
n n
408 367
5 5
81.6 73.4