100% found this document useful (1 vote)
1K views299 pages

ST1381 Elementary Statistics PDF

This document provides an overview of an introductory statistics course. It outlines that the course is meant for non-statistics majors without strong mathematical backgrounds. It then provides the course code, instructor details, timetable for different student groups, ground rules, and introduces key concepts in statistics including populations, samples, parameters, statistics, and the fields of descriptive and inferential statistics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views299 pages

ST1381 Elementary Statistics PDF

This document provides an overview of an introductory statistics course. It outlines that the course is meant for non-statistics majors without strong mathematical backgrounds. It then provides the course code, instructor details, timetable for different student groups, ground rules, and introduces key concepts in statistics including populations, samples, parameters, statistics, and the fields of descriptive and inferential statistics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 299

This is an introductory course which is meant

for non-statistics majors. It is for those people


who do not have a good mathematical
background or no mathematical background
at all.
ST1381

 Lecture: Mr. M. Malunga
 Office: BTM220
TIME TABLE

 GROUP A – EDUCATION & HUMANITIES
 THURSDAY 1600HRS
 FRIDAY 1100HRS
 FRIDAY 1200HRS

 GROUP B – SOCIAL SCIENCES, HEALTH


SCIENCES & LAW
 MONDAY 1300HRS
 WEDNESDAY1500HRS
 WEDNESDAY 1600HRS
GROUNDING RULES

 ATTENDING LECTURES
 WRITTING TESTS/EXAMS (Proof of Registration,
ID)
 USE OF PENCILS
 CELLPHONES (Silent mode/Off)
 CONSULTATION HOURS
 WRITTING CLEARLY
What is statistics?

The word statistics has two basic meanings:
1) Statistics as a subject.
2) The actual numbers resulting from an
analysis, such as the birth rate in a country,
rainfall, goalkeeper’s success rate, etc.
Continue…

Statistics as a subject may be defined as the
science of conducting studies to collect,
organize, summarize, analyze, and draw
conclusions from the data.
The subject of statistics therefore, involves
planning an experiment, obtaining the
relevant data, analyzing the data obtained,
and interpreting and drawing conclusions
from the data.
Why study statistics?

Like professional people, you must be
able to read and understand the
various statistical studies performed in
your fields. To have this
understanding, you must be
knowledgeable about the vocabulary,
symbols, concepts, and statistical
procedures used in these studies.
Continue…

You may be called on to conduct a research
in your field, since statistical procedures are
basic to research.
To accomplish this, you must be able to
design experiments; collect, organize, analyze
and summarize data; and possibly make
reliable predictions or forecast for future use.
You must also be able to communicate the
results of the study in your own words.
Basic concepts in
statistics

 Population: is the complete collection of individuals
(scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that it
includes all individuals to be studied.
 Parameter: is a statistical measure (such as average)
computed from the entire universe /population. The
numerical value will not change if the universe does
not change.
Examples

The average age at the time of admission for
all students who have ever attended NUL.
The proportion of students who were older
than 21 years of age when they entered NUL.
Continue…

 Sample: it is a group of subjects selected from a
population.
 Statistic: is a statistical measure (such as average)
computed from the sample.
 Example: The average height, found by using the set
of 25 heights.
Continue...

 Data: are values (measurements or observations) that
the variables can assume . Raw data refers to
unprocessed data or data that is in its original form.
 Statistical Data: numbers or facts from which
conclusions can be drawn
 Data Set: is a collection of all statistical data related
to a particular study /A collection of data values.
 Each value in a dataset is called a data value/datum.
This value may be a number, a word, or a symbol.
Example…

Mr. Thabo entered college at age 23.
His hair colour is black.
He weighs 183 pounds.
He is 71 inches tall
FIELDS OF STATISTICS

There are two fields of statistics, namely
descriptive and inferential statistics.
1) Descriptive Statistics: consists of methods
for organizing and summarizing information
in a clear and effective way. It includes the
construction of graphs, charts and tables and
the calculation of descriptive measures such
as averages & percentiles
Example

A study that was conducted in 2013 showed
that 48% of the students who set for ST1381
exam passed. The value 48% is a value in the
field of descriptive statistics.
If 300 out of 350 people leaving in Maseru
have their last name Pen, we can go on to say
that 86% of the people living in Maseru have
their last name Pen. The number 86%
describes the set of data that we have.
Continue....

2) Inferential Statistics: consists of methods
for drawing conclusions about the
population based on the sample.

Statistical inference is always associated with


generalizations that are subject to
uncertainty, since we are dealing with only
partial information obtained from a subset of
the data of interest.
Example

 Academic records during the past five years at NUL
show that 80% of the entering freshmen eventually
graduate. If you are a member of the present
freshmen class and conclude from this study that
your chances of graduating are better than 80%, you
have made a statistical inference that is subject to
uncertainty.
Example

• Telkom Lesotho conducted a study to estimate the
percentage of Lesotho residents who own mobile
phones. Data were obtained from all the residents of
Maseru, Leribe and Mokhotlong. Based on the data,
it was concluded that 40% of Lesotho residents have
mobile phones.
• Identify the population and the sample
• Is the conclusion descriptive or inferential?
Solution

• Population: All Lesotho residents
• Sample: Residents of Maseru, Leribe and
Mokhotlong
• Inferential
• A further study of the data obtained above revealed
that 55% of the residents in the three districts
considered in the study have mobile phones. Is the
conclusion descriptive or inferential?
Example

• During the last week, Tony Gwynn of the San Diego
Padres recorded the following number of hits.
Sun Mon Tues Wed Thurs Fri Sat

2 1 4 3 0 3 1

 Which of the following conclusions can be obtained


from purely by descriptive methods and which can
be obtained by inferential methods?
Continue...

i. Tony will never have more than 4 hits in a
game.
ii. Tony had 0 hits on Thursday because he used a
bat that belonged to another player.
iii. During the last week, Tony averaged 2 hits per
game.
iv. Tony is a better hitter than any other baseball
player.
v. Tony had the same total number of hits in the
first 3 games as he did in the last 4 games.
Solution

i. Inferential
ii. Inferential
iii. Descriptive
iv. Inferential
v. Descriptive
Variables

Statisticians gain information about a
particular situation by collecting data for
random variables. This section will explore in
greater detail the nature of variables and
types of data.
Variables

A Variable is a characteristic or attribute that
can assume different values . For example,
the height of a person is a variable because
people have different heights and sex is
another variable because people are not of
the same sex. Age, etc.
A variable can be classified as either discrete
or continuous.
Continue...

• In general, we can say discrete variables
assume values that can be counted. Discrete
variables can be assigned values such as 1, 2,
3… and are said to be countable.
• Example: Number of children in the family,
Number of students in a class room, and
Number of calls received by a switchboard
operator each day for a month.
Continue…

Continuous variables, by comparison, can
assume an infinite number of values in an
interval between two specific values. They
are obtained by measuring. They often
include decimals and fractions.
Example: Temperature is a continuous
variable, since it can assume an infinite
number of values between any two given
temperatures.
Levels of measurement for
variables

 Measurement is the process we use to assign
numbers to the observations or elements of a
variable. The term “number” does not necessarily
mean numbers that can be added , subtracted,
multiplied or divided. Instead, it means that
numbers are used as symbols to represent certain
characteristic like age, income, height of the object
etc. For example, as a student your student number
may identify you.
Continue…

There are four levels of scales of
measurement, each with it’s own
characteristic and from the weakest to the
strongest; they are:
nominal
ordinal
Interval
 ratio
Continue....

Nominal Scale: Data measured at this level
can be placed into categories in which no
order or ranking can be imposed on the data.
Examples:
Gender: Male/Female.
Political party: Democratic, Republican,
Independent, etc.
Continue...

 Ordinal Scale: classifies the data into categories that
can be ordered/ranked; however, precise difference
between the ranks do not exist.
 Examples:
 Highest qualification: None/School
education/Tertiary education.
 Degree of pain: None/Mild/Moderate/Severe
Continue...

Interval Scale: is like the ordinal level, with
the additional property that the difference
between two data values is meaningful.
However, data at this level do not have a
natural zero starting point (where none of the
quantity is present).
Example

Body temperature of 98.2F and 98.6F are
examples of data at this level of
measurement. These values are ordered, and
we can find the difference of 0.4F. However,
there is no natural starting point. The value
of 0F might seem like a starting point, but it
is arbitrary and does not imply the total
absence of heat.
Continue...

Ratio Scale: is the interval level with the
additional property that there is also a
natural zero starting point (where zero
indicates that none of the quantity is present).
For values at this level differences and ratios
are both meaningful.
Examples

 1) Distances: distances (in KM) travelled by
cars (0 KM represents no distance travelled
and 400 KM is as twice as far as 200 KM.
2) Prices: prices of college text books (M0.00
does represent no cost and M100.00 book
does cost twice as much as M50.00 book
Summary

 Ratio: there is a natural zero starting point and ratios
and differences are meaningful. E.g. Distances.
 Interval: differences are meaningful, but there is no
natural zero starting point and ratio are meaningless.
E.g. body temperature in degrees Fahrenheit or
Celsius.
 Ordinal: Categories are ordered, but differences can’t
be found or meaningless.
 Nominal: Categories only. Data cannot be arranged
in an ordering scheme.
Continue...

• Nominal and ordinal variables are often referred to
as qualitative variables, whereas interval and ratio
variables are referred to as quantitative variables.
• Qualitative variables: is a variable which can not be
measured on a numerical scale. It serves as a name or
a label for identifying it.
• Examples: sex of a person, race, colour, educational
level
Continue...

 Quantitative variable: is a variable that can be
measured on a numerical scale ( they are numerical
and can be ordered or ranked).
 Example:
 The variable age is numerical and people can be
ranked in order according to the value of their ages.
 Other examples include heights, weights and body
temperatures.
Summary

 The classification of variables can be summarized as
follows:
Data

Quantitative Qualitative

discrete Continuous
Exercise

 Read the following on attendance and grades, and
answer the questions.
 A study conducted at NUL revealed that students
who attend class 95 to 100% of the time usually
received an A in the class. Students who attended
class 80 to 90% of the time usually received a B or a C
in the class. Students who attended class less than
80% of the time received a D or an F or eventually
withdrew from the class.
Continue…

 Based on this information, attendance and grades are
related. The more you attend class, the more likely it
is you will receive a higher grade. If you improve
your attendance, your grades will improve. Many
factors affect your grade in a course. One factor that
you have considerable control over is attendance.
You can increase your opportunities for learning by
attending class more often.
Continue…

 What are the variables under study?
 What are the data in the study?
 Are descriptive, inferential, both types of statistics
used?
 What is the population under study?
 Was the sample collected? If so, from where?
 From the information given, comment on the
relationship between the variables.
Solution

 grades and attendance.
 Data consists of specific grades and attendance
numbers.
 These are descriptive statistics; however, if an
inference were made to all students, then that would
be inferential statistics.
 Population under study is ALL students at NUL.
 While not specified, we probably have data from a
sample of NUL students.
Continue…

 Based on the data, it appears that, in general, the
better your attendance, the higher your grade.
Exercise

 Classify each of the following as nominal, ordinal,
interval or ratio-scaled data.
 The time required to produce each tyre on an
assembly line.
 The number of liters of milk a family drinks in a
month.
 The ranking of four machines in your plant after
they have been designated as excellent, good,
satisfactory or poor.
Continue…

 Major in college (mathematics, biology, psychology,
etc.)
 The age of each of your employees.
 The sales in maloti at a local pizza house each
month.
 Elevations of Lesotho National Parks, in feet
above/below sea level.
 The response time of an emergency unit.
 A college student’s degree (associate, bachelor’s,
master’s, etc.)
Choosing a Sample

Recall that inferential statistics consists of
methods of drawing conclusions about the
population based on information obtained
from a sample of the population.
When one collects information from the
entire population, the exercise is referred to
as CENSUS
Continue...

Population: refers to the collection of all
individuals or items under consideration in a
statistical study.
Examples: All employed workers in Lesotho,
All registered voters in Lesotho.
Sample: refers to part of the population from
which information is collected.
Why do we use a
sample?

 There are various reasons why we do not
investigate a whole population (take a census)
but rather investigate a sample from a
population:
 Census is expensive- for example, it involves
millions questionnaires, travelling costs,
temporary personnel etc.
Census takes a long time- for example, it
involves the distribution and collections of
questionnaires, the processing of large
amounts of data etc.
Continue…

Sections of a population are inaccessible- it
is difficult to reach animals and plants on
very high mountains and access to persons in
hospitals and prisons is often forbidden, etc.
Inaccuracy of a census- for example, good
planning is necessary to take a census, there
is a large amount of administrative work,
mistakes are made by people working with
large datasets, etc.
Continue...

Once the researcher has decided that
sampling is appropriate, the next question to
consider is how to select a sample.
Remember the sample results will be used to
make conclusions concerning the entire
population.
Continue...

As a result, it is important for a sample to be
representative, that is, a sample should reflect
as closely as possible, the relevant
characteristics of the population under
consideration. Otherwise the sample is said
to be biased.
Examples of a biased
sample

It would not make sense to use the mean
weight of a sample of football team players to
make inferences about the mean weight of all
adult males in Lesotho.
It would be very unreasonable to try to
estimate the mean income of Roma residents
by sampling the incomes of people who work
at NUL.
Sampling Methods

Taking a sample is not simply a matter of
taking the nearest item.
If worthwhile conclusions relating to the
whole population are made from the sample,
it is essential to ensure that as far as possible
that the sample is free from bias.
Continue…

 To obtain samples that are unbiased- i.e.. That
gives each subject in a population an equally
likely chance of being selected. Statisticians use
four basic methods of probability sampling.
 a) Simple random Sampling
 b) Systematic Sampling
 c) Stratified Sampling
 d) Cluster Sampling
Simple Random
Sampling

Is a probability sampling technique where by
each member of the population has an equal
and known chance/probability of being
selected into a sample.
There are different sampling procedures that
can be used to select a simple random
sample, namely:
Continue...

i) Lottery type of sampling procedure
Assign each student name a number, these
would be written on pieces of paper and be
blindly drawn from the box without
replacement.
Continue…

 ii) Random number tables could be used: in this
method of selecting a simple random sample, we
start by assigning and allocating a unique number to
each member (unit) of the population. Often the
numbers 1 to N are allocated to the N units in the
population. We then open to any page of the table of
random numbers start at any point, read k digits and
move in any direction where k = number of digits
on N. The first n numbers not exceeding N will
identify the units to be included in the sample.
Example

 A part of table of random numbers is given
 02946 81881 96520 56247 17623
 85697 62000 87957 07258 45054
 26734 68026 52067 23123 73700
 47829 31353 95944 72169 58374
 76603 99339 40571 41186 04981
 Use the above random number to select a random
sample of 8 units out of 600 units.
Solution

 We first assign a unique number (1 to 600) to each of
the 600 units.
 If we decide to start from the second row second
column, read the three digits from the left to the
right, the first 8 numbers less than 600 are:
 569, 570, 054, 267, 346, 067, 231 237.
 The units bearing the numbers 569, 570,54, 267, 346,
67, 231, 237 will be included in the sample.
Systematic Sampling

Is a probability sampling technique that can
be obtained by selecting a starting number at
random and each successive number
systematically form an orderly list of the
population. Every individual still has an
equal chance of being selected in the sample.
Example

 To select a systematic sample of size n = 200 from a
population of size N=3000
 Calculate the sampling interval k = N/n = 15.
 Thus, we select one unit from the first 15 units at
random and every 15th unit thereafter. If the first
unit selected is 13, then the units of analysis
corresponding to the elements 13, 28,43,58, and so on
will be included in the sample.
Stratified Sampling

Is a probability sampling technique whereby
the population is divided into a number of
classes or strata and a sample is obtained by
combining samples drawn independently
from each strata.
Example 1

If a person conducting a customer
satisfaction survey selected a random
customers from each customer type in the
proportion to the number of customers of
that type in the population. If the sample of
size 40 is to be selected and 10% of the
customers are managers, 60% are users, 25%
are operators and 5% are database
administrators.
Continue...

Then 4 managers, 24 users, 10 operators and
2 administrators would be randomly
selected.
Example 2

 Of the 130 students in year 1, 70 are boys and 60 are
girls. If we were to select a stratified random sample
based on gender, what proportion of our sample
should be boys and what should be girls?
70 7
 Solution: proportion of boys =   53.8% of the
130 13
 sample should be made up of boys.

60 6
 Proportion of girls =   46.2 % of the sample
130 13
 should be made up of girls.
Cluster Sampling

Is a probability sampling technique where by
the population is divided into separate
groups called clusters. Then a simple random
sample of clusters is selected from a
population.
Example

Let’s say we want to conduct a study
involving nurses in Lesotho. Instead of
randomly selecting 20% of all nurses in every
hospital in the country. We could randomly
select 20% of the hospitals and take all the
nurses in those hospitals to be part of our
sample.
Summary on Sampling
Methods

 Random Sampling: Subjects are selected by random
numbers.
 Systematic Sampling: Subjects are selected by using
every kth number after the first subject is randomly
selected from 1 through k.
 Stratified Sampling: subjects are selected by dividing
the population into groups (strata), and subjects are
randomly selected within groups.
 Cluster Sampling: Subjects are selected by using an
intact group that is representative of the population.
Statistical Errors

 Despite efforts to obtain a good sample it is common
to make statistical errors.
 There are two main types of errors, namely: Non-
Sampling and Sampling Errors.
Non-Sampling errors

 These include all kinds of human errors such as
mistakes in
 collecting,
 reporting or
 analyzing data, like making errors in calculations,
copying data incorrectly and so on.
Sampling errors

 Sampling error is the difference between the results
we find by studying the complete population (a
census), and the results we find by studying only a
sample and using that sample to draw conclusions
about the population.
Continue...

 Sampling error can occur in basically two ways: by
chance and by sampling bias.

 In a census there are no sampling errors but non-


sampling errors are common.
Data Collection

 After we have decided on the sampling procedure
we are going to use, the next step will be to do data
collection.
Sources of Data

 The statistical data may be classified under two
categories, depending upon the source.
 Available data collected for other purposes rather
than the current investigation is known as
Secondary Data
Advantages

 Can be obtained quickly and inexpensively.
 But before it can be used as the only source of
information, it must be ascertained that the data are:
 - Available
 - Relevant
 - Accurate
 - Sufficient
Primary data

 Data collected for a specific purpose in order to
obtain the exact information wanted. These are
considered to be more meaningful and reliable but
the disadvantage is that they are time consuming
and more costly to obtain than secondary data.
Methods of data
collection

 Data can be collected in a variety of ways. One of the
most common methods is through the use of
surveys. Three of the most common methods are:
 the telephone survey,
 mailed questionnaire,
 the personal interview (face-to-face).
Continue…

 No single method is superior to another. Each needs
to be assessed in terms of:
 survey content,
 responded characteristics,
 time line,
 available resources.
Survey content

 What types of questions will be asked?
 How complex or sensitive are the questions?
 Would people be more likely to understand and
respond to questions presented in print or orally?
Respondent Characteristics

 From whom do you want to collect information?
 What is the easiest way to reach them?
 Do they have certain characteristics that rule out one
method over another (e.g., literacy skills, no
telephone, etc.)?
Time Line

 How quickly do you need results?
Available Resources

 Who will work on the survey?
 Will you have help from outside experts in planning
the survey?
 How much money do you have to spend?
Personal Interview

 Where the source of information are people, each
may be asked a series of questions in an interview.
Advantages

 Immediate feedback
 Cooperation – 1/20 people refuse to be interviewed
 High response rate.
Disadvantages

 Expensive because interviewers have been trained
and paid.
 The interviewer may be biased in his/her selection of
respondents.
 It may be difficult to find convenient time for
interviewing certain people.
Postal Questionnaires

 This method is used if the targeted geographical
area/number of respondents is large. A self-
addressed prepaid envelope should always be
included.
Advantages

 It can be used to cover a wider geographic area than
telephone surveys or personal interviews since
mailed questionnaires are less expensive to conduct.
 Respondents can remain anonymous as they desire.
Disadvantages

 The response rate may be very low unless there is an
incentive or a legal obligation to reply
 There is no control over how long people take to
reply
 The answers may not be entirely the respondent’s
own
 Some people may have difficulty in reading or
understanding the questions.
 In appropriate answers to questions.
Continue....

 If a respondent has difficulty with questions, she/he
may not return the form
 Misunderstanding of questions cannot be corrected
 Only straight forward questions can be set
 Postal questionnaires are difficult to design
Telephone interviews

Where the source of information are people,
each may be asked a series of questions in an
interview.
Advantages

They are less costly compared to personal
interviews.
People may be more candid in their opinions
since there is no face-to-face contact.
High response rate than mail survey.
Disadvantages

Some people in the population will not have
phones, or will not answer when calls are
made; hence not all people have chance to be
surveyed.
Many people now have unlisted numbers
and cell phones, so they cannot be surveyed.
The tone of the interviewer might influence
the response of the person who is being
interviewed.
Continue…

Once the data has been collected, it is then
summarized (in chart or tabular format),
analyzed and interpreted.
Summarizing and
Presentation of Statistical Data

A summary of the data generally provides a
better general impression of the population
than raw data.
The data can be presented in a tabular form
or frequency distributions.
Tabulation

Presenting data in the form of a table
Layout of the table depends on its purpose
and information to be presented

Tables can be classified as informative or


classifying tables, reference tables and text or
summary tables
Informative/classifying
Table

 These are original tables that contain systematically
arranged data compiled for records and further use,
and not intended for presentation of comparisons or
to show relationships or significance of the figures
Reference Tables

 They contain all summarized information relevant to
the subject in question. They are usually quite long
and they are set out alphabetically for ease of
reference.
Text or Summary Tables

 These kind of tables are usually found in reports and
reference books.
 They are kept as simple as possible and are usually
interpreted in the accompanying text.
 They show only the information relevant to the
question being discussed.
Frequency Distributions

By suitably organizing data, it is often
possible to make rather complicated set of
data easier to understand. The most common
way of summarizing large data set is to
partition the sample range into a number of
classes and then count the frequencies of
occurrence of data in each class.
Continue...

A frequency distribution: is a table of classes,
where each class is paired with the frequency
of occurrence of data in that class. The
frequency distribution can present both
qualitative and quantitative data.
Continue...

The following table illustrates the grouping
of data.
Example 1: Table 1 gives a hypothetical
example of the number of days to maturity
for 40 short-term investments.
Table 1: Days to Maturity for 40
short-term Investments

 70 64 99 55 64 89 87 65
 62 38 67 70 60 69 78 39
 75 56 71 51 99 68 95 86
 57 53 47 50 55 81 80 98
 51 32 63 66 85 79 83 70
Continue...

 The first step would be to decide on the number of
classes.
 One convenient way would be to group these data
by tens. The smallest piece of data is 32 and the
largest piece of data is 99, so if we group by ten we
get the following table:
Table 2: Frequency distribution for days
to maturity for 40 short-term investments

Class Tally Frequency
Marks
30-39 3
40-49 1
50-59 8
60-69 10
70-79 7
80-89 7
90-99 4
Total 40
Continue...

It is important to note that each class has a
lower and upper class limit. These are the
smallest and largest possible values of a
given class.
The class limits of different classes do not
overlap, there should be a gap between the
upper class limit of one class and the lower
class limit of the next class.
Continue....

We also have the so called class width which
is the difference between the lower class limit
of one class and the lower class limit of the
next class.
Guidelines for Constructing a
Frequency Distribution

• Number of classes (K): The choice of the
number of classes depends on the sample
size.
• A guide on the determination of the number
of classes (k) can be Sturge’s formula, given
by k=1+3.322log(n), where n is the number of
observations.
Continue…

Note that Sturge’s rule should not be
regarded as final, but should be considered
as a guide only. The number of classes
specified by the rule should be increased or
decreased for convenience or clear
presentation.
Guidelines…# of
Classes

The larger the sample size, the more the
classes
Normally one has between 5 and 12
classes
It is necessary to avoid too many classes
or too few classes. Too few classes lead
to great loss of information while too
many classes produce unnecessary
details.
Class Width

 Class width: Once the number of classes has been
decided upon, then the class width can be estimated
as follows:
 Suppose we have decided to have seven classes.
 K= 7
Denote largest number by L
Denote smallest number by S
 So L = 99 and S = 32.
Continue…

 Then the class width (denoted by W) should be
approximately
 W = L – S = 99-32 = 9.57 ≈ 10
 K 7
 The class width would be 10.
 E.g. Suppose L = 99, S = 32, K = 6 what will be the
class width?
Class limits

 Class Limits: Once the largest value (L) and the
smallest value (S) and the class width (W) have been
determined, the choice of the class limits becomes a
straightforward matter.
 1st you have to choose the lower class limit L1 of the
first class.
 L1 can be any number smaller than or equals to the
smallest data value S.
Lower class limits

 At the same time, we have to make sure that the
largest data value L is smaller than L1 + KW
 In summary, L1 ≤ S provided L ‹ L1 + KW
 L1 should also have the same number of decimal
places as the data.
 Then the other lower class limits are obtained by
adding multiples of the class width W to L1
Continue...

 In our example, we have K = 7, W = 10, S = 32, L =
99.
 L1 must be smaller than or equals to 32,
 L1+KW>L
 L1 + 7*10 › 99
 Now we can choose our lower class limits.
Continue...

Class
30 – 39
40 – 49
50 – 59
60 – 69
70 – 79
80 – 89
90 - 99
Total
Upper Class Limits

 The upper limits follow straight forwardly:
 The upper limit of the 1st class must be less than the
lower class limit of the next class, that is, it must be
less than 40.
Class Boundaries

Class Boundaries: are numbers which
demarcate/separate neighboring class intervals.
• Remember in class limits there are gaps in between,
while there are no gaps between neighboring classes
when class boundaries are used.
• Class boundaries have one decimal place more than
the data. This is done to ensure that every
observation falls into exactly one class.
Continue...

 Class boundaries can also be defined in terms of
class limits. A class boundary is the midpoint
between the upper class limit of one class, say u, and
the lower class limit of the next class, say l.
 Class boundary = u + l
2
 A class width is the difference between two
consecutive class boundaries.
Continue...

Class Frequency
Boundaries
29.5 – 39.5 3
39.5 – 49.5 1
49.5 – 59.5 8
59.5 – 69.5 10
69.5 – 79.5 7
79.5 – 89.5 7
89.5 – 99.5 4
Null-Classes

 Null-Classes: Care must be taken to avoid wherever
possible classes which have no members (empty
classes).
Example

• The following data elements represent the amount of
time (rounded the nearest second) that 30 randomly
selected customers spent in the line before being
served at a branch of Post Bank.
• 183 121 140 198 199 90 62 135 60 175
• 320 110 185 85 172 235 250 242 193 75
• 263 295 146 160 210 165 179 359 220 170
• Construct a frequency distribution with 5 classes
Continue...

 Obtain the corresponding class boundaries.
L=359, S=60, K=5
L1+KW>359

Class Frequency Class
Limits Boundaries
60 - 119 6 59.5 - 119.5
120 - 179 10 119.5 -179.5
180 - 239 8 179.5 - 239.5
240 - 299 4 239.5 - 299.5
300 - 359 2 299.5 – 359.5
Total 30
OTHER COLUMNS OF A
FREQUENCY DISTRIBUTION

 Relative Frequency:
 The percentage of the class, expressed as a decimal,
is called the relative frequency of the class.
 A table listing all classes and their relative
frequencies is called a relative frequency distribution.
 Relative frequency = Class frequency.
Sample size

Table 4: Relative Frequency
Distribution for 40 Short-term
Investments

Class Frequency Relative
limits Frequency

30 – 39 3 0.075(7.5)
40 - 49 1 0.025(2.5)
50 – 59 8 0.200(20)
60 – 69 10 0.250(25)
70 – 79 7 0.175(17.5)
80 – 89 7 0.175(17.5)
90 – 99 4 0.100(10)
Total 40 1(100)
Continue...

 Note that the relative frequencies must always add
up to 1(100%).
 The table shows that 10% of the investments have
maturity period of between 90 and 99 days.
Cumulative Frequency

For a given class interval, The cumulative
frequency of a class is the sum of the
frequency for that class and all the previous
classes.
Continue...
Class limits
 Freque Cumulative frequency
ncy

30 - 39 3 f1
40 - 49 1 f1+f2
50 – 59 8 f1+f2+f3
60 – 69 10 f1+f2+f3+f4
70 – 79 7 f1+f2+f3+f4+f5
80 – 89 7 f1+f2+f3+f4+f5+f6
90 – 99 4 f1+f2+f3+f4+f5+f6+f7
Table 5: The cumulative frequency
distribution for days before maturity for
40-short-term investments

Class Freque Cumulative
limits ncy frequency
30 - 39 3 3
40 - 49 1 4
50 – 59 8 12
60 – 69 10 22
70 – 79 7 29
80 – 89 7 36
90 – 99 4 40
Table 5: The cumulative frequency
distribution for days before maturity for
40-short-term investments
Class

Frequ Cumulat Relative Relative
limits ency ivfreque Frequen Cumulativ
ncy cy e
Frequency
30 - 39 3 3 0.075 0.075
40 - 49 1 4 0.025 0.100
50 – 59 8 12 0.200 0.300
60 – 69 10 22 0.250 0.550
70 – 79 7 27 0.175 0.675
80 – 89 7 36 0.175 0.900
90 – 99 4 40 0.100 1.000
Total 40 1
QUALITATIVE FREQUENCY
DISTRIBUTION

Remember: Qualitative data provide non-
numerical measures that categorize (or
classify) individual observations.
The construction of a frequency distribution
for qualitative data is much easier because
the nature of the data provides a
straightforward classification.
Example

A student has completed 20 courses in the
school of business administration. His grades
in the 20 courses are shown below:
 A B A B C C C B B B
 B A B B B C B C B A
Construct a frequency distribution.
Frequency Distribution for
grades.

Grade Frequency
A 4
B 11
C 5
TOTAL 20
Stem and Leaf Plot

A stem and leaf plot is a frequency
distribution which carries all the individual
values in the raw data.
It is constructed by breaking up every data
value into two components, a stem (usually
the entry’s leftmost digits) and a leaf (
usually the rightmost digit). For the number
173, for example, the stem would be “17” and
the leaf would be “3”.
Continue...

Data are then classified according to the
values of their stems.
Example

The test scores of 14 individuals on their first
statistics examination are shown below:
 95 87 52 43 77 84 78 75 63 92
81 83 91 88
Construct a stem and leaf display for these
data.
Continue...

• The stems will be the number/s to the left and the
leaves will be a number/s to the right
• Stem Leaf Frequency
4 3 1
5 2 1
6 3 1
7 7 8 5 3
8 7 4 1 3 8 5
9 5 2 1 3
Continue...

• Now, the digits (stem) are ranked ordered
horizontally, thus leading to the following stem and
leaf display.
• Stem Leaf Frequency
4 3 1
5 2 1
6 3 1
7 5 7 8 3
8 1 3 4 7 8 5
9 1 2 5 3
Example

 Consider the following data for car battery life (in
years).
 2.2 4.1 3.5 4.5 3.2 3.7 3.0 2.6 3.4 3.5
 1.6 3.1 3.3 3.8 3.1 4.7 3.7 2.5 4.3 4.2
3.4 3.6 2.9 3.3 3.9 3.1 3.3 3.1 3.7 4.4
3.2 4.1 1.9 3.4 4.7 3.8 3.2 2.6 3.9 3.0
Construct a stem and leaf display for these data.
Stem-and-Leaf display of
Battery Life

 Stem Leaf Frequency
1 69 2
2 25669 5
3 00111122233344455
67778899 25
4 11234577 8

 1/6 means 1.6


Stem and Leaf

 We could also increase the number of classes by
breaking down each stem in to two.
 This would result in to stems 1(a), 1(b), 2(a), 2(b),
3(a), 3(b), 4(a), 4(b).
 Then leaves with values 0, 1, 2, 3 and 4 would be
grouped under stems ending with ‘a’.
 While the leaves with values 5, 6, 7, 8 and 9 would be
grouped under stems ending with ‘b’
Stem & Leaf Plot

• Stem Leaf Frequency
1(b) 69 2
• 2(a) 2 1
• 2(b) 5669 4
• 3(a) 001111222333444 15
• 3(b) 5567778899 10
• 4(a) 11234 5
• 4(b) 577 3
Continue....

 The frequency distribution is called a double stem
and leaf plot.
Stem & Leaf plot

• Main Advantages
• Possible to retrieve raw data from it, i.e. no
information is lost.

• Disadvantages
• Not very flexible with respect to the choice of the
number of classes.
• Cumbersome when the number of data values is
large
GRAPHICAL REPRESENTATION OF
FREQUENCY DISTRIBUTION

 A graph does not replace a table, but complements it
by showing the data’s general structure more clearly.
It is more likely to observe the attention of a casual
observer and reveal trends or relationships that
might be overlooked in a table.
Continue...

 For example, a graph will show the relationship
between two variables, or changes in a variable over
a time period.
Continue...

 In this class we will concentrate on histograms, pie
charts, bar charts, frequency polygons and Ogives.
Histogram

 DEF: A histogram is a graph that uses bars to
portray the frequencies or the relative frequencies
of possible outcomes for the numerical data. In
which the horizontal scale represent classes and
the vertical scale represents frequencies.
 A rectangle (bar) is drawn above each class
interval with its height corresponding to the
interval’s frequency, relative frequency, or percent
frequency.
Continue…

 In other words, in a histogram the base (horizontal
axis) of each bar corresponds to a class boundary of a
frequency distribution, and heights of the bars
represent the frequency, relative frequency, or
percent frequency associated with each bar.
 The use of the class boundaries eliminates the spaces
between the bars to give a solid appearance.
Continue…

Histogram for the number of days before maturity for
40 short-term investments

12
10
10
8
8
7 7
Frequency

6 Frequency

4
4
3

2
1
0
0
39.5 49.5 59.5 69.5 79.5 89.5 99.5 More
Class Boundaries
Cumulative Frequency
Curves

 We can plot cumulative frequencies against their
corresponding upper class boundaries and join the
points with a smooth curve.
 The curve is called the Cumulative Frequency Curve or
the Ogive
 We are going to use the data for the number of days
before maturity to demonstrate this.
Continue…
50

Cumulative Frequency Curve for the number of d

40

30

20

10

0
29.5 39.5 49.5 59.5 69.5 79.5 89.5 99.5

Class Boundaries
Continue...

 If the relative cumulative frequencies have been used,
we would call the graph above the relative
cumulative frequency distribution.
Continue...

 Ogives can be used to obtain certain quantities in the
frequency distribution called quantiles. These are
median, percentiles, the deciles, the quartiles.
Median

 The median is the central value of the distribution.
 The cumulative frequency at the median is 50%,
which means 50% of the data values are smaller than
or equal to the median.
Continue...

 For example, the median of the numbers 24, 18, 21,
17, 19,12, 14 is.
 18
 The median of the numbers 24, 18, 21, 17, 14,10 is
 17.5.
 Thus to find the median of the smaller data set one
has to arrange the data in an increasing order.
Continue...

 Then the median is the middle value if there are an
odd number of observations in the data set.
 If the data consists of even numbers, then the median
is the average of the two middle numbers.
Continue...

 How do we get the median for the number of days to
maturity?
 The data consisted of 40 observations. Half of 40 is
20.
 Thus from the point y = 20 on the y-axis, draw a
horizontal line towards the curve and drop it down
to the x-axis.
 Read the x-coordinate of the point where the vertical
line meets the x-axis.
Continue....

 The value of this point is
 67.5;
 hence the median is approximately 67.5.
Quartiles

 Quartiles are values that divide a set of
observations into 4 equal parts and they are
normally denoted by Q1, Q2 and Q3.
 The lower quartile, Q1, is a value such that one
quarter of all the values lies below Q1, that is,
 the relative cumulative frequency at Q1 is 25%.
Deciles

 These are points on the x-axis which divide a set of
observations into 10 equal parts. These values
denoted by D1, D2, D3,…, D9 are such that 10% of
the data falls below D1, 20% falls below D2, …, and
90% falls below D9.
Percentiles

 These are the points on the sample range which
divide a set of data into 100 equal parts. These
values, denoted by P1, P2, P3,…, P99 are such that
1% of the data falls below P1, 2% falls below P2, …,
and 99% falls below P99
Continue...

 50th percentile = Median = Q2
 75th percentile = Q3
 20th Percentile = 2nd Decile
Semi-Inter Quartile
Range

This is defined as
 S.I.R = Q3 – Q1
2
I.R = Q3 – Q1
Relative Cumulative
Curve

 A curve obtained by plotting the upper class
boundaries on the x-axis and the relative cumulative
frequencies on the y-axis is called a relative
cumulative frequency curve.
 It is much easier to use the relative cumulative
frequency curve to obtain the quartiles than to use
the cumulative frequency curve. Using number of
days before maturity
Relative Cumulative
Frequency Curve

Relative Cumulative Ogive
120

100

80

60

40

20

0
29.5 39.5 49.5 59.5 69.5 79.5 89.5 99.5

Cl ass Boundaries
RCFC

 To obtain the median from the relative cumulative
frequency curve, draw a horizontal line from the
point 50 on the y-axis towards the curve.
 At the point where the line meets the curve, drop a
vertical line to the x-axis.
 The point where the line meets the x-axis is the
median.
Continue...

 The quartiles Q1 and Q3 are obtained in a similar
manner by starting at points 25 and 75,
respectively, on the y-axis.
 In our example, the median is 67.5
 Q1 = 57.5 and
 Q3 = 83.5
 Then
 S.I.R = Q3 – Q1
2
83.5 – 57.5 =13
2
Continue....

Bar Charts

A chart with rectangular bars with lengths
proportional to the values that they present.
Bars can be plotted vertically or horizontally.
A vertical bar chart is sometimes called a
column bar chart.
There are different types of bar charts,
namely, simple bar chart, comparative bar chart
and component bar chart.
Simple Bar Chart

 In simple bar charts, the data is represented by a
series of bars, the height (length) of each bar
indicating the size of the figure represented.
Example

 There are 800 students in the School of Business
Administration at National University of Lesotho.
There are four majors in the school: Accounting,
Finance, Management and Marketing. The following
shows the number of students in each major:
Continue....

 Major Number of Students
Accounting 240
Finance 160
Management 320
Marketing 80
construct a bar chart for the above data.
Continue...

Bar chart for majors in the School of Business Administration
350

300

250
Number of students

200

150

100

50

0
Acconting Finance Management Marketing
Majors
Comparative (Multiple)
Bar Chart

 This type of chart shows several variables over the
same time period or a given variable over several
periods.
 Two or more bars are grouped together and more
than one set of comparisons can be made. The use of
a key will help distinguish between the categories
Example

 Draw a multiple bar chart to represent the imports
and exports of Lesotho (values in M) for the years
1991 to 1995
 Years Imports Exports
 1991 7930 4620
 1992 8850 5225
 1993 9780 6150
 1994 11720 7340
 1995 12150 8145
Continue....

Bar chart for imports and exports of Lesotho for the years 1991-
1995
14000

12000

10000
Imports & Exports

8000

Imports
6000
Exports
4000

2000

0
1991 1992 1993 1994 1995
Years
Pie Chart

 A graphical device for presenting data by
subdividing a circle into sectors that corresponds
with a relative frequency of each class.
 Angular measurement of a circle (pie) is 360. For
instance, a sector which has been allocated X units
must receive a portion of the pie of size
 X *360
Total observations
Continue…

Pie chart for majors in the School of
. Business Administration

10%

30%
Accounting
Finance
Management
40%
Marketing

20%
Frequency polygon

 It is constructed by plotting class frequencies against
class marks (Mid points) and connecting the
consecutive points by a straight lines.
 Usually a frequency polygon is a closed figure.
Therefore additional class marks are added both
ends of the distribution, each with zero frequency.
Example

 A supermarket recorded the number of items bought
by each customer and recorded in the following
table.
 Draw a frequency polygon to illustrate these results.
Continue...

Number of Number
items of
bought customers

1-5 22
6-10 36
11-15 52
16-20 26
21-25 18
26-30 6
31-35 10
Continue....
Number Number
of Mid points
of items customers
bought
1-5 22 3
6-10 36 8
11-15 52 13
16-20 26 18
21-25 18 23
26-30 6 28
31-35 10 33
Total 170
Continue...

Frequency polygon for items bought per customer
60

52
50
Number of customers

40
36

30
26 Number of customers
22
20
18

10 10
6

0
3 8 13 18 23 28 33
Number of Items bought
MEASURES OF LOCATION
OR CENTRAL TENDENCY

 We began our study of descriptive statistics by
leaning how to
 Organize data into tables
 Summarize data using graphical displays.
 We shall now learn numerical methods of
summarizing data.
Continue...

 We will look at some of the statistical measures
which define in some sense, the centre of a set of
data.
 These are called measures of location or measures of
central tendency.
THE SIGMA 
NOTATION

 So far we have dealt with data in its numerical form
only, such as 21, 19, 17, 18, 20.
 In other words we have dealt with specific
realizations of statistical variables of interest. But
when we need to write formulas or general
expressions involving statistical variables we need to
use variable names or symbols instead of numbers.
Continue...

 Thus if we are talking about the ages of five students,
we may use the symbols, X1, X2, X3, X4 and X5 when
we do not yet know their actual values.
 This means that X1 represents the age of the first
students, X2 represents the age of the second
students and so on.
Continue...

So for the numerical values 21, 19,17,18,
20
we can have x1 = 21, x2 = 19, x3 = 17, x4
= 18 and x5 = 20.
Continue....

 Now, supposing we have a large data set, say 1000
values. It is clearly inconvenient to always write
down all 1000 variable names.
 There are many ways to write down an expression
for the 1000 variables names without writing all of
them.
Continue...

 One such representation is
 x1, x2, x3,...,x1000 Equation (1)
 The three dots represent the 996 missing symbols
written following the pattern already established by
the first three symbols.
 Another way is to write down a typical symbol, say
xi, and then define the range of the subscript i
Continue....

 For example equation 1 can be written as
 xi, i=1, 2, 3, ... , 1000 Equation (2)
 Any letter can be used instead of x, there is nothing
special about it.
Continue...

 Now suppose we want to write down an expression
for the sum of the five variables, x1, x2, x3, x4 and x5.
We can write the sum as
 x1+ x2+ x3+ x4 + x5 Equation (3)

 The above expression can be read as: sum (add) all


variables xj from j=1 to j=5
Continue...

 There are four essential ingredients in the above
statement:
 a) The operation sum
 b) The generic summand xj
 c) The starting value of the subscript j=1
 d) The last value j=5
Continue...

 This means we could write Equation (3) more
concisely if we use a symbol that incorporates all the
four ingredients. The most common used symbol is
the sigma notation.


Continue...

 With this notation equation (3) can be written as
 5 Equation (4)
 x
j
j 1
Continue...

 Thus Equation 3 and 4 are two ways of writing the
same thing. Therefore, we can equate the two
expressions as
5

x
j 1
j  x1  x2  x3  x4  x5

 Equation 5
Continue...
n 
 In general,  x
j
j 1
 is the sum of the quantities xj, from j = 1 to j = n and
is called the index of summation.
 It does not matter what letter you use for the index
of summation as long as the same letter is used for
the summand and the lower limit of the summation.
Continue...
 Thus

3 3 3

i j k 1 2 3
x 
i 1
x  x
j 1
 x  x
k 1
 x
Continue...

 Examples
If x1  2, x2  9, x3  1 and x4  3, then
4

x
i 1
i

 x1  x2  x3  x4
= 2+9+1+3
=15
Continue...

3

x
i 2
i  x2  x3

= 9+1
 = 10
Continue...

If x1  1, x2  1 x3  1 and x4  1, then
4

x
i 1
i  x1  x2  x3  x4

 = 1+1+1+1
=4
Continue...

 Since all values of xj are equal to 1, we could
alternatively write the above as
4

1  1  1  1  1  4
i 1

 Equation (6)
Continue...

 We can generalize the sum in Equation (6) by putting
as the upper limit an unknown number N in place of
the number 4. Then it is not difficult to see that
N

1  1  1  1  ...  1  N
i 1

 Equation (7)
Continue...

 Similarly
N

 c  c  c  c  c  ...  c  Nc
i 1
 Equation (8)
 But the statement
N

c 
c 1
Nc
Continue....

 Is not true. Why? It is because now c is the index of
summation and varies from 1 to N. The correct
expression is
N

 c  1  2  3  4  ...  N
c 1
 Equation (9)
 The index c first takes the value 1, then 2, then 3 and
finally it takes the value N.
Continue...

 Suppose now instead of N values we have 100
values. Then
100
100 *101

c 1
c  1  2  3  4  ...  100 
2
 5050

 Equation (10)
Continue...

 In general we have a formula for summing an index
of summation
N
N ( N  1)

c 1
c  1  2  3  4  ...  N 
2
 Equation (11)
Continue...

 A sum can be represented in different ways if we
appropriately adjust the lower limit, the upper
limit and the index of summation. For example

3 2 4

x  x
i 1
i
i 0
i 1   xi 1  x1  x2  x3
i 2
i  1,2,3 .

. 3

x
i 1
i  x1  x2  x3

i  0,1,2
2

x i 0
i 1  x( 0 1)  x(11)  x( 2 1)

 x1  x2  x3
i= 2,3,4
4 
x
.

i 2
i 1  x( 2 1)  x( 31)  x( 4 1)

 x1  x2  x3
Summation of Squares and
Cross Products

 Consider measurements of two variables of interest,
say weight and height.
 Let these variables be denoted by x and y
respectively.
 Suppose five measurements of each variable are
taken and these are represented as x1, x2, x3, x4 and x5
for weights and y1, y2, y3, y4 and y5 for heights.
Continue

 Then
5

x
j 1
2
j x x x x x
2
1
2
2
2
3
2
4
2
5

y
j 1
2
j y y y y y
2
1
2
2
2
3
2
4
2
5

 Are called the sum of squares of x and y respectively


Continue...

 The sum of the cross product is
5

x y
j 1
j j  x1 y1  x2 y2  x3 y3  x4 y4  x5 y5
Continue...

 Example:
 Let x1=6, x2=1, x3=2, y1=5, y2=3 and y3=4. Evaluate
the following sums
3

x
3
yi

i
2 i 1
x i
i 1
 This is done as follows:
Continue
3 
.
 i 1 2 3
x 2

i 1
 x 2
 x 2
 x 2

= 62+12+22
= 36+1+4
= 41
Continue...
3

. x y
i 1
i i x y x y x
1 1 2 2 3 y3

= (6)(5)+(1)(3)+(2)(4)
=30+3+8
=41
Operational Rules for
Summation

 There are three basic operational rules which help to
simplify the use of the sigma notation.
 1. For any integer N

N N N

 x
i 1
i  yi    xi   yi
i 1 i 1
Continue...

 2. If c is a constant, that is, does not depend on the
index of summation i, then
N N

 cx
i 1
i  c  xi
i 1

and
N

c
i 1
 Nc
DEFINITION OF THE
MEASURES OF LOCATION

 The graphical representation of data gives us the
idea on the shape of the distribution (population).
 The mean, median and mode which we shall define
in this section provide estimates for the centre of the
distribution
Continue...

 The mean (or the arithmetic mean, to give its full
title) can be defined as the value which each item in
the distribution would have if all the values were
shared out equally among the items.
 For instance, if three people had M2, M3 and M7
respectively, the mean amount would be M4, i.e.
M12 shared equally between the three people.
Continue...

 Given a set of n values x1, x2,.., xn, the arithmetic
mean is defined as
x1  x2  ...  xn
x
n
 Equation (5.1)
 If the n values are a sample then the arithmetic mean
is also defined as the sample mean. Equation 5.1 can
also be written as:
Continue...

. n

x i
1 n
x i 1
  xi
n n i 1
Equation (5.2)
Example

 A food inspector examined a random sample of 7 cans of
a certain brand of tuna to determine the percent of
foreign impurities.
 The following data were recorded: 1.8, 2.1, 1.7, 1.6, 0.9,
2.7 and 1.8. Compute the sample mean.
Continue...

 The sample mean is
1.8  2.1  1.7  1.6  0.9  2.7  1.8
x  1.8%
7
 Equation (5.3)
Median

 We define the median as the middle value of a set of
an ungrouped data and for grouped data the median
is the point which splits the frequency distribution in
such a way that 50% of the values are smaller than
that point.
Example

The nicotine contents for a random sample of 6
cigarettes of a certain brand are found to be 2.3, 2.7,
2.5, 2.9, 3.1 and 1.9 milligrams. Find the median
 If we arrange these nicotine contents in an increasing
order of magnitude, we get
 1.9 2.3 2.5 2.7 2.9 3.1 and the
median is then the mean of 2.5 and 2.7. Therefore the
mean would be
Continue...
2.5  2.7

. ~
x  2.6milligrams
2
Mode

 We define the mode as the sample value with the
highest frequency. In other words, the mode can be
defined as the point of maximum frequency density.
Example

 The number of incorrect answers on a true-false
competency test for a random sample of 15 students
were recorded as follows:
 2, 1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4 and 2.
 The mode for these data is 3
Bi-modal data:

 In a case where we have more than two modes, the
data is said to be bi-modal.
Geometric Mean and
Harmonic Means

 There are two other means apart from the arithmetic
mean though not used as often as the other averages.
 These are the geometric mean and the harmonic mean.
Continue...

 The Geometric mean of a set of n observations x1, x2,
..., xn is defined as
x  x .x ...x 
1
 n
G 1 2 n

 For example the geometric mean of 6 and 8 is


xG  6 * 8  6.9282
1
2
Harmonic Mean

 The Harmonic Mean of a set of observations x1, x2,...xn,
is defined by

1 n
xH  n
 n
1 1 1
n

i 1 xi

i 1 xi
 It is most frequently used in averaging speeds for
various distances covered, where the distance
remains constant.
Examples

 The following examples illustrate the typical use of
the harmonic mean:
 1. A car travels from point A to B at an average speed
of 60km/h and returns at an average speed of
40km/h. What is the average speed for the entire
journey?
 2. Three Basotho athletes took part in the Comrades
marathon in May this year. Their average speeds
were recorded as 15km/h, 20km/h and 10km/h.
What was the average speed of the Lesotho athletes?
Continue...
2 2

xH    48km / h
2
1  1 1 
. 
i 1 xi
  
 60 40 
Therefore, the average speed of the entire journey is
48km/h.
n 3
xH  n   13.846km / h
1  1 1 1 

i 1 xi
  
 15 20 10 

Therefore, the average speed for the Lesotho athletes is
13.846km/h.
CALCULATION OF
AVERAGES FOR GROUPED
DATA

 Suppose that instead of a raw set of data we have a
grouped frequency distribution, how do we calculate
the measures of average?
 Consider the following general frequency
distribution with k classes and with a total of
observations equal to n:
Table 5.1: A general
Frequency Distribution

Class Class Mark Frequency Product
1 x1 f1 x1f1
2 x2 f2 x2f2
3 x3 f3 x3f3
. . . .
. . . .
. . . .
k xk fk xkfk
Total n n

f
i 1
i n x
i 1
i fi
Example
Class

Frequenc Class Product
limits y Marks
30 - 39 3 34.5 103.5
40 - 49 1 44.5 44.5
50 – 59 8 54.5 436
60 – 69 10 64.5 645
70 – 79 7 74.5 521.5
80 – 89 7 84.5 591.5
90 – 99 4 94.5 378
Total 40 2720
Continue...

 The sample mean of the above frequency
distribution is defined as
n

x i fi
1 n
i 1
n
  xi f i
f
n i 1
i
i 1
 = 2720 = 68
40
Calculation of a
weighted average

 If frequencies fi, i=1, 2, 3, ..., k are replaced by
numbers wi, i=1, 2, 3, ..., k called weights, whose
values represent the relative importance of the
classes or variables, then is called a weighted
average.
 It is used to find the mean of dataset which values
are not equally represented.
Continue...

It is defined as
n

x w i i
xw  i 1
n

w
i 1
i

The important point to remember about a mean


computed from such weights is that, instead of
dividing by the number of items, one divides by the
sum of weights.
Example

 The following example illustrates how a weighted
average could be calculated
 What is the average score for a student who received
grades 85, 76 and 82 on 3 tests and 79 on the final
examination in a certain course if the final
examination counts 3 times as much as each of the
three tests?
Continue…

 xi wi xiwi
 85 1 85
 76 1 76
 82 1 82
 79 3 237
 Total 6 480
Continue...

 Thus, x  480  80 po int s
w
6
The Mode

 For a symmetric frequency distribution, the mode is
defined as the class mark of the modal class (the class
with the highest frequency). In general the mode is
defined as:

1
xmod e l *w
1   2
Continue...

 Where l = the lower class boundary of the modal
1 class,
= the frequency of the modal class minus the
frequency of the class immediately before the modal
class ( f m  f m1 )
2 = the frequency of the modal class minus the frequency of
the class immediately after the modal class
( f m  f m1 )
 W is the class width.
Example

i Class Frequency, fi

1 1.5 – 1.9 3
2 2.0 – 2.4 10
3 2.5 – 2.9 18
4 3.0 – 3.4 10
5 3.5 – 3.9 7
4.0 - 4.4 2

50
Continue…

 We calculate the mode of the frequency distribution
given in the above table. The modal class is 2.5 – 2.9
since it is the class with the highest frequency.
 Thus l = 2.45, Δ1 = 18-10 = 8,
 Δ2 = 18 - 10 = 8 and w = 0.5
 Hence
Continue…
8
. xmod e  2.45  * 0.5
88
4
 2.45   2.7
16
Median

. The median for grouped data is calculated using the
following formula n
xmedian  l  2  F * w
f
Where l = the lower class boundary of the median
class,
n = total frequency,
F = the cumulative frequency of the previous class,
f = the frequency of the median class and
w = the class width
Continue
i Class

Class Mark, xi Frequency, fi Cumulative
Frequency

1 1.5 – 1.9 1.7 3 3


2 2.0 – 2.4 2.2 10 13
3 2.5 – 2.9 2.7 18 31
4 3.0 – 3.4 3.2 10 41
5 3.5 – 3.9 3.7 7 48
4.0 - 4.4 4.2 2 50

Total 50
Continue…
50 
. xmedian  2.45  2  13 * 0.5
18
 2.45  (0.6667 * 0.5)
 2.45  0.3333  2.7833
MEASURES OF DISPERSION
OR VARIATION

 A measure of variation is a way of indicating how
dispersed a set of observations is.
 We need to know whether observations are closely
together or well spread out.
 It is quite possible to have two sets of observations
with the same mean or median that differ
considerably in the variability of their measurements
about the average.
Example

 Consider the following measurements, in liters, for
two samples of orange juice bottled by companies A
and B.
 Sample A 97 100 94 103 106
 Sample B 106 101 88 91 114
 Both samples have the same mean, 100. But it is quite
clear that company A bottled orange juice with more
uniform content than company B. We say that the
variability or dispersion of the observations from the
average is less for sample A than for sample B.
Continue…

 In this section we are going to discuss the following
common measures of dispersion (variation):
 range
 mean deviation
 variance and standard deviation and
 Coefficient of variation.
The Range

 The range is the simplest measure of dispersion.
 The range for a set of data is defined as the difference
between the largest and the smallest value.
 The range of 19, 19, 19, 19, 19 is zero since the largest
value – smallest value = 19-19 = 0.
 The range of the 19, 18, 17, 19, 18, 20 is 20 – 17 = 3.
The Mean Deviation

 The mean deviation is defined to be the arithmetic
mean of the absolute deviations from the mean.
More precisely, if it is a sample of n values, the mean
deviation is defined as

 xi  x
D i 1
n
Continue…

 Where x is the sample mean. The number xi  x is
called the absolute deviation of the ith value from the
sample mean.
Example

 The mean deviation of the sample 19, 19, 19, 19, 19 is
0
 The mean deviation of the sample 18, 21, 20, 17, 19 is

18  19  21  19  20  19  17  19
19  19
D  1.2
5
Continue…
1
 n
 Since x
n
 x
i 1
i  19
 Thus as expected, there is more variability in the 18,
21, 20, 17, 19 than in the sample 19, 19, 19, 19, 19.
Mean deviation for
grouped data.

 The mean deviation for grouped data with class
marks, and frequencies is calculated using the
following formula:
1 n
D   f i xi  x
n i 1
Example

 Calculate the mean deviation of the following
distribution:
Class limits Frequency f i
10-14 3
15-19 5
20-24 7
25-29 4
30-34 2
Total 21
Example 21.2857
f x

Class limits Frequency
x*f x x
i i i i i xi  x * f i

10-14 3 12 36 9.286 27.858

15-19 5 17 85 4.286 21.430

20-24 7 22 154 0.714 4.998

25-29 4 27 108 5.714 22.856

30-34 2 32 64 10.714 21.428

Total 21 447 98.57


Continue…

. n
1
D   f i xi  x
n i 1
= 98.57
21
= 4.694
The Variance and Standard
Deviation

 Like the mean, the variance and the standard
deviation are parameters of fundamental importance
in statistics. The variance for a set of observations,
x1 , x2,...,xn
 is defined as

 
n 2
1
S   xi  x
2

n i 1
 Or 1 n
S   xi  x
2 2 2

n i 1
Continue…

 And the standard deviation is defined as the square
root of the variance, that is,

 
n
1

2
S S 
2
xi  x
 Or
n i 1
n
1

2
S xi  x
2

n i 1
Example

 Clearly the standard deviation of 19, 19, 19, 19, 19 is
zero.
 To calculate the standard deviation of 18, 21, 20, 17,
19, we find the mean

1 n
18  21  20  17  19
x   xi   19
n i 1 5
Continue…

 Table below can help us with the other calculations
xi xi  x x
i x 
2

18 -1 1
21 2 4
20 1 1
17 -2 4
19 0 0
Continue…
1
S  
2
n
x  x 
2

i
n i 1
1
1  4  1  4  0  2.0
5
 Hence

 S 2
 = 1.414
Continue…

 Equivalently we can use the other formula
n 2
1
S 2   xi2  x
n i 1
1
 
 182  212  20 2  17 2  19 2  19 2
5
 324  441  400  289  361  361
1
5
 1815  361
1
5
 363  361
Continue….

 The variance for grouped data with class marks, and
frequencies is calculated using the formula

S   f i xi  x 
21 2

n
Continue…

 An alternative formula for the variance for grouped
data is:

S   f i xi  x    f i xi  x
21 2 1 2 2

n n
Example

 Find an estimate of the variance and standard
deviation of the following data for the marks
obtained in a test of 88 students.
Marks (x) Frequency (𝑓𝑖 )
0-9 6
10-19 16
20-29 24
30-39 25
40-49 17
Continue… 28.0227

Class Frequency, Class xi f i xi  x  xi  x 2 f i  xi  x 
2

fi Mark, xi
0–9 6 4.5 27 -23.5227 553.3174 3319.9044
10 – 19 16 14.5 232 -13.5227 182.8634 2925.8144
20 – 29 24 24.5 588 -3.5227 12.4094 297.8256
30 – 39 25 34.5 862.5 6.4473 41.9554 1048.885
40 – 49 17 44.5 756.5 16.4473 271.5014 4615.5238

Total 88 2466 12207.9532


Continue…

1 n
x   xi f i
n i 1
= 2466
88
=28.0227
Continue…

 
n 2
1
S   f i xi  x
2

n i 1
1
S  (12207.9532)  138.727
2

88
Continue…

 
n 2
1
S 
n i 1
f i xi  x

S  138.727  11.778
Coefficient of Variation

 The coefficient of variation (CV) or relative
standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean.
 i.e.
s
CV    100%
x
Measures of Association

 In this topic, we introduce methods for investigating
relationships between two statistical variables
 Given sample values of two variables, we need
measures based on these samples, that would tell us
whether there is any association between the
variables.
 In other words, bivariate analysis examines the way
in which the characteristics of one variable are
associated with the characteristics of another
variable.
Continue…

 The following are some examples of questions open
to bivariate analysis:
 Is educational attainment associated with race?
 Is drug use associated with income?
 Does religious affiliation vary by geographical
location?
 Is crime associated with concentrated poverty?
Continue…

 All these questions involve comparing two variables
to see if there is an association of one variable with
the other.
 In this chapter, we will learn how to determine the
extent to which two variables are associated with one
another. We will focus only on relationships between
continuous variables which involves construction of
correlations and a simple introduction to regression.
Correlations

 They are designed to measure the strength of the
relationship between two continuous variables.
 Generally when social scientists discuss correlation
they are referring to the Pearson’s correlation
coefficient
 Pearson’s correlation coefficient varies somewhere
between -1 and +1. The closer the correlation is to
either -1 or +1, the stronger the relationship
between the two variables.
 A correlation of 0 indicates that there is absolutely
no association between the two variables
 Negative and positive signs indicate direction of the
relationship
Continue…

 POSITIVE CORRELATION: an increase in values
for one variable is associated with a increase in
the values for the other variable, for example,
as height increases so does shoe size.
 NEGATIVE CORRELATION: an increase in
values for one variable is associated with a
decrease in values on another variable, for
example, as temperature reduces the use of
electricity for heating increases.
Continue…

 x  y 
 xy  n
r

 x   x 2

 y   y 2


 
2 2

n  n 
  
Example

 Here are the number of hours that 10 students spent
studying for a final exam (x) and their score on that
exam(y).
 Hrs 7 8 4 9 13 5 9 6 16 3
 Score 70 76 57 77 91 66 82 64 96 50
 Calculate the correlation coefficient r for these data.
Solution
x y

xy x 2 y2
7 70 490 49 4900
8 76 608 64 5776
4 57 228 16 3249
9 77 693 81 5929
13 91 1183 169 8281
5 66 330 25 4356
9 82 738 81 6724
6 64 384 36 4096
16 96 1536 256 9216
3 50 150 9 2500
80 729 6340 786 55027
Continue…

 For our formula,  x  80  y  729  xy  6340

  786
x 2
  55027
y 2
Continue…

  x y
 xy  n
r

 x2  x
2
 
 y 2   
y 2

  

n n 
  
80  729
6340 
 10
 80 2  729 2 
 786   55027  
 10  10 

 0.969
Linear Regression

 If we can determine that there is a linear correlation
between two variable, then the behavior of those
variables can be described graphically by a straight
line.
 In this section we learn how to find the equation of
the line that best fits a set of data.
 We will go on to use that equation to predict the
value of one of the variables for a particular value of
the other variable.
 A regression line is a line that best fits a set of data.
Continue…

 The general formula of a regression line is yˆ  a  bx
 In general the equation ŷ , which is read as “y hat,” is
the predicted value of y for a given value of x. The
slope of the line is b, and we calculate it first. We
then use the value of b to help calculate a which is
the y- intercept of the line.
Continue…

 Here are the formulas to calculate a and b.
 n xy   x  y
b
n x   x 
2 2

a  y  bx
Example

 Here are the scores of five randomly selected
students on test 1 and 2 in a Statistics class. Find the
equation of the regression line treating the score on
test 1 as x and the score on test 2 as y.
Student Test 1 score Test 2 score
1 83 83
2 86 84
3 76 63
4 92 83
5 71 55
Solution

 We begin by creating a scatter plot, to make sure that
the data seem to have a linear relationship.
Continue…
 2
x y xy x
83 82 6806 6889
86 84 7224 7396
76 63 4788 5776
92 83 7636 8464
71 55 3905 5041
408 367 30359 33566
Continue…

n xy   x  y 
 So b 
n x   x 
2 2

530359   408367 

533566   4082
2059

1366
 1.5073
 The slope of the regression equation line is 1.5073.
This tells us that each point increase on test 1, the
score on test 2 will increase by 1.5073
Continue…

 Now we calculate the y-intercept a.
x
 x
y
 y
n n
408 367
 
5 5
 81.6  73.4

a  y  bx  73.4  (1.5073)81.6  49.5957


Continue…

 The regression equation is given by
yˆ  49.5957  1.5073x
 Where x represents student score on test 1 and ŷ is
the predicted score for that student on test 2.
 We can use the regression line to predict a y-value
for a given x-value.
 Suppose that a student got a score of 95 on test 1.
What score can we expect for a student on test 2?
Continue…

 All we need to do is plug in 95 for x in the regression
equation yˆ  49.5957  1.5073 x
 49.5957  1.5073(95)
 49.5957  143.1935
 93.5978
 The predicted score from the regression equation is
93.5978. we could round this to produce a predicted
score of 94. This seems to fit the pattern of the data.

You might also like