0% found this document useful (0 votes)
18 views97 pages

NSTA 51516 Slides

The document provides an overview of statistics, data types, sampling methods, and data collection techniques. It explains the differences between primary and secondary data, as well as various sampling strategies such as probability and non-probability sampling. Additionally, it discusses measurement scales, random variables, and methods for presenting data, including frequency distributions and stem-and-leaf displays.

Uploaded by

zulukhanyo73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views97 pages

NSTA 51516 Slides

The document provides an overview of statistics, data types, sampling methods, and data collection techniques. It explains the differences between primary and secondary data, as well as various sampling strategies such as probability and non-probability sampling. Additionally, it discusses measurement scales, random variables, and methods for presenting data, including frequency distributions and stem-and-leaf displays.

Uploaded by

zulukhanyo73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97










✓ –

• Statistics: this is the science of data. In simple terms, statistics can be defined as
the processing of raw data into summary measures that aid in decision-making
by representing important information.
• Data: this is unprocessed data that carries out little useful and useable
information.
Consider the following example:
• Data of students who study at SPU consists of their Age, Addresses , and
Gender.
• Statistics of students at SPU is as follows. 40% males and 60% females, those
aged between 19 and 25 constitutes 80% and the rest of 20% is older than
25. More than 70% are from NC and the rest of 30% is from the neighboring
towns.
• Statistics provide evidence-based decisions by assisting managers,
policy makers or business excecutives with addressing questions or
problems with confidence. For instance;
Management decision making based on statistical analysis is as follows:
We have data on our subject of interest.
We perform statistical analysis from the data.
We have information from the analysis.
Decision is made based on the analysis from data.
Consider the following scenario:
• Suppose you are a data analyst of a certain company. The company
wishes to extend their business, what steps can you follow to assist
the company into making good decisions?
Random variable is a sample space real-valued function. Also, it can be
defined as the variable of “interest” that we collect our data on. Random
variables are mostly denoted by capital letters.
Population: this is a set of all possible outcomes on our random variable.
For instance, the population is all students at SPU.
Sample: this is the subset of the population that is used to represent the
whole population. Sometimes, it is costly and time-consuming to work
with the whole population, hence the sample. An example of a sample
can be students at SPU who are in their first year staying off cam.
Sampling unit: this is the unit we collect data on. For instance, it can be a
student from the sample above.
There are 2 ways in which a sample can be selected from the population.
a) Probability Sampling: Each sample has an equal chance of being chosen in probability sampling. A
probability sample, to put it another way, is one in which each element of the population has a known
non-zero probability of selection. This sampling strategy yields the likelihood that our sample is
representative of the entire population.
• For instance, in SPU, for example, there are 500 students. This is called probability sampling, and it is a way
in which all 500 students at SPU have an equal chance of being participants in your study. To put it another
way, probability sampling employs random sampling procedures to select a sample.
• In a population of 100 people, for example, each person has a one-in-a-hundred chance of getting chosen.
This sampling approach ensures that a representative sample of the entire population is obtained
This is the method of selecting a sample from the population in such a
way that all elements of the population stand an equal chance of being
selected.
The probability of selecting the first unit of the sample is,
𝑛
𝑁
Second sample unit probability is
𝑛−1
𝑁−1
The last unit probability is
1
𝑁−𝑛+1
Hence, the probability of selecting n samples from N population units is
𝑛! 𝑁 − 𝑛 ! 1
=
𝑁 ! 𝑁𝐶𝑛
The above method is called sampling without replacement.
The other way we can sample is with replacement and with this method,
the probability is the same throughout the sampling process. This is
because after picking one unit, it is taken back into the population before
the next pick is made.
The selection can be done using the computer random numbers or the random
number table. In the case where we use the random number table, we follow
the following process.
To choose a random sample of 15 people from a total of 85, each subject must
be numbered from 01 to 85. Then, by closing your eyes and placing your finger
on a number on the table, choose a starting number. (Although this may appear
weird, it allows us to generate a random starting number.) Assume your finger
landed on the number 12 in the second column in this scenario. (It's the sixth
number from the top on the list.) Then go down the list until you've chosen 15
distinct numbers between 01 and 85. Go to the top of the following column
when you reach the bottom of the previous column. If you choose a number
that is bigger than 85 or 00 or a duplicate number, just omit it. We'll utilize the
subjects 12, 27, 75, 62, 57, 13, 31, 06, 16, 49, 46, 71, 53, 41, and 02 in our
example.
Researchers obtain systematic samples by numbering each data object in
the population and then picking every kth individual. Let's say there were
2000 people in the population and a sample of 50 people was required.
2000
Because = 40, 𝑘 = 40 𝑎𝑛𝑑 𝑡ℎ𝑒 40𝑡ℎ subject would be chosen at
50
random; nonetheless, the first subject (numbered between 1 and 40)
would be chosen at random. Assume subject 12 was chosen first; the
sample would then consist of subjects with numbers 12, 52, 92, and so
on until 50 subjects were acquired. When utilizing systematic sampling,
it's important to pay attention to how the population's subjects are
counted.
Researchers produce stratified samples by dividing the population into
groups (referred to as strata) based on a study-relevant trait and then
sampling from each group. Within each stratum, samples should be
chosen at random. Let's say the VC of SPU wants to know how students
feel about a particular topic. In addition, the VC wants to examine if first-
year students' attitudes differ from those of second-year students.
Students from each group will be chosen at random by the VC to be
included in the sample.
Cluster samples are also used by researchers. The population is
separated into clusters by a variety of factors, including geographic area,
schools in a large school district, and so on. The researcher then chooses
some of these clusters at random and employs all members of those
clusters as sample subjects. Let's say a researcher wants to conduct a
survey of apartment inhabitants in a big city. If there are ten apartment
buildings in the city, the researcher can choose two at random from the
ten and interview all the tenants. When there is a huge population or
people who live in a large geographic area, cluster sampling is
performed.
Non-probability sampling, as a contrast to probability sampling, draws
the sample using non-randomized methods. The majority of non-
probability sampling methods entail judgment. Instead of randomization,
individuals are chosen based on their accessibility. Your classmates and
friends, for example, have a larger probability of being included in your
sample. Non-probability sampling is a helpful and handy method of
selecting a sample in some circumstances when it is the only method
available.
Qualitative Random Variable: this generates categorical response data or
nonnumeric data . Examples of Categorical data are gender, education
levels, the province one comes from, etc.

Quantitative random variable: it generates numerical response data.


Examples can be, age, shoe size, height, and weight.
• Numerical data can be classified as:
• Discrete data: these are whole numbers or integers. For
example, age, number of cars, number of books, and number of
laptops.
• Continuous data: the numbers that can occur in an interval. For
instance, weight, height, and time.
Scales of Measurements
Measurement scale assists in deciding on the appropriate statistical approach and helps indicate how much
arithmetic manipulation is possible on the data. There are 4 numerical scales:
• Ordinal data: this is associated with categorical data and the ranking of different categories have implication. That
is, the next category is either more or less than the previous.
For instance, clothes sizes may be categorized as, 1: small, 2: Medium, 3: Large, etc. Income categories are 1: low
income, 2: middle, and 3: high
• Interval: The disparity between the numbers, as well as their relative order, is crucial. The concept of unit
distance is used in this scale, so the difference between any two integers can be stated as a number of units. The
interval scale necessitates a zero point, but its location is up to you.
Good examples of interval scales are the Fahrenheit and Celsius temperature scales. The zero points and unit
distances are different in both. A change in scale, location, or both does not break the principle of an interval scale.
• Nominal: When we solely use numbers to categorize the outcomes of a variable, we utilize the nominal scale.
For example, a "man" could be 1 and a "female" could be 0, but this number assignment is clearly arbitrary—a
female could be 100 and a male could be 0. Another example could be of how we code Provinces and language.
• Ratio: When the interval size is relevant as well as the ratio between two integers, the ratio scale is utilized. This
means that it is fair to remark that one number is twice as large as another.
This is obviously not feasible on an interval scale, where 80°F is not twice as "hot" as 40°F-measured on the Celsius
scale, these two temperatures are 27°C and 4°C, respectively, and 27°C is not twice 4°C. Height, weight, and age
measurements are examples of situations where ratio scales are applicable. The majority of the statistical methods
we'll cover demand that the variable be measured on an interval scale at the very least.
• Primary data: this is the data we collect for the first time for the specific task we
wish to carry out. This can be either from the records available such as registration of
employees or sales invoices. The other way is through different surveys. Primary data
has the advantage of being highly relevant since it gets collected with a specific aim.
The disadvantage of this source of data is that it is expensive and takes time to
obtain.
• Secondary data is data that already exists since it was collected for other purposes
previously. Examples can be quarterly reports, first semester marks, and last year’s
rainfall report.
• Advantages of this data are that it is readily available hence easier to access.
Also, it is less expensive to collect.
• Disadvantages are that data was collected with a certain objective in mind,
hence it may not be specific for our current problem. It may be outdated, and its
accuracy difficult to assess.
Starting Salary Frequency

5 000 4
6 000 1
7 000 3
8 000 5
9 000 8
10 000 10
11 000 2
13 000 5
14 000 6
15 000 1
• Observation: this is the observation of the respondent of the process in action. This may include,
observing traffic, the behavior of students in class, how employees work, and pedestrian flow.
• The advantage is that the respondent is unaware that they are being observed, hence they act
normal, hence reducing biases in data.
• The disadvantage is that this is a passive form of data collection hence the respondent cannot be
questioned as to why they doing what they are doing.
• Surveys: This is the most used method to collect data and it is done through questionnaires. The
questionnaire is administered to the respondent by directly asking them questions.
• There are different types of surveys, we have: Personal interviews, Telephone interview, and e-
surveys
• Experimentation: This is when data is collected by conducting experiments. Here there are controlled
conditions while other variables are being manipulated. Examples can be, changing advertising platforms
to push products sales.
• The advantage of this method is that it produces high-quality data that is likely to be accurate. This leads to reliable
and valid statistical results.
• The disadvantage is costly and time-consuming with some complications when certain variables are to be
controlled.
Starting Salaries of Graduates
12

Frequencies 10

0
5000 6000 7000 8000 9000 10000 11000 13000 14000 15000

Salaries
• Suppose we have the following information. From the data of salaries of people when they
started working, we can categorise them as;

Males 15

Females 30

• We can present this as a Pie Chart as follows;

Gender

15

30
• A stem-and-leaf display is a shorthand notation for expressing the values in
ascending order, from lowest to highest. We may gain a sense of the typical value, or
center, of the set of data, as well as how the values are spread, from the
presentation.
• Each value is broken into two parts—a "stem" and a "leaf"—to create a stem-and-
leaf display. The first portion of the number is the stem, and the latter part of the
number is the leaf. Although the values can be split in a variety of ways depending on
the types of values you're working with, we normally use the leaf to represent the
value's last digit and the stem to represent the value's preceding digits. The value 46,
for example, has a stem of 4 and a leaf of 6. The stem is 19 and the leaf is 2 for the
value 192.
• When constructing a stem and leaf display, we start by locating the lowest and
greatest values in the data set before generating the actual stem-and-leaf display.
This will provide us with the beginning and last stems for our display. Write down
each conceivable stem in a vertical column, beginning with the lowest and ending
with the tallest. Then, on the row holding its stem, write each value's leaf. After
you've done this for each value, you'll need to sort each row from lowest to highest.
• Consider the following test scores:
83 86 65 94 88 51 76 75 86 64 91
47 71 48 68 45 83 76 92 82 96 82
71 56 79 90 92 76 74 98 75 69
Construct the Steam and Leaf for the above scores.

• The following are the ages of people who stay at an old age home around Bloemfontein.
Construct a stem and leaf display for them.

43 68 46 43 34 59 55 42 73 71 73
75 71 84 43 36 68 54 77 76 62 59
62 69 31 60 82 88 68 59 81 51 75
79 51 62 41 50 74 82 61 56 66 75
52 59 44 58 51 58 72 62 72 50
• A frequency distribution is another graphical tool we may use to evaluate data. We
divide the data into classes and count the number of times each one is represented.
• Consider the following frequency distribution of 32 test scores from one module.

Scores Frequency

40-50 3
50-60 2
60-70 4
70-80 9
80-90 7
90-100 7
The first class, 40–50, consists of values from 40 up to, but not including, 50. A value of
50 would be counted in the second class, 50–60. In the first class, 40 is referred to as
the lower-class limit, whereas 50 is the upper-class limit. The upper-class limit is not
actually included in the class but is the boundary point for beginning the next class.
The class size for a given class is the distance from its lower limit to its upper limit.
𝑐𝑙𝑎𝑠𝑠 𝑠𝑖𝑧𝑒 = 𝑢𝑝𝑝𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑙𝑖𝑚𝑖𝑡 − 𝑙𝑜𝑤𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑙𝑖𝑚𝑖𝑡
The class size for each class in the previous example is 10 (50 – 40 = 10, 60 – 50 = 10,
etc.).
The class mark for a given class is the midpoint of the class.
𝑙𝑜𝑤𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑙𝑖𝑚𝑖𝑡
𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 =
2
In the above example class mark for first class is 45.
There are 2 rules to follow when creating a frequency distribution
• The classes must, first and foremost, be mutually exhaustive. This implies that each value in the data
collection must fall into one of the classes. Consider the following frequency distribution in Table 1. There are
some values that will not appear in the distribution table because they do not belong in any of the classes.
Suppose we had 51, in which class would it belong?
• The second rule is that the classes must be mutually exclusive. This means that the two classes may not
overlap. Looking at the second table, classes overlap. Suppose we had a value of 63, in which class would it
be because it can either be in second class or third class?

score frequency score frequency

40-48 40-55

52-65 50-65
60-75
65-73 70-85
76-89 85-100

89-100
• The ideal way to satisfy both conditions is to make sure that the lowest value belongs
in the first class, the highest value belongs in the last class, and the top limit of one
class corresponds to the lower limit of the next. This eliminates both overlaps and
missing values.
• Do not use either too many classes or fewer classes. Making too many classes breaks
data too much while few classes does not break data enough, hence 5 to 20 classes
are recommended.
• Use equal class sizes and that will enable us to compare frequencies easily. We can
not compare a frequency of a class size 5 to that of class size 10.
• Another thing to think about is avoiding open-ended classes. An open-ended class is
one that lacks one of its constraints. If we were looking at household incomes,
R100,000 and up would be considered an open-ended class. Below R20,000, on the
other hand, is not an open-ended category because a household's income cannot be
less than R0. As a result, R0 is the true lower limit for that class. The range of the
class should be R0–R20,000.
Use data in example 1 to construct
frequency distribution.
• The graph constructed using cumulative
frequencies is called Ogive
• We will use the score example to construct
this graph
• This is the graph that is used to present the frequency distribution.
• The horizontal axis shows the class limits while the vertical axes is the frequencies.
as an example, we will consider the test score data above and construct the histogram.

This relationship can be depicted
using the following graphs:
•Scatter plot: it displays the data of 2
numerical random variables on a x-y graph.
•Trendline Graph: data is plotted over time
Scatter plot: we can construct the scatter of Trendline: consider the amount of electricity
between the number of hours studying for an over 12 months
exam and the marks obtained
• This is the single value that gives us the idea of the center of the data. This value gives us the idea of where most of our data
points fall.
1. Mean for ungrouped data
• Sample mean is calculated as
σ𝑥
𝑥ҧ =
𝑛
• Population mean is calculated as
σ𝑥
𝜇=
𝑁
Note that when calculating population, we divide by capital N because we taking all population units, while the sample is small n.
Example 1
• A manager of a local restaurant is interested in the number of people who eat there on Fridays. Here are the totals for nine
randomly selected Friday.
712 626 600 596 655 682 642 532 526

1. Calculate the mean of these data.


2. Is this a sample or a population mean? Why?
Here are the number of electricity units a student
accommodation landlord used monthly for the whole
of last academic year.
1470 1304 1548 1352 1115 1883 1500 1553 2053
1491
What is the mean number of units used?
Is this a population mean or sample mean?
Calculating mean for grouped data
σ 𝑥𝑖 𝑓𝑖
𝑥ҧ =
𝑛
Where; 𝑓𝑖 is the frequency of the ith class
𝑥𝑖 is the class midpoint

Example
• Use example 1 data to find mean of grouped data.
• After the values have been ordered from lowest to highest, the median of a set of
data is a value that divides the set of data into two equal groups. What is the
location of a road's median? In the middle of the road.
• When looking for a median, there are two possibilities. There will be one value in
the center of the data if there are an odd number of items in the collection, and this
value is the median.
• If the number of values in the collection of data is even, however, there is no one
value in the middle. We take the mean of the two numbers in the center in this
situation.

Example
• Using the above data on the number of people and the units of electricity, find the
median
The formula used to calculate median for grouped data is as follows:
𝑛
𝑐[ − 𝑓(<)]
𝑀𝑐 = 𝑂𝑚𝑒 + 2
𝑓𝑚𝑒
Where:
𝑀𝑐 is the median of the grouped data
𝑂𝑚𝑒 is the lower limit of the median interval
C is the class width
n is the sample size
𝑓𝑚𝑒 is the frequency count of the median interval
𝑓 < is the cumulative frequency count of all intervals before the median interval

Example
Use the data on the test score to calculate the mean and the median.
𝑂𝑚𝑒 =

𝑓𝑚𝑒
𝑓 <

32
10 −9
2
𝑀𝑐 = 70 + = 77.78
9
Weights Frequency

20-Oct 0
20-30 7
30-40 12
40-50 23
50-60 16
60-70 2
• The mode is the value in the data set that appears the most frequently. It is also referred to as the most
typical situation.
• Unimodal data is defined as a set of data having only one value that occurs with the greatest frequency.
• When two values occur with the same maximum frequency in a data collection, both values are regarded
as the mode, and the data set is called bimodal.
• A data set is considered to be multimodal if it contains more than two values that occur with the same
maximum frequency. Each value is utilized as the mode.
• The data set is said to have no mode if no data value appears more than once. There can be more than
one mode in a data collection, or none at all.
Properties and uses of the Mode
1. When the most common scenario is wanted, the mode is employed.
2. The mode is the most straightforward average to calculate.
3. When the data is nominal or categorical, such as religious preference, gender, or political affiliation, the
mode can be employed.
4. The mode isn't necessarily unique. A data set may contain more than one mode, or it may not have any at
all.
The data show the
Find the mode for the 104 104 104 104 104 107
number of licensed cars
number of branches that 109 109 109 110 109 111 Find the mode.
in the Sol plaatjie for a
six banks have. 112 111 109
recent 15-year period.

401, 344, 209, 201, 227,


353
𝑐 𝑓𝑚 − 𝑓𝑚−1
𝑀𝑜 = 𝑜𝑚𝑜 +
2𝑓𝑚 − 𝑓𝑚−1 − 𝑓𝑚+1

𝑜𝑚𝑜
𝑐
𝑓𝑚
𝑓𝑚−1
𝑓𝑚+1

• 𝑜𝑚𝑜 = 70
• 𝑐 = 10
• 𝑓𝑚 = 9
• 𝑓𝑚−1 = 4
• 𝑓𝑚+1 = 7

10 9 − 4
𝑀𝑜 = 70 + = 77.14
2∗9 −4−7
• The median divides a set of data in half. Quartiles are the ideal
metrics to use if we want to divide a set of data into quarters.
• The value that divides the first quarter of a data set from the
remainder is known as the first quartile, or 𝑄1 .
• The value that divides the last quarter of a data set from the rest is
the third quartile, abbreviated 𝑄3 .
• A set of data is divided into two equal groups by the median. The
"median" of the first group is in the first quartile, while the "median"
of the second group is in the third quartile.
Here are the maths scores for 19 randomly selected students. Find the median, as well as the first quartile and third quartile.
480 370 540 660 650 710 470 490 630 390 430 320 470 400 430 570 450 470 530
Solution
• The first step is to put the values in ascending order and find the median as before.
• Since there are 19 values, we will have two groups of nine values with one value left in the middle. That middle value is the
median.
• First Group: 320 370 390 400 430 430 450 470 470 Median: 470
• Second Group: 480 490 530 540 570 630 650 650 710
• The median is 470. Now to find the first quartile, we need to find the “median” of the first group of nine values. This first group
will be broken into two groups of four values with one value left in the middle. That middle value is the first quartile, 𝑄1 .
320 370 390 400 430 430 450 470 470
• The first quartile is 430. This is the score that separates the first quarter of the values from the rest.
• To find the third quartile, 𝑄3 , we repeat the same procedure with the second group of nine values.
480 490 530 540 570 630 650 650 710
• The third quartile is 570. This is the score that separates the last quarter of the data from the rest.

𝑛
𝑐( − 𝑓 < )
𝑄1 = 𝑂𝑄1 + 4
𝑓𝑄1
3𝑛
𝑐( − 𝑓 < )
𝑄3 = 𝑂𝑄3 + 4
𝑓𝑄3

𝑂𝑄1 𝑂𝑄3 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡𝑠 𝑜𝑓𝑄1 𝑄3


𝑓𝑄1 𝑎𝑛𝑑 𝑓𝑄3 𝑎𝑟𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠 𝑜𝑓𝑄1 𝑎𝑛𝑑𝑄3 𝑟𝑒𝑠𝑝𝑒𝑐𝑡𝑖𝑣𝑒𝑙𝑦
𝑓 <
• Measures of Dispersion: When trying to describe a set of data, measurements of central
tendency are necessary, but they aren't enough. Even though two sets of data are centered in
the same location, they are nevertheless completely different. Another essential factor to
consider is the dispersion or spread of the data. The values are closely packed together in certain
sets, whereas they are widely separated in others.
• For instance, here are the five test scores of two statistics students.
Jack 85 70 55 41 99
Jill 72 68 70 65 75
Each student's average score is 70, however, the two sets of scores are not the same. Jill’s scores
are very close; none of them differ by more than 5 points from her mean. Jack's scores are more
evenly distributed. Two of his scores are 29 points higher than his average. On the next test, we
should expect Jill to score closer to 70 than Jack.
The range is the first measure of dispersion we'll look at. We subtract the lowest value from the highest value to find the range for a set of
data.

𝑅𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒


𝑅 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Consider the following data of recorded steps per day recoded on someone’s phone. What is the Range of number of steps recorded?
1470 1304 1548 1352 1115
1883 1500 1553 2053 1491
• Answer
The highest value is 2053 steps, and the lowest is 1115 steps
Therefore,
𝑅 = 2053 − 1115 = 938

The range indicates the distance between the lowest and highest values. The range has the drawback of being sensitive to outliers. If there is
an outlier in a collection of data, the range uses it in its calculations. Another issue with range is that two sets of data can be distributed in
completely different ways while still having the same range.
• Here are the scores of two golfers from last month’s matches.
Jack 71 72 73 71 73 82 70 72 68
Greg 75 73 77 78 78 81 74 71 85
Both golfers have a 14-stroke range, yet their scores are spread out very
differently. Jack's scores are mostly in the range of 68 to 73, with an 82 as an
anomaly, and Greg's scores range from 71 to 85.
The range should only be used as a starting point when looking at the dispersion of
a set of data. Although it can give us an indication of how dispersed the values are,
it cannot provide us with a complete picture.
The interquartile range, or the distance between the first and third quartiles, is another measure
of dispersion.
Rather than showing us how far away the two extreme values are, it shows us the range in which
the middle 50% of the values can be found. It is not affected by outliers.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Example:
Let’s consider the example on quartile chapter of 19 test score.
We found the following;
𝑚𝑒𝑑𝑖𝑎𝑛 = 470
𝐿𝑜𝑤𝑒𝑟 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 = 430
𝑇ℎ𝑖𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 = 570
Therefore;
𝐼𝑄𝑅 = 570 − 430 = 140
• To calculate the mean deviation for a set of data, first determine the distance between each value
and the mean. The mean of these distances is then calculated. This is the standard deviation of the
data.
• In a nutshell, it informs us how distant the values are on average from the data’s center. Because
distance is a nonnegative metric, we use the absolute value of the difference between the value and
the mean to calculate each distance. The formula is as follows:
σ 𝑥𝑖 − 𝑥ҧ
𝑀𝐷 =
𝑛
• Calculation Mean Deviation
Although this calculation employs sample notation, the process for calculating the mean deviation of a
population is the same. Simply change 𝑥ҧ to 𝜇 and n to N.
The stages for this computation are as follows.
1. Determine the average.
2. Subtract each value from the mean.
3. Calculate each difference's absolute value.
4. Add up the distances.
5. Divide the total distance of values in the set by the number of values in the set.
1. A uber owner is interested in the number of fares for his drivers on
Fridays. He randomly selects seven drivers, and then randomly selects
one Friday for each of the drivers. Here are the number of fares for
each. Find the mean deviation for these totals.
32 27 30 41 29 38 34

2. Here is a list of the JSE’s daily volumes for one week of trading, in
millions of shares. Find the mean deviation for these values.
669 754 752 771 835
• Variance is another measure of dispersion that measures from the inside out. With two exceptions, variance is like mean deviation.
• The first difference between these two measures is that instead of calculating the absolute value, we square the difference between each value and
the mean.
• Another distinction is that there are two formulas to use depending on whether we’re looking for a sample or population variance. The two
formulas are listed below,
σ 𝑥𝑖 − 𝑥ҧ 2
𝑠2 = ; 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑛 − 12
σ 𝑥𝑖 − 𝜇
𝜎2 = ; 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑁
• Calculating Variance
• It's worth noting that the two formulas differ significantly. The sample variance formula requires us to subtract 1 from the sample size, whereas the
population variance formula utilizes the population size as the denominator without subtracting 1. Why is there a distinction? When the sample
variance is subtracted from the sample size, the sample variance becomes an unbiased estimator of the population variance.
• We must be able to distinguish if a set of data is a sample or a population. Using the sample formula incorrectly will result in an excessively large
variation. When the population formula is used incorrectly, the variance is too little.


• The standard deviation is the most common measure of dispersion that we will use in this
course. The square root of the variance is the standard deviation of a set of data.
𝑠 = 𝑠 2 ; 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝜎 = 𝜎 2 ; 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Examples:
1. Using the uber owner example, find the standard deviation.
2. A student is interested in how old women are when they first get married. To estimate the
mean age, she goes into 11 randomly selected chat rooms, and asks randomly selected
women how old they were at their first marriage until she gets a response in each room. Here
are the 11 ages.
Find the standard deviation of these ages.
21 20 19 16 22 21 21 19 18 24 18
This is the measure of relative variability, and its formula is as follows;
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑉 = %
𝑚𝑒𝑎𝑛
𝑠
= %
𝑥ҧ
This is expressed as percentage as it expresses the variability of the random variable.
This assists in comparing variability across different samples.
The smaller the percentage of CV, the more concentrated the data values are to their
mean, and the larger value of CV says the data values are widely dispersed about their
mean.
Examples:
From the above 2 examples, calculate the CV.
Consider the following data of the number of children born
around free state hospitals on Christmas over the last 15 years.
26 26 24 23 23 23 25 23
22 22 23 22 23 25 24

Calculate coefficient of variation and interpret it.


• Inferential statistics frequently employ
the standard deviation as a metric of
dispersion. Here are three examples of
how it can be used.
• The first is the skewness of the data.
When a set of data isn't symmetrical, it's
said to be skewed. This is a histogram
made from a roughly symmetrical
collection of data.
• The preceding histogram's peak is positioned in the center.
• If the histogram is stretched to the left or right, as in the following two histograms, the data set is skewed.
• Negatively skewed data is data that is stretched to the left. If the values to the left of the median are more
spread out than the values to the right of the median, the data set is negatively skewed. A set of data can
be adversely skewed if there are few outliers. The mean will be lower than the median in this case.
(Why?)We call a set of data positively skewed if it is extended to the right.

• This metric assesses the skewness of a set of numerical data. Its formula is
as follows,
𝑛 σ 𝑥𝑖 − 𝑥ҧ 3
𝑆𝑘𝑝 =
𝑛 − 1 𝑛 − 2 𝑠3
We use the above formula to interpret this metric in the following manner,
• If 𝑆𝑘𝑝 = 0, we have a symmetrical histogram, and this implies 𝑥ҧ = 𝑀𝑐 =
𝑀𝑜 .
• If 𝑆𝑘𝑝 > 0, 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑠𝑘𝑒𝑤𝑒𝑑 ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚 𝑎𝑛𝑑 𝑥ҧ > 𝑀𝑐 .
• If 𝑆𝑘𝑝 < 0, 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑎 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑠𝑘𝑒𝑤𝑒𝑑 ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚 𝑎𝑛𝑑 𝑥ҧ < 𝑀𝑐 .
• Here are the ages of ten randomly selected women at college
orientation. Find the coefficient of skewness for these data.
18 25 31 19 22 21 19 25 18 27
• Here are the ages of the 16 full-time faculty members in the
commerce department at SPU. Calculate the coefficient of skewness
for these ages.
• Is the skewness positive or negative?
• Is the skewness moderate or severe?
27 42 40 59 32 28 29 30 32 45 27 44 37 52 54 26
• An outlier is a value that is significantly different from the bulk of values in a dataset. In a dataset
of battery lifetimes with x = 16, 12, 2, 15, 13, 11, and 12 hours, for example, x = 2 hours is an
outlier. Similarly, for the variable x = monthly household electricity usage, x = 1922 kWh is an
outlier in a dataset of x = 326, 412, 1922, 296, 314, 384, and 370 kWh.
• There are 2 methods that are used to identify outliers
The z-score approach
(𝒙−ഥ𝒙) 𝒙−𝝁
𝒛= or 𝒛 =
𝒔 𝝈

This is called the standardization of data. This is mostly done when we have data points from
different
• This is used to measure how far the data points deviates from the
data mean
• When a data value (𝑥) has a z-score that is either below –3 or above
+3, it is considered an outlier. The characteristic that values (𝑥) of a
normally distributed random variable lie within 3 standard deviations
of its mean is the basis for this rule of thumb. (i.e. a z-score −3 ≤
𝑧 − 𝑠𝑐𝑜𝑟𝑒 ≤ +3) As a result, x-values with z-scores more than ±3
standard deviations are considered outliers.
• Here is a list of the Johannesburg Stock Exchange’s daily volumes for
one week of trading, in millions of shares. Convert each volume to its
z-score.
669 754 752 771 835
• A boxplot is another technique to graphically depict a set of data. The
lowest value, the first quartile, the median, the third quartile, and
the highest value are all required to produce a boxplot. The five-
number summary for a set of data is sometimes referred to as the
five-number summary. We draw a box from the first to the third
quartile above a horizontal axis. At the median, we drew a dashed
line in the box. Finally, from the box, extend line segments to the
lowest and highest values.

• Find the interquartile range for the following set of values.
1. 45 58 50 47 55 60 40 43 50 55 40 43 48 46 56 46
2. 260 56 65 19 63 74 63 55 105 23 30 49 13 68 31 86 101 91 70 76 15 55 50 98 35 104
57 57 17 107 98 47 49 84 98 74 33
• Construct a boxplot for a set of data that has the following five-number summary. Be
sure to label both the range and the interquartile range.

Lowest Q1 Median Q3 Highest


15 44 51 60 72
• A sample of five college females produced the following heights (in inches). Find the mean deviation for these
heights.
64 67 65 70 62







The probability of an event is the proportion of times that the
event occurs in many trials of the experiment.
Consider an experiment of drawing a card from a standard
deck of cards. What is the probability that the card drawn will
an Ace?
Solution;
A standard deck of cards has 52 cards
There are 4 Ace in a deck of cards
So, we looking for the probability of choosing any of those 4 cards
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑐𝑒 4
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑡𝑦 𝑜𝑓 𝐴𝑐𝑒 = = = 0.0769
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑟𝑑𝑠 52
We will look at two types of probability:
• An Objective Probability: This probability that can be proven statistically. This is the one
used in statistical analysis.

Objective Probability: 𝑷(𝑨)


number of times A occurred
In words: empirical probability of 𝐴 =
𝑛𝐴
number of trials
In algebra: 𝑃 𝐴 =
𝑛
• A subjective probability generally results from personal judgment. A weather
forecaster often assigns a probability to the event “precipitation.” For example,
“there is a 20% chance of rain today,” or “there is a 70% chance of snow tomorrow.”
In such cases, the only method available for assigning probabilities is personal
judgment. These probability assignments are called subjective probabilities. The
accuracy of subjective probabilities depends on the individual’s ability to correctly
assess the situation
Probability is a number assigned to each member in the sample space. Denote by 𝑃 ∙ .
A probability function is a rule of correspondence that associates with each event A in
the sample space S a number 𝑃 𝐴 such that
1. 0 ≤ 𝑃 𝐴 ≤ 1, for any event A.
2. The sum of probabilities for all distinct events is 1.
3. If A and B are mutually exclusive events, then
P(A or B) 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃(𝐴 ∪ 𝐵) = 𝑃 𝐴 + 𝑃 𝐵
• Sample space: The set of all possible distinct outcomes, denoted by 𝒮 (e.g., 52 cards)
• Elementary event or sample point: a member of the sample space. (e.g., the ace of hearts).
• Event (or event class): any set of elementary events. e.g., Color (Red), or Number (Ace).
Notes:
• Elementary events are equally likely Denoted events
by roman letters (e.g., A, B,etc)
• Denote probability of an event as P(A).
Joint Event is when you consider two (or more events) at a time. e.g., A =heads
on coin1, B = coin2, and joint event is heads on both coins.
Intersection: (A ∩ B) = A and B occur at the same time.
Union: (A ∪ B) = A or B occur
Only A occurs.
Only B occurs.
A and B occur.
ҧ not A. e.g., if A =red
Complement of an event is that the event did not occur. 𝐴≡
card, then 𝐴ҧ is a black card (not a red card).
Mutually exclusive events are events that cannot occur at the same time.
Events have no elementary events in common. e.g., A = heart and B = club.
Let A = number card (i.e., 2–10), B = face card (i.e., J, Q, K), and C = Ace.
Probabilities of events:
9 4 36
𝑃 𝐴 = = = 0.6923
52 52
3 4 12
𝑃 𝐵 = = = 0.2308
52 52
1 4 4
𝑃 𝐶 = = = 0.0769
52 52
𝑃 𝐴 +𝑃 𝐵 +𝑃 𝐶 =1
48
P A ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 = 0.6923 + 0.2308 = 0.9231 =
52
Rule 1: If 2 events, B & C, are mutually exclusive (i.e., no overlap) then the probability
that one or both occur is
𝑃 𝐵 𝑜𝑟 𝐶 = 𝑃 𝐵 ∪ 𝐶 = 𝑃 𝐵 + 𝑃(𝐶)

Rule 2: For any 2 events, A & B, the probability that one or both occur is
𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵
= 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)
From Table 1, we have:
Elementary event (or “sample point”) is a teacher.
Event is any set of teachers. (e.g., region, level, or combination).
Simple Experiment: Select 1 teacher at random, so

1106
𝑃 𝑃𝑟𝑖𝑚𝑎𝑟𝑦 = = 0.555
1991

885
𝑃 𝑁𝑜𝑡 𝑃𝑟𝑖𝑚𝑎𝑟𝑦 = 𝑃 𝑆𝑒𝑐𝑜𝑛𝑑𝑎𝑟𝑦 = = 0.445
1991
Events are a Primary teacher from the South & a Primary
teacher from the far central,
𝑃 𝑃𝑟𝑖𝑚𝑎𝑟𝑦 𝑖𝑛 𝑠𝑜𝑢𝑡ℎ 𝑜𝑟 𝑓𝑎𝑟 𝑐𝑒𝑛𝑡𝑟𝑎𝑙
= 𝑃 𝑝𝑟𝑖𝑚𝑎𝑟𝑦, 𝑠𝑜𝑢𝑡ℎ + 𝑃 𝑝𝑟𝑖𝑚𝑎𝑟𝑦, 𝑓𝑎𝑟 𝑐𝑒𝑛𝑡𝑟𝑎𝑙

240 279
= + = 0.121 + 0.140 = 0.261
1991 1991
Conditional Probability equals the probability of an event A given that
we know that event B has occurred.
𝑃 𝐴∩𝐵 𝑃 𝐴, 𝐵
𝑃 𝐴𝐵 = =
𝑃 𝐵 𝑃 𝐵

Example: What is the probability that a teacher is from the South


given that he/she is a primary school teacher?
Solution:
240
1991
𝑃 𝑝𝑟𝑖𝑚𝑎𝑟𝑦 𝑎𝑛𝑑 𝑠𝑜𝑢𝑡ℎ 1106 0.121
𝑃 𝑆𝑜𝑢𝑡ℎ 𝑝𝑟𝑖𝑚𝑎𝑟𝑦 = = =
𝑃 𝑝𝑟𝑖𝑚𝑎𝑟𝑦 1991 0.555
= 0.218
If the conditional and unconditional probabilities are identical, then the two events are Independent.
For Independent events,
𝑃 𝐴 𝐵 = 𝑃(𝐴)
𝑃 𝐵 𝐴 = 𝑃(𝐵)
𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃 𝐵 ⇒ 𝑡ℎ𝑒"𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑣𝑒 𝑟𝑢𝑙𝑒“
Examples
Toss a R1 & a R2 coins:
𝑃 𝑅1 = ℎ𝑒𝑎𝑑 & 𝑅2 = ℎ𝑒𝑎𝑑 = 𝑃 𝑅1 = ℎ𝑒𝑎𝑑 𝑃 𝑅2 = ℎ𝑒𝑎𝑑
1 1 1
= ∗ =
2 2 4
Role two dice:
1 1
𝑃 𝑑𝑖𝑒1 = 5 & 𝑑𝑖𝑒2 = 6 = 𝑃 𝑑𝑖𝑒1 = 5 𝑃 𝑑𝑖𝑒2 = 6 = ∗
6 6
1
=
36
Permutation: this is the ordering of objects of size r from the larger group of size n. here the order
of selection is important.
permutation of r objects from n objects is denoted as:
𝑛!
𝑛𝑃𝑟 = 𝑛−𝑟 !
Combinations: The number of distinct COMBINATIONS of n objects, taken k at a time, is given by
the ratio
𝑛! 𝑛∗ 𝑛−1 ∗ 𝑛−2 ∗⋯∗ (𝑛−𝑘+1)
𝑛𝐶𝑘 = 𝑘! =
𝑛−𝑘 ! 𝑘!
𝑛
This quantity is usually written a ,and read “n choose k”.
𝑘
Suppose we wish to arrange n = 5 people {a, b, c, d, e}, standing
side by side, for a portrait. How many such distinct portraits
(“permutations”) are possible?
Solution
There are 5 possible choices for which person stands in the first
position (either a, b, c, d, or e). For each of these five
possibilities, there are 4 possible choices left for who is in the
next position. For each of these four possibilities, there are 3
possible choices left for the next position, and so on. Therefore,
there are 5 × 4 × 3 × 2 × 1 = 120 distinct permutations
This number, 5 × 4 × 3 × 2 × 1 (or equivalently, 1 × 2 × 3 × 4 × 5),
is denoted by the symbol “5!” and read “5 factorial”, so we can
write the answer succinctly as 5! = 120
The number of distinct COMBINATIONS of n objects, taken k at a time, is given by the
ratio
𝑛! 𝑛∗ 𝑛−1 ∗ 𝑛−2 ∗⋯∗ (𝑛−𝑘+1)
=
𝑘! 𝑛−𝑘 ! 𝑘!
𝑛
This quantity is usually written a ,and read “n choose k”.
𝑘
Example:
Suppose that instead of portraits (“permutations”), we wish to form committees (“combinations”) of k = 3
people from the original n = 5. How many such distinct committees are possible
Solution:
This time the reasoning5!is a little subtler. From the previous calculation, we know that # of permutations of k =
3 from n = 5 is equal to = 60.
2!
# of combinations of k = 3 from n = 5 is equal to 5!2!, divided by 3!, i.e., 60 ÷ 6 = 10.
5!
This number, , is given the compact notation
3!2!
5
, read “5 choose 3”, and corresponds to the number of ways of selecting 3 objects
3 5
from 5 objects, regardless of their order. Hence = 10
3
𝑃 𝐸1 = 𝑃 𝐸2 = 0.15 , 𝑃 𝐸3 = 0.4 𝑎𝑛𝑑 𝑃 𝐸4 = 2𝑃(𝐸5 )
𝐸4 𝐸5

A = {𝐸1 𝐸3 𝐸4 }
𝐵 = {𝐸2 𝐸3 }
A picture containing text, clipart Icon A picture containing text, clipart, vector graphics Icon

Description automatically generated Description automatically generated Description automatically generated Description automatically generated
A picture containing text, clipart Icon A picture containing text, clipart, vector graphics

Description automatically generated Description automatically generated Description automatically generated


A picture containing text, clipart Icon A picture containing text, clipart, vector graphics

Description automatically generated Description automatically generated Description automatically generated


A picture containing text, clipart Icon A picture containing text, clipart, vector graphics

Description automatically generated Description automatically generated Description automatically generated


A picture containing text, clipart Icon A picture containing text, clipart, vector graphics

Description automatically generated Description automatically generated Description automatically generated

You might also like