AUCA Descriptive Statistics Lecture 1
AUCA Descriptive Statistics Lecture 1
DESCRIPTIVE STATISTICS
1. Introduction
Subdivisions of Statistics
Cont.
Examples:
1 Based on sample information, the pollster predicted Demosthenes would be
elected. Inferential statistics
2 The population of Rwanda in 1984 was 5 million. Descriptive Statistics
3 According to the pool forecasting, Demosthenes would get 54.3 percent of the
votes cast. Inferential statistics
4 An engagement approach to learning work; the 29 students who generated
summaries and inferences performed 20 percent better than those who just
memorized. Descriptive Statistics
Population: The universe (the set) of all potential observations having a common
characteristic that is being studied and about which the experimenter wishes to
make some general statements or inference.
Sample: Census is practically impossible for an infinite population or for a
population with large size. In such cases, the enumeration will be restrained to a
limited number of individuals in the population called a sample.
Experimental unit: Person, thing, event, or any item involved with a statistical
study.
Eg. Height of the people, duration of the exam, distance from the school to my
house.
Here people, exam, and road/venue from the school to my house are experimental
units.
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Cont.
Sampling: The different methods and rules to apply for selecting atypical sample
of the population are called sampling. Sampling is often called sampling methods
or sampling procedures. These are different sampling methods:
1 Simple random sample: Draw a sample of size n from the population of size N in a
such way that every sample of size n has the same chance of being selected. A such
sample may proceeds from sampling with replacement or without replacement;
2 Systematic Sampling : Draw n values from the population for a sample according
to the initial values xi at the i th position, and other values are xi+kh at (i + kh)th
positions, k = 1, 2, . . . , n − 1 ,and h being the sampling period;
3 Stratified Random Sampling: Subdivide the population into k strata according to
different criteria; and then select from each strata a simple random sample of size
′ ′
n . The last are gathered in a final sample of size n = kn .
Cont. Variable
Qualitative Variables:
Some characteristics are not capable of being measured in the sense that height,
weight, and age are measured. But they can be categorized or identified by a number
only. Such characteristics are called qualitative variables.
Examples:
health status of a patient
color of the t-chart
gender of a student
nationality of the person
academic performance (grand distinction, distinction, satisfaction, and failure)
Measurements made on qualitative variables convey information regarding attributes.
Scale of measurements
Definition:
Measurement Measurement is defined as the assignment of a numerical value to
different experimental units in conformity with a set of rules. For this reason, various
scales result from the fact that measurement may be carried out under different sets of
rules.
There exist 4 scales of measurement of variables in statistics:
Nominal scale
Ordinal scale
Interval scale
Ratio scale
The Nominal Scale: The lowest measurement scale is the nominal scale. As the
name implies it consists of “naming” observations or classifying them into various
mutually exclusive and collectively exhaustive categories.
Example:
color of the t-chart (blue = 1, green =2, dark = 3, red = 4)
gender of a student (male = 1, female = 2)
nationality of the person (Rwandan = 1, Ugandan = 2, Kenyan = 3, Gabonese =
4, Chadian = 5, Zambian = 6)
The Ordinal Scale: Whenever observations are not only different from category to
the category but can be ranked according to some criterion, they are said to be
measured on an ordinal scale.
Example:
Convalescing patients may be characterized as (unimproved = 0, improved = 1,
much improved = 2)
Individuals may be classified according to socioeconomic status as (low = 1,
medium = 2, or high = 3)
The intelligence of children may be (above average = 2, average = 1, or below
average = 0)
academic performance (grand distinction = 1, distinction = 2, satisfaction = 3,
and failure = 4)
Note: The function of numbers assigned to ordinal data is to order (or rank) the
observations from lowest to highest and, hence, the term ordinal.
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
The Interval Scale: The interval scale is a more sophisticated scale than the nominal
or ordinal in that with this scale not only is it possible to order measurements but also
the distance between any two measurements is known.
We know, say, that the difference between a measurement of 20 and a measurement of
30 is equal to the difference between measurements of 30 and 40.
The ability to do this implies choosing arbitrarily two points of reference for measuring:
a zero point; 0
a unit distance; 1
Note: The selected zero point is not necessarily a true zero in that it does not have to
indicate a total absence of the quantity being measured.
The Ratio Scale: The highest level of measurement is the ratio scale. This scale is
characterized by the fact that equality of ratios, as well as equality of intervals, maybe
determined. Fundamental to the ratio scale is a true zero point. The measurement of
such familiar traits as height, weight, and length makes use of the ratio scale.
Example:
amount of money in the pocket ( zero degrees mean the total absence of money
or empty pocket)
number of the students per classroom ( zero students mean the total absence of
students in the classroom)
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Frequency distribution is a simple table of two rows for the different distinct
observed values xi with their respective frequency fi .
For eg. the frequency distribution of the 7 different distinct values is the table
here below:
xi 1 2 3 4 5 6 7
fi 1 2 2 3 1 1 2
This frequency distribution is appropriate for only discrete variable when the
number of the different distinct values k is less than 12 i.e. k ≤ 12. Otherwise,
values are grouped into class intervals. The frequency distribution with class
intervals is called ”grouped frequency distribution”
Grouped Frequency Distribution: It is a table in two rows; the 1st row for
different class intervals in which fall observations and the 2nd exclusively for the
frequency of the corresponding class interval.
For eg. Find the grouped frequency distribution of the following data: 69 84 52
93 81 74 89 85 88 63 87 64 67 72 74 55 82 91 68 77
Note that this set of values has more than 12 different distinct values. Values
have to be grouped into class intervals.
Let’s group them into class intervals of length c = 10 starting from the value 50.
These 20 observed values are put into 5 class intervals as follows:
a-b 50 − 60 60 − 70 70 − 80 80 − 90 90 − 100
fi 2 5 4 7 2
Rmrk: Grouped frequency is recommended for a frequency distribution of continuous
data variable
There are some important elements that should be known before the construction of
the grouped frequency distribution. These are:
The minimum number k of class intervals required for the given data variable of
size N. The following Sturges’s rule should be used k = 1 + 3.322 ∗ log10 n for that
purpose.
For eg. Find the minimum number k of class intervals in order to represent the
following data by a grouped frequency distribution: 69 84 52 93 81 74 89 85 88
63 87 64 67 72 74 55 82 91 68 77
Apply Sturges’s rule by taking n = 20, we have
k = 1 + 3.322 ∗ log10 20 = 1 + 3.322 ∗ 1.3010299957 = 5.322022. We should
round up the value of k to the next natural number. i.e., k = 6. but k ≤ 12,
therefore k = 6, 7, 8, 9, . . . , 12. We have to modify the previously grouped
frequency distribution so as to have more than 5 class intervals.
Suppose now we take the number of class intervals k = 6, then the length of each
class interval; called class width c is equal to the ratio obtained from the division of
the difference of largest and smallest values divided by the number of class intervals
e.i., c = (maxvalue − minvalue)/k, and adjust its value to accommodate all values of
the given data variable.
Find the value of ”the class width” c needed to group the following data 69 84 52 93
81 74 89 85 88 63 87 64 67 72 74 55 82 91 68 77 into 6 class intervals
Solution: The value of c = (93−52)
6 = 6.833333 take an even number as the smallest
value 52 is even, thus c = 8.
The modified grouped frequency distribution with 6 grouped class intervals is:
a-b 52 − 60 60 − 68 68 − 76 76 − 84 84 − 92 92 − 100
fi 2 3 5 3 6 1
Note:
The cumulative frequency of the xi is given by the formula cf1 = f1 for the first value
x1 and cfi = cfi−1 + fi for all i > 1
Example:
Find the cumulative frequencies of the values presented by the previous frequency
distribution
cf1 = f1 = 1,
cf2 = cf1 + f 2 = 1 + 2 = 3,
cf3 = cf2 + f3 = 3 + 2 = 5,
cf4 = cf3 + f4 = 5 + 3 = 8,
...
Activity II
1 Represent the following data of the ages of 62 people who live in a certain
neighborhood by an appropriate frequency distribution. Construct its
corresponding extended frequency distribution table:
2, 5, 6, 12, 14, 15, 15, 16, 18, 19, 20, 22, 23, 25, 27, 28, 30, 32, 33, 35, 36, 36,
37, 38, 39, 40, 40, 41, 42, 43, 43, 44, 44, 45, 45, 46, 47, 47, 48, 49, 50, 51, 56,
57, 58, 59, 59, 60, 62, 63, 65, 65, 67, 69, 71, 75, 78, 80, 82, 84, 90, 96
2 Octane levels for various gasoline blends are given below: 87.9 84.2 86.9 87.7 91.7
88.8 95.3 93.5 94.3 88.1 90.2 91.4 91.3 93.9
Represent these data by an appropriate extended frequency distribution table.
Explain why you made a such choice.
3 The following are data on the number of students per classroom in AUCA, Faculty
of Education. Represent them by an appropriate frequency distribution table
14 11 10 8 12 13 11 10 16 11 11 9 9 7 14 12 9 10 11 6 13 8 11 11 9 8 13 16 10
11 9 8 12 11 10
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Statistical data variables are often presented by a graph or chart. The type of chart
depends upon the nature of the variable it may represent.
Qualitative variables are either represented by a pie chart or a bar chart.
Quantitative discrete variables are represented by rod/spike chart, frequency
polygon, and cumulative frequency chart (ogive in stairs form).
Quantitative continuous variables are represented by a Histograms chart, polygon
frequency chart, and ogives in a continuous curve form.
Solution:
Generate the extended frequency distribution table:
Bar Chart:
Bar chart is another alternative representation of qualitative data variable. It
consists of a sequence of the equidistant vertical rectangles proportional to the
frequency fi for each value xi , drawn in the XY −plane
Example:
Frequency distribution of the enrollment of four classes in a high school is given in the
following table.
Note: The bar chart is fitted within the available space by the scale defined by the
distance at which the largest frequency is fixed. i.e at x cm from the origin of the
y-axis (axis of frequencies).
30 freq −→ 10 cm
1 freq −→ 1030 = 0.3cm The positions are indicated in the last column of the
above-extended frequency distribution table.
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Rod/Spike Chart:
This graphic is drawn in XY-plane where different distinct values xi are found on
X-axis, and their frequency fi on Y-axis. The graphic consists of vertical line
segments starting from X-axis at the point xi , and of height proportional to the
frequency fi .
Example:
Represent the following data variable by Rod / Spike Chart:
xi 1 2 3 4 5 6 7
fi 1 2 2 3 1 1 2
i xi fi cfi location(cm)
1 1 1 1 3
2 2 2 3 6
3 3 2 5 6
4 4 3 8 9
5 5 1 9 3
6 6 1 10 3
7 7 2 12 6
P7
i=1 12
Let’s locate the largest frequency 3 on Y-axis (axis of frequencies) at 9 cm. The
location of other frequencies follows from the correspondence 3 freq −→ 9 cm. We
have the correspondence: 1 freq −→ 93 = 3cm (one unit of frequency must be marked
at 3 cm from the origin 0 on Y-axis.)
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Polygon Frequency:
Polygon frequency is obtained by joining every two consecutive upper points of
the rod/spike chart by a line segment. Here below is the polygon chart generated
from the previous rod/spike chart.
Histogram:
Histogram is a series of contingent rectangles of equal breadth, drawn in
XY-plane, whose heights are proportional to the frequency of each class interval.
Remember that to draw the histogram, it is necessary to fix first the largest frequency
at a specified distance x (units of distance) from the origin 0 of the axis of frequencies.
The next ppt shows a histogram that corresponds to the lifetimes of 200 incandescent
lamps. We should fit the chart within 10 cm i.e. fix the frequency 58 at 10 cm from
the origin of the axis of frequencies
Polygon frequency:
Polygon frequency of the grouped frequency distribution is obtained by joining the
consecutive upper midpoint of each class interval (midpoint of the upper bases of
the rectangles) by line segments.
Example:
Generate the stemplot corresponding to the following scores for the final exam of
descriptive statistics.
33 42 49 49 53 55 55 61 63 67 68 68 69 69 72 73 74 78 80 83 88 88 88 90 92 94 94
94 94 96 100
The following is the stem-and-leaf graphic of the above data
In statistics, data variables are summarized by some values computed (or determined)
from the values of that data variable. These values are often called statistical
descriptor measures of the statistical data variable.
The statistical descriptor values are classified into two categories: measures of
central tendency and measures of spread. The measures of central tendency
comprise the mean, the median, the mode, and the quantiles while the tree
measures; the ranges, variance, standard deviation, and coefficient of variation
are expressing the spread or volatility of the data variable.
This formula is given when the data are given in form of the raw data
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Cont.
Example:
Find the arithmetic mean of the following data of the variable x: 24 39 7 48 16 29 34
20 43 18
10
1 X
x̄ = xi
10
i=1
24 + 39 + 7 + 48 + 16 + 29 + 34 + 20 + 43 + 18 278
x̄ = = = 27.8
10 10
Example:
Find the arithmetic mean of the data variable x represented by the following frequency
distribution:
xi 1 2 3 4 5 6 7
fi 1 2 2 3 1 1 2
Solution:
The problem must be solved by adding column for xi fi to the extended frequency
distribution table. This looks like:
i xi fi cfi xi fi
1 1 1 1 1
2 2 2 3 4
3 3 2 5 6
4 4 3 8 12
5 5 1 9 5
6 6 1 10 6
7 7 2 12 14
Dr. Hategekimana
P7 Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
7
1 X
x̄ = xi fi
12
i=1
48
x̄ = =4
12
Note:
When the different distinct values xi and fi have many digits, the product xi fi may
take time to be computed to fill in the columns of the extended frequency distribution
table. To simplify the task, we have to use the following formula called the short-cut
formula for the calculation of the arithmetic mean
k
1 X
x̄ = A + di fi
N
i=1
Where, A: the assumed mean (One of the xi value taken in the central part of the
column of xi ). For the previous example A = 4, the 4the observed value.
di : the deviation of the value xi from the assumed mean A. e.i, di = xi − A
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Find the arithmetic mean of the previous example by using the short-cut formula:
i xi fi cfi di = xi − 4 di fi
1 1 1 1 −3 −3
2 2 2 3 −2 −4
3 3 2 5 −1 −2
4 4 3 8 0 0
5 5 1 9 1 1
6 6 1 10 2 2
7 7 2 12 3 6
P7
i=1 12 48 0
7
1 X
x̄ = 4 + di fi
12
i=1
0
x̄ = 4 + =4
12
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
where
k: the number of different class intervals
A: the assumed mean which is one of the mid-points mi picked from the central
part of the column of the mid-points.
Di : the reduced deviation of the mid-point mi from the assumed mean A i.e.,
Di = mi c−A
c: Class width or the length of each class interval
Note: The above formula is called step-deviation formula and gives an
approximate value of the arithmetic mean.
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Find the arithmetic mean of the data variable here below represented by the following
grouped frequency distribution with the first class interval: 0 − 10
class 1 2 3 4 5 6 7 8
fi 5 10 25 30 20 10 5 5
Solution:
−35
i a-b mi fi cfi Di = mi10 Di fi —
1 0 − 10 5 5 5 −3 −15
2 10 − 20 15 10 15 −2 −20
3 20 − 30 25 25 40 −1 −25
4 30 − 40 35 30 70 0 0
5 40 − 50 45 20 90 1 20
6 50 − 60 55 10 100 2 20
7 60 − 70 65 5 105 3 15
8 70 − 80 75 5 110 4 20
P7
i=1 110 15
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Finally, we obtain the approximate value of the arithmetic mean equal to:
k
c X
x̄ = A + Di fi
N
i=1
10x15
x̄ = 35 +
110
15
x̄ = 35 +
11
x̄ = 36.36
Apart from the arithmetic mean, as we say, the central value tendency knows other
measure under the global name of mean:
Solve the problems of the activity here below using the theory of weighted mean.
(f1 − f0 )
M0 = L + xc
(f1 − f0 ) + (f1 − f2 )
, Where
L :is the lower limit of the modal class interval (The modal class interval is the one
that has the highest frequency).
f0 : is the frequency of the class interval that comes just before the modal class
interval
f1 : is the frequency of the modal class interval
f2 : is the frequency of the class interval that comes just after the modal class interval
c : is the class width
Dr. Hategekimana Fidele DESCRIPTIVE STATISTICS
Data Variable Organization and Presentation
Summarizing Data Variable
Given a set of numerical observations, we may transform it into an array of data (order
the data in ascending order). In statistics, it is very important to understand the role
of percentiles. The percentiles are positional values. They are describing the number in
percent of the data less or equal to the value in a specific position within the set of the
whole values.
Example:
If the grade of the student is in the 90th percentile, this does mean that 90% of his or
her classmates have got grades less than or equal to his (or her) grades.