Chapter 1
Chapter 1
Chapter1
Descriptive Statistics
03/10/2022
Contents
3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position
Contents
3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position
Definition 1
Statistics is the science of conducting studies to collect, organize,
summarize, analyze, and draw conclusions from data.
Definition 3
Inferential statistics consists of generalizing from samples to
populations, performing estimations and hypothesis tests, determining
relationships among variables, and making predictions.
Definition 4
A population consists of all subjects (human or otherwise) that are
being studied.
Definition 5
A sample is a group of subjects selected from a population.
Example 1
Determine whether descriptive or inferential statistics were used.
a. The average jackpot for the top five lottery winners was $367.6
million.
b. A study done by the American Academy of Neurology suggests
that older people who had a high caloric diet more than doubled
their risk of memory loss.
c. Based on a survey of 9317 consumers done by the National Retail
Federation, the average amount that consumers spent on
Valentine’s Day in 2011 was $116.
d. Scientists at the University of Oxford in England found that a
good laugh significantly raises a person’s pain level tolerance.
Solution
a. Descriptive statistics were used because this is an average, and it
is based on data obtained from the top five lottery winners at this
time.
b. Inferential statistics were used since this is a generalization made
from a sample to a population.
c. Descriptive statistics were used since this is an average based on a
sample of 9317 respondents.
d. Inferential statistics were used since an inference is made from a
sample to a population
Definition 6
A variable is a characteristic or attribute that can assume different
values. Data are the values (measurements or observations) that the
variables can assume. Variables whose values are determined by
chance are called random variables.
Variables can be classified as qualitative or quantitative.
Definition 7
Qualitative variables are variables that have distinct categories
according to some characteristic or attribute.
Quantitative variables are variables that can be counted or
measured.
Quantitative variables can be further classified into two groups:
discrete and continuous. Discrete variables assume values that can
be counted. Continuous variables can assume an infinite number of
values between any two specific values. They are obtained by
measuring. They often include fractions and decimals.
AMS (ITC) Descriptive Statistics 03/10/2022 7 / 69
The Nature of Probability and Statistics
TYPES OF VARIABLES
There are two basic types of variables: (1) qualitative and (2)
quantitative.
Example 2
Classify each variable as a discrete variable or a continuous variable.
a. The highest wind speed of a hurricane
b. The weight of baggage on an airplane
c. The number of pages in a statistics book
d. The amount of money a person spends per year for online
purchases
Solution
a. Continuous, since wind speed must be measured
b. Continuous, since weight is measured
c. Discrete, since the number of pages is countable
d. Discrete, since the smallest value that money can assume is in
cents
Example 3
What level of measurement would be used to measure each variable?
a. The ages of patients in a local hospital
b. The ratings of movies released this month
c. Colors of athletic shirts sold by Oak Park Health Club
d. Temperatures of hot tubs in local health clubs
Solution
a. Ratio
b. Ordinal
c. Nominal
d. Interval
Contents
3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position
Definition 9
A frequency distribution is the organization of raw data in table
form, using classes and frequencies.
Two types of frequency distributions that are most often used are the
categorical frequency distribution and the grouped frequency
distribution.
The categorical frequency distribution is used for data that can
be placed in specific categories, such as nominal- or ordinal-level
data.
When the range of the data is large, the data must be grouped
into classes that are more than one unit in width, in what is called
a grouped frequency distribution.
Definition 10
A cumulative frequency distribution is a distribution that shows the
number of data values less than or equal to a specific value (usually an
upper boundary). The values are found by adding the frequencies of
the classes less than or equal to the upper class boundary of a specific
class. This gives an ascending cumulative frequency.
Definition 11
The histogram is a graph that displays the data by using contiguous
vertical bars (unless the frequency of a class is 0) of various heights to
represent the frequencies of the classes.
Definition 12
The frequency polygon is a graph that displays the data by using
lines that connect points plotted for the frequencies at the midpoints
of the classes. The frequencies are represented by the heights of the
points.
Definition 13
The ogive is a graph that represents the cumulative frequencies for
the classes in a frequency distribution.
Example 6
Construct a histogram, a frequency polygon and ogive to represent the
data shown for the record high temperatures for each of the 50 states
(see Example above).
When the data are qualitative or categorical, bar graphs can be used
to represent the data. A bar graph can be drawn using either
horizontal or vertical bars.
Definition 14
A bar graph represents the data by using vertical or horizontal bars
whose heights or lengths represent the frequencies of the data.
Bar graphs can also be used to compare data for two or more groups.
These types of bar graphs are called compound bar graphs or
multiple bar graphs.
Example 8
Consider the following data for the number (in millions) of never
married adults in the United States.
Year Males Females
1960 15.3 12.3
1980 24.2 20.2
2000 32.3 27.8
2010 40.2 34.0
Construct a multiple bar graphs for this data.
Example 9
The data show the percentage of U.S. adults who smoke. Draw and
analyze a time series graph for the data.
Year 1970 1980 1990 2000 2010
Percent 37 33 25 23 19
Two or more data sets can be compared on the same graph called a
compound time series graph if two or more lines are used, as shown
below:
This graph shows the percentage of elderly males and females in the
U.S. labor force from 1960 to 2010. It shows that the percentage of
elderly men decreased significantly from 1960 to 1990 and then
increased slightly after that. For the elderly females, the percentage
decreased slightly from 1960 to 1980 and then increased from 1980 to
2010.
AMS (ITC) Descriptive Statistics 03/10/2022 23 / 69
Frequency Distributions and Graphs
The purpose of the pie graph is to show the relationship of the parts to
the whole by visually comparing the sizes of the sections. Percentages
or proportions can be used. The variable is nominal or categorical.
Definition 16
A pie graph is a circle that is divided into sections or wedges according
to the percentage of frequencies in each category of the distribution.
A dotplot uses points or dots to represent the data values. If the data
values occur more than once, the corresponding points are plotted
above one another.
Definition 17
A dotplot is a statistical graph in which each data value is plotted as
a point (dot) above the horizontal axis.
Dotplots are used to show how the data values are distributed and to
see if there are any extremely high or low data values.
Example 11 (Named Storms)
The data show the number of named storms each year for the last 40
years. Construct and analyze a dotplot for the data.
19 15 14 7 6 11 11 9
16 8 8 11 9 8 16 12 13
14 13 12 7 15 15 19 11 4
6 13 10 15 7 12 6 10
28 12 8 7 12 9
AMS (ITC) Descriptive Statistics 03/10/2022 25 / 69
Data Description
Contents
3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position
Definition 19
Any measurable characteristic of a population is called a parameter.
The mean of a population is an example of a parameter.
AMS (ITC) Descriptive Statistics 03/10/2022 27 / 69
Data Description Measures of Central Tendency
Example 12
There are 42 exits on I-75 through the state of Kentucky. Listed below
are the distances between exits (in miles).
11 4 10 4 9 3 8 10 3 14 1 10 3 5
2 2 5 6 1 2 2 3 7 1 3 7 8 10
1 4 7 5 2 2 5 1 1 3 3 1 2 1
Why is this information a population? What is the mean number of
miles between exits?
Definition 21
Any measure based on sample data, is called a statistic. The mean of
a sample is an example of a statistic.
Example 13
Verizon is studying the number of monthly minutes used by clients in
a particular cell phone rate plan. A random sample of 12 clients
showed the following number of minutes used last month.
90 77 94 89 119 112
91 110 92 100 113 83
What is the arithmetic mean number of minutes used last month?
Example 14
The annual incomes of a sample of middle-management employees at
Westinghouse are $62,900, $69,100, $58,300, and $76,800.
(a) Give the formula for the sample mean.
(b) Find the sample mean.
(c) Is the mean you computed in (b) a statistic or a parameter? Why?
(d) What is your best estimate of the population mean?
Example 15
The six students in Computer Science 411 are a population. Their
final course grades are 92, 96, 61, 86, 79, and 84.
(a) Give the formula for the population mean.
(b) Compute the mean course grade.
(c) Is the mean you computed in (b) a statistic or a parameter? Why?
Procedure Table
Finding the Mean for Grouped Data
1 Make a table as shown.
A B C D
Class Frequency f Midpoint Xm f .Xm
2 Find the midpoints of each class and place them in column C.
3 Multiply the frequency by the midpoint for each class, and place
the product in column D.
4 Find the sum of column D.
5 Divide the sum obtained in column D by the sum of the
frequencies obtained in column B.
The formula for the mean is
P
f .Xm
X̄ =
n
Example 16
For 108 randomly selected college students, this exam score frequency
distribution was obtained.
Class limits Frequency
90-98 6
99-107 22
108-116 43
117-125 28
126-134 9
Find the mean for this grouped data.
Definition 22
The type of mean that considers an additional factor is called the
weighted mean, and it is used when the values are not all equally
represented.
Find the weighted mean of a variable X by multiplying each value by
its corresponding weight and dividing the sum of the products by the
sum of the weights.
P
w1 X1 + w2 X2 + . . . + wn Xn wX
X̄ = = P
w1 + w2 + . . . + wn w
Example 17
A student received an A in English Composition I (3 credits), a C in
Introduction to Psychology (3 credits), a B in Biology I (4 credits),
and a D in Physical Education (2 credits). Assuming A= 4 grade
points, B = 3 grade points, C = 2 grade points, D = 1 grade point,
and F = 0 grade points, find the student’s grade point average.
AMS (ITC) Descriptive Statistics 03/10/2022 34 / 69
Data Description Measures of Central Tendency
Example 18
Facebook is a popular social networking website. Users can add
friends, send them messages, and update their personal profiles to
notify friends about themselves and their activities. A sample of 10
adults revealed they spent the following number of hours last month
using Facebook.
3 5 7 5 9 1 3 9 17 10
Example 19
The grouped data in Table 1.20 below represent the number of
children from birth through the end of the teenage years in a large
apartment complex. Find the median.
Definition 24
MODE is the value of the observation that appears most frequently.
A data set that has only one value that occurs with the greatest
frequency is said to be unimodal.
If a data set has two values that occur with the same greatest
frequency, both values are considered to be the mode and the
data set is said to be bimodal.
If a data set has more than two values that occur with the same
greatest frequency, each value is used as the mode, and the data
set is said to be multimodal.
When no data value occurs more than once, the data set is said
to have no mode.
Example 20
Recall the data regarding the distance in miles between exits on I-75 in
Kentucky. The information is repeated below.
11 4 10 4 9 3 8 10 3 14 1 10 3 5
2 2 5 6 1 2 2 3 7 1 3 7 8 10
1 4 7 5 2 2 5 1 1 3 3 1 2 1
What is the modal distance?
Definition 25
The mode for grouped data is the modal class. The modal class is
the class with the largest frequency.
Example 21
Find the modal class for the frequency distribution in Example 19.
Properties of Mode
1 The mode is used when the most typical case is desired.
2 The mode is the easiest average to compute.
3 The mode can be used when the data are nominal or categorical,
such as religious preference, gender, or political affiliation.
4 The mode is not always unique. A data set can have more than
one mode, or the mode may not exist for a data set.
Definition 26
The midrange is defined as the sum of the lowest and highest values in
the data set, divided by 2. The symbol MR is used for the midrange.
lowest value + highest value
MR =
2
Example 22
The number of bank failures for a recent five-year period is shown.
Find the midrange.
Properties of MR
1 The midrange is easy to compute.
2 The midrange gives the midpoint.
3 The midrange is affected by extremely high or low values in a
data set.
AMS (ITC) Descriptive Statistics 03/10/2022 42 / 69
Data Description Measures of Variation
Definition 28
The variance of the population is denoted by σ 2 defined by
(X − µ)2
P
2
σ =
N
The standard deviation of the population √ denoted by σ is the
square root of the variance, that is, σ = σ 2 .
Example 23
The number of traffic citations issued last year by month in Beaufort
County, South Carolina, is reported below.
Example 24
The Philadelphia office of PricewaterhouseCoopers hired five
accounting trainees this year. Their monthly starting salaries were
$3,536; $3,173; $3,448; $3,121; and $3,622.
(a) Compute the population mean.
(b) Compute the population variance.
(c) Compute the population standard deviation.
(d) The Pittsburgh office hired six trainees. Their mean monthly
salary was $3,550, and the standard deviation was $250. Compare
the two groups.
Definition 29
The variance of the sample (or sample variance) is denoted by
s 2 defined by
(X − X̄ )2 n( X 2 ) − ( X )2
P P P
s2 = =
n−1 n(n − 1)
Example 25
Find the sample variance and standard deviation for the amount of
European auto sales for a sample of 6 years shown. The data are in
millions of dollars.
Example 26
Find the variance and the standard deviation for the frequency
distribution of the data. The data represent the number of miles that
20 runners ran during one week.
Class Frequency Midpoint
5.5-10.5 1 8
10.5-15.5 2 13
15.5-20.5 3 18
20.5-25.5 5 23
25.5-30.5 4 28
30.5-35.5 3 33
35.5-40.5 2 38
Example 27
The mean of the number of sales of cars over a 3-month period is 87,
and the standard deviation is 5. The mean of the commissions is
$5225, and the standard deviation is $773. Compare the variations of
the two.
Definition 31
A z score or standard score for a value is obtained by subtracting the
mean from the value and dividing the result by the standard deviation.
The symbol for a standard score is z. The formula is
value − mean
z=
standard deviation
The z score represents the number of standard deviations that a data
value falls above or below the mean.
Definition 32
Percentiles are position measures used in educational and
health-related fields to indicate the position of an individual in a
group. Percentiles divide the data set into 100 equal groups. When
the data are arranged in order from lowest to highest, the percentile
corresponding to a given value X is computed by using the following
formula:
(number of values below X ) + 0.5
Percentile = × 100
total number of values
Example 31
A teacher gives a 20-point test to 10 students. The scores are shown
here. Find the percentile rank of a score of 12 and then of 6.
Example 32
Using the scores in Example above, find the value corresponding to the
25th percentile and the value that corresponds to the 60th percentile.
AMS (ITC) Descriptive Statistics 03/10/2022 56 / 69
Data Description Measures of Position
Definition 33
Quartiles divide the distribution into four groups, separated by
Q1 , Q2 , Q3 . Note that Q1 is the same as the 25th percentile; Q2 is the
same as the 50th percentile, or the median; Q3 corresponds to the
75th percentile, as shown:
Example 33
Find Q1 , Q2 , Q3 , and IQR for the data set 15, 13, 6, 5, 12, 50, 22, 18.
AMS (ITC) Descriptive Statistics 03/10/2022 57 / 69
Data Description Measures of Position
Remark 1
An outlier can strongly affect the mean and standard deviation of a
variable. For example, suppose a researcher mistakenly recorded an
extremely high data value. This value would then make the mean and
standard deviation of the variable much larger than they really were.
Outliers can have an effect on other statistics as well.
Example 34
Check the following data set for outliers.
Contents
3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position
Definition 35
A boxplot is a graph of a data set obtained by drawing a horizontal
line from the minimum data value to Q1 , drawing a horizontal line
from Q3 to the maximum data value, and drawing a box whose
vertical sides pass through Q1 and Q3 with a vertical line inside the
box passing through the median or Q2 .
Example 35
The number of meteorites found in 10 states of the United States is
89, 47, 164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot for the
data.
SKEWNESS
Another characteristic of a distribution is the shape. There are four
shapes commonly observed: symmetric, positively skewed, negatively
skewed, and bimodal. In a symmetric distribution the mean and
median are equal and the data values are evenly spread around these
values. The shape of the distribution below the mean and median is a
mirror image of distribution above the mean and median. A
distribution of values is skewed to the right or positively skewed if
there is a single peak, but the values extend much farther to the right
of the peak than to the left of the peak. In this case, the mean is
larger than the median. In a negatively skewed distribution there is a
single peak, but the observations extend farther to the left, in the
negative direction, than to the right. In a negatively skewed
distribution, the mean is smaller than the median. Positively skewed
distributions are more common. Salaries often follow this pattern. A
bimodal distribution will have two or more peaks. This is often the
case when the values are from two or more populations.
AMS (ITC) Descriptive Statistics 03/10/2022 65 / 69
Exploratory Data Analysis
Skewness
Example 36
Following are the earnings per share for a sample of 15 software
companies for the year 2017. The earnings per share are arranged from
smallest to largest.
$0.09 $0.13 $0.41 $0.51 $ 1.12 $ 1.20 $ 1.49 $3.18
3.50 6.36 7.83 8.92 10.13 12.99 16.40
Compute the mean, median, and standard deviation. Find the
coefficient of skewness using Pearson’s estimate and the software
methods. What is your conclusion regarding the shape of the
distribution?
Example 37
A sample of five data entry clerks employed in the Horry County Tax
Office revised the following number of tax records last hour: 73, 98,
60, 92, and 84.
(a) Find the mean, median, and the standard deviation.
(b) Compute the coefficient of skewness using Pearson’s method.
(c) Calculate the coefficient of skewness using the software method.
(d) What is your conclusion regarding the skewness of the data?