0% found this document useful (0 votes)
61 views70 pages

Chapter 1

The document describes descriptive statistics and frequency distributions. It defines key concepts in descriptive statistics like population and sample, variables, levels of measurement, and frequency distributions. Examples are provided to illustrate categorical and grouped frequency distributions as well as how to classify variables and their levels of measurement. The document also differentiates between descriptive and inferential statistics.

Uploaded by

RA YUT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views70 pages

Chapter 1

The document describes descriptive statistics and frequency distributions. It defines key concepts in descriptive statistics like population and sample, variables, levels of measurement, and frequency distributions. Examples are provided to illustrate categorical and grouped frequency distributions as well as how to classify variables and their levels of measurement. The document also differentiates between descriptive and inferential statistics.

Uploaded by

RA YUT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Contents

Chapter1
Descriptive Statistics

PHOK Ponna and PHAUK Sokkhey

Department of Applied Mathematics and Statistics


Institute of Technology of Cambodia

03/10/2022

AMS (ITC) Descriptive Statistics 03/10/2022 1 / 69


Contents

Contents

1 The Nature of Probability and Statistics

2 Frequency Distributions and Graphs

3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position

4 Exploratory Data Analysis

AMS (ITC) Descriptive Statistics 03/10/2022 1 / 69


The Nature of Probability and Statistics

Contents

1 The Nature of Probability and Statistics

2 Frequency Distributions and Graphs

3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position

4 Exploratory Data Analysis

AMS (ITC) Descriptive Statistics 03/10/2022 2 / 69


The Nature of Probability and Statistics

Definition 1
Statistics is the science of conducting studies to collect, organize,
summarize, analyze, and draw conclusions from data.

There are two types of statistics.


Definition 2
Descriptive statistics consists of the collection, organization,
summarization, and presentation of data.

Definition 3
Inferential statistics consists of generalizing from samples to
populations, performing estimations and hypothesis tests, determining
relationships among variables, and making predictions.

AMS (ITC) Descriptive Statistics 03/10/2022 3 / 69


The Nature of Probability and Statistics

Definition 4
A population consists of all subjects (human or otherwise) that are
being studied.

Definition 5
A sample is a group of subjects selected from a population.

AMS (ITC) Descriptive Statistics 03/10/2022 4 / 69


The Nature of Probability and Statistics

Example 1
Determine whether descriptive or inferential statistics were used.
a. The average jackpot for the top five lottery winners was $367.6
million.
b. A study done by the American Academy of Neurology suggests
that older people who had a high caloric diet more than doubled
their risk of memory loss.
c. Based on a survey of 9317 consumers done by the National Retail
Federation, the average amount that consumers spent on
Valentine’s Day in 2011 was $116.
d. Scientists at the University of Oxford in England found that a
good laugh significantly raises a person’s pain level tolerance.

AMS (ITC) Descriptive Statistics 03/10/2022 5 / 69


The Nature of Probability and Statistics

Solution
a. Descriptive statistics were used because this is an average, and it
is based on data obtained from the top five lottery winners at this
time.
b. Inferential statistics were used since this is a generalization made
from a sample to a population.
c. Descriptive statistics were used since this is an average based on a
sample of 9317 respondents.
d. Inferential statistics were used since an inference is made from a
sample to a population

AMS (ITC) Descriptive Statistics 03/10/2022 6 / 69


The Nature of Probability and Statistics

Definition 6
A variable is a characteristic or attribute that can assume different
values. Data are the values (measurements or observations) that the
variables can assume. Variables whose values are determined by
chance are called random variables.
Variables can be classified as qualitative or quantitative.
Definition 7
Qualitative variables are variables that have distinct categories
according to some characteristic or attribute.
Quantitative variables are variables that can be counted or
measured.
Quantitative variables can be further classified into two groups:
discrete and continuous. Discrete variables assume values that can
be counted. Continuous variables can assume an infinite number of
values between any two specific values. They are obtained by
measuring. They often include fractions and decimals.
AMS (ITC) Descriptive Statistics 03/10/2022 7 / 69
The Nature of Probability and Statistics

TYPES OF VARIABLES
There are two basic types of variables: (1) qualitative and (2)
quantitative.

AMS (ITC) Descriptive Statistics 03/10/2022 8 / 69


The Nature of Probability and Statistics

Example 2
Classify each variable as a discrete variable or a continuous variable.
a. The highest wind speed of a hurricane
b. The weight of baggage on an airplane
c. The number of pages in a statistics book
d. The amount of money a person spends per year for online
purchases

Solution
a. Continuous, since wind speed must be measured
b. Continuous, since weight is measured
c. Discrete, since the number of pages is countable
d. Discrete, since the smallest value that money can assume is in
cents

AMS (ITC) Descriptive Statistics 03/10/2022 9 / 69


The Nature of Probability and Statistics

There are four levels of measurement: nominal, ordinal, interval, and


ratio.
Definition 8
The nominal level of measurement classifies data into mutually
exclusive (nonoverlapping) categories in which no order or ranking
can be imposed on the data.
The ordinal level of measurement classifies data into categories
that can be ranked; however, precise differences between the
ranks do not exist.
The interval level of measurement ranks data, and precise
differences between units of measure do exist; however, there is
no meaningful zero.
The ratio level of measurement possesses all the characteristics
of interval measurement, and there exists a true zero. In addition,
true ratios exist when the same variable is measured on two
different members of the population.

AMS (ITC) Descriptive Statistics 03/10/2022 10 / 69


The Nature of Probability and Statistics

AMS (ITC) Descriptive Statistics 03/10/2022 11 / 69


The Nature of Probability and Statistics

Example 3
What level of measurement would be used to measure each variable?
a. The ages of patients in a local hospital
b. The ratings of movies released this month
c. Colors of athletic shirts sold by Oak Park Health Club
d. Temperatures of hot tubs in local health clubs

Solution
a. Ratio
b. Ordinal
c. Nominal
d. Interval

AMS (ITC) Descriptive Statistics 03/10/2022 12 / 69


Frequency Distributions and Graphs

Contents

1 The Nature of Probability and Statistics

2 Frequency Distributions and Graphs

3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position

4 Exploratory Data Analysis

AMS (ITC) Descriptive Statistics 03/10/2022 13 / 69


Frequency Distributions and Graphs

Definition 9
A frequency distribution is the organization of raw data in table
form, using classes and frequencies.

Two types of frequency distributions that are most often used are the
categorical frequency distribution and the grouped frequency
distribution.
The categorical frequency distribution is used for data that can
be placed in specific categories, such as nominal- or ordinal-level
data.
When the range of the data is large, the data must be grouped
into classes that are more than one unit in width, in what is called
a grouped frequency distribution.

AMS (ITC) Descriptive Statistics 03/10/2022 14 / 69


Frequency Distributions and Graphs

Definition 10
A cumulative frequency distribution is a distribution that shows the
number of data values less than or equal to a specific value (usually an
upper boundary). The values are found by adding the frequencies of
the classes less than or equal to the upper class boundary of a specific
class. This gives an ascending cumulative frequency.

Example 4 (Distribution of Blood Types)


Twenty-five army inductees were given a blood test to determine their
blood type. The data set is
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
Construct a frequency distribution for the data.
AMS (ITC) Descriptive Statistics 03/10/2022 15 / 69
Frequency Distributions and Graphs

Procedure to Construct a Grouped Frequency Distribution


1. Determine the classes.
Find the highest and lowest values.
Select the number of classes desired k such that 2k > n.
Find the width of the class i where i ≥ maximum valu−minimum
k
value

and rounding up to the nearest integer.


Select a starting point (usually the lowest value or any convenient
number less than the lowest value); add the width to get the lower
limits.
Find the upper class limits.
Find the boundaries.
2. Tally the data.
3. Find the numerical frequencies from the tallies, and find the
cumulative frequencies.

AMS (ITC) Descriptive Statistics 03/10/2022 16 / 69


Frequency Distributions and Graphs

Example 5 (Record High Temperatures)


These data represent the record high temperatures in degrees
Fahrenheit (o F) for each of the 50 states. Construct a grouped
frequency distribution for the data, using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114

AMS (ITC) Descriptive Statistics 03/10/2022 17 / 69


Frequency Distributions and Graphs

Definition 11
The histogram is a graph that displays the data by using contiguous
vertical bars (unless the frequency of a class is 0) of various heights to
represent the frequencies of the classes.

Definition 12
The frequency polygon is a graph that displays the data by using
lines that connect points plotted for the frequencies at the midpoints
of the classes. The frequencies are represented by the heights of the
points.

Definition 13
The ogive is a graph that represents the cumulative frequencies for
the classes in a frequency distribution.

AMS (ITC) Descriptive Statistics 03/10/2022 18 / 69


Frequency Distributions and Graphs

Procedure for Constructing a Histogram, Frequency Polygon,


and Ogive
1 Draw and label the x and y axes.
2 On the x axis, label the class boundaries of the frequency
distribution for the histogram and ogive. Label the midpoints for
the frequency polygon.
3 Plot the frequencies for each class, and draw the vertical bars for
the histogram and the lines for the frequency polygon and ogive.

Example 6
Construct a histogram, a frequency polygon and ogive to represent the
data shown for the record high temperatures for each of the 50 states
(see Example above).

AMS (ITC) Descriptive Statistics 03/10/2022 19 / 69


Frequency Distributions and Graphs

When the data are qualitative or categorical, bar graphs can be used
to represent the data. A bar graph can be drawn using either
horizontal or vertical bars.
Definition 14
A bar graph represents the data by using vertical or horizontal bars
whose heights or lengths represent the frequencies of the data.

Example 7 (College Spending for First-Year Students)


The table shows the average money spent by first-year college
students. Draw a horizontal and vertical bar graph for the data.
Electronics $728
Dorm decor 344
Clothing 141
Shoes 72

AMS (ITC) Descriptive Statistics 03/10/2022 20 / 69


Frequency Distributions and Graphs

Bar graphs can also be used to compare data for two or more groups.
These types of bar graphs are called compound bar graphs or
multiple bar graphs.
Example 8
Consider the following data for the number (in millions) of never
married adults in the United States.
Year Males Females
1960 15.3 12.3
1980 24.2 20.2
2000 32.3 27.8
2010 40.2 34.0
Construct a multiple bar graphs for this data.

AMS (ITC) Descriptive Statistics 03/10/2022 21 / 69


Frequency Distributions and Graphs

When data are collected over a period of time, they can be


represented by a time series graph.
Definition 15
A time series graph represents data that occur over a specific period
of time.

Example 9
The data show the percentage of U.S. adults who smoke. Draw and
analyze a time series graph for the data.
Year 1970 1980 1990 2000 2010
Percent 37 33 25 23 19

AMS (ITC) Descriptive Statistics 03/10/2022 22 / 69


Frequency Distributions and Graphs

Two or more data sets can be compared on the same graph called a
compound time series graph if two or more lines are used, as shown
below:

This graph shows the percentage of elderly males and females in the
U.S. labor force from 1960 to 2010. It shows that the percentage of
elderly men decreased significantly from 1960 to 1990 and then
increased slightly after that. For the elderly females, the percentage
decreased slightly from 1960 to 1980 and then increased from 1980 to
2010.
AMS (ITC) Descriptive Statistics 03/10/2022 23 / 69
Frequency Distributions and Graphs

The purpose of the pie graph is to show the relationship of the parts to
the whole by visually comparing the sizes of the sections. Percentages
or proportions can be used. The variable is nominal or categorical.
Definition 16
A pie graph is a circle that is divided into sections or wedges according
to the percentage of frequencies in each category of the distribution.

Example 10 (Super Bowl Snack Foods)


This frequency distribution shows the number of pounds of each snack
food eaten during the Super Bowl. Construct a pie graph for the data.
Snack Pounds (frequency)
Potato chips 11.2 million
Tortilla chips 8.2 million
Pretzels 4.3 million
Popcorn 3.8 million
Snack nuts 2.5 million
Total n = 30.0 million
AMS (ITC) Descriptive Statistics 03/10/2022 24 / 69
Frequency Distributions and Graphs

A dotplot uses points or dots to represent the data values. If the data
values occur more than once, the corresponding points are plotted
above one another.
Definition 17
A dotplot is a statistical graph in which each data value is plotted as
a point (dot) above the horizontal axis.

Dotplots are used to show how the data values are distributed and to
see if there are any extremely high or low data values.
Example 11 (Named Storms)
The data show the number of named storms each year for the last 40
years. Construct and analyze a dotplot for the data.
19 15 14 7 6 11 11 9
16 8 8 11 9 8 16 12 13
14 13 12 7 15 15 19 11 4
6 13 10 15 7 12 6 10
28 12 8 7 12 9
AMS (ITC) Descriptive Statistics 03/10/2022 25 / 69
Data Description

Contents

1 The Nature of Probability and Statistics

2 Frequency Distributions and Graphs

3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position

4 Exploratory Data Analysis

AMS (ITC) Descriptive Statistics 03/10/2022 26 / 69


Data Description Measures of Central Tendency

Definition 18 (The population mean)


P
x
POPULATION MEAN µ=
N
where:
µ represents the population mean. It is the Greek lowercase letter
“mu.”
N is the number of values in the population.
x represents any particular value.
Σ is the Greek capital letter “sigma” and indicates the operation
of adding.
P
x is the sum of the x values in the population.

Definition 19
Any measurable characteristic of a population is called a parameter.
The mean of a population is an example of a parameter.
AMS (ITC) Descriptive Statistics 03/10/2022 27 / 69
Data Description Measures of Central Tendency

Example 12
There are 42 exits on I-75 through the state of Kentucky. Listed below
are the distances between exits (in miles).

11 4 10 4 9 3 8 10 3 14 1 10 3 5
2 2 5 6 1 2 2 3 7 1 3 7 8 10
1 4 7 5 2 2 5 1 1 3 3 1 2 1
Why is this information a population? What is the mean number of
miles between exits?

AMS (ITC) Descriptive Statistics 03/10/2022 28 / 69


Data Description Measures of Central Tendency

Definition 20 (The sample mean)


P
x
SAMPLE MEAN x̄ =
n
where:
x̄ represents the sample mean. It is read “x bar.”
n is the number of values in the sample.
x represents any particular value.
Σ is the Greek capital letter “sigma” and indicates the operation
of adding.
P
x is the sum of the x values in the sample.

Definition 21
Any measure based on sample data, is called a statistic. The mean of
a sample is an example of a statistic.

AMS (ITC) Descriptive Statistics 03/10/2022 29 / 69


Data Description Measures of Central Tendency

Example 13
Verizon is studying the number of monthly minutes used by clients in
a particular cell phone rate plan. A random sample of 12 clients
showed the following number of minutes used last month.
90 77 94 89 119 112
91 110 92 100 113 83
What is the arithmetic mean number of minutes used last month?

Properties of the Arithmetic Mean


1 To compute a mean, the data must be measured at the interval
or ratio level. Recall that ratio-level data include such data as
ages, incomes, and weights.
2 All the values are included in computing the mean.
3 The mean is unique. That is, there is only one mean in a set of
data.
4 The sum of the deviations of each value from the mean is zero.
AMS (ITC) Descriptive Statistics 03/10/2022 30 / 69
Data Description Measures of Central Tendency

Example 14
The annual incomes of a sample of middle-management employees at
Westinghouse are $62,900, $69,100, $58,300, and $76,800.
(a) Give the formula for the sample mean.
(b) Find the sample mean.
(c) Is the mean you computed in (b) a statistic or a parameter? Why?
(d) What is your best estimate of the population mean?

Example 15
The six students in Computer Science 411 are a population. Their
final course grades are 92, 96, 61, 86, 79, and 84.
(a) Give the formula for the population mean.
(b) Compute the mean course grade.
(c) Is the mean you computed in (b) a statistic or a parameter? Why?

AMS (ITC) Descriptive Statistics 03/10/2022 31 / 69


Data Description Measures of Central Tendency

Procedure Table
Finding the Mean for Grouped Data
1 Make a table as shown.
A B C D
Class Frequency f Midpoint Xm f .Xm
2 Find the midpoints of each class and place them in column C.
3 Multiply the frequency by the midpoint for each class, and place
the product in column D.
4 Find the sum of column D.
5 Divide the sum obtained in column D by the sum of the
frequencies obtained in column B.
The formula for the mean is
P
f .Xm
X̄ =
n

AMS (ITC) Descriptive Statistics 03/10/2022 32 / 69


Data Description Measures of Central Tendency

Example 16
For 108 randomly selected college students, this exam score frequency
distribution was obtained.
Class limits Frequency
90-98 6
99-107 22
108-116 43
117-125 28
126-134 9
Find the mean for this grouped data.

AMS (ITC) Descriptive Statistics 03/10/2022 33 / 69


Data Description Measures of Central Tendency

Definition 22
The type of mean that considers an additional factor is called the
weighted mean, and it is used when the values are not all equally
represented.
Find the weighted mean of a variable X by multiplying each value by
its corresponding weight and dividing the sum of the products by the
sum of the weights.
P
w1 X1 + w2 X2 + . . . + wn Xn wX
X̄ = = P
w1 + w2 + . . . + wn w

Example 17
A student received an A in English Composition I (3 credits), a C in
Introduction to Psychology (3 credits), a B in Biology I (4 credits),
and a D in Physical Education (2 credits). Assuming A= 4 grade
points, B = 3 grade points, C = 2 grade points, D = 1 grade point,
and F = 0 grade points, find the student’s grade point average.
AMS (ITC) Descriptive Statistics 03/10/2022 34 / 69
Data Description Measures of Central Tendency

Definition 23 (The Median)


MEDIAN is the midpoint of the values after they have been ordered
from the minimum to the maximum values.

Steps in computing the median of a data array


1 Arrange the data in order X , X , . . . , X .
1 2 n
2 Select the middle point.

X n+1 if n is odd
MD = X n 2+X n +1
 2 2
2 if n is even

AMS (ITC) Descriptive Statistics 03/10/2022 35 / 69


Data Description Measures of Central Tendency

Properties of the median


1 It is not affected by extremely large or small values. Therefore,
the median is a valuable measure of location when such values do
occur.
2 It can be computed for ordinal-level data or higher. Recall that
ordinal-level data can be ranked from low to high.

Example 18
Facebook is a popular social networking website. Users can add
friends, send them messages, and update their personal profiles to
notify friends about themselves and their activities. A sample of 10
adults revealed they spent the following number of hours last month
using Facebook.

3 5 7 5 9 1 3 9 17 10

Find the median number of hours.


AMS (ITC) Descriptive Statistics 03/10/2022 36 / 69
Data Description Measures of Central Tendency

The median for the grouped data


The median for the grouped data is given by
w
MD = Lm + (0.5n − cf )
f
where
Lm = lower boundary of median class.
n = sum of frequencies.
cf = cumulative frequency of class immediately preceding the
median class.
f = frequency of median class.
w = width of median class.

AMS (ITC) Descriptive Statistics 03/10/2022 37 / 69


Data Description Measures of Central Tendency

Example 19
The grouped data in Table 1.20 below represent the number of
children from birth through the end of the teenage years in a large
apartment complex. Find the median.

AMS (ITC) Descriptive Statistics 03/10/2022 38 / 69


Data Description Measures of Central Tendency

Definition 24
MODE is the value of the observation that appears most frequently.

A data set that has only one value that occurs with the greatest
frequency is said to be unimodal.
If a data set has two values that occur with the same greatest
frequency, both values are considered to be the mode and the
data set is said to be bimodal.
If a data set has more than two values that occur with the same
greatest frequency, each value is used as the mode, and the data
set is said to be multimodal.
When no data value occurs more than once, the data set is said
to have no mode.

AMS (ITC) Descriptive Statistics 03/10/2022 39 / 69


Data Description Measures of Central Tendency

Example 20
Recall the data regarding the distance in miles between exits on I-75 in
Kentucky. The information is repeated below.
11 4 10 4 9 3 8 10 3 14 1 10 3 5
2 2 5 6 1 2 2 3 7 1 3 7 8 10
1 4 7 5 2 2 5 1 1 3 3 1 2 1
What is the modal distance?

Definition 25
The mode for grouped data is the modal class. The modal class is
the class with the largest frequency.

Example 21
Find the modal class for the frequency distribution in Example 19.

AMS (ITC) Descriptive Statistics 03/10/2022 40 / 69


Data Description Measures of Central Tendency

Properties of Mode
1 The mode is used when the most typical case is desired.
2 The mode is the easiest average to compute.
3 The mode can be used when the data are nominal or categorical,
such as religious preference, gender, or political affiliation.
4 The mode is not always unique. A data set can have more than
one mode, or the mode may not exist for a data set.

AMS (ITC) Descriptive Statistics 03/10/2022 41 / 69


Data Description Measures of Central Tendency

Definition 26
The midrange is defined as the sum of the lowest and highest values in
the data set, divided by 2. The symbol MR is used for the midrange.
lowest value + highest value
MR =
2

Example 22
The number of bank failures for a recent five-year period is shown.
Find the midrange.

3, 30, 148, 157, 71

Properties of MR
1 The midrange is easy to compute.
2 The midrange gives the midpoint.
3 The midrange is affected by extremely high or low values in a
data set.
AMS (ITC) Descriptive Statistics 03/10/2022 42 / 69
Data Description Measures of Variation

For the spread or variability of a data set, three measures are


commonly used: range, variance, and standard deviation.
Definition 27
The range is the highest value minus the lowest value. The symbol R
is used for the range.

R = highest value − lowest value

Definition 28
The variance of the population is denoted by σ 2 defined by

(X − µ)2
P
2
σ =
N
The standard deviation of the population √ denoted by σ is the
square root of the variance, that is, σ = σ 2 .

AMS (ITC) Descriptive Statistics 03/10/2022 43 / 69


Data Description Measures of Variation

Example 23
The number of traffic citations issued last year by month in Beaufort
County, South Carolina, is reported below.

Determine the population variance.

AMS (ITC) Descriptive Statistics 03/10/2022 44 / 69


Data Description Measures of Variation

Example 24
The Philadelphia office of PricewaterhouseCoopers hired five
accounting trainees this year. Their monthly starting salaries were
$3,536; $3,173; $3,448; $3,121; and $3,622.
(a) Compute the population mean.
(b) Compute the population variance.
(c) Compute the population standard deviation.
(d) The Pittsburgh office hired six trainees. Their mean monthly
salary was $3,550, and the standard deviation was $250. Compare
the two groups.

AMS (ITC) Descriptive Statistics 03/10/2022 45 / 69


Data Description Measures of Variation

Definition 29
The variance of the sample (or sample variance) is denoted by
s 2 defined by

(X − X̄ )2 n( X 2 ) − ( X )2
P P P
s2 = =
n−1 n(n − 1)

The standard deviation of the sample


√ denoted by s is the square
root of the variance, that is, s = s 2 .

Example 25
Find the sample variance and standard deviation for the amount of
European auto sales for a sample of 6 years shown. The data are in
millions of dollars.

11.2, 11.9, 12.0, 12.8, 13.4, 14.3

AMS (ITC) Descriptive Statistics 03/10/2022 46 / 69


Data Description Measures of Variation

The variance and standard deviation for grouped data

AMS (ITC) Descriptive Statistics 03/10/2022 47 / 69


Data Description Measures of Variation

Example 26
Find the variance and the standard deviation for the frequency
distribution of the data. The data represent the number of miles that
20 runners ran during one week.
Class Frequency Midpoint
5.5-10.5 1 8
10.5-15.5 2 13
15.5-20.5 3 18
20.5-25.5 5 23
25.5-30.5 4 28
30.5-35.5 3 33
35.5-40.5 2 38

AMS (ITC) Descriptive Statistics 03/10/2022 48 / 69


Data Description Measures of Variation

Uses of the Variance and Standard Deviation


1 Variances and standard deviations can be used to determine the

spread of the data. If the variance or standard deviation is large,


the data are more dispersed. This information is useful in
comparing two (or more) data sets to determine which is more
(most) variable.
2 The measures of variance and standard deviation are used to
determine the consistency of a variable. For example, in the
manufacture of fittings, such as nuts and bolts, the variation in
the diameters must be small, or else the parts will not fit together.
3 The variance and standard deviation are used to determine the
number of data values that fall within a specified interval in a
distribution. For example, Chebyshev’s theorem (explained later)
shows that, for any distribution, at least 75% of the data values
will fall within 2 standard deviations of the mean.
4 Finally, the variance and standard deviation are used quite often
in inferential statistics.
AMS (ITC) Descriptive Statistics 03/10/2022 49 / 69
Data Description Measures of Variation

A statistic that allows you to compare standard deviations when the


units are different is called the coefficient of variation.
Definition 30
The coefficient of variation, denoted by CVar, is the standard
deviation divided by the mean. The result is expressed as a percentage
For populations CVar = σµ .100
s
For samples CVar = X
.100

Example 27
The mean of the number of sales of cars over a 3-month period is 87,
and the standard deviation is 5. The mean of the commissions is
$5225, and the standard deviation is $773. Compare the variations of
the two.

The Range Rule of Thumb


range
A rough estimate of the standard deviation is s ≈ 4 .

AMS (ITC) Descriptive Statistics 03/10/2022 50 / 69


Data Description Measures of Variation

Theorem 1 (Chebyshev’s theorem)


The proportion of values from a data set that will fall within k
standard deviations of the mean will be at least 1 − k12 , where k is a
number greater than 1 (k is not necessarily an integer).

In summary, Chebyshev’s theorem states


1 At least three-fourths, or 75%, of all data values fall within 2
standard deviations of the mean.
2 At least eight-ninths, or 89%, of all data values fall within 3
standard deviations of the mean.

AMS (ITC) Descriptive Statistics 03/10/2022 51 / 69


Data Description Measures of Variation

Example 28 ( Prices of Homes)


The mean price of houses in a certain neighborhood is $50,000, and
the standard deviation is $10,000. Find the price range for which at
least 75% of the houses will sell.

Example 29 (Travel Allowances)


A survey of local companies found that the mean amount of travel
allowance for couriers was $0.25 per mile. The standard deviation was
$0.02. Using Chebyshev’s theorem, find the minimum percentage of
the data values that will fall between $0.20 and $0.30.

AMS (ITC) Descriptive Statistics 03/10/2022 52 / 69


Data Description Measures of Variation

The Empirical (Normal) Rule


When a distribution is bell-shaped (or what is called normal), the
following statements, which make up the empirical rule, are true.
Approximately 68% of the data values will fall within 1 standard
deviation of the mean.
Approximately 95% of the data values will fall within 2 standard
deviations of the mean.
Approximately 99.7% of the data values will fall within 3 standard
deviations of the mean.

AMS (ITC) Descriptive Statistics 03/10/2022 53 / 69


Data Description Measures of Position

Definition 31
A z score or standard score for a value is obtained by subtracting the
mean from the value and dividing the result by the standard deviation.
The symbol for a standard score is z. The formula is
value − mean
z=
standard deviation
The z score represents the number of standard deviations that a data
value falls above or below the mean.

Example 30 (Test Scores)


A student scored 65 on a calculus test that had a mean of 50 and a
standard deviation of 10; she scored 30 on a history test with a mean
of 25 and a standard deviation of 5. Compare her relative positions on
the two tests

AMS (ITC) Descriptive Statistics 03/10/2022 54 / 69


Data Description Measures of Position

Definition 32
Percentiles are position measures used in educational and
health-related fields to indicate the position of an individual in a
group. Percentiles divide the data set into 100 equal groups. When
the data are arranged in order from lowest to highest, the percentile
corresponding to a given value X is computed by using the following
formula:
(number of values below X ) + 0.5
Percentile = × 100
total number of values

Example 31
A teacher gives a 20-point test to 10 students. The scores are shown
here. Find the percentile rank of a score of 12 and then of 6.

18, 15, 12, 6, 8, 2, 3, 5, 20, 10

AMS (ITC) Descriptive Statistics 03/10/2022 55 / 69


Data Description Measures of Position

Example 32
Using the scores in Example above, find the value corresponding to the
25th percentile and the value that corresponds to the 60th percentile.
AMS (ITC) Descriptive Statistics 03/10/2022 56 / 69
Data Description Measures of Position

Definition 33
Quartiles divide the distribution into four groups, separated by
Q1 , Q2 , Q3 . Note that Q1 is the same as the 25th percentile; Q2 is the
same as the 50th percentile, or the median; Q3 corresponds to the
75th percentile, as shown:

Quartiles can be computed by using the formula given for computing


percentiles. For Q1 use p = 25. For Q2 use p = 50. For Q3 use
p = 75. The interquartile range (IQR) is defined as the difference
between Q1 and Q3 and is the range of the middle 50% of the data.

Example 33
Find Q1 , Q2 , Q3 , and IQR for the data set 15, 13, 6, 5, 12, 50, 22, 18.
AMS (ITC) Descriptive Statistics 03/10/2022 57 / 69
Data Description Measures of Position

A data set should be checked for extremely high or extremely low


values. These values are called outliers.
Definition 34
An outlier is an extremely high or an extremely low data value when
compared with the rest of the data values.

Remark 1
An outlier can strongly affect the mean and standard deviation of a
variable. For example, suppose a researcher mistakenly recorded an
extremely high data value. This value would then make the mean and
standard deviation of the variable much larger than they really were.
Outliers can have an effect on other statistics as well.

AMS (ITC) Descriptive Statistics 03/10/2022 58 / 69


Data Description Measures of Position

Example 34
Check the following data set for outliers.

5, 6, 12, 13, 15, 18, 22, 50

AMS (ITC) Descriptive Statistics 03/10/2022 59 / 69


Exploratory Data Analysis

Contents

1 The Nature of Probability and Statistics

2 Frequency Distributions and Graphs

3 Data Description
Measures of Central Tendency
Measures of Variation
Measures of Position

4 Exploratory Data Analysis

AMS (ITC) Descriptive Statistics 03/10/2022 60 / 69


Exploratory Data Analysis

In exploratory data analysis (EDA), data can be organized using a


stem and leaf plot.The measure of central tendency used in EDA is the
median. The measure of variation used in EDA is the interquartile
range Q3 –Q1 . In EDA the data are represented graphically using a
boxplot (sometimes called a box and whisker plot). The purpose of
exploratory data analysis is to examine data to find out what
information can be discovered about the data, such as the center and
the spread. Exploratory data analysis was developed by John Tukey
and presented in his book Exploratory Data Analysis (Addison-Wesley,
1977).

AMS (ITC) Descriptive Statistics 03/10/2022 61 / 69


Exploratory Data Analysis

The Five-Number Summary and Boxplots


A boxplot can be used to graphically represent the data set. These
plots involve five specific values:
1 The lowest value of the data set (i.e., minimum)
2 Q1
3 The median
4 Q3
5 The highest value of the data set (i.e., maximum)

Definition 35
A boxplot is a graph of a data set obtained by drawing a horizontal
line from the minimum data value to Q1 , drawing a horizontal line
from Q3 to the maximum data value, and drawing a box whose
vertical sides pass through Q1 and Q3 with a vertical line inside the
box passing through the median or Q2 .

AMS (ITC) Descriptive Statistics 03/10/2022 62 / 69


Exploratory Data Analysis

Procedure for constructing a boxplot


1. Find the five-number summary for the data values, that is, the
maximum and minimum data values, Q1 and Q3 , and the median.
2. Draw a horizontal axis with a scale such that it includes the
maximum and minimum data values.
3. Draw a box whose vertical sides go through Q1 and Q3 , and draw
a vertical line though the median.
4. Draw a line from the minimum data value to the left side of the
box and a line from the maximum data value to the right side of
the box.

Example 35
The number of meteorites found in 10 states of the United States is
89, 47, 164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot for the
data.

AMS (ITC) Descriptive Statistics 03/10/2022 63 / 69


Exploratory Data Analysis

Information Obtained from a Boxplot


a. If the median is near the center of the box, the distribution is
approximately symmetric.
b. If the median falls to the left of the center of the box, the
distribution is positively skewed.
c. If the median falls to the right of the center, the distribution is
negatively skewed.

AMS (ITC) Descriptive Statistics 03/10/2022 64 / 69


Exploratory Data Analysis

SKEWNESS
Another characteristic of a distribution is the shape. There are four
shapes commonly observed: symmetric, positively skewed, negatively
skewed, and bimodal. In a symmetric distribution the mean and
median are equal and the data values are evenly spread around these
values. The shape of the distribution below the mean and median is a
mirror image of distribution above the mean and median. A
distribution of values is skewed to the right or positively skewed if
there is a single peak, but the values extend much farther to the right
of the peak than to the left of the peak. In this case, the mean is
larger than the median. In a negatively skewed distribution there is a
single peak, but the observations extend farther to the left, in the
negative direction, than to the right. In a negatively skewed
distribution, the mean is smaller than the median. Positively skewed
distributions are more common. Salaries often follow this pattern. A
bimodal distribution will have two or more peaks. This is often the
case when the values are from two or more populations.
AMS (ITC) Descriptive Statistics 03/10/2022 65 / 69
Exploratory Data Analysis

Skewness

There are several formulas in the statistical literature used to calculate


skewness. The simplest, developed by Professor Karl Pearson
(1857–1936), is based on the difference between the mean and the
median.

AMS (ITC) Descriptive Statistics 03/10/2022 66 / 69


Exploratory Data Analysis

PEARSON’S COEFFICIENT OF SKEWNESS


3 (x̄ − Median)
sk =
s
Using this relationship, the coefficient of skewness can range from -3
up to 3. A value near -3, such as -2.57, indicates considerable negative
skewness. A value such as 1.63 indicates moderate positive skewness.
A value of 0, which will occur when the mean and median are equal,
indicates the distribution is symmetrical and there is no skewness
present

SOFTWARE COEFFICIENT OF SKEWNESS


" #
n X  x − x̄ 3
sk =
(n − 1)(n − 2) s

AMS (ITC) Descriptive Statistics 03/10/2022 67 / 69


Exploratory Data Analysis

Example 36
Following are the earnings per share for a sample of 15 software
companies for the year 2017. The earnings per share are arranged from
smallest to largest.
$0.09 $0.13 $0.41 $0.51 $ 1.12 $ 1.20 $ 1.49 $3.18
3.50 6.36 7.83 8.92 10.13 12.99 16.40
Compute the mean, median, and standard deviation. Find the
coefficient of skewness using Pearson’s estimate and the software
methods. What is your conclusion regarding the shape of the
distribution?

AMS (ITC) Descriptive Statistics 03/10/2022 68 / 69


Exploratory Data Analysis

Example 37
A sample of five data entry clerks employed in the Horry County Tax
Office revised the following number of tax records last hour: 73, 98,
60, 92, and 84.
(a) Find the mean, median, and the standard deviation.
(b) Compute the coefficient of skewness using Pearson’s method.
(c) Calculate the coefficient of skewness using the software method.
(d) What is your conclusion regarding the skewness of the data?

AMS (ITC) Descriptive Statistics 03/10/2022 69 / 69

You might also like