0% found this document useful (0 votes)
43 views29 pages

Unit 2

Uploaded by

Sandee Zandueta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views29 pages

Unit 2

Uploaded by

Sandee Zandueta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Unit 2: Descriptive Statistics

Overview:

Descriptive statistics helps, describes and understands the features of a specific data set
by giving short summaries about the sample and measures of the data. It consists of two basic
categories of measures: measures of central tendency and measures of variability or spread. But
the most recognized type of descriptive statistics is measures of central tendency: the mean,
median, and mode, which are used at almost all levels of math and statistics.

Data is all around us and everything that we do results in new data. Every kind of electronic
message, which we either receive or send like withdrawing money from a bank, every website
that we visit contributes to storage of data.

Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation. A characteristic or a quantity of interest that can take on different values is known
as a variable. An observation is a set of values corresponding to a set of variables.

Practically every problem (and opportunity) that an organization (or individual) faces is
concerned with the impact of the possible values of relevant variables on the business outcome.
Thus, we are concerned with how the value of a variable can vary; variation is the difference in
a variable measured over observations (time, customers, items, etc.).

Learning Outcomes:

After successful Completion of this module, you should be able to:


• Identify what is frequency and relative frequency distribution
• Create a Histogram
• Analyze and compute for the measures of location (mean, median, mode)

Course Materials:

Type of Data
1. Population and Sample Data
Data can be categorized in several ways based on how they are collected, and the
type collected. In many cases, it is not feasible to collect data from the population of all
elements of interest. In such instances, we collect data from a subset of the population
known as a sample. For example, your population is all the CBA students. But since you
need to collect data from the subset of your population, then your sample are the Human
Resource students. It is a more practical approach to data collection because you will not
incur tremendous costs in time, effort, and money. And in most cases, a representative
sample can be gathered by random sampling of the population data. Dealing with
populations and samples can introduce subtle differences in how we calculate and
interpret summary statistics. In almost all practical applications of business analytics, we
will be dealing with sample data.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 8


2. Quantitative and Categorical Data

Data are considered quantitative data if numeric and arithmetic operations, such
as addition, subtraction, multiplication, and division, can be performed on them. For
instance, we can sum the values for Volume in the DJI data in Table 2.1 to calculate a
total volume of all shares traded by companies included in the DJI. If arithmetic operations
cannot be performed on the data, they are considered categorical data. We can
summarize categorical data by counting the number of observations or computing the
proportions of observations in each category. For instance, the data in the industry column
in Table 2.1 are categorical. We can count the number of companies in the DJI that are in
the telecommunications industry. Table 2.1 shows two companies in the
telecommunications industry: AT&T and Verizon Communications.

3. Cross- Sectional and Time Series Data

• Cross-sectional data are collected from several entities at the same, or


approximately the same, point in time. The data in Table 2.1 are cross-sectional
because they describe the 30 companies that comprise the DJI at the same point
in time (April 2013).

• Time series data are collected over several time periods. Graphs of time series
data are frequently found in business and economic publications. Such graphs
help analysts understand what happened in the past, identify trends over time, and
project future levels for the time series. For example, the graph of the time series
in Figure 2.1 shows the DJI value from February 2002 to April 2013.

4. Sources of Data

Data necessary to analyze a business problem or opportunity can often be obtained


with an appropriate study; such statistical studies can be classified as either experimental
or observational.

a) Experimental study a variable of interest is first identified. Then one or more other
variables are identified and controlled so that data can be obtained about how they
influence the variable of interest.

b) Nonexperimental or observational studies make no attempt to control the


variables of interest. A survey is perhaps the most common type of observational
study.

Modifying Data in Excel


Projects often involve so much data that it is difficult to analyze all of the data at once. In
this section, we examine methods for summarizing and manipulating data using Excel to make
the data more manageable and to develop insights.

Sorting Data in Excel


Excel contains many useful features for sorting and filtering data so that one can more
easily identify patterns. Suppose that we want to sort these automobiles by March 2010 sales
instead of by March 2011 sales. To do this, we use Excel’s Sort function, as shown in the following
steps.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 9


Step 1. Select cells A1:F21

Step 2. Click the DATA tab in the Ribbon


Step 3. Click Sort in the Sort & Filter group
Step 4. Select the check box for My data has headers
Step 5. In the first Sort by dropdown menu, select Sales (March 2010)
Step 6. In the Order dropdown menu, select Largest to Smallest (see Figure 2.4)
Step 7. Click OK

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 10


4

5
6

The result of using Excel’s Sort function for the March 2010 data is shown in Figure 2.5.
Note that while we sorted on Sales (March 2010), which is in column E, the data in all other
columns are adjusted accordingly.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 11


Filtering Data in Excel
Now let’s suppose that we are interested only in seeing the sales of models made by
Toyota. We can do this using Excel’s Filter function:

Step 1. Select cells A1:F21

Step 2. Click the DATA tab in the Ribbon


Step 3. Click Filter in the Sort & Filter group
Step 4. Click on the Filter Arrow in column B, next to Manufacturer

Step 5. Select only the check box for Toyota. You can easily deselect all choices by
unchecking (Select All)

The result is a display of only the data for models made by Toyota (see Figure 2.6). We
now see that of the 20 top-selling models in March 2011, Toyota made three of them. We can
further filter the data by choosing the down arrows in the other columns. We can make all data
visible again by clicking on the down arrow in column B and checking (Select All) or by clicking
Filter in the Sort & Filter Group again from the DATA tab.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 12


Conditional Formatting Data in Excel
Conditional formatting in Excel can make it easy to identify data that satisfy certain
conditions in a data set. For instance, suppose that we wanted to quickly identify the automobile
models in Table 2.2 for which sales had decreased from March 2010 to March 2011. We can
quickly highlight these models:

Step 1. Starting with the original data shown in Figure 2.3, select cells F1:F21
Step 2. Click on the HOME tab in the Ribbon
Step 3. Click Conditional Formatting in the Styles group
Step 4. Select Highlight Cells Rules, and click Less Than from the dropdown menu
Step 5. Enter 0% in the Format cells that are LESS THAN: box
Step 6. Click OK

The results are shown in Figure 2.7. Here we see that the models with decreasing sales
(Toyota Camry, Ford Focus, Chevrolet Malibu, and Nissan Versa) are now clearly visible.

Note that Excel’s Conditional Formatting function offers tremendous flexibility. Instead of
highlighting only models with decreasing sales, we could instead choose Data Bars from the
Conditional Formatting dropdown menu in the Styles Group of the HOME tab in the Ribbon. The
result of using the Blue Data Bar Gradient Fill option is shown in Figure 2.8.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 13


Creating Distributions from Data

1. Frequency Distributions for Categorical Data


It is often useful to create a frequency distribution for a data set. A frequency distribution
is a summary of data that shows the number (frequency) of observations in each of several non-
overlapping classes, typically referred to as bins, when dealing with distributions.

In Table 2.1, taken from a sample of 50 soft drink purchases, each purchase is for one of
five popular soft drinks, which define the five bins: Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and
Sprite. To develop a frequency distribution for these data, we count the number of times

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 14


each soft drink appears in Table 2.3. Coca-Cola appears 19 times, Diet Coke appears 8 times,
Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5 times.

This frequency distribution provides a summary of how the 50 soft drink purchases are
distributed across the five soft drinks.

Relative Frequency and Percent Frequency Distributions


A frequency distribution shows the number (frequency) of items in each of several non-
overlapping bins. However, we are often interested in the proportion, or percentage, of items in
each bin. The relative frequency of a bin equals the fraction or proportion of items belonging to a
class. For a data set with n observations, the relative frequency of each bin can be determined as
follows:
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓𝑡ℎ𝑒𝑏𝑖𝑛
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓𝑎𝑏𝑖𝑛 =
𝑛

A relative frequency distribution is a tabular summary of data showing the relative


frequency for each bin. A percent frequency distribution summarizes the percent frequency of the
data for each bin.

Table 2.5 shows a relative frequency distribution and a percent frequency distribution for
the soft drink data.

2. Frequency Distributions for Quantitative Data


The three steps necessary to define the classes for a frequency distribution with
quantitative data are:

1. Determine the number of non-overlapping bins.


2. Determine the width of each bin.
3. Determine the bin limits.

Histograms
A common graphical presentation of quantitative data is a histogram. This graphical
summary can be prepared for data previously summarized in either a frequency, a relative
frequency, or a percent frequency distribution. A histogram is constructed by placing the

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 15


variable of interest on the horizontal axis and the selected frequency measure (absolute
frequency, relative frequency, or percent frequency) on the vertical axis. The frequency measure
of each class is shown by drawing a rectangle whose base is determined by the class limits on
the horizontal axis and whose height is the corresponding frequency measure.

Histogram for the audit time data

Cumulative Distributions
A variation of the frequency distribution that provides another tabular summary of
quantitative data is the cumulative frequency distribution, which uses the number of classes, class
widths, and class limits developed for the frequency distribution.

CUMULATIVE FREQUENCY, CUMULATIVE RELATIVE FREQUENCY, AND CUMULATIVE


PERCENT FREQUENCY DISTRIBUTIONS FOR THE AUDIT TIME DATA

Measures of Location
1. Mean (Arithmetic Mean)
The most commonly used measure of location is the mean (arithmetic mean), or average
value, for a variable. The mean provides a measure of central location for the data. If the data are
for a sample (typically the case), the mean is denoted by x . The sample mean is a point estimate
of the (typically unknown) population mean for the variable of interest. If the data for the entire
population are available, the population mean is computed in the same manner, but denoted by
the Greek letter u.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 16


TAB Data on Home Sales in a Cincinnati, Ohio,
LE Suburb
2.9

Home Sale Selling Price ($)


1 138,000

2 254,000
3 186,000
4 257,500
5 108,000
6 254,000
7 138,000
8 298,000
9 199,500
10 208,000
142,000
11 456,250
12

The mean can be found in Excel using the AVERAGE function. Figure 2.16 shows
the Home Sales data from Table 2.9 in an Excel spreadsheet. The value for the mean in cell
E2 is calculated using the formula =AVERAGE (B2:B13)

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 17


2. Median
The median, another measure of central location, is the value in the middle when the data
are arranged in ascending order (smallest to largest value). With an odd number of observations,
the median is the middle value. An even number of observations has no single middle value. In
this case, we follow convention and define the median as the average of the values for the middle
two observations.

Let us apply this definition to compute the median class size for a sample of five
college classes. Arranging the data in ascending order provides the following list:

32 42 46 46 54

Because n = 5 is odd, the median is the middle value. Thus, the median class size is46
students. Even though this data set contains two observations with values of 46, each observation
is treated separately when we arrange the data in ascending order. Suppose, we also compute
the median value for the 12 home sales in Table 2.9. We first arrange the data in ascending order.

Although the mean is the more commonly used measure of central location, in some
situations the median is preferred. The mean is influenced by extremely small and large data
values. Notice that the median is smaller than the mean in Figure 2.16. This is because the one
large value of $456,250 in our data set inflates the mean but does not have the same effect on
the median. Notice also that the median would remain unchanged if we replaced the $456,250
with a sales price of $1.5 million. In this case, the median selling price would remain $203,750,
but the mean would increase to $306,916.67. If you were looking to buy a home in this suburb,
the median gives a better indication of the central selling price of the homes there. We can
generalize, saying that whenever a data set contains extreme values or is severely skewed, the
median is often the preferred measure of central location.

3. Mode
A third measure of location, the mode, is the value that occurs most frequently in a data
set. To illustrate the identification of the mode, consider the sample of five class sizes.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 18


32 42 46 46 54

The only value that occurs more than once is 46. Because this value, occurring with a
frequency of 2, has the greatest frequency, it is the mode. To find the mode for a data set with
only one most often occurring value in Excel, we use the MODE.SNGL function. Occasionally the
greatest frequency occurs at two or more different values, in which case more than one mode
exists. If data contain at least two modes, we say that they are multimodal.

A special case of multimodal data occurs when the data contain exactly two modes; in such
cases we say that the data are bimodal. In multimodal cases when there are more than two
modes, the mode is almost never reported because listing three or more modes is not particularly
helpful in describing a location for the data. Also, if no value in the data occurs more than once,
we say the data have no mode.

The Excel MODE.SNGL function will return only a single mostoften-occurring value. For
multimodal distributions, we must use the MODE.MULT command in Excel to return more than
one mode.

For example, two selling prices occur twice in Table 2.9: $138,000 and $254,000. Hence,
these data are bimodal. To find both of the modes in Excel, we take these steps:
Step 1. Select cells E4 and E5
Step 2. Type the formula =MODE.MULT(B2:B13)
Step 3. Press CTRL+SHIFT+ENTER after typing the formula in Step 2.

Excel enters the values for both modes of this data set in cells E4 and E5: $138,000 and
$254,000.

4. Geometric Mean
The geometric mean is a measure of location that is calculated by finding the nth root of
the product of n values. The general formula for the sample geometric mean, denoted xg, follows.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 19


FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 20
Measures of Variability

1. Range
The simplest measure of variability is the range. The range can be found by subtracting
the smallest value from the largest value in a data set. Let us return to the home sales data set
to demonstrate the calculation of range. Refer to the data from home sales prices in Table 2.9.
The largest home sales price is $456,250, and the smallest is $108,000. The range is $456,250
- $108,000 = $348,250.

Although the range is the easiest of the measures of variability to compute, it is seldom
used as the only measure. The reason is that the range is based on only two of the

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 21


observations and thus is highly influenced by extreme values. If, for example, we replace the
selling price of $456,250 with $1.5 million, the range would be $1,500,000- $108,000 =
$1,392,000. This large value for the range would not be especially descriptive of the variability
in the data because 11 of the 12 home selling prices are between $108,000 and $298,000.

The range can be calculated in Excel using the MAX and MIN functions. The range value
in cell E7 of Figure 2.19 calculates the range using the formula =MAX(B2:B13) − MIN(B2:B13).
This subtracts the smallest value in the range B2:B13 from the largest value in the range B2:B13.

2. Variance

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 22


3. Standard Deviation
The standard deviation is defined to be the positive square root of the variance. We use
o to denote the sample standard deviation and s to denote the population standard deviation.

The sample standard deviation, s, is a point estimate of the population standard


deviation, o, and is derived from the sample variance in the following way:

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 23


The sample variance for the sample of class sizes in five college classes is s = 64. Thus, the
sample standard deviation is s= /64 = 8

Recall that the units associated with the variance are squared and that it is difficult to interpret
the meaning of squared units. Because the standard deviation is the square root of the variance,
the units of the variance, (students)2 in our example, are converted to students in the standard
deviation. In other words, the standard deviation is measured in the same units as the original
data. For this reason, the standard deviation is more easily compared to the mean and other
statistics that are measured in the same units as the original data.
Figure 2.19 shows the Excel calculation for the sample standard deviation of the home sales
data, which can be calculated using Excel’s STDEV.S function. The sample standard deviation
in cell E9 is calculated using the formula =STDEV.S(B2:B13). Excel calculates the sample
standard deviation for the home sales to be $95,065.77.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 24


Analyzing Distributions
In Section 2.4 we demonstrated how to create frequency, relative, and cumulative
distributions for data sets. Distributions are very useful for interpreting and analyzing data. A
distribution describes the overall variability of the observed values of a variable. In this section we
introduce additional ways of analyzing distributions.

1. Percentiles
A percentile is the value of a variable at which a specified (approximate) percentage of
observations are below that value. The path percentile tells us about the point in the data where
approximately p percent of the observations have values less than the pth percentile; hence,
approximately (100 – p) percent of the observations have values greater than the pth percentile.
Colleges and universities frequently report admission test scores in terms of percentiles. For
instance, suppose an applicant obtains a raw score of 54 on the verbal portion of an admission
test. How this student performed in relation to other students taking the same test may not be
readily apparent. However, if the raw score of 54 corresponds to the 70th percentile, we know
that approximately 70 percent of the students scored lower than this individual, and approximately
30 percent of the students scored higher. The following procedure can be used to compute the
pth percentile:

1. Arrange the data in ascending order (smallest to largest value).

2. Compute k= (n + 1) X p.

3. Divide k into its integer component, i, and its decimal component, d. (For example, k= 13.25
would result in i= 13 and d= 0.25.)

a. If d= 0 (there is no decimal component for k), find the kth largest value in the data set. This is
the pth percentile.

b. If d>0, the percentile is between the values in positions i and i + 1 in the sorted data. To find
this percentile, we must interpolate between these two values.

i. Calculate the difference between the values in positions i and i+1 in the sorted data
set. We define this difference between the two values as m.

ii. Multiply this difference by d:

t=mXd

iii. To find the pth percentile, add t to the value in position i of the sorted data.

As an illustration, let us determine the 85th percentile for the home sales data in Table 2.9.

1. Arrange the data in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500

208,000 254,000 254,000 257,500 298,000 456,250

2. Compute k = (n + 1) X p = (12 + 1) X 0.85 = 11.05.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 25


3. Dividing 11.05 into the integer and decimal components gives us i= 11 and d = 0.05. Because
d > 0, we must interpolate between the values in the 11th and 12th positions in our sorted data.
The value in the 11th position is 298,000, and the value in the 12th position is 456,250.

i. m = 456,250 - 298,000 = 158,250.

ii. t = m X d = 158,250 X 0.05 = 7912.5.

iii. pth percentile = 298,000 + 7912.5 = 305,912.5

Therefore, $305,912.50 represents the 85th percentile of the home sales data.

The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC. Figure
2.18 shows the Excel calculation for the 85th percentile of the home sales data. The value in cell
E13 is calculated using the formula= PERCENTILE.EXC (B2:B13,0.85); B2:B13 defines the data
set for which we are calculating a percentile, and 0.85 defines the percentile of interest.

2. Quartiles
It is often desirable to divide data into four parts, with each part containing approximately
one-fourth, or 25 percent, of the observations. These division points are referred to as the quartiles
and are defined as:

Q1 = first quartile, or 25th percentile

Q2 = second quartile, or 50th percentile (also the median)

Q3 = third quartile, or 75th percentile.

To demonstrate quartiles, the home sales data are again arranged in ascending order.
108,000 138,000 138,000 142,000 186,000 199,500

208,000 254,000 254,000 257,500 298,000 456,250

We already identified Q2, the second quartile (median) as 203,750. To find Q1 and Q3, wemust
find the 25th and 75th percentiles.

For Q1,
1. The data are arranged in ascending order, as previously done.

2. Compute k = (n + 1) X p = (12 + 1) X 0.25 = 3.25.

3. Dividing 3.25 into the integer and decimal components gives us i= 3 and d = 0.25. Because d
> 0, we must interpolate between the values in the 3rd and 4th positions in our sorted data. The
value in the 3rd position is 138,000, and the value in the 4thposition is 142,000.

i. m = 142,000 - 138,000 = 4000.

ii. t = m X d = 4000 X 0.25 = 1000.

iii. pth percentile = 138,000 + 1000 = 139,000.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 26


Therefore, the 25th percentile is 139,000. Similar calculations for the 75th percentile resulting 75th
percentile = 256,625. The quartiles divide the home sales data into four parts, with each part
containing 25 percent of the observations.

108,000 142,000 208,000 257,500


138,000 186,000 254,000 298,000
138,000 199,500 254,000 456,250

Q1 = 139,000 Q2 = 203,750 Q3 = 256,625

The difference between the third and first quartiles is often referred to as the interquartile
range, or IQR. For the home sales data, IQR = Q3 - Q1 = 256,625 - 139,000 = 117,625. Because
it excludes the smallest and largest 25 percent of values in the data, the IQR is a useful measure
of variation for data that have extreme values or are badly skewed.

A quartile can be computed in Excel using the function QUARTILE.EXC. Figure 2.18
shows the calculations for first, second, and third quartiles for the home sales data. The formula
used in cell E15 is =QUARTILE.EXC(B2:B13,1). The range B2:B13 defines the data set, and 1
indicates that we want to compute the 1st quartile. Cells E16 and E17 use similar formulas to
compute the second and third quartiles.

3. z-scores
A z-score allows us to measure the relative location of a value in the data set. More
specifically, a z-score helps us determine how far a particular value is from the mean relative to
the data set’s standard deviation. Suppose we have a sample of n observations, with the values
denoted by x1, x2, . . . ,xn. In addition, assume that the sample mean, x¯, and the sample standard
deviation, s, are already computed. Associated with each value, xi ,is another value called its z-
score. Equation (2.7) shows how the z-score is computed for each xi :

The z-score is often called the standardized value. The z-score, zi, can be interpreted as
the number of standard deviations, xi, is from the mean. For example, z1= 1.2 indicates thatx1 is
1.2 standard deviations greater than the sample mean. Similarly, z2= 20.5indicatesthatx2 is 0.5,
or 1/2, standard deviation less than the sample mean. A z-score greater than zero occurs for
observations with a value greater than the mean, and a z-score less than zero occurs for
observations with a value less than the mean. A z-score of zero indicates that the value of the
observation is equal to the mean.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 27


The z-scores for the class size data are computed in Table 2.13. Recall the previously
computed sample mean, x = 44, and sample standard deviation, s = 8. The z-score of
21.50 for the fifth observation shows that it is farthest from the mean; it is 1.50
standard deviations below the mean.

The z-score can be calculated in Excel using the function STANDARDIZE. Figure 2.19
demonstrates the use of the STANDARDIZE function to compute z-scores for the home sales
data. To calculate the z-scores, we must provide the mean and standard deviation for the data
set in the arguments of the STANDARDIZE function. For instance, the z-score in cell C2 is
calculated with the formula =STANDARDIZE(B2, $B$15, $B$16), where cell B15 contains the
mean of the home sales data and cell B16 contains the standard deviation of the home sales
data. We can then copy and paste this formula into cells C3:C13.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 28


4. Empirical Rule

When the distribution of data exhibits a symmetric bell-shaped distribution, as shown in Figure
2.20, the empirical rule can be used to determine the percentage of data values that are within a
specified number of standard deviations of the mean. Many, but not all, distributions of data found
in practice exhibit a symmetric bell-shaped distribution.

The height of adult males in the United States has a bell-shaped distribution similar to that
shown in Figure 2.20 with a mean of approximately 69.5 inches and standard deviation of
approximately 3 inches. Using the empirical rule, we can draw the following conclusions.

• Approximately 68 percent of adult males in the United States have heights between 69.5
2 3 5 66.5 and 69.5 1 3 5 72.5 inches.

• Approximately 95 percent of adult males in the United States have heights between 63.5
and 75.5 inches.

• Almost all adult males in the United States have heights between 60.5 and 78.5 inches.

5. Identifying Outliers
Sometimes a data set will have one or more observations with unusually large or unusually
small values. These extreme values are called outliers. Experienced statisticians take steps to
identify outliers and then review each one carefully. An outlier may be a data value that has been
incorrectly recorded; if so, it can be corrected before further analysis. An outlier may also be from
an observation that doesn’t belong to the population we are studying and was incorrectly included
in the data set; if so, it can be removed. Finally, an outlier may be an

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 29


unusual data value that has been recorded correctly and is a member of the population we are
studying. In such cases, the observation should remain.

Standardized values (z-scores) can be used to identify outliers. Recall that the empirical
rule allows us to conclude that for data with a bell-shaped distribution, almost all the data values
will be within three standard deviations of the mean. Hence, in using z-scores to identify outliers,
we recommend treating any data value with a z-score less than 23 or greater than 13 as an outlier.
Such data values can then be reviewed to determine their accuracy and whether they belong in
the data set.

6. Box Plots
A box plot is a graphical summary of the distribution of data. A box plot is developed from the
quartiles for a data set. Figure 2.21 is a box plot for the home sales data. Here are the steps used
to construct the box plot:

1. A box is drawn with the ends of the box located at the first and third quartiles. For the
home sales data, Q1 = 139,000 and Q3 = 256,625. This box contains the middle 50
percent of the data.

2. A vertical line is drawn in the box at the location of the median (203,750 for the home sales
data).

3. By using the interquartile range, IQR = Q3 - Q1, limits are located. The limits for the box
plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the home sales data, IQR = Q3
- Q1 = 256,625 - 139,000 = 117,625. Thus, the limits are 139,000 - 1.5(117,625) = -
37,437.5 and 256,625 + 1.5(117,625) = 433,062.5. Data outside these limits are
considered outliers.

4. The dashed lines in Figure 2.21 are called whiskers. The whiskers are drawn from the
ends of the box to the smallest and largest values inside the limits computed in step 3.
Thus, the whiskers end at home sales values of 108,000 and 298,000.

5. Finally, the location of each outlier is shown with an asterisk (*). In Figure 2.21, we see
one outlier, 456,250.

Box plots are also very useful for comparing different data sets. For instance, if we want to
compare home sales from several different communities, we could create box plots for recent
home sales in each community. An example of such box plots is shown in Figure 2.22.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 30


Measures of Association Between Two Variables

1. Scatter Charts
A scatter chart is a useful graph for analyzing the relationship between two variables.
Figure 2.23 shows a scatter chart for sales of bottled water versus the high temperature
experienced over 14 days. The scatter chart also suggests that a straight line could be used as
an approximation for the relationship between high temperature and sales of bottled water.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 31


Table 2.14 DATA FOR BOTTLED WATER SALES AT QUEENSLAND AMUSEMENT PARK
FOR A SAMPLE OF 14 SUMMER DAYS

FIGURE 2.23 CHART SHOWING THE POSITIVE LINEAR RELATION BETWEEN SALES AND
HIGH TEMPERATURES

2. Covariance
Covariance is a descriptive measure of the linear association between two variables. For
a sample of size n with the observations (x1, y1), (x2, y2), and so on, the sample covariance is
defined as follows:

To measure the strength of the linear relationship between the high temperature x and the
sales of bottled water y at Queensland, we use equation (2.8) to compute the sample covariance.
The calculations in Table 2.15 show the computation.

Table 2.15 SAMPLE VARIANCE CALCULATIONS FOR DAILY HIGH TEMPERATURE AND
BOTTLED WATER SALES AT QUEENSLAND AMUSEMENT PARK

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 32


The covariance calculated in Table 2.15 is sxy= 12.8. Because the covariance is greater
than 0, it indicates a positive relationship between the high temperature and sales of bottled water.
This verifies the relationship we saw in the scatter chart in Figure 2.23 that as the high temperature
for a day increase, sales of bottled water generally increase. If the covariance is near 0, then the
x and y variables are not linearly related. If the covariance is less than 0, then the x and y variables
are negatively related, which means that as x increases, y generally decreases.

The sample covariance can also be calculated in Excel using the COVARIANCE.S
function. Figure 2.24 shows the data from Table 2.14 entered into an Excel Worksheet. Figure
2.25 demonstrates several possible scatter charts and their associated covariance values.

FIGURE 2.24 CALCULATING COVARIANCE AND CORRELATION COEFFICIENT FOR


BOTTLED WATER SALES USING EXCEL

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 33


FIGURE 2.25 SCATTER DIAGRAMS AND ASSOCIATED COVARIANCE VALUES FOR
DIFFERENT VARIABLE RELATIONSHIPS

3. Correlation Coefficient
The correlation coefficient measures the relationship between two variables, and, unlike
covariance, the relationship between two variables is not affected by the units of measurement
for x and y. For sample data, the correlation coefficient is defined as follows.

The sample correlation coefficient is computed by dividing the sample covariance by the
product of the sample standard deviation of x and the sample standard deviation of y. This scales
the correlation coefficient so that it will always take values between 21 and 11.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 34


Let us now compute the sample correlation coefficient for bottled water sales at
Queensland Amusement Park. Recall that we calculated sxy 5 12.8 using equation (2.8). Using
data in Table 2.14, we can compute sample standard deviations for x and y.

The sample correlation coefficient is computed from equation (2.9) as follows:

The correlation coefficient can take only values between –1 and 11. Correlation coefficient
values near 0 indicate no linear relationship between the x and y variables. Correlation coefficients
greater than 0 indicate a positive linear relationship between the x and y variables. The closer the
correlation coefficient is to 11, the closer the x and y values are to forming a straight line that
trends upward to the right (positive slope) Correlation coefficients less than 0 indicate a negative
linear relationship between the x and y variables. The closer the correlation coefficient is to –1,
the closer the x and y values are to forming a straight line with negative slope.

FIGURE 2.26 EXAMPLE OF NONLINEAR RELATIONSHIP PRODUCING A CORRELATION


COEFFICIENT NEAR ZERO

The scatter diagram in Figure 2.26 shows the relationship between the amount spent by
a small retail store for environmental control (heating and cooling) and the daily high outside
temperature over 100 days. The sample correlation coefficient for these data is rxy 5 20.007

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 35


and indicates that there is no linear relationship between the two variables. However, Figure
2.26 provides strong visual evidence of a nonlinear relationship.

Read:
Chp. 2 - Camm, J., Cochran, Fry, Ohlmann, Anderson, Sweeney, Williams. (2015). Essentials of
Business Analytics. Stamford, USA: Cengage Learning.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 36

You might also like