0% found this document useful (0 votes)

43 views29 pages

Unit 2

Uploaded by

Sandee Zandueta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views29 pages

Unit 2

Uploaded by

Sandee Zandueta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Unit 2: Descriptive Statistics

Overview:

Descriptive statistics helps, describes and understands the features of a specific data set
by giving short summaries about the sample and measures of the data. It consists of two basic
categories of measures: measures of central tendency and measures of variability or spread. But
the most recognized type of descriptive statistics is measures of central tendency: the mean,
median, and mode, which are used at almost all levels of math and statistics.

Data is all around us and everything that we do results in new data. Every kind of electronic
message, which we either receive or send like withdrawing money from a bank, every website
that we visit contributes to storage of data.

Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation. A characteristic or a quantity of interest that can take on different values is known
as a variable. An observation is a set of values corresponding to a set of variables.

Practically every problem (and opportunity) that an organization (or individual) faces is
concerned with the impact of the possible values of relevant variables on the business outcome.
Thus, we are concerned with how the value of a variable can vary; variation is the difference in
a variable measured over observations (time, customers, items, etc.).

Learning Outcomes:

After successful Completion of this module, you should be able to:

• Identify what is frequency and relative frequency distribution
• Create a Histogram
• Analyze and compute for the measures of location (mean, median, mode)

Course Materials:

Type of Data
1. Population and Sample Data
Data can be categorized in several ways based on how they are collected, and the
type collected. In many cases, it is not feasible to collect data from the population of all
elements of interest. In such instances, we collect data from a subset of the population
known as a sample. For example, your population is all the CBA students. But since you
need to collect data from the subset of your population, then your sample are the Human
Resource students. It is a more practical approach to data collection because you will not
incur tremendous costs in time, effort, and money. And in most cases, a representative
sample can be gathered by random sampling of the population data. Dealing with
populations and samples can introduce subtle differences in how we calculate and
interpret summary statistics. In almost all practical applications of business analytics, we
will be dealing with sample data.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 8

2. Quantitative and Categorical Data

Data are considered quantitative data if numeric and arithmetic operations, such
as addition, subtraction, multiplication, and division, can be performed on them. For
instance, we can sum the values for Volume in the DJI data in Table 2.1 to calculate a
total volume of all shares traded by companies included in the DJI. If arithmetic operations
cannot be performed on the data, they are considered categorical data. We can
summarize categorical data by counting the number of observations or computing the
proportions of observations in each category. For instance, the data in the industry column
in Table 2.1 are categorical. We can count the number of companies in the DJI that are in
the telecommunications industry. Table 2.1 shows two companies in the
telecommunications industry: AT&T and Verizon Communications.

3. Cross- Sectional and Time Series Data

• Cross-sectional data are collected from several entities at the same, or

approximately the same, point in time. The data in Table 2.1 are cross-sectional
because they describe the 30 companies that comprise the DJI at the same point
in time (April 2013).

• Time series data are collected over several time periods. Graphs of time series
data are frequently found in business and economic publications. Such graphs
help analysts understand what happened in the past, identify trends over time, and
project future levels for the time series. For example, the graph of the time series
in Figure 2.1 shows the DJI value from February 2002 to April 2013.

4. Sources of Data

Data necessary to analyze a business problem or opportunity can often be obtained

with an appropriate study; such statistical studies can be classified as either experimental
or observational.

a) Experimental study a variable of interest is first identified. Then one or more other
variables are identified and controlled so that data can be obtained about how they
influence the variable of interest.

b) Nonexperimental or observational studies make no attempt to control the

variables of interest. A survey is perhaps the most common type of observational
study.

Modifying Data in Excel

Projects often involve so much data that it is difficult to analyze all of the data at once. In
this section, we examine methods for summarizing and manipulating data using Excel to make
the data more manageable and to develop insights.

Sorting Data in Excel

Excel contains many useful features for sorting and filtering data so that one can more
easily identify patterns. Suppose that we want to sort these automobiles by March 2010 sales
instead of by March 2011 sales. To do this, we use Excel’s Sort function, as shown in the following
steps.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 9

Step 1. Select cells A1:F21

Step 2. Click the DATA tab in the Ribbon

Step 3. Click Sort in the Sort & Filter group
Step 4. Select the check box for My data has headers
Step 5. In the first Sort by dropdown menu, select Sales (March 2010)
Step 6. In the Order dropdown menu, select Largest to Smallest (see Figure 2.4)
Step 7. Click OK

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 10

5
6

The result of using Excel’s Sort function for the March 2010 data is shown in Figure 2.5.
Note that while we sorted on Sales (March 2010), which is in column E, the data in all other
columns are adjusted accordingly.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 11

Filtering Data in Excel
Now let’s suppose that we are interested only in seeing the sales of models made by
Toyota. We can do this using Excel’s Filter function:

Step 1. Select cells A1:F21

Step 2. Click the DATA tab in the Ribbon

Step 3. Click Filter in the Sort & Filter group
Step 4. Click on the Filter Arrow in column B, next to Manufacturer

Step 5. Select only the check box for Toyota. You can easily deselect all choices by
unchecking (Select All)

The result is a display of only the data for models made by Toyota (see Figure 2.6). We
now see that of the 20 top-selling models in March 2011, Toyota made three of them. We can
further filter the data by choosing the down arrows in the other columns. We can make all data
visible again by clicking on the down arrow in column B and checking (Select All) or by clicking
Filter in the Sort & Filter Group again from the DATA tab.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 12

Conditional Formatting Data in Excel
Conditional formatting in Excel can make it easy to identify data that satisfy certain
conditions in a data set. For instance, suppose that we wanted to quickly identify the automobile
models in Table 2.2 for which sales had decreased from March 2010 to March 2011. We can
quickly highlight these models:

Step 1. Starting with the original data shown in Figure 2.3, select cells F1:F21
Step 2. Click on the HOME tab in the Ribbon
Step 3. Click Conditional Formatting in the Styles group
Step 4. Select Highlight Cells Rules, and click Less Than from the dropdown menu
Step 5. Enter 0% in the Format cells that are LESS THAN: box
Step 6. Click OK

The results are shown in Figure 2.7. Here we see that the models with decreasing sales
(Toyota Camry, Ford Focus, Chevrolet Malibu, and Nissan Versa) are now clearly visible.

Note that Excel’s Conditional Formatting function offers tremendous flexibility. Instead of
highlighting only models with decreasing sales, we could instead choose Data Bars from the
Conditional Formatting dropdown menu in the Styles Group of the HOME tab in the Ribbon. The
result of using the Blue Data Bar Gradient Fill option is shown in Figure 2.8.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 13

Creating Distributions from Data

1. Frequency Distributions for Categorical Data

It is often useful to create a frequency distribution for a data set. A frequency distribution
is a summary of data that shows the number (frequency) of observations in each of several non-
overlapping classes, typically referred to as bins, when dealing with distributions.

In Table 2.1, taken from a sample of 50 soft drink purchases, each purchase is for one of
five popular soft drinks, which define the five bins: Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and
Sprite. To develop a frequency distribution for these data, we count the number of times

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 14

each soft drink appears in Table 2.3. Coca-Cola appears 19 times, Diet Coke appears 8 times,
Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5 times.

This frequency distribution provides a summary of how the 50 soft drink purchases are
distributed across the five soft drinks.

Relative Frequency and Percent Frequency Distributions

A frequency distribution shows the number (frequency) of items in each of several non-
overlapping bins. However, we are often interested in the proportion, or percentage, of items in
each bin. The relative frequency of a bin equals the fraction or proportion of items belonging to a
class. For a data set with n observations, the relative frequency of each bin can be determined as
follows:
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓𝑡ℎ𝑒𝑏𝑖𝑛
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓𝑎𝑏𝑖𝑛 =
𝑛

A relative frequency distribution is a tabular summary of data showing the relative

frequency for each bin. A percent frequency distribution summarizes the percent frequency of the
data for each bin.

Table 2.5 shows a relative frequency distribution and a percent frequency distribution for
the soft drink data.

2. Frequency Distributions for Quantitative Data

The three steps necessary to define the classes for a frequency distribution with
quantitative data are:

1. Determine the number of non-overlapping bins.

2. Determine the width of each bin.
3. Determine the bin limits.

Histograms
A common graphical presentation of quantitative data is a histogram. This graphical
summary can be prepared for data previously summarized in either a frequency, a relative
frequency, or a percent frequency distribution. A histogram is constructed by placing the

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 15

variable of interest on the horizontal axis and the selected frequency measure (absolute
frequency, relative frequency, or percent frequency) on the vertical axis. The frequency measure
of each class is shown by drawing a rectangle whose base is determined by the class limits on
the horizontal axis and whose height is the corresponding frequency measure.

Histogram for the audit time data

Cumulative Distributions
A variation of the frequency distribution that provides another tabular summary of
quantitative data is the cumulative frequency distribution, which uses the number of classes, class
widths, and class limits developed for the frequency distribution.

CUMULATIVE FREQUENCY, CUMULATIVE RELATIVE FREQUENCY, AND CUMULATIVE

PERCENT FREQUENCY DISTRIBUTIONS FOR THE AUDIT TIME DATA

Measures of Location
1. Mean (Arithmetic Mean)
The most commonly used measure of location is the mean (arithmetic mean), or average
value, for a variable. The mean provides a measure of central location for the data. If the data are
for a sample (typically the case), the mean is denoted by x . The sample mean is a point estimate
of the (typically unknown) population mean for the variable of interest. If the data for the entire
population are available, the population mean is computed in the same manner, but denoted by
the Greek letter u.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 16

TAB Data on Home Sales in a Cincinnati, Ohio,
LE Suburb
2.9

Home Sale Selling Price ($)

1 138,000

2 254,000
3 186,000
4 257,500
5 108,000
6 254,000
7 138,000
8 298,000
9 199,500
10 208,000
142,000
11 456,250
12

The mean can be found in Excel using the AVERAGE function. Figure 2.16 shows
the Home Sales data from Table 2.9 in an Excel spreadsheet. The value for the mean in cell
E2 is calculated using the formula =AVERAGE (B2:B13)

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 17

2. Median
The median, another measure of central location, is the value in the middle when the data
are arranged in ascending order (smallest to largest value). With an odd number of observations,
the median is the middle value. An even number of observations has no single middle value. In
this case, we follow convention and define the median as the average of the values for the middle
two observations.

Let us apply this definition to compute the median class size for a sample of five
college classes. Arranging the data in ascending order provides the following list:

32 42 46 46 54

Because n = 5 is odd, the median is the middle value. Thus, the median class size is46
students. Even though this data set contains two observations with values of 46, each observation
is treated separately when we arrange the data in ascending order. Suppose, we also compute
the median value for the 12 home sales in Table 2.9. We first arrange the data in ascending order.

Although the mean is the more commonly used measure of central location, in some
situations the median is preferred. The mean is influenced by extremely small and large data
values. Notice that the median is smaller than the mean in Figure 2.16. This is because the one
large value of $456,250 in our data set inflates the mean but does not have the same effect on
the median. Notice also that the median would remain unchanged if we replaced the $456,250
with a sales price of $1.5 million. In this case, the median selling price would remain $203,750,
but the mean would increase to $306,916.67. If you were looking to buy a home in this suburb,
the median gives a better indication of the central selling price of the homes there. We can
generalize, saying that whenever a data set contains extreme values or is severely skewed, the
median is often the preferred measure of central location.

3. Mode
A third measure of location, the mode, is the value that occurs most frequently in a data
set. To illustrate the identification of the mode, consider the sample of five class sizes.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 18

32 42 46 46 54

The only value that occurs more than once is 46. Because this value, occurring with a
frequency of 2, has the greatest frequency, it is the mode. To find the mode for a data set with
only one most often occurring value in Excel, we use the MODE.SNGL function. Occasionally the
greatest frequency occurs at two or more different values, in which case more than one mode
exists. If data contain at least two modes, we say that they are multimodal.

A special case of multimodal data occurs when the data contain exactly two modes; in such
cases we say that the data are bimodal. In multimodal cases when there are more than two
modes, the mode is almost never reported because listing three or more modes is not particularly
helpful in describing a location for the data. Also, if no value in the data occurs more than once,
we say the data have no mode.

The Excel MODE.SNGL function will return only a single mostoften-occurring value. For
multimodal distributions, we must use the MODE.MULT command in Excel to return more than
one mode.

For example, two selling prices occur twice in Table 2.9: $138,000 and $254,000. Hence,
these data are bimodal. To find both of the modes in Excel, we take these steps:
Step 1. Select cells E4 and E5
Step 2. Type the formula =MODE.MULT(B2:B13)
Step 3. Press CTRL+SHIFT+ENTER after typing the formula in Step 2.

Excel enters the values for both modes of this data set in cells E4 and E5: $138,000 and
$254,000.

4. Geometric Mean
The geometric mean is a measure of location that is calculated by finding the nth root of
the product of n values. The general formula for the sample geometric mean, denoted xg, follows.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 19

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 20
Measures of Variability

1. Range
The simplest measure of variability is the range. The range can be found by subtracting
the smallest value from the largest value in a data set. Let us return to the home sales data set
to demonstrate the calculation of range. Refer to the data from home sales prices in Table 2.9.
The largest home sales price is $456,250, and the smallest is $108,000. The range is $456,250
- $108,000 = $348,250.

Although the range is the easiest of the measures of variability to compute, it is seldom
used as the only measure. The reason is that the range is based on only two of the

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 21

observations and thus is highly influenced by extreme values. If, for example, we replace the
selling price of $456,250 with $1.5 million, the range would be $1,500,000- $108,000 =
$1,392,000. This large value for the range would not be especially descriptive of the variability
in the data because 11 of the 12 home selling prices are between $108,000 and $298,000.

The range can be calculated in Excel using the MAX and MIN functions. The range value
in cell E7 of Figure 2.19 calculates the range using the formula =MAX(B2:B13) − MIN(B2:B13).
This subtracts the smallest value in the range B2:B13 from the largest value in the range B2:B13.

2. Variance

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 22

3. Standard Deviation
The standard deviation is defined to be the positive square root of the variance. We use
o to denote the sample standard deviation and s to denote the population standard deviation.

The sample standard deviation, s, is a point estimate of the population standard

deviation, o, and is derived from the sample variance in the following way:

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 23

The sample variance for the sample of class sizes in five college classes is s = 64. Thus, the
sample standard deviation is s= /64 = 8

Recall that the units associated with the variance are squared and that it is difficult to interpret
the meaning of squared units. Because the standard deviation is the square root of the variance,
the units of the variance, (students)2 in our example, are converted to students in the standard
deviation. In other words, the standard deviation is measured in the same units as the original
data. For this reason, the standard deviation is more easily compared to the mean and other
statistics that are measured in the same units as the original data.
Figure 2.19 shows the Excel calculation for the sample standard deviation of the home sales
data, which can be calculated using Excel’s STDEV.S function. The sample standard deviation
in cell E9 is calculated using the formula =STDEV.S(B2:B13). Excel calculates the sample
standard deviation for the home sales to be $95,065.77.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 24

Analyzing Distributions
In Section 2.4 we demonstrated how to create frequency, relative, and cumulative
distributions for data sets. Distributions are very useful for interpreting and analyzing data. A
distribution describes the overall variability of the observed values of a variable. In this section we
introduce additional ways of analyzing distributions.

1. Percentiles
A percentile is the value of a variable at which a specified (approximate) percentage of
observations are below that value. The path percentile tells us about the point in the data where
approximately p percent of the observations have values less than the pth percentile; hence,
approximately (100 – p) percent of the observations have values greater than the pth percentile.
Colleges and universities frequently report admission test scores in terms of percentiles. For
instance, suppose an applicant obtains a raw score of 54 on the verbal portion of an admission
test. How this student performed in relation to other students taking the same test may not be
readily apparent. However, if the raw score of 54 corresponds to the 70th percentile, we know
that approximately 70 percent of the students scored lower than this individual, and approximately
30 percent of the students scored higher. The following procedure can be used to compute the
pth percentile:

1. Arrange the data in ascending order (smallest to largest value).

2. Compute k= (n + 1) X p.

3. Divide k into its integer component, i, and its decimal component, d. (For example, k= 13.25
would result in i= 13 and d= 0.25.)

a. If d= 0 (there is no decimal component for k), find the kth largest value in the data set. This is
the pth percentile.

b. If d>0, the percentile is between the values in positions i and i + 1 in the sorted data. To find
this percentile, we must interpolate between these two values.

i. Calculate the difference between the values in positions i and i+1 in the sorted data
set. We define this difference between the two values as m.

ii. Multiply this difference by d:

t=mXd

iii. To find the pth percentile, add t to the value in position i of the sorted data.

As an illustration, let us determine the 85th percentile for the home sales data in Table 2.9.

1. Arrange the data in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500

208,000 254,000 254,000 257,500 298,000 456,250

2. Compute k = (n + 1) X p = (12 + 1) X 0.85 = 11.05.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 25

3. Dividing 11.05 into the integer and decimal components gives us i= 11 and d = 0.05. Because
d > 0, we must interpolate between the values in the 11th and 12th positions in our sorted data.
The value in the 11th position is 298,000, and the value in the 12th position is 456,250.

i. m = 456,250 - 298,000 = 158,250.

ii. t = m X d = 158,250 X 0.05 = 7912.5.

iii. pth percentile = 298,000 + 7912.5 = 305,912.5

Therefore, $305,912.50 represents the 85th percentile of the home sales data.

The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC. Figure
2.18 shows the Excel calculation for the 85th percentile of the home sales data. The value in cell
E13 is calculated using the formula= PERCENTILE.EXC (B2:B13,0.85); B2:B13 defines the data
set for which we are calculating a percentile, and 0.85 defines the percentile of interest.

2. Quartiles
It is often desirable to divide data into four parts, with each part containing approximately
one-fourth, or 25 percent, of the observations. These division points are referred to as the quartiles
and are defined as:

Q1 = first quartile, or 25th percentile

Q2 = second quartile, or 50th percentile (also the median)

Q3 = third quartile, or 75th percentile.

To demonstrate quartiles, the home sales data are again arranged in ascending order.
108,000 138,000 138,000 142,000 186,000 199,500

208,000 254,000 254,000 257,500 298,000 456,250

We already identified Q2, the second quartile (median) as 203,750. To find Q1 and Q3, wemust
find the 25th and 75th percentiles.

For Q1,
1. The data are arranged in ascending order, as previously done.

2. Compute k = (n + 1) X p = (12 + 1) X 0.25 = 3.25.

3. Dividing 3.25 into the integer and decimal components gives us i= 3 and d = 0.25. Because d
> 0, we must interpolate between the values in the 3rd and 4th positions in our sorted data. The
value in the 3rd position is 138,000, and the value in the 4thposition is 142,000.

i. m = 142,000 - 138,000 = 4000.

ii. t = m X d = 4000 X 0.25 = 1000.

iii. pth percentile = 138,000 + 1000 = 139,000.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 26

Therefore, the 25th percentile is 139,000. Similar calculations for the 75th percentile resulting 75th
percentile = 256,625. The quartiles divide the home sales data into four parts, with each part
containing 25 percent of the observations.

108,000 142,000 208,000 257,500

138,000 186,000 254,000 298,000
138,000 199,500 254,000 456,250

Q1 = 139,000 Q2 = 203,750 Q3 = 256,625

The difference between the third and first quartiles is often referred to as the interquartile
range, or IQR. For the home sales data, IQR = Q3 - Q1 = 256,625 - 139,000 = 117,625. Because
it excludes the smallest and largest 25 percent of values in the data, the IQR is a useful measure
of variation for data that have extreme values or are badly skewed.

A quartile can be computed in Excel using the function QUARTILE.EXC. Figure 2.18
shows the calculations for first, second, and third quartiles for the home sales data. The formula
used in cell E15 is =QUARTILE.EXC(B2:B13,1). The range B2:B13 defines the data set, and 1
indicates that we want to compute the 1st quartile. Cells E16 and E17 use similar formulas to
compute the second and third quartiles.

3. z-scores
A z-score allows us to measure the relative location of a value in the data set. More
specifically, a z-score helps us determine how far a particular value is from the mean relative to
the data set’s standard deviation. Suppose we have a sample of n observations, with the values
denoted by x1, x2, . . . ,xn. In addition, assume that the sample mean, x¯, and the sample standard
deviation, s, are already computed. Associated with each value, xi ,is another value called its z-
score. Equation (2.7) shows how the z-score is computed for each xi :

The z-score is often called the standardized value. The z-score, zi, can be interpreted as
the number of standard deviations, xi, is from the mean. For example, z1= 1.2 indicates thatx1 is
1.2 standard deviations greater than the sample mean. Similarly, z2= 20.5indicatesthatx2 is 0.5,
or 1/2, standard deviation less than the sample mean. A z-score greater than zero occurs for
observations with a value greater than the mean, and a z-score less than zero occurs for
observations with a value less than the mean. A z-score of zero indicates that the value of the
observation is equal to the mean.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 27

The z-scores for the class size data are computed in Table 2.13. Recall the previously
computed sample mean, x = 44, and sample standard deviation, s = 8. The z-score of
21.50 for the fifth observation shows that it is farthest from the mean; it is 1.50
standard deviations below the mean.

The z-score can be calculated in Excel using the function STANDARDIZE. Figure 2.19
demonstrates the use of the STANDARDIZE function to compute z-scores for the home sales
data. To calculate the z-scores, we must provide the mean and standard deviation for the data
set in the arguments of the STANDARDIZE function. For instance, the z-score in cell C2 is
calculated with the formula =STANDARDIZE(B2, $B$15, $B$16), where cell B15 contains the
mean of the home sales data and cell B16 contains the standard deviation of the home sales
data. We can then copy and paste this formula into cells C3:C13.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 28

4. Empirical Rule

When the distribution of data exhibits a symmetric bell-shaped distribution, as shown in Figure
2.20, the empirical rule can be used to determine the percentage of data values that are within a
specified number of standard deviations of the mean. Many, but not all, distributions of data found
in practice exhibit a symmetric bell-shaped distribution.

The height of adult males in the United States has a bell-shaped distribution similar to that
shown in Figure 2.20 with a mean of approximately 69.5 inches and standard deviation of
approximately 3 inches. Using the empirical rule, we can draw the following conclusions.

• Approximately 68 percent of adult males in the United States have heights between 69.5
2 3 5 66.5 and 69.5 1 3 5 72.5 inches.

• Approximately 95 percent of adult males in the United States have heights between 63.5
and 75.5 inches.

• Almost all adult males in the United States have heights between 60.5 and 78.5 inches.

5. Identifying Outliers
Sometimes a data set will have one or more observations with unusually large or unusually
small values. These extreme values are called outliers. Experienced statisticians take steps to
identify outliers and then review each one carefully. An outlier may be a data value that has been
incorrectly recorded; if so, it can be corrected before further analysis. An outlier may also be from
an observation that doesn’t belong to the population we are studying and was incorrectly included
in the data set; if so, it can be removed. Finally, an outlier may be an

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 29

unusual data value that has been recorded correctly and is a member of the population we are
studying. In such cases, the observation should remain.

Standardized values (z-scores) can be used to identify outliers. Recall that the empirical
rule allows us to conclude that for data with a bell-shaped distribution, almost all the data values
will be within three standard deviations of the mean. Hence, in using z-scores to identify outliers,
we recommend treating any data value with a z-score less than 23 or greater than 13 as an outlier.
Such data values can then be reviewed to determine their accuracy and whether they belong in
the data set.

6. Box Plots
A box plot is a graphical summary of the distribution of data. A box plot is developed from the
quartiles for a data set. Figure 2.21 is a box plot for the home sales data. Here are the steps used
to construct the box plot:

1. A box is drawn with the ends of the box located at the first and third quartiles. For the
home sales data, Q1 = 139,000 and Q3 = 256,625. This box contains the middle 50
percent of the data.

2. A vertical line is drawn in the box at the location of the median (203,750 for the home sales
data).

3. By using the interquartile range, IQR = Q3 - Q1, limits are located. The limits for the box
plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the home sales data, IQR = Q3
- Q1 = 256,625 - 139,000 = 117,625. Thus, the limits are 139,000 - 1.5(117,625) = -
37,437.5 and 256,625 + 1.5(117,625) = 433,062.5. Data outside these limits are
considered outliers.

4. The dashed lines in Figure 2.21 are called whiskers. The whiskers are drawn from the
ends of the box to the smallest and largest values inside the limits computed in step 3.
Thus, the whiskers end at home sales values of 108,000 and 298,000.

5. Finally, the location of each outlier is shown with an asterisk (*). In Figure 2.21, we see
one outlier, 456,250.

Box plots are also very useful for comparing different data sets. For instance, if we want to
compare home sales from several different communities, we could create box plots for recent
home sales in each community. An example of such box plots is shown in Figure 2.22.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 30

Measures of Association Between Two Variables

1. Scatter Charts
A scatter chart is a useful graph for analyzing the relationship between two variables.
Figure 2.23 shows a scatter chart for sales of bottled water versus the high temperature
experienced over 14 days. The scatter chart also suggests that a straight line could be used as
an approximation for the relationship between high temperature and sales of bottled water.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 31

Table 2.14 DATA FOR BOTTLED WATER SALES AT QUEENSLAND AMUSEMENT PARK
FOR A SAMPLE OF 14 SUMMER DAYS

FIGURE 2.23 CHART SHOWING THE POSITIVE LINEAR RELATION BETWEEN SALES AND
HIGH TEMPERATURES

2. Covariance
Covariance is a descriptive measure of the linear association between two variables. For
a sample of size n with the observations (x1, y1), (x2, y2), and so on, the sample covariance is
defined as follows:

To measure the strength of the linear relationship between the high temperature x and the
sales of bottled water y at Queensland, we use equation (2.8) to compute the sample covariance.
The calculations in Table 2.15 show the computation.

Table 2.15 SAMPLE VARIANCE CALCULATIONS FOR DAILY HIGH TEMPERATURE AND
BOTTLED WATER SALES AT QUEENSLAND AMUSEMENT PARK

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 32

The covariance calculated in Table 2.15 is sxy= 12.8. Because the covariance is greater
than 0, it indicates a positive relationship between the high temperature and sales of bottled water.
This verifies the relationship we saw in the scatter chart in Figure 2.23 that as the high temperature
for a day increase, sales of bottled water generally increase. If the covariance is near 0, then the
x and y variables are not linearly related. If the covariance is less than 0, then the x and y variables
are negatively related, which means that as x increases, y generally decreases.

The sample covariance can also be calculated in Excel using the COVARIANCE.S
function. Figure 2.24 shows the data from Table 2.14 entered into an Excel Worksheet. Figure
2.25 demonstrates several possible scatter charts and their associated covariance values.

FIGURE 2.24 CALCULATING COVARIANCE AND CORRELATION COEFFICIENT FOR

BOTTLED WATER SALES USING EXCEL

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 33

FIGURE 2.25 SCATTER DIAGRAMS AND ASSOCIATED COVARIANCE VALUES FOR
DIFFERENT VARIABLE RELATIONSHIPS

3. Correlation Coefficient
The correlation coefficient measures the relationship between two variables, and, unlike
covariance, the relationship between two variables is not affected by the units of measurement
for x and y. For sample data, the correlation coefficient is defined as follows.

The sample correlation coefficient is computed by dividing the sample covariance by the
product of the sample standard deviation of x and the sample standard deviation of y. This scales
the correlation coefficient so that it will always take values between 21 and 11.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 34

Let us now compute the sample correlation coefficient for bottled water sales at
Queensland Amusement Park. Recall that we calculated sxy 5 12.8 using equation (2.8). Using
data in Table 2.14, we can compute sample standard deviations for x and y.

The sample correlation coefficient is computed from equation (2.9) as follows:

The correlation coefficient can take only values between –1 and 11. Correlation coefficient
values near 0 indicate no linear relationship between the x and y variables. Correlation coefficients
greater than 0 indicate a positive linear relationship between the x and y variables. The closer the
correlation coefficient is to 11, the closer the x and y values are to forming a straight line that
trends upward to the right (positive slope) Correlation coefficients less than 0 indicate a negative
linear relationship between the x and y variables. The closer the correlation coefficient is to –1,
the closer the x and y values are to forming a straight line with negative slope.

FIGURE 2.26 EXAMPLE OF NONLINEAR RELATIONSHIP PRODUCING A CORRELATION

COEFFICIENT NEAR ZERO

The scatter diagram in Figure 2.26 shows the relationship between the amount spent by
a small retail store for environmental control (heating and cooling) and the daily high outside
temperature over 100 days. The sample correlation coefficient for these data is rxy 5 20.007

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 35

and indicates that there is no linear relationship between the two variables. However, Figure
2.26 provides strong visual evidence of a nonlinear relationship.

Read:
Chp. 2 - Camm, J., Cochran, Fry, Ohlmann, Anderson, Sweeney, Williams. (2015). Essentials of
Business Analytics. Stamford, USA: Cengage Learning.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 36

Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
100% (1)
Chapter 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
54 pages
Final UNIT II-DESCRIPTIVE ANALYTICS
No ratings yet
Final UNIT II-DESCRIPTIVE ANALYTICS
128 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
Research Report
No ratings yet
Research Report
47 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
STAM Formula Sheet
100% (2)
STAM Formula Sheet
4 pages
C2. Descriptive Statistics
No ratings yet
C2. Descriptive Statistics
157 pages
By Microsoft Website: DURATION: 6 Weeks Amount Paid: Yes: Introduction To Data Science
100% (1)
By Microsoft Website: DURATION: 6 Weeks Amount Paid: Yes: Introduction To Data Science
21 pages
Unit 03 Descriptive Analysis and Visual Exploration
No ratings yet
Unit 03 Descriptive Analysis and Visual Exploration
90 pages
Chapter-2 BUSINESS ANALYTICS
No ratings yet
Chapter-2 BUSINESS ANALYTICS
114 pages
Chapter 2 (Descriptive)
No ratings yet
Chapter 2 (Descriptive)
92 pages
Chapter 3 - Visualizing Data
No ratings yet
Chapter 3 - Visualizing Data
70 pages
Cengage EBA 2e Chapter02
No ratings yet
Cengage EBA 2e Chapter02
84 pages
BA 2023 - 2024 T04 Descriptive Statistics
No ratings yet
BA 2023 - 2024 T04 Descriptive Statistics
115 pages
Chapter 2
No ratings yet
Chapter 2
84 pages
Module 5 - Data Visualization
No ratings yet
Module 5 - Data Visualization
53 pages
BA1 Introduction 2025
No ratings yet
BA1 Introduction 2025
55 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Lesson 3 Notes
No ratings yet
Lesson 3 Notes
53 pages
The Role of Sustainability in Shaping Customer Perception of Expensive Goods in The Fashion Industry
No ratings yet
The Role of Sustainability in Shaping Customer Perception of Expensive Goods in The Fashion Industry
49 pages
Camm BA 5e PPT CH02 03-09-23 PC - Final
No ratings yet
Camm BA 5e PPT CH02 03-09-23 PC - Final
52 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Chapter 3 (Descriptive)
No ratings yet
Chapter 3 (Descriptive)
78 pages
Data Visualisation
No ratings yet
Data Visualisation
55 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
BA 1 - Describing and Summarizing Data PDF
No ratings yet
BA 1 - Describing and Summarizing Data PDF
4 pages
An Introduction To Statistics With Python With Applications in The Life Sciences Research PDF Download
100% (18)
An Introduction To Statistics With Python With Applications in The Life Sciences Research PDF Download
14 pages
Descriptive Na Ly Tics
No ratings yet
Descriptive Na Ly Tics
112 pages
Excel DataAnalysis
No ratings yet
Excel DataAnalysis
38 pages
Camm 3e Ch02 PPT PDF
No ratings yet
Camm 3e Ch02 PPT PDF
112 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
26 pages
Topic2 - 2024 - Descriptive Statistics - STD - Revised
No ratings yet
Topic2 - 2024 - Descriptive Statistics - STD - Revised
20 pages
Unit 3
No ratings yet
Unit 3
22 pages
Lesson Two
No ratings yet
Lesson Two
66 pages
Module 01 - STAT 101
No ratings yet
Module 01 - STAT 101
23 pages
Cengage EBA 2e Chapter02
No ratings yet
Cengage EBA 2e Chapter02
84 pages
Descriptive Analysis
No ratings yet
Descriptive Analysis
29 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
21 pages
Chapter 2 DESCRIPTIVE ANALYTICS
No ratings yet
Chapter 2 DESCRIPTIVE ANALYTICS
86 pages
208 RM Lab File1 PDF
No ratings yet
208 RM Lab File1 PDF
31 pages
1 ASAP Business Analytics Introduction
No ratings yet
1 ASAP Business Analytics Introduction
25 pages
Basic Economic Analytics Using Excel!
No ratings yet
Basic Economic Analytics Using Excel!
72 pages
Unit .......
No ratings yet
Unit .......
45 pages
Shopee Fundamental Data Analytical Thinking
No ratings yet
Shopee Fundamental Data Analytical Thinking
33 pages
Slide PTDL.1
No ratings yet
Slide PTDL.1
16 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Visualizing Data
No ratings yet
Visualizing Data
51 pages
1 Introduction
No ratings yet
1 Introduction
15 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
Data Visualization
No ratings yet
Data Visualization
18 pages
Lesson 2 Notes
No ratings yet
Lesson 2 Notes
11 pages
Sample Questions
No ratings yet
Sample Questions
8 pages
DAI Data Preprocessing 1 46233380 2025 06 12 17 18
No ratings yet
DAI Data Preprocessing 1 46233380 2025 06 12 17 18
14 pages
Lesson 2
No ratings yet
Lesson 2
2 pages
Business Analytics (MIS171) Summary Notes
No ratings yet
Business Analytics (MIS171) Summary Notes
6 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
4 pages
Elective Finals 3A
No ratings yet
Elective Finals 3A
2 pages
E-Book On Essentials of Business Analytics: Group 7
No ratings yet
E-Book On Essentials of Business Analytics: Group 7
6 pages
I PUC Stats Model QP1 Jan 2024
100% (1)
I PUC Stats Model QP1 Jan 2024
4 pages
Measures of Relative Position
No ratings yet
Measures of Relative Position
22 pages
Deurenberg Formula Imc para %grasa
No ratings yet
Deurenberg Formula Imc para %grasa
10 pages
Complete Bundle Essential Statistics For Public Managers and Policy Analysts 4th Edition Berman
No ratings yet
Complete Bundle Essential Statistics For Public Managers and Policy Analysts 4th Edition Berman
413 pages
Chapter 8 Group 4 Q
No ratings yet
Chapter 8 Group 4 Q
133 pages
Histogram Notes
No ratings yet
Histogram Notes
2 pages
The Completely Randomized Design: Statistics 802 - Page 10
No ratings yet
The Completely Randomized Design: Statistics 802 - Page 10
21 pages
Unit 7
No ratings yet
Unit 7
18 pages
Scheme of Work STA408 (MARCH 2014)
No ratings yet
Scheme of Work STA408 (MARCH 2014)
4 pages
MAE202 FINALterm 2nd Sem AY 22-23-Zafra-Jonald-Grace
No ratings yet
MAE202 FINALterm 2nd Sem AY 22-23-Zafra-Jonald-Grace
13 pages
Unit 1
No ratings yet
Unit 1
7 pages
Week 4 Moderation
No ratings yet
Week 4 Moderation
35 pages
Delfin
No ratings yet
Delfin
13 pages
Simple Mathematics in Psychological Research
No ratings yet
Simple Mathematics in Psychological Research
22 pages
Ho There Is No Significant Difference Among The Means. Ha There Is A Significant Difference Among The Means
No ratings yet
Ho There Is No Significant Difference Among The Means. Ha There Is A Significant Difference Among The Means
3 pages
Practice Problems: Chapter 4, Forecasting: Problem 1
No ratings yet
Practice Problems: Chapter 4, Forecasting: Problem 1
10 pages
2071 TC2AILab5
No ratings yet
2071 TC2AILab5
6 pages
Correlation Pearsons R
No ratings yet
Correlation Pearsons R
25 pages
What Is Variance?: Key Takeaways
No ratings yet
What Is Variance?: Key Takeaways
10 pages
Introduction To Econometrics - Stock & Watson - CH 13 Slides
No ratings yet
Introduction To Econometrics - Stock & Watson - CH 13 Slides
38 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
40 pages
Difference Between Classification and Regression
No ratings yet
Difference Between Classification and Regression
1 page
Supervised Learning by Fadhlurrohman Henriwan
No ratings yet
Supervised Learning by Fadhlurrohman Henriwan
31 pages
1 s2.0 S235197892030696X Main
No ratings yet
1 s2.0 S235197892030696X Main
8 pages
Article Quiz1 - Linear Regression Analysis
No ratings yet
Article Quiz1 - Linear Regression Analysis
5 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Decision Making
No ratings yet
Decision Making
4 pages
Crosstabs: Kriteria Produk Kriteria Keputusan Pemanfaatan Crosstabulation
No ratings yet
Crosstabs: Kriteria Produk Kriteria Keputusan Pemanfaatan Crosstabulation
7 pages
Interval Estimation Solve The Problem
No ratings yet
Interval Estimation Solve The Problem
3 pages
Student Perceptions On The Use of Genyo in Junior High School Students Computer Education Classes in San Jose Academy
No ratings yet
Student Perceptions On The Use of Genyo in Junior High School Students Computer Education Classes in San Jose Academy
1 page
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)

Unit 2

Uploaded by

Unit 2

Uploaded by

Unit 2: Descriptive Statistics

After successful Completion of this module, you should be able to:

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 8

3. Cross- Sectional and Time Series Data

• Cross-sectional data are collected from several entities at the same, or

Data necessary to analyze a business problem or opportunity can often be obtained

b) Nonexperimental or observational studies make no attempt to control the

Modifying Data in Excel

Sorting Data in Excel

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 9

Step 2. Click the DATA tab in the Ribbon

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 10

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 11

Step 1. Select cells A1:F21

Step 2. Click the DATA tab in the Ribbon

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 12

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 13

1. Frequency Distributions for Categorical Data

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 14

Relative Frequency and Percent Frequency Distributions

A relative frequency distribution is a tabular summary of data showing the relative

2. Frequency Distributions for Quantitative Data

1. Determine the number of non-overlapping bins.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 15

Histogram for the audit time data

CUMULATIVE FREQUENCY, CUMULATIVE RELATIVE FREQUENCY, AND CUMULATIVE

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 16

Home Sale Selling Price ($)

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 17

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 18

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 19

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 21

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 22

The sample standard deviation, s, is a point estimate of the population standard

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 23

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 24

1. Arrange the data in ascending order (smallest to largest value).

ii. Multiply this difference by d:

1. Arrange the data in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500

208,000 254,000 254,000 257,500 298,000 456,250

2. Compute k = (n + 1) X p = (12 + 1) X 0.85 = 11.05.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 25

i. m = 456,250 - 298,000 = 158,250.

ii. t = m X d = 158,250 X 0.05 = 7912.5.

iii. pth percentile = 298,000 + 7912.5 = 305,912.5

Q1 = first quartile, or 25th percentile

Q2 = second quartile, or 50th percentile (also the median)

Q3 = third quartile, or 75th percentile.

208,000 254,000 254,000 257,500 298,000 456,250

2. Compute k = (n + 1) X p = (12 + 1) X 0.25 = 3.25.

i. m = 142,000 - 138,000 = 4000.

ii. t = m X d = 4000 X 0.25 = 1000.

iii. pth percentile = 138,000 + 1000 = 139,000.

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 26

108,000 142,000 208,000 257,500

Q1 = 139,000 Q2 = 203,750 Q3 = 256,625

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 27

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 28

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 29

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 30

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 31

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 32

FIGURE 2.24 CALCULATING COVARIANCE AND CORRELATION COEFFICIENT FOR

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 33

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 34

The sample correlation coefficient is computed from equation (2.9) as follows:

FIGURE 2.26 EXAMPLE OF NONLINEAR RELATIONSHIP PRODUCING A CORRELATION

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 35

FUNDAMENTALS OF DESCRIPTIVE ANALYTICS (BUMA 30063) 36

You might also like