Unit 2
Unit 2
Overview:
Descriptive statistics helps, describes and understands the features of a specific data set
by giving short summaries about the sample and measures of the data. It consists of two basic
categories of measures: measures of central tendency and measures of variability or spread. But
the most recognized type of descriptive statistics is measures of central tendency: the mean,
median, and mode, which are used at almost all levels of math and statistics.
Data is all around us and everything that we do results in new data. Every kind of electronic
message, which we either receive or send like withdrawing money from a bank, every website
that we visit contributes to storage of data.
Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation. A characteristic or a quantity of interest that can take on different values is known
as a variable. An observation is a set of values corresponding to a set of variables.
Practically every problem (and opportunity) that an organization (or individual) faces is
concerned with the impact of the possible values of relevant variables on the business outcome.
Thus, we are concerned with how the value of a variable can vary; variation is the difference in
a variable measured over observations (time, customers, items, etc.).
Learning Outcomes:
Course Materials:
Type of Data
1. Population and Sample Data
Data can be categorized in several ways based on how they are collected, and the
type collected. In many cases, it is not feasible to collect data from the population of all
elements of interest. In such instances, we collect data from a subset of the population
known as a sample. For example, your population is all the CBA students. But since you
need to collect data from the subset of your population, then your sample are the Human
Resource students. It is a more practical approach to data collection because you will not
incur tremendous costs in time, effort, and money. And in most cases, a representative
sample can be gathered by random sampling of the population data. Dealing with
populations and samples can introduce subtle differences in how we calculate and
interpret summary statistics. In almost all practical applications of business analytics, we
will be dealing with sample data.
Data are considered quantitative data if numeric and arithmetic operations, such
as addition, subtraction, multiplication, and division, can be performed on them. For
instance, we can sum the values for Volume in the DJI data in Table 2.1 to calculate a
total volume of all shares traded by companies included in the DJI. If arithmetic operations
cannot be performed on the data, they are considered categorical data. We can
summarize categorical data by counting the number of observations or computing the
proportions of observations in each category. For instance, the data in the industry column
in Table 2.1 are categorical. We can count the number of companies in the DJI that are in
the telecommunications industry. Table 2.1 shows two companies in the
telecommunications industry: AT&T and Verizon Communications.
• Time series data are collected over several time periods. Graphs of time series
data are frequently found in business and economic publications. Such graphs
help analysts understand what happened in the past, identify trends over time, and
project future levels for the time series. For example, the graph of the time series
in Figure 2.1 shows the DJI value from February 2002 to April 2013.
4. Sources of Data
a) Experimental study a variable of interest is first identified. Then one or more other
variables are identified and controlled so that data can be obtained about how they
influence the variable of interest.
5
6
The result of using Excel’s Sort function for the March 2010 data is shown in Figure 2.5.
Note that while we sorted on Sales (March 2010), which is in column E, the data in all other
columns are adjusted accordingly.
Step 5. Select only the check box for Toyota. You can easily deselect all choices by
unchecking (Select All)
The result is a display of only the data for models made by Toyota (see Figure 2.6). We
now see that of the 20 top-selling models in March 2011, Toyota made three of them. We can
further filter the data by choosing the down arrows in the other columns. We can make all data
visible again by clicking on the down arrow in column B and checking (Select All) or by clicking
Filter in the Sort & Filter Group again from the DATA tab.
Step 1. Starting with the original data shown in Figure 2.3, select cells F1:F21
Step 2. Click on the HOME tab in the Ribbon
Step 3. Click Conditional Formatting in the Styles group
Step 4. Select Highlight Cells Rules, and click Less Than from the dropdown menu
Step 5. Enter 0% in the Format cells that are LESS THAN: box
Step 6. Click OK
The results are shown in Figure 2.7. Here we see that the models with decreasing sales
(Toyota Camry, Ford Focus, Chevrolet Malibu, and Nissan Versa) are now clearly visible.
Note that Excel’s Conditional Formatting function offers tremendous flexibility. Instead of
highlighting only models with decreasing sales, we could instead choose Data Bars from the
Conditional Formatting dropdown menu in the Styles Group of the HOME tab in the Ribbon. The
result of using the Blue Data Bar Gradient Fill option is shown in Figure 2.8.
In Table 2.1, taken from a sample of 50 soft drink purchases, each purchase is for one of
five popular soft drinks, which define the five bins: Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and
Sprite. To develop a frequency distribution for these data, we count the number of times
This frequency distribution provides a summary of how the 50 soft drink purchases are
distributed across the five soft drinks.
Table 2.5 shows a relative frequency distribution and a percent frequency distribution for
the soft drink data.
Histograms
A common graphical presentation of quantitative data is a histogram. This graphical
summary can be prepared for data previously summarized in either a frequency, a relative
frequency, or a percent frequency distribution. A histogram is constructed by placing the
Cumulative Distributions
A variation of the frequency distribution that provides another tabular summary of
quantitative data is the cumulative frequency distribution, which uses the number of classes, class
widths, and class limits developed for the frequency distribution.
Measures of Location
1. Mean (Arithmetic Mean)
The most commonly used measure of location is the mean (arithmetic mean), or average
value, for a variable. The mean provides a measure of central location for the data. If the data are
for a sample (typically the case), the mean is denoted by x . The sample mean is a point estimate
of the (typically unknown) population mean for the variable of interest. If the data for the entire
population are available, the population mean is computed in the same manner, but denoted by
the Greek letter u.
2 254,000
3 186,000
4 257,500
5 108,000
6 254,000
7 138,000
8 298,000
9 199,500
10 208,000
142,000
11 456,250
12
The mean can be found in Excel using the AVERAGE function. Figure 2.16 shows
the Home Sales data from Table 2.9 in an Excel spreadsheet. The value for the mean in cell
E2 is calculated using the formula =AVERAGE (B2:B13)
Let us apply this definition to compute the median class size for a sample of five
college classes. Arranging the data in ascending order provides the following list:
32 42 46 46 54
Because n = 5 is odd, the median is the middle value. Thus, the median class size is46
students. Even though this data set contains two observations with values of 46, each observation
is treated separately when we arrange the data in ascending order. Suppose, we also compute
the median value for the 12 home sales in Table 2.9. We first arrange the data in ascending order.
Although the mean is the more commonly used measure of central location, in some
situations the median is preferred. The mean is influenced by extremely small and large data
values. Notice that the median is smaller than the mean in Figure 2.16. This is because the one
large value of $456,250 in our data set inflates the mean but does not have the same effect on
the median. Notice also that the median would remain unchanged if we replaced the $456,250
with a sales price of $1.5 million. In this case, the median selling price would remain $203,750,
but the mean would increase to $306,916.67. If you were looking to buy a home in this suburb,
the median gives a better indication of the central selling price of the homes there. We can
generalize, saying that whenever a data set contains extreme values or is severely skewed, the
median is often the preferred measure of central location.
3. Mode
A third measure of location, the mode, is the value that occurs most frequently in a data
set. To illustrate the identification of the mode, consider the sample of five class sizes.
The only value that occurs more than once is 46. Because this value, occurring with a
frequency of 2, has the greatest frequency, it is the mode. To find the mode for a data set with
only one most often occurring value in Excel, we use the MODE.SNGL function. Occasionally the
greatest frequency occurs at two or more different values, in which case more than one mode
exists. If data contain at least two modes, we say that they are multimodal.
A special case of multimodal data occurs when the data contain exactly two modes; in such
cases we say that the data are bimodal. In multimodal cases when there are more than two
modes, the mode is almost never reported because listing three or more modes is not particularly
helpful in describing a location for the data. Also, if no value in the data occurs more than once,
we say the data have no mode.
The Excel MODE.SNGL function will return only a single mostoften-occurring value. For
multimodal distributions, we must use the MODE.MULT command in Excel to return more than
one mode.
For example, two selling prices occur twice in Table 2.9: $138,000 and $254,000. Hence,
these data are bimodal. To find both of the modes in Excel, we take these steps:
Step 1. Select cells E4 and E5
Step 2. Type the formula =MODE.MULT(B2:B13)
Step 3. Press CTRL+SHIFT+ENTER after typing the formula in Step 2.
Excel enters the values for both modes of this data set in cells E4 and E5: $138,000 and
$254,000.
4. Geometric Mean
The geometric mean is a measure of location that is calculated by finding the nth root of
the product of n values. The general formula for the sample geometric mean, denoted xg, follows.
1. Range
The simplest measure of variability is the range. The range can be found by subtracting
the smallest value from the largest value in a data set. Let us return to the home sales data set
to demonstrate the calculation of range. Refer to the data from home sales prices in Table 2.9.
The largest home sales price is $456,250, and the smallest is $108,000. The range is $456,250
- $108,000 = $348,250.
Although the range is the easiest of the measures of variability to compute, it is seldom
used as the only measure. The reason is that the range is based on only two of the
The range can be calculated in Excel using the MAX and MIN functions. The range value
in cell E7 of Figure 2.19 calculates the range using the formula =MAX(B2:B13) − MIN(B2:B13).
This subtracts the smallest value in the range B2:B13 from the largest value in the range B2:B13.
2. Variance
Recall that the units associated with the variance are squared and that it is difficult to interpret
the meaning of squared units. Because the standard deviation is the square root of the variance,
the units of the variance, (students)2 in our example, are converted to students in the standard
deviation. In other words, the standard deviation is measured in the same units as the original
data. For this reason, the standard deviation is more easily compared to the mean and other
statistics that are measured in the same units as the original data.
Figure 2.19 shows the Excel calculation for the sample standard deviation of the home sales
data, which can be calculated using Excel’s STDEV.S function. The sample standard deviation
in cell E9 is calculated using the formula =STDEV.S(B2:B13). Excel calculates the sample
standard deviation for the home sales to be $95,065.77.
1. Percentiles
A percentile is the value of a variable at which a specified (approximate) percentage of
observations are below that value. The path percentile tells us about the point in the data where
approximately p percent of the observations have values less than the pth percentile; hence,
approximately (100 – p) percent of the observations have values greater than the pth percentile.
Colleges and universities frequently report admission test scores in terms of percentiles. For
instance, suppose an applicant obtains a raw score of 54 on the verbal portion of an admission
test. How this student performed in relation to other students taking the same test may not be
readily apparent. However, if the raw score of 54 corresponds to the 70th percentile, we know
that approximately 70 percent of the students scored lower than this individual, and approximately
30 percent of the students scored higher. The following procedure can be used to compute the
pth percentile:
2. Compute k= (n + 1) X p.
3. Divide k into its integer component, i, and its decimal component, d. (For example, k= 13.25
would result in i= 13 and d= 0.25.)
a. If d= 0 (there is no decimal component for k), find the kth largest value in the data set. This is
the pth percentile.
b. If d>0, the percentile is between the values in positions i and i + 1 in the sorted data. To find
this percentile, we must interpolate between these two values.
i. Calculate the difference between the values in positions i and i+1 in the sorted data
set. We define this difference between the two values as m.
t=mXd
iii. To find the pth percentile, add t to the value in position i of the sorted data.
As an illustration, let us determine the 85th percentile for the home sales data in Table 2.9.
Therefore, $305,912.50 represents the 85th percentile of the home sales data.
The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC. Figure
2.18 shows the Excel calculation for the 85th percentile of the home sales data. The value in cell
E13 is calculated using the formula= PERCENTILE.EXC (B2:B13,0.85); B2:B13 defines the data
set for which we are calculating a percentile, and 0.85 defines the percentile of interest.
2. Quartiles
It is often desirable to divide data into four parts, with each part containing approximately
one-fourth, or 25 percent, of the observations. These division points are referred to as the quartiles
and are defined as:
To demonstrate quartiles, the home sales data are again arranged in ascending order.
108,000 138,000 138,000 142,000 186,000 199,500
We already identified Q2, the second quartile (median) as 203,750. To find Q1 and Q3, wemust
find the 25th and 75th percentiles.
For Q1,
1. The data are arranged in ascending order, as previously done.
3. Dividing 3.25 into the integer and decimal components gives us i= 3 and d = 0.25. Because d
> 0, we must interpolate between the values in the 3rd and 4th positions in our sorted data. The
value in the 3rd position is 138,000, and the value in the 4thposition is 142,000.
The difference between the third and first quartiles is often referred to as the interquartile
range, or IQR. For the home sales data, IQR = Q3 - Q1 = 256,625 - 139,000 = 117,625. Because
it excludes the smallest and largest 25 percent of values in the data, the IQR is a useful measure
of variation for data that have extreme values or are badly skewed.
A quartile can be computed in Excel using the function QUARTILE.EXC. Figure 2.18
shows the calculations for first, second, and third quartiles for the home sales data. The formula
used in cell E15 is =QUARTILE.EXC(B2:B13,1). The range B2:B13 defines the data set, and 1
indicates that we want to compute the 1st quartile. Cells E16 and E17 use similar formulas to
compute the second and third quartiles.
3. z-scores
A z-score allows us to measure the relative location of a value in the data set. More
specifically, a z-score helps us determine how far a particular value is from the mean relative to
the data set’s standard deviation. Suppose we have a sample of n observations, with the values
denoted by x1, x2, . . . ,xn. In addition, assume that the sample mean, x¯, and the sample standard
deviation, s, are already computed. Associated with each value, xi ,is another value called its z-
score. Equation (2.7) shows how the z-score is computed for each xi :
The z-score is often called the standardized value. The z-score, zi, can be interpreted as
the number of standard deviations, xi, is from the mean. For example, z1= 1.2 indicates thatx1 is
1.2 standard deviations greater than the sample mean. Similarly, z2= 20.5indicatesthatx2 is 0.5,
or 1/2, standard deviation less than the sample mean. A z-score greater than zero occurs for
observations with a value greater than the mean, and a z-score less than zero occurs for
observations with a value less than the mean. A z-score of zero indicates that the value of the
observation is equal to the mean.
The z-score can be calculated in Excel using the function STANDARDIZE. Figure 2.19
demonstrates the use of the STANDARDIZE function to compute z-scores for the home sales
data. To calculate the z-scores, we must provide the mean and standard deviation for the data
set in the arguments of the STANDARDIZE function. For instance, the z-score in cell C2 is
calculated with the formula =STANDARDIZE(B2, $B$15, $B$16), where cell B15 contains the
mean of the home sales data and cell B16 contains the standard deviation of the home sales
data. We can then copy and paste this formula into cells C3:C13.
When the distribution of data exhibits a symmetric bell-shaped distribution, as shown in Figure
2.20, the empirical rule can be used to determine the percentage of data values that are within a
specified number of standard deviations of the mean. Many, but not all, distributions of data found
in practice exhibit a symmetric bell-shaped distribution.
The height of adult males in the United States has a bell-shaped distribution similar to that
shown in Figure 2.20 with a mean of approximately 69.5 inches and standard deviation of
approximately 3 inches. Using the empirical rule, we can draw the following conclusions.
• Approximately 68 percent of adult males in the United States have heights between 69.5
2 3 5 66.5 and 69.5 1 3 5 72.5 inches.
• Approximately 95 percent of adult males in the United States have heights between 63.5
and 75.5 inches.
• Almost all adult males in the United States have heights between 60.5 and 78.5 inches.
5. Identifying Outliers
Sometimes a data set will have one or more observations with unusually large or unusually
small values. These extreme values are called outliers. Experienced statisticians take steps to
identify outliers and then review each one carefully. An outlier may be a data value that has been
incorrectly recorded; if so, it can be corrected before further analysis. An outlier may also be from
an observation that doesn’t belong to the population we are studying and was incorrectly included
in the data set; if so, it can be removed. Finally, an outlier may be an
Standardized values (z-scores) can be used to identify outliers. Recall that the empirical
rule allows us to conclude that for data with a bell-shaped distribution, almost all the data values
will be within three standard deviations of the mean. Hence, in using z-scores to identify outliers,
we recommend treating any data value with a z-score less than 23 or greater than 13 as an outlier.
Such data values can then be reviewed to determine their accuracy and whether they belong in
the data set.
6. Box Plots
A box plot is a graphical summary of the distribution of data. A box plot is developed from the
quartiles for a data set. Figure 2.21 is a box plot for the home sales data. Here are the steps used
to construct the box plot:
1. A box is drawn with the ends of the box located at the first and third quartiles. For the
home sales data, Q1 = 139,000 and Q3 = 256,625. This box contains the middle 50
percent of the data.
2. A vertical line is drawn in the box at the location of the median (203,750 for the home sales
data).
3. By using the interquartile range, IQR = Q3 - Q1, limits are located. The limits for the box
plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the home sales data, IQR = Q3
- Q1 = 256,625 - 139,000 = 117,625. Thus, the limits are 139,000 - 1.5(117,625) = -
37,437.5 and 256,625 + 1.5(117,625) = 433,062.5. Data outside these limits are
considered outliers.
4. The dashed lines in Figure 2.21 are called whiskers. The whiskers are drawn from the
ends of the box to the smallest and largest values inside the limits computed in step 3.
Thus, the whiskers end at home sales values of 108,000 and 298,000.
5. Finally, the location of each outlier is shown with an asterisk (*). In Figure 2.21, we see
one outlier, 456,250.
Box plots are also very useful for comparing different data sets. For instance, if we want to
compare home sales from several different communities, we could create box plots for recent
home sales in each community. An example of such box plots is shown in Figure 2.22.
1. Scatter Charts
A scatter chart is a useful graph for analyzing the relationship between two variables.
Figure 2.23 shows a scatter chart for sales of bottled water versus the high temperature
experienced over 14 days. The scatter chart also suggests that a straight line could be used as
an approximation for the relationship between high temperature and sales of bottled water.
FIGURE 2.23 CHART SHOWING THE POSITIVE LINEAR RELATION BETWEEN SALES AND
HIGH TEMPERATURES
2. Covariance
Covariance is a descriptive measure of the linear association between two variables. For
a sample of size n with the observations (x1, y1), (x2, y2), and so on, the sample covariance is
defined as follows:
To measure the strength of the linear relationship between the high temperature x and the
sales of bottled water y at Queensland, we use equation (2.8) to compute the sample covariance.
The calculations in Table 2.15 show the computation.
Table 2.15 SAMPLE VARIANCE CALCULATIONS FOR DAILY HIGH TEMPERATURE AND
BOTTLED WATER SALES AT QUEENSLAND AMUSEMENT PARK
The sample covariance can also be calculated in Excel using the COVARIANCE.S
function. Figure 2.24 shows the data from Table 2.14 entered into an Excel Worksheet. Figure
2.25 demonstrates several possible scatter charts and their associated covariance values.
3. Correlation Coefficient
The correlation coefficient measures the relationship between two variables, and, unlike
covariance, the relationship between two variables is not affected by the units of measurement
for x and y. For sample data, the correlation coefficient is defined as follows.
The sample correlation coefficient is computed by dividing the sample covariance by the
product of the sample standard deviation of x and the sample standard deviation of y. This scales
the correlation coefficient so that it will always take values between 21 and 11.
The correlation coefficient can take only values between –1 and 11. Correlation coefficient
values near 0 indicate no linear relationship between the x and y variables. Correlation coefficients
greater than 0 indicate a positive linear relationship between the x and y variables. The closer the
correlation coefficient is to 11, the closer the x and y values are to forming a straight line that
trends upward to the right (positive slope) Correlation coefficients less than 0 indicate a negative
linear relationship between the x and y variables. The closer the correlation coefficient is to –1,
the closer the x and y values are to forming a straight line with negative slope.
The scatter diagram in Figure 2.26 shows the relationship between the amount spent by
a small retail store for environmental control (heating and cooling) and the daily high outside
temperature over 100 days. The sample correlation coefficient for these data is rxy 5 20.007
Read:
Chp. 2 - Camm, J., Cochran, Fry, Ohlmann, Anderson, Sweeney, Williams. (2015). Essentials of
Business Analytics. Stamford, USA: Cengage Learning.