0% found this document useful (0 votes)
410 views70 pages

BA 1.3 - Descriptive Statistics

The document discusses calculating conditional means, which is the mean of a subset of data that meets a specified condition. It explains that the AVERAGEIF function in Excel can be used to calculate conditional means by specifying a range of cells to apply criteria to, the criteria or condition, and an optional range of cells containing the data to average if different from the criteria range. Calculating conditional means allows analyzing averages for particular groups within a larger data set defined by a condition.

Uploaded by

ScarfaceXXX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
410 views70 pages

BA 1.3 - Descriptive Statistics

The document discusses calculating conditional means, which is the mean of a subset of data that meets a specified condition. It explains that the AVERAGEIF function in Excel can be used to calculate conditional means by specifying a range of cells to apply criteria to, the criteria or condition, and an optional range of cells containing the data to average if different from the criteria range. Calculating conditional means allows analyzing averages for particular groups within a larger data set defined by a condition.

Uploaded by

ScarfaceXXX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 70

1.3.

1 Central Values for Data


As we have seen, graphs are very useful for providing insight into a data set’s patterns,
trends and outliers. However, sometimes we would like to describe the data in a more
concise way, with just one or two numbers. We call such summary measures
“descriptive statistics”—they provide a quick overview of a data set without showing
every data point. We encounter these kinds of descriptive statistics every day: we talk
about a baseball player’s performance by referring to his batting average; we measure
stock market performance using the Dow Jones Industrial Average or the Nikkei Index;
we summarize our academic performance using our grade point average. 

As a starting point, let’s determine the “central tendency” of a data set—an indication of
where the “center” of the data set lies. We usually start by calculating the mean, the
most common measurement of central tendency. The mean is the number people refer
to when they talk about the “average” of a set of numbers.  
The mean, median, and mode for the data set {0.5, 0.5, 1.5, 3.0, 4.0} are shown in the
graph below. The graph uses Excel’s conventions for histograms. Remember that each
bin contains a range of values. For example, bin 2 contains all values greater than 1 and
less than or equal to 2. Thus, the value 1 on the graph is represented by the gray vertical
grid line between bins 1 and 2 (not where the bin label 1 appears). Similarly, the value 2
is represented by the gray vertical grid line between bins 2 and 3 (not where the bin label
2 appears). Thus, for example, the mean, 1.9, is represented by the red dotted line just
to the left of the gray vertical grid line that represents the value 2. The median, 1.5, is
represented by the red dotted line in the center of bin 2, which is between the gray
vertical grid lines at the values 1 and 2.

Drill Down: Distributions with Multiple Modes


 

The mode is the value that occurs most frequently in a data set. If a data set has more
than one value with the highest frequency, that data set has more than one mode. A
distribution is called bimodal if it has two clearly defined peaks (two points with very high
frequency). The two peaks may have equal frequency and hence be true modes, or one
peak may be a mode and the other peak may simply have a very high (but not the
highest) frequency. Distributions with multiple peaks are called multimodal.

1.3.1_01_Central Values for Data.wmv

 Let's look at a
couple of data sets
 to help us understand how the
locations of the mean, median,
 and mode can vary based on
the distribution of the data.
 In the first distribution,
the mean, median, and mode
 are the same
because the data set
 is symmetric and has
only one peak or mode.
 In the second distribution,
the data set has an outlier.
 The mean is pulled towards
that extreme value,
 but the median and
mode are the same as
 in the first distribution.
 In general, if
there is an outlier
 or if the distribution
is skewed,
 that is, has a tail that
extends out to one side,
 the extreme values will
pull the mean towards them.
 People tend to rely
heavily on the mean
 to characterize a data set.
 But it's important to realize
that the mean is affected
 by outliers, and
therefore, may not
 be the best value to
represent the distribution.

 MEAN

 When we examine the oil


consumption distribution,
 we see that most countries use
less than 3 million barrels
 per day.
 If this is true, why is the
mean, 5.31 million barrels,
 so much higher than 3 million?
 The highest consuming countries,
the United States and China,
 consume much more than
the other countries.
 These two large consumers pull
the average up considerably.
 In this case, with a
skewed distribution,
 we may draw false
conclusions about
 the underlying
distribution if we
 use only the mean to
represent the data.
 MEDIAN

 The median daily oil consumption


is 3.05 million barrels,
 significantly less
than the mean.
 Notice that 3.05 is not
actually a value in our dataset.
 This is because we have an
even number of observations.
 Thus, we rank the
values by magnitude,
 and then take the average
of the two middle values.
 By definition,
half the countries
 consume less than the median
and half consume more.
 In this case, we may
consider the median
 to capture the central
tendency more accurately,
 because it is not pulled upwards
by the two largest consumers.

 MODE

 The mode, like the median,


is not affected by outliers.
 Here, the mode is
2.3 million barrels,
 which falls in bin three.
 Since most countries
fall in bin three,
 the mode is also
a helpful measure
 for summarizing this dataset.
Correct!

Your answer is close! Notice that this is a skewed distribution, which pulls the mean
towards the tail. The mean of this data set is 5.31 million barrels per day.
Correct!

Your answer is close! Remember that by definition half of the data points are below
the median and half are above. The median of this data set is 3.05 million barrels per
day.
Correct!

Your answer is close! Remember that the mode is the most common value in the
data set. Look for the bin with the highest frequency. The mode of this data set is
2.30 million barrels per day.

Question 1 of 2
Question 2 of 2

In the graphs shown below, several values of the data set shown in the previous
question have been shifted to the right. Which of the following do you think accurately
displays the mean, median, and mode of this data set?
To calculate the values for the mean, median, and mode in Excel, we can use the
following formulas:

=AVERAGE(number 1, [number 2], …)

=MEDIAN(number 1, [number 2], …)

=MODE.SNGL(number 1, [number 2], …)

 number 1 is the first number, cell reference, or range of cells for which to
calculate the specified value.
 [number 2],… represents additional numbers, cell references, or ranges of cells.
The square brackets indicate that the argument is optional. 

For example, if we had the data set {0.5, 0.5, 1.5, 3.0, 4.0}, we could find the mean by
entering =AVERAGE(0.5, 0.5, 1.5, 3.0, 4.0) into a cell. This would calculate the average,
1.9. More often when using Excel, we won’t input data directly into the formula, but will
instead input a range of cells. For example if cell A1=0.5, A2=0.5, A3=1.5, A4=3.0, and
A5=4.0, we could calculate the mean of the data set by entering =AVERAGE(A1:A5),
which would return 1.9. Because the mean is the sum of all data points, divided by the
number of data points, we could also enter =SUM(A1:A5)/COUNT(A1:A5). The COUNT
function counts the number of cells that contain numerical values so in this case
=SUM(A1:A5)/COUNT(A1:A5) is equivalent to =SUM(A1:A5)/5.

Drill Down: Excel Functions for Distributions with


Multiple Modes
 

The Excel function MODE.MULT finds all of the modes in a data set, generating a
vertical array that has one row for each mode in the data set. Suppose a data set has
three modes. To find them, instead of entering the MODE.MULT formula in a single
cell, highlight at least three vertically-contiguous cells and then
input =MODE.MULT(number 1, [number 2], …) into the formula bar. Then, instead of
using ENTER to find the result, use CTRL+SHIFT+ENTER to enter the array. The
modes will appear in the first rows of the array (filling as many rows as there are
modes in the data set) and #N/A will appear in each of the other rows in the array. 

How do we know how many vertically-contiguous cells to highlight when using this
function?  To determine how many modes are in a data set, enter
=COUNT(MODE.MULT(number 1, [number 2], …)) in a single cell. When you click
ENTER, the function will return the number of modes in the data set. The result will tell
you how many vertical cells you should highlight to create a MODE.MULT array.

Note that our embedded spreadsheet currently does not support the functionality
required by MODE.MULT.

Related: Alternative Excel Functions


 

MODE.SNGL replaces the function:

=MODE(number 1, [number 2], …)

Throughout the course we will provide alternative functions that existed prior to Excel
2010 that can still be used in Excel 2010.

Question 1 of 4
Correct!

The mean is AVERAGE(A2:A11)=5.31 million barrels per day.


Question 2 of 4

Correct!

The median is MEDIAN(A2:A11)=3.05 million barrels per day.

Question 3 of 4
Correct!

The mode is MODE.SNGL(A2:A11)=2.30 million barrels per day.

Question 4 of 4
Correct!

The mean is AVERAGE(A1:A12)=10.

The median is MEDIAN(A1:A12)=12.

The mode is MODE.SNGL(A1:A12)=12.


1.3.2 Conditional Means
Sometimes it’s necessary or helpful to analyze a subset of a data set. For example, we
may be interested in the average oil consumption across countries in North America or
we may want to compare oil consumption across continents. The mean of a specific
subset of data is known as the conditional mean, because we are imposing
a condition on our data set and we only want to find the mean of the values that meet
that condition.

To calculate a conditional mean in Excel, we can use the following formula:

=AVERAGEIF(range, criteria, [average_range])

 range contains the one or more cells to which we want to apply the


criteria or condition.
 criteria is the condition that is to be applied to the range.
 [average_range] is the range of cells containing the data we wish to
average.

Spreadsheet: Calculating Conditional Means


Question 1 of 2

Let’s calculate the mean oil consumption for all North American countries in our data set.
Note that the appropriate continent is now listed next to each data point to make it easy
to apply a condition to our data set.

Step 1

In cell F2, enter the function =AVERAGEIF(C2:C11,E2,A2:A11).

 This function says, if the country’s continent is North America, then include that
country’s oil consumption in the calculation of the average.
 C2:C11 contains the data to which we want to apply the condition.
 E2 is the condition or criterion by which we choose the data points to include in
the calculation of the conditional mean. In the function we have specified, we
have just entered a link to cell E2, which contains “North America”. Alternatively,
we could type “North America” into the function as follows:
=AVERAGEIF(C2:C11,"North America",A2:A11). 
 A2:A11 contains the numeric data from which the conditional mean will be
calculated. Only the values that meet the criterion—in this case, data
corresponding to countries in North America—will be used in the calculation.
Thus, the mean oil consumption will be calculated for Canada and the United
States.
Question 2 of 2
Now calculate the mean oil consumption for countries located in Asia.
1.3.3 Percentiles
In addition to calculating measures of central tendency, sometimes we may want to
know the value beneath which a certain percentage of the data lie. For example, we may
want to find the 25th percentile—also known as the first quartile. The 25 th percentile is the
smallest value that is greater than or equal to 25% of the data points. 

Percentiles are often used to categorize test scores. For example, someone who scored
in the 95th percentile of a test scored equal to or higher than 95% of all people who took
that test. We can also say that person scored in the top 5%.

Question 1 of 3

What percentile does the mean represent?

0%
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution.
100%
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution.
50%
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution.
The answer cannot be determined without further information CORRECT
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution. Therefore, there is no way to determine the percentile of the mean without
more information about the data set.

Question 2 of 3

What percentile does the median represent?

0%
Remember that half of a distribution’s data points are less than or equal to the median.
100%
Remember that half of a distribution’s data points are less than or equal to the median.
50% CORRECT
Remember that half of a distribution’s data points are less than or equal to the median.
Therefore, the median is equal to the 50th percentile, because 50% of the data points
are equal to or below this value.
The answer cannot be determined without further information
Remember that half of a distribution’s data points are less than or equal to the median.

Question 3 of 3

What percentile does the mode represent?

0%
Remember that the mode’s location depends upon the distribution of the data set.
100%
Remember that the mode’s location depends upon the distribution of the data set.
50%
Remember that the mode’s location depends upon the distribution of the data set.
The answer cannot be determined without further information CORRECT
Remember that the mode’s location depends upon the distribution of the data set.
Therefore, there is no way to determine the percentile of the mode without more
information about the data set.

Let’s return to our oil consumption data. Use the slider to change the percentile and see
the amount of oil consumption associated with it.
There are several methods for calculating percentiles, and different methods may lead
to different values. We will use the same method as Excel, which means that the
percentile found may fall between two data points and not actually be a value in our
data set. We have seen this when we calculate the median of an even number of data
points—the median falls between the two middle values.

To find a percentile in Excel, we use the following function:

=PERCENTILE.INC(array, k)

 array is the range of data for which we want to calculate a given percentile.
 k is the percentile value. For example, if we want to know the 95 th percentile, k
would be 0.95.

Related: Alternative Excel Functions


 

PERCENTILE.INC replaces the function:


=PERCENTILE(array, k)

Throughout the course we will provide alternative functions that existed prior to Excel
2010 that can still be used in Excel 2010.

Question 1 of 2

Suppose we want to know which countries are in the 75 th percentile.


Question 2 of 2
Now calculate the 50th percentile for the oil consumption data on your own. Also,
calculate the median using the MEDIAN function. Note how the 50 th percentile relates to
the median.

1.3.4 Variability
1.3.4_01_Variability.wmv

 It's often critical to


have a sense of how much
 the values in a dataset vary.
 The mean, median,
and mode may give us
 a sense of where the
center of the data lies,
 but none of these indicates
how widely dispersed
 the data are around the mean.
 Two sets of data could have
the same or very similar means,
 but may be distributed
completely differently.
 In one set, the values may
cluster around the mean,
 whereas in the other the
values may be widely dispersed.
 Let's look at an
example to see why
 understanding the dispersion
of data is so important.
 To identify good target
markets, a car dealership
 might look at
several communities,
 and find the average
household income of each.
 Suppose two towns with roughly
equal populations, Greenville
 and Springfield, have average
household incomes of $95,000
 and $98,000 respectively.
 If the car dealer wants
to target households
 with incomes above $90,000,
on which town should he focus?
 Based on average
household income,
 we might be tempted to
conclude that the dealer should
 focus on Springfield.
 We need to be careful though.
 The mean income doesn't
tell us anything about how
 the incomes are distributed.
 Even though Springfield
has a higher average
 income than Greenville,
the majority
 of households in
Springfield actually
 earned less than $90,000.
 Springfield's mean
income is higher,
 because its household
incomes are widely dispersed,
 and a few very high income
families pull up the average.
 If we take a closer
look at the data,
 we see that despite
having a lower average,
 there are actually more
households in Greenville
 with incomes above $90,000.
 Without understanding how
the data are distributed,
 the dealer might have
chosen Springfield,
 which has fewer homes
in the target range.
 As we've already
learned, looking
 at the distribution of a
dataset is often very helpful.
 However, if we want to
summarize the data numerically,
 we typically convey
both the mean
 and a measure of variability.

One of the simplest measures of variability, or spread, is the range:

Range=Maximum value–Minimum value

Question 1 of 4
0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.
11
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.

Question 2 of 4

What is the range of this data set?

0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.
11
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.

Question 3 of 4

What is the range of this data set?

0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
5
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.

What is the range of this data set?

0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
5
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.

1.3.4_02_Variability.wmv
 As we can see, datasets
that have the same minimum
 and maximum values will
have the same range,
 even if the shapes of
their distributions
 are completely different.
 To gain more insight into the
spread of the distribution
 and how the data behave
between the two extremes,
 we can calculate the variance.
 Rather than simply measuring
the distance between the two
 extremes, the variance
measures how far
 each point is from the mean.
 Let's look at an example.
 First, we find the mean of
the dataset, in this case 3.
 Then we calculate the
distance from each point
 to the mean and square
that difference.
 Then for each occurrence
of a value in the dataset,
 we sum the squared terms.
 We square the differences,
because if we simply
 added them, the positive
and negative values
 would cancel each other out.
 Squaring the differences gives
us only non-negative numbers.
 Moreover, squaring the
distances gives more weight
 to points that are
further from the mean.
 To calculate the
variance of a population,
 we divide the sum of
the squared differences
 by n, the number of
data points, essentially
 giving us an average of
the squared differences.
 However, typically we're working
with a sample of data points.
 So we divide by n
minus 1 to obtain
 a sample variance of 12.8.
 The technical reason
we divide by n minus 1
 is beyond the scope
of this course.
 Essentially, we do this to
obtain an unbiased estimate
 of the population variance.
 One shortcoming
of the variance is
 that it has inconvenient units.
 Since we square
the differences, we
 end up with a value
that is also squared.
 For example, how would we
interpret a square dollar?
 To convert the variance to
have the same units as the data
 points, we take the square root
of the variance, which gives us
 the standard deviation.
 In this case, the sample
standard deviation
 is about 3.6.
 A small standard deviation
indicates that the data points
 are close to the mean.
 A large standard deviation
indicates a broader spread.
 Luckily, we can use Excel
to calculate the variance
 and standard deviation.
 We'll never have to
do so by hand again.
Question 1 of 3
Which of the following data sets do you think has the smallest standard deviation?
Question 2 of 3

If the variance of a data set is 9, what is the standard deviation?

9
The variance and the standard deviation are not equal. Recall how the two measures
are related.
3 CORRECT
The standard deviation is equal to the square root of the variance. If the variance is 9,
then the standard deviation must be 3.
81
The standard deviation is not equal to the variance squared. Recall how the two
measures are related.
There is not enough information to determine the standard deviation
There is sufficient information to answer this question. Recall how the standard deviation
and the variance are related.
Question 3 of 3

If two data sets, A and B, have equal standard deviations, which of the following
statements is true? Select all that apply.

The range of A must equal the range of B


Data sets with equal standard deviations can have different ranges because the range
measures the difference between the minimum and maximum values whereas the
standard deviation calculation is based upon each data point’s distance from the mean.
The range of A might not equal the range of B CORRECT
Data sets with equal standard deviations do not necessarily have the same range. The
standard deviation is based upon each data point’s distance from the mean and thus
provides little information about the range, which is based only on the minimum and
maximum values in the data set. Note that another option is also correct.
The range of A must equal the variance of B
We can calculate the variance of a data set based on its standard deviation, but the
standard deviation provides little information about the range of a data set. Therefore,
we cannot say that the range of A is equal to the variance of B.
The variance of A must equal the range of B
We can calculate the variance of a data set based on its standard deviation, but the
standard deviation provides little information about the range of a data set. Therefore,
we cannot say that the variance of A is equal to the range of B.
The variance of A must equal the variance of B CORRECT
The variance is equal to the square of the standard deviation. If the standard deviation of
A and B are equal, then the variances must also be equal. Note that another option is
also correct.
The mean of A must equal the mean of B
The standard deviation considers each data point’s distance from the mean. Two data
sets can have equal standard deviations but unequal means. For example, if we added 1
to every point in a data set, the mean would increase by 1 but the standard deviation
would remain the same.

To calculate the variance or standard deviation of a sample in Excel, we can use the
following functions:

=VAR.S(number 1, [number 2], …)

=STDEV.S(number 1, [number 2], …)

 number 1 is the first number, cell reference, or range of cells for which to
calculate the specified value.
 [number 2],… represents additional numbers, cell references, or ranges of cells.
The square brackets indicate that the argument is optional.  

Note that the “S” in VAR.S and STDEV.S indicates that we are working with a sample.
We will learn more about the differences between samples and populations in the next
module.
We can also find the standard deviation using the Excel function =SQRT(number) to
take the square root of the variance. For example, =SQRT(16)=4.

Related: Alternative Excel Functions


 

VAR.S and STDEV.S replace the functions:

=VAR(number 1, [number 2], …)

=STDEV(number 1, [number 2], …)

Throughout the course we will provide alternative functions that existed prior to Excel
2010 that can still be used in Excel 2010.

Question 1 of 4
Question 2 of 4

Let’s calculate the variance and standard deviation of the oil consumption data.

Step 2

In cell E3, enter the function =STDEV.S(A2:A11).

Note

Notice that the standard deviation is equal to the square root of the variance.
Question 3 of 4
Question 4 of 4
1.3.5 Descriptive Statistics in Excel
Spreadsheet: Creating the Output
Excel has a descriptive statistics tool that provides a number of summary statistics,
including those we’ve already learned, for a set of data. Let’s walk through how to create
it.

Step 1

From the Data menu, select Data Analysis, then select Descriptive Statistics.

Step 2

Enter the appropriate Input Range:


 The Input Range is the oil consumption data in column A with its
label, A1:A11.
 Make sure to include A1, the cell containing the label, when inputting your
range and check the Label in first row box, as this ensures that your
output table will be appropriately labeled.

Step 3

Enter the appropriate Output Range, in this case enter D1.

 This cell is the top left hand cell in which the output table will appear.
 Be sure to always use the specified Output Range, so that the calculated
values are placed in the blue cells for grading.

Step 4

Be sure to select Summary Statistics so that the output table is generated.


Related: Creating the Descriptive Statistics Output in
Excel
 

In order to create the descriptive statistics output table in Excel, you must download
the Analysis ToolPak, which is an add-in program.
Spreadsheet
Use the descriptive statistics tool to calculate the summary statistics for the heights of
these ten Red Sox players. Make sure to enter D1 as the output range so that your
calculations are graded correctly.
1.3.6 Coefficient of Variation
The standard deviation describes how much the values in a single data set vary, but
what if we want to compare the amount of variation in two different data sets? Suppose
we want to compare the share price volatility of two stocks, Stock A and Stock B. For
both stocks, the standard deviation in share price is $4.50. Compared to its average
price, this standard deviation is relatively high for Stock A but relatively low for Stock B.
The standard deviations are the same, but Stock A is much more volatile relative to its
mean than Stock B.
To compare variation in two data sets, we calculate a value called the coefficient of
variation (CV). The coefficient of variation is the ratio of the standard deviation to the
mean. The equation for the coefficient of variation is:
1.4.1 Scatter Plots
We use histograms to help us visualize a single variable’s distribution. To visualize the
relationship between two variables, we typically use a scatter plot. One variable is
plotted on the horizontal axis (x-axis), and the other is plotted on the vertical axis (y-
axis). Please read the instructions carefully to make sure you assign variables to the
correct axes for grading purposes. Later in the course when we learn about regression
analysis, we’ll see that the convention is to plot the so-called “dependent” variable on the
vertical axis and the “independent” variable on horizontal axis.

Spreadsheet: Creating Scatter Plots


Let’s return to our ten Boston Red Sox players. We would like to create a scatter plot to
see if there is a relationship between the players’ heights and weights.

Step 1

From the Insert menu, select Scatter, then select Scatter With Only Markers.

Step 2

Enter the appropriate Input Y Range and Input X Range:

 The Input Y Range is the weight data in column C with its label, C1:C11.
 The Input X Range is the height data in column B with its label, B1:B11.
 Make sure to include the cells containing labels when inputting your
ranges and check the Labels in first row box, as this ensures that your
scatter plot will be appropriately labeled.

Note

 The process for creating scatter plots in this course is different from the
process in Excel.
Correct!

The Input Y Range is C1:C11 and the Input X Range is B1:B11. You must check
the Labels in first row box since we included B1 and C1 to ensure that the scatter
plot’s axes are appropriately labeled.

The scatter plot of players’ weight vs. height is shown below. In general, as height
increases, what happens to weight?
Weight increases
The scatter plot indicates that as height increases, weight also increases.
Weight decreases
Look at the scatter plot. What trend do you see? As you move along the x-axis from left
to right (smaller values to larger values), what typically happens to the corresponding y-
values? Do they increase or decrease?
Weight remains the same
Look at the scatter plot. What trend do you see? As you move along the x-axis from left
to right (smaller values to larger values), what typically happens to the corresponding y-
values? Do they increase or decrease?
1.4.2 Correlation
As we have seen, scatter plots are useful for visualizing the relationship between two
variables. It is also helpful to have a measure that quantifies how strong the relationship
is. We will use the correlation coefficient to measure the strength of the linear
relationship between two variables. The correlation coefficient measures the extent to
which the data points on the scatter plot create a line, on a scale from -1 to +1. Move the
slider to see how a scatter plot changes as the correlation changes.
Even when the correlation coefficient is 0, a relationship between two variables might
exist—just not a linear one. The relationship may appear more like a curve, for example,
as shown below. Never rely solely on the correlation coefficient. Always consult a visual
representation of the data such as a scatter plot to see patterns and gain other insights
into the situation the data describe.
 

What do you estimate to be the correlation coefficient between the height and weight of
the ten Red Sox players? 

0
There is clearly a linear relationship between height and weight so the correlation is not
0.
0.8 CORRECT
The relationship between height and weight is positive and is quite strong so 0.8 is a
good estimate.
-0.8
The relationship between height and weight is positive, so the correlation coefficient
must be positive. A negative correlation coefficient would indicate that as height
increases, weight typically decreases.
-0.5
The relationship between height and weight is positive, so the correlation coefficient
must be positive. A negative correlation coefficient would indicate that as height
increases, weight typically decreases.

Shared Reflection
What does the value of the correlation coefficient tell you about the strength and nature
of the relationship between two variables?  Be sure to consider the full range of
relationships.

Correlation coefficient value of +1 indicates that there is a strong linear relationship


between two variables. It indicates that as one variable increases the other also
increases.
Correlation coefficient value of -1 indicates that there is a strong linear relationship
between two variables but that one variable decreases as another increases.
All other values between -1 and +1 indicates relations that come between the two
extremes.
Correlation coefficient value of 0 indicates that the two variables would have no
relationship or a non linear relationship. –Me+0/+4

The correlation coefficient tells us how strongly the two variables are related. If the
correlation is 0, there is almost no correlation between the two variables. If it is -1, there
is a strong negative correlation and if it is +1 there is a strong positive correlation.
Example:
Positive correlation: the larger your feet, the larger the shoe size you need
negative correlation: The more breads you bake the more the amount of flour in your
storage decreases
If the correlation coefficient is 0 there is no or almost no correlation between the two
variables. For example there is no or little correlation between the amount of drinks sold
at an even and the level of education of a person.
On a scattergram you can usually see if there is no, a positive, or a negative correlation
of the two variables. --Dee+7/+18
A correlation coefficient is able to give the user a sense of whether the two variables will
move in tandem or in the opposite. This is important for future forecasting i.e. if there is
evidence that the two variables move in tandem, then future forecasting is more
accurate and reliable. However, if it's scattered, then there is no visible pattern to follow
to forecast. --Theng Fei+3/+12

The value of the correlation coefficient allows you to see the strength of the relationship
between 2 values. The correlation coefficient goes from -1 (indicating a perfect negative
correlation between x-axis and y-axis) to +1 (indicating a perfect positive correlation
between the variables). The closer to -1, the stronger the negative linear correlation
whereas the closer to +1, the stronger the positive correlation. On the other hand, the
closer to 0 means that there is less of a linear correlation.

It is important to notice that a correlation coefficient of 0 does NOT mean that there isn't
a correlation, but instead, there is just not a linear correlation (the relationship may be a
curve for example). Therefore, a graphical representation of the data such as a scatter
plot might be needed. --Carla+2/+12

1.4.2_01_Correlation.wmv

 It's important to realize that


a correlation coefficient may
 give us an incomplete
or even incorrect
 picture of what's going on.
 Suppose you're a
manager and you suspect
 that your employees skip
work to enjoy nice weather.
 After gathering
the necessary data,
 you find that the correlation
between temperature
 and absences is 0.47.
 While not a strong
linear relationship,
 a correlation of
0.47 does indicate
 a positive relationship,
suggesting that the weather
 might indeed be the culprit.
 But when we look at a
scatter plot of the data,
 other than a few
outliers, there really
 isn't a clear relationship.
 You might then realize that
the three outliers correspond
 to a late summer, three day
transportation strike that
 kept some worker's home.
 In this case, if we
exclude the outliers
 the relationship disappears
and the correlation essentially
 drops to zero, quieting
any suspicion of weather
 influencing worker absences.
 As a summary
statistic of the data,
 the correlation coefficient
is calculated by incorporating
 the value of every data point.
 Like the mean, including
outliers in our correlation
 calculations can affect results
and lead to false impressions.
 The correlation calculation
gives more weight
 to points that are
further from the mean,
 so it is strongly
influenced by outliers.

To find the correlation coefficient in Excel, we use the following function:

=CORREL(array 1, array 2)

 array 1 is a set of numerical variables or cell references containing data for one
variable of interest.
 array 2 is a set of numerical variables or cell references containing data for the
other variable of interest.
 Note that the number of observations in array 1 must be equal to the number in
array 2.

Spreadsheet: Calculating the Correlation


Coefficient
Now let’s calculate exactly what the correlation is between height and weight for these
ten Boston Red Sox players.

Step 1
In cell F2, enter the function =CORREL(B2:B11,C2:C11).

Note

 Note that the order in which the two data sets are selected does not matter (that
is, it doesn’t matter which variable we choose as the x variable and which we
choose as the y variable), as long as the association between each data “pair” is
maintained. With height and weight, both values certainly need to refer to the
same person!

The correlation of height and weight between the ten Red Sox players is quite high.
However, a high correlation does not imply that one variable causes the other. Taller
people may tend to weigh more, but gaining weight won’t make you taller! Correlation
indicates a linear relationship, but it does not indicate causality.
1.4.3 Hidden Variables
1.4.3_Hidden_Variables.wmv

 Suppose we find that two


variables are correlated.
 We graph the data
on a scatter plot
 to make sure that the
correlation calculation isn't
 driven solely by outliers.
 We see an apparent relationship
between the two variables.
 But can we be sure that there
is a direct relationship
 between them?
 Let's take a look at a scatter
plot of ice cream and snow
 shovel sales.
 When snow shovels sales
slump, ice cream sales jump.
 But is there actually a
fundamental relationship
 between these variables?
 Or is the relationship
driven by something else?
 Upon reflection, we
realized that there
 is no fundamental relationship
between the two variables
 but that the apparent
relationship is driven
 by a third, hidden variable--
 the season.
 People purchase more ice
cream during the summer months
 and more snow shovels
during the winter months.
 Without such
reflection, we might not
 have considered the
time of year at all,
 and we would have
neglected a critical driver
 of both products' sales.

A hidden variable is a variable that is correlated with each of two variables (such as ice
cream and snow shovel sales) that are not fundamentally related to each other.  That
is, there is no reason to think that a change in one variable will lead to a change in the
other; in fact, the correlation between the two variables may seem surprising until the
hidden variable is considered.  Although there is no direct relationship between these
two variables, they are mathematically correlated because each is correlated individually
with a third “hidden” variable. Therefore, for a variable to act as a hidden variable, there
must be three variables, all of which are mathematically correlated (either directly or
indirectly).

 In the example above, season is correlated with ice cream sales (people are more likely
to buy ice cream in the summer when the weather is hot). Season is also correlated with
snow shovel sales (people are more likely to buy snow shovels in winter when the
weather is cold and snow begins to fall). However, there is no direct connection between
ice cream sales and snow shovel sales: ice cream sales don’t go up because no one is
buying snow shovels, and people don’t purchase snow shovels because they are not
buying ice cream. Nonetheless, the two variables are correlated because both ice cream
sales and snow shovel sales are correlated with the same variable: season.

Hidden variables are not the same as “mediating variables,” which are variables which
are affected by one variable, and then affect another variable in turn. For example, being
worried about grades 

1. may cause a student to study harder, and thus get better grades, but we wouldn’t
consider studying to be a hidden variable linking worry and getting better grades.
Those two variables ARE fundamentally related, in that the worry is leading to the
better grades. If students are more worried, they may study harder and get even
better grades. 
2. may cause a student to stress eat and gain weight , but we wouldn’t consider
eating to be a hidden variable linking worry and weight gain.  Those two variables
ARE fundamentally related, in that the worry is leading to the weight gain. If
students are more worried, they may gain even more weight. 

In this situation, we’d see a correlation between weight gain and grades, driven by the
hidden variable, worry. Students couldn’t just eat more food and expect their grade to
improve, nor could they make a point of doing poorly in their courses just to lose weight.
These two variables are not fundamentally related.

Question 1 of 5

A hidden variable, such as GDP, may explain variation in oil consumption across various
countries, and provide more clarity than looking solely at the number of barrels of oil
consumed.

Example of a hidden variable


Not an example of a hidden variable
Question 2 of 5

A researcher finds a positive correlation between the number of traffic lights in a town or city and
the number of crimes committed each month in that town.  The hidden variable is population. Cities
with a greater number of people have more traffic and thus need more traffic lights.  These cities
also have more people who can commit crimes (and be victims of crimes), and more crimes are
committed.

Example of a hidden variable


Not an example of a hidden variable
Question 3 of 5

Market researchers at a corporation assess the sales and revenue for the corporation’s hot dog
subsidiary, but do not pay attention to the fact that many people in their market are vegetarians. 
The researchers’ lack of understanding about the dietary habits of the market is a hidden variable.

Example of a hidden variable


Not an example of a hidden variable
Question 4 of 5

A retail store owner offers a small discount on the same-day delivery service she offers for her
store’s products. In the week following the discount offer, sales via the delivery service jumped by
50%. The hidden variable is weather; it rained throughout that week and more people opted for
delivery rather than going to the store.

Example of a hidden variable


Not an example of a hidden variable
Question 5 of 5

A student finds that there is a positive correlation between the volume of music and the prevalence
of acne.  The hidden variable is age; teenagers tend to listen to louder music and have more acne.

Example of a hidden variable


Not an example of a hidden variable
Shared Reflection
Describe a real-world example in which hidden variables may be an issue, taking care to
identify each variable and its relationship to the others.

Be creative!  Your choice of variables should not be too similar to the examples provided
above.  Ideally, it should be a bit surprising that the two variable are related – until you
consider the hidden variable.

The number of roads built is apparently correlated to the number of children born. The
more roads, the more the number of children born. In reality there is a third variable :
economy. Increase or improvement in economy brings about an increase in the number
of roads built and the number of children born. –Me+0/+2

As per the medical journal issued in 2013, there was a positive correlation found
between skin cancer and exercise. Initially it is confusing to agree that an increase in
exercise also increases the number of skin cancer cases. Both these variables do not
seen fundamentally related. However, there lies a hidden variable influencing these two
variables which is climatic condition. It is a known fact that people living in warmer areas
tend to lead a more active outdoor lifestyle then people living in colder areas which
means people tend to be more exposed to sunlight in warmer areas leading to risk of
skin cancer. –Sahiba+8/+21
There is a positive correlation between earlier bed times and net worth. The hidden
variable is age; in general, your net worth increases as you age and you tend not to stay
up so late as you age. –Prakash+6/+17

When I go to the beach, there is a strong positive linear correlation between the distance
between my umbrella and the lifeguard stand and the number of shells I collect.
Mathematically, the farther my umbrella is from the lifeguard stand, the more shells I
collect. However, I cannot simply move my umbrella closer to the stand to collect more
shells. The hidden variable is the tide. As the tide comes in farther, the lifeguards move
their stand backwards, where I usually place my umbrella. This is also a tough time to
find good shells. When the tide goes back out, the lifeguards move back, and I can find
better shells more easily. –Jason+5/+17

1.4.4 Time Series


Spreadsheet
Suppose that once we realize that the U.S. consumes more oil than any other country,
we decide to look at U.S. oil consumption over the past eleven years to gain insight into
the consumption trend. Create a scatter plot to help visualize the relationship.
Correct!

The Input Y Range is B1:B12 and the Input X Range is A1:A12. You must check
the Labels in first row box since we included A1 and B1 to ensure that the scatter
plot’s axes are appropriately labeled.

The data set we just used is called a time series—a data set in which one of the
variables is time. Most of the data sets we have analyzed up until this point have been
cross-sectional data. Let’s make sure we understand the difference between these two
types of data sets.

Time Series vs. Cross-Sectional Data:

 Time Series: Time series data contain data about a given subject in temporal
order, measured at regular time intervals (e.g. minutes, months, or years). U.S.
oil consumption from 2002 through 2012 is an example of a time series.
Managers collect and analyze time series to identify trends and predict future
outcomes.
 Cross-Sectional: Cross-sectional data contain data that measure an attribute
across multiple different subjects (e.g. people, organizations, countries) at a
given moment in time or during a given time period. The average oil consumption
of ten countries in 2012 is an example of cross-sectional data. Managers use
cross-sectional data to compare metrics across multiple groups.
 Question 1 of 4
 Question 1 of 4

Question 1 of 4

For each of the following scenarios, determine whether it would be better to analyze
cross-sectional or time series data.

We want to know the current average height and weight of citizens in each country that
belongs to the European Union.

Cross-Sectional
Since we are interested in the average height and weight of citizens living in different
countries in the European Union at a specific point in time (“currently”), we should
analyze a cross-section of citizens.
Time Series
See correct answer for explanation.
Question 2 of 4

We want to know if a company’s profits have increased after it started advertising more.

Cross-Sectional
See correct answer for explanation.
Time Series
To determine whether profits have increased during a period of time, we must compare
profits over time. Therefore, we should analyze time series data.

Question 3 of 4

We want to compare the final exam scores of students this semester.

Cross-Sectional
Since we are interested in final exam scores for a single point in time (this semester), we
should analyze cross-sectional data of this year’s results.
Time series
See correct answer for explanation.
Question 4 of 4

We want to know if rates of dementia in the U.S. have decreased.

Cross-Sectional
See correct answer for explanation.
Time series
To determine whether rates of dementia have decreased, we must compare dementia
rates over time. Therefore, we should analyze time series data.

You might also like