BA 1.3 - Descriptive Statistics
BA 1.3 - Descriptive Statistics
As a starting point, let’s determine the “central tendency” of a data set—an indication of
where the “center” of the data set lies. We usually start by calculating the mean, the
most common measurement of central tendency. The mean is the number people refer
to when they talk about the “average” of a set of numbers.
The mean, median, and mode for the data set {0.5, 0.5, 1.5, 3.0, 4.0} are shown in the
graph below. The graph uses Excel’s conventions for histograms. Remember that each
bin contains a range of values. For example, bin 2 contains all values greater than 1 and
less than or equal to 2. Thus, the value 1 on the graph is represented by the gray vertical
grid line between bins 1 and 2 (not where the bin label 1 appears). Similarly, the value 2
is represented by the gray vertical grid line between bins 2 and 3 (not where the bin label
2 appears). Thus, for example, the mean, 1.9, is represented by the red dotted line just
to the left of the gray vertical grid line that represents the value 2. The median, 1.5, is
represented by the red dotted line in the center of bin 2, which is between the gray
vertical grid lines at the values 1 and 2.
The mode is the value that occurs most frequently in a data set. If a data set has more
than one value with the highest frequency, that data set has more than one mode. A
distribution is called bimodal if it has two clearly defined peaks (two points with very high
frequency). The two peaks may have equal frequency and hence be true modes, or one
peak may be a mode and the other peak may simply have a very high (but not the
highest) frequency. Distributions with multiple peaks are called multimodal.
Let's look at a
couple of data sets
to help us understand how the
locations of the mean, median,
and mode can vary based on
the distribution of the data.
In the first distribution,
the mean, median, and mode
are the same
because the data set
is symmetric and has
only one peak or mode.
In the second distribution,
the data set has an outlier.
The mean is pulled towards
that extreme value,
but the median and
mode are the same as
in the first distribution.
In general, if
there is an outlier
or if the distribution
is skewed,
that is, has a tail that
extends out to one side,
the extreme values will
pull the mean towards them.
People tend to rely
heavily on the mean
to characterize a data set.
But it's important to realize
that the mean is affected
by outliers, and
therefore, may not
be the best value to
represent the distribution.
MEAN
MODE
Your answer is close! Notice that this is a skewed distribution, which pulls the mean
towards the tail. The mean of this data set is 5.31 million barrels per day.
Correct!
Your answer is close! Remember that by definition half of the data points are below
the median and half are above. The median of this data set is 3.05 million barrels per
day.
Correct!
Your answer is close! Remember that the mode is the most common value in the
data set. Look for the bin with the highest frequency. The mode of this data set is
2.30 million barrels per day.
Question 1 of 2
Question 2 of 2
In the graphs shown below, several values of the data set shown in the previous
question have been shifted to the right. Which of the following do you think accurately
displays the mean, median, and mode of this data set?
To calculate the values for the mean, median, and mode in Excel, we can use the
following formulas:
number 1 is the first number, cell reference, or range of cells for which to
calculate the specified value.
[number 2],… represents additional numbers, cell references, or ranges of cells.
The square brackets indicate that the argument is optional.
For example, if we had the data set {0.5, 0.5, 1.5, 3.0, 4.0}, we could find the mean by
entering =AVERAGE(0.5, 0.5, 1.5, 3.0, 4.0) into a cell. This would calculate the average,
1.9. More often when using Excel, we won’t input data directly into the formula, but will
instead input a range of cells. For example if cell A1=0.5, A2=0.5, A3=1.5, A4=3.0, and
A5=4.0, we could calculate the mean of the data set by entering =AVERAGE(A1:A5),
which would return 1.9. Because the mean is the sum of all data points, divided by the
number of data points, we could also enter =SUM(A1:A5)/COUNT(A1:A5). The COUNT
function counts the number of cells that contain numerical values so in this case
=SUM(A1:A5)/COUNT(A1:A5) is equivalent to =SUM(A1:A5)/5.
The Excel function MODE.MULT finds all of the modes in a data set, generating a
vertical array that has one row for each mode in the data set. Suppose a data set has
three modes. To find them, instead of entering the MODE.MULT formula in a single
cell, highlight at least three vertically-contiguous cells and then
input =MODE.MULT(number 1, [number 2], …) into the formula bar. Then, instead of
using ENTER to find the result, use CTRL+SHIFT+ENTER to enter the array. The
modes will appear in the first rows of the array (filling as many rows as there are
modes in the data set) and #N/A will appear in each of the other rows in the array.
How do we know how many vertically-contiguous cells to highlight when using this
function? To determine how many modes are in a data set, enter
=COUNT(MODE.MULT(number 1, [number 2], …)) in a single cell. When you click
ENTER, the function will return the number of modes in the data set. The result will tell
you how many vertical cells you should highlight to create a MODE.MULT array.
Note that our embedded spreadsheet currently does not support the functionality
required by MODE.MULT.
Throughout the course we will provide alternative functions that existed prior to Excel
2010 that can still be used in Excel 2010.
Question 1 of 4
Correct!
Correct!
Question 3 of 4
Correct!
Question 4 of 4
Correct!
Let’s calculate the mean oil consumption for all North American countries in our data set.
Note that the appropriate continent is now listed next to each data point to make it easy
to apply a condition to our data set.
Step 1
This function says, if the country’s continent is North America, then include that
country’s oil consumption in the calculation of the average.
C2:C11 contains the data to which we want to apply the condition.
E2 is the condition or criterion by which we choose the data points to include in
the calculation of the conditional mean. In the function we have specified, we
have just entered a link to cell E2, which contains “North America”. Alternatively,
we could type “North America” into the function as follows:
=AVERAGEIF(C2:C11,"North America",A2:A11).
A2:A11 contains the numeric data from which the conditional mean will be
calculated. Only the values that meet the criterion—in this case, data
corresponding to countries in North America—will be used in the calculation.
Thus, the mean oil consumption will be calculated for Canada and the United
States.
Question 2 of 2
Now calculate the mean oil consumption for countries located in Asia.
1.3.3 Percentiles
In addition to calculating measures of central tendency, sometimes we may want to
know the value beneath which a certain percentage of the data lie. For example, we may
want to find the 25th percentile—also known as the first quartile. The 25 th percentile is the
smallest value that is greater than or equal to 25% of the data points.
Percentiles are often used to categorize test scores. For example, someone who scored
in the 95th percentile of a test scored equal to or higher than 95% of all people who took
that test. We can also say that person scored in the top 5%.
Question 1 of 3
0%
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution.
100%
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution.
50%
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution.
The answer cannot be determined without further information CORRECT
Remember that the mean’s location depends upon the distribution of the data set. Recall
how the location of the mean differs for a symmetrical distribution and a skewed
distribution. Therefore, there is no way to determine the percentile of the mean without
more information about the data set.
Question 2 of 3
0%
Remember that half of a distribution’s data points are less than or equal to the median.
100%
Remember that half of a distribution’s data points are less than or equal to the median.
50% CORRECT
Remember that half of a distribution’s data points are less than or equal to the median.
Therefore, the median is equal to the 50th percentile, because 50% of the data points
are equal to or below this value.
The answer cannot be determined without further information
Remember that half of a distribution’s data points are less than or equal to the median.
Question 3 of 3
0%
Remember that the mode’s location depends upon the distribution of the data set.
100%
Remember that the mode’s location depends upon the distribution of the data set.
50%
Remember that the mode’s location depends upon the distribution of the data set.
The answer cannot be determined without further information CORRECT
Remember that the mode’s location depends upon the distribution of the data set.
Therefore, there is no way to determine the percentile of the mode without more
information about the data set.
Let’s return to our oil consumption data. Use the slider to change the percentile and see
the amount of oil consumption associated with it.
There are several methods for calculating percentiles, and different methods may lead
to different values. We will use the same method as Excel, which means that the
percentile found may fall between two data points and not actually be a value in our
data set. We have seen this when we calculate the median of an even number of data
points—the median falls between the two middle values.
=PERCENTILE.INC(array, k)
array is the range of data for which we want to calculate a given percentile.
k is the percentile value. For example, if we want to know the 95 th percentile, k
would be 0.95.
Throughout the course we will provide alternative functions that existed prior to Excel
2010 that can still be used in Excel 2010.
Question 1 of 2
1.3.4 Variability
1.3.4_01_Variability.wmv
Question 1 of 4
0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.
11
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
Question 2 of 4
0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.
11
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
Question 3 of 4
0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
5
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.
0
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
1
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
5
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0.
10 CORRECT
The range is the difference between the maximum value and the minimum value. We
can see from the histogram that the maximum value in this data set is 10 and the
minimum value is 0, so the range equals 10–0=10.
1.3.4_02_Variability.wmv
As we can see, datasets
that have the same minimum
and maximum values will
have the same range,
even if the shapes of
their distributions
are completely different.
To gain more insight into the
spread of the distribution
and how the data behave
between the two extremes,
we can calculate the variance.
Rather than simply measuring
the distance between the two
extremes, the variance
measures how far
each point is from the mean.
Let's look at an example.
First, we find the mean of
the dataset, in this case 3.
Then we calculate the
distance from each point
to the mean and square
that difference.
Then for each occurrence
of a value in the dataset,
we sum the squared terms.
We square the differences,
because if we simply
added them, the positive
and negative values
would cancel each other out.
Squaring the differences gives
us only non-negative numbers.
Moreover, squaring the
distances gives more weight
to points that are
further from the mean.
To calculate the
variance of a population,
we divide the sum of
the squared differences
by n, the number of
data points, essentially
giving us an average of
the squared differences.
However, typically we're working
with a sample of data points.
So we divide by n
minus 1 to obtain
a sample variance of 12.8.
The technical reason
we divide by n minus 1
is beyond the scope
of this course.
Essentially, we do this to
obtain an unbiased estimate
of the population variance.
One shortcoming
of the variance is
that it has inconvenient units.
Since we square
the differences, we
end up with a value
that is also squared.
For example, how would we
interpret a square dollar?
To convert the variance to
have the same units as the data
points, we take the square root
of the variance, which gives us
the standard deviation.
In this case, the sample
standard deviation
is about 3.6.
A small standard deviation
indicates that the data points
are close to the mean.
A large standard deviation
indicates a broader spread.
Luckily, we can use Excel
to calculate the variance
and standard deviation.
We'll never have to
do so by hand again.
Question 1 of 3
Which of the following data sets do you think has the smallest standard deviation?
Question 2 of 3
9
The variance and the standard deviation are not equal. Recall how the two measures
are related.
3 CORRECT
The standard deviation is equal to the square root of the variance. If the variance is 9,
then the standard deviation must be 3.
81
The standard deviation is not equal to the variance squared. Recall how the two
measures are related.
There is not enough information to determine the standard deviation
There is sufficient information to answer this question. Recall how the standard deviation
and the variance are related.
Question 3 of 3
If two data sets, A and B, have equal standard deviations, which of the following
statements is true? Select all that apply.
To calculate the variance or standard deviation of a sample in Excel, we can use the
following functions:
number 1 is the first number, cell reference, or range of cells for which to
calculate the specified value.
[number 2],… represents additional numbers, cell references, or ranges of cells.
The square brackets indicate that the argument is optional.
Note that the “S” in VAR.S and STDEV.S indicates that we are working with a sample.
We will learn more about the differences between samples and populations in the next
module.
We can also find the standard deviation using the Excel function =SQRT(number) to
take the square root of the variance. For example, =SQRT(16)=4.
Throughout the course we will provide alternative functions that existed prior to Excel
2010 that can still be used in Excel 2010.
Question 1 of 4
Question 2 of 4
Let’s calculate the variance and standard deviation of the oil consumption data.
Step 2
Note
Notice that the standard deviation is equal to the square root of the variance.
Question 3 of 4
Question 4 of 4
1.3.5 Descriptive Statistics in Excel
Spreadsheet: Creating the Output
Excel has a descriptive statistics tool that provides a number of summary statistics,
including those we’ve already learned, for a set of data. Let’s walk through how to create
it.
Step 1
Step 2
Step 3
This cell is the top left hand cell in which the output table will appear.
Be sure to always use the specified Output Range, so that the calculated
values are placed in the blue cells for grading.
Step 4
In order to create the descriptive statistics output table in Excel, you must download
the Analysis ToolPak, which is an add-in program.
Spreadsheet
Use the descriptive statistics tool to calculate the summary statistics for the heights of
these ten Red Sox players. Make sure to enter D1 as the output range so that your
calculations are graded correctly.
1.3.6 Coefficient of Variation
The standard deviation describes how much the values in a single data set vary, but
what if we want to compare the amount of variation in two different data sets? Suppose
we want to compare the share price volatility of two stocks, Stock A and Stock B. For
both stocks, the standard deviation in share price is $4.50. Compared to its average
price, this standard deviation is relatively high for Stock A but relatively low for Stock B.
The standard deviations are the same, but Stock A is much more volatile relative to its
mean than Stock B.
To compare variation in two data sets, we calculate a value called the coefficient of
variation (CV). The coefficient of variation is the ratio of the standard deviation to the
mean. The equation for the coefficient of variation is:
1.4.1 Scatter Plots
We use histograms to help us visualize a single variable’s distribution. To visualize the
relationship between two variables, we typically use a scatter plot. One variable is
plotted on the horizontal axis (x-axis), and the other is plotted on the vertical axis (y-
axis). Please read the instructions carefully to make sure you assign variables to the
correct axes for grading purposes. Later in the course when we learn about regression
analysis, we’ll see that the convention is to plot the so-called “dependent” variable on the
vertical axis and the “independent” variable on horizontal axis.
Step 1
Step 2
The Input Y Range is the weight data in column C with its label, C1:C11.
The Input X Range is the height data in column B with its label, B1:B11.
Make sure to include the cells containing labels when inputting your
ranges and check the Labels in first row box, as this ensures that your
scatter plot will be appropriately labeled.
Note
The process for creating scatter plots in this course is different from the
process in Excel.
Correct!
The Input Y Range is C1:C11 and the Input X Range is B1:B11. You must check
the Labels in first row box since we included B1 and C1 to ensure that the scatter
plot’s axes are appropriately labeled.
The scatter plot of players’ weight vs. height is shown below. In general, as height
increases, what happens to weight?
Weight increases
The scatter plot indicates that as height increases, weight also increases.
Weight decreases
Look at the scatter plot. What trend do you see? As you move along the x-axis from left
to right (smaller values to larger values), what typically happens to the corresponding y-
values? Do they increase or decrease?
Weight remains the same
Look at the scatter plot. What trend do you see? As you move along the x-axis from left
to right (smaller values to larger values), what typically happens to the corresponding y-
values? Do they increase or decrease?
1.4.2 Correlation
As we have seen, scatter plots are useful for visualizing the relationship between two
variables. It is also helpful to have a measure that quantifies how strong the relationship
is. We will use the correlation coefficient to measure the strength of the linear
relationship between two variables. The correlation coefficient measures the extent to
which the data points on the scatter plot create a line, on a scale from -1 to +1. Move the
slider to see how a scatter plot changes as the correlation changes.
Even when the correlation coefficient is 0, a relationship between two variables might
exist—just not a linear one. The relationship may appear more like a curve, for example,
as shown below. Never rely solely on the correlation coefficient. Always consult a visual
representation of the data such as a scatter plot to see patterns and gain other insights
into the situation the data describe.
What do you estimate to be the correlation coefficient between the height and weight of
the ten Red Sox players?
0
There is clearly a linear relationship between height and weight so the correlation is not
0.
0.8 CORRECT
The relationship between height and weight is positive and is quite strong so 0.8 is a
good estimate.
-0.8
The relationship between height and weight is positive, so the correlation coefficient
must be positive. A negative correlation coefficient would indicate that as height
increases, weight typically decreases.
-0.5
The relationship between height and weight is positive, so the correlation coefficient
must be positive. A negative correlation coefficient would indicate that as height
increases, weight typically decreases.
Shared Reflection
What does the value of the correlation coefficient tell you about the strength and nature
of the relationship between two variables? Be sure to consider the full range of
relationships.
The correlation coefficient tells us how strongly the two variables are related. If the
correlation is 0, there is almost no correlation between the two variables. If it is -1, there
is a strong negative correlation and if it is +1 there is a strong positive correlation.
Example:
Positive correlation: the larger your feet, the larger the shoe size you need
negative correlation: The more breads you bake the more the amount of flour in your
storage decreases
If the correlation coefficient is 0 there is no or almost no correlation between the two
variables. For example there is no or little correlation between the amount of drinks sold
at an even and the level of education of a person.
On a scattergram you can usually see if there is no, a positive, or a negative correlation
of the two variables. --Dee+7/+18
A correlation coefficient is able to give the user a sense of whether the two variables will
move in tandem or in the opposite. This is important for future forecasting i.e. if there is
evidence that the two variables move in tandem, then future forecasting is more
accurate and reliable. However, if it's scattered, then there is no visible pattern to follow
to forecast. --Theng Fei+3/+12
The value of the correlation coefficient allows you to see the strength of the relationship
between 2 values. The correlation coefficient goes from -1 (indicating a perfect negative
correlation between x-axis and y-axis) to +1 (indicating a perfect positive correlation
between the variables). The closer to -1, the stronger the negative linear correlation
whereas the closer to +1, the stronger the positive correlation. On the other hand, the
closer to 0 means that there is less of a linear correlation.
It is important to notice that a correlation coefficient of 0 does NOT mean that there isn't
a correlation, but instead, there is just not a linear correlation (the relationship may be a
curve for example). Therefore, a graphical representation of the data such as a scatter
plot might be needed. --Carla+2/+12
1.4.2_01_Correlation.wmv
=CORREL(array 1, array 2)
array 1 is a set of numerical variables or cell references containing data for one
variable of interest.
array 2 is a set of numerical variables or cell references containing data for the
other variable of interest.
Note that the number of observations in array 1 must be equal to the number in
array 2.
Step 1
In cell F2, enter the function =CORREL(B2:B11,C2:C11).
Note
Note that the order in which the two data sets are selected does not matter (that
is, it doesn’t matter which variable we choose as the x variable and which we
choose as the y variable), as long as the association between each data “pair” is
maintained. With height and weight, both values certainly need to refer to the
same person!
The correlation of height and weight between the ten Red Sox players is quite high.
However, a high correlation does not imply that one variable causes the other. Taller
people may tend to weigh more, but gaining weight won’t make you taller! Correlation
indicates a linear relationship, but it does not indicate causality.
1.4.3 Hidden Variables
1.4.3_Hidden_Variables.wmv
A hidden variable is a variable that is correlated with each of two variables (such as ice
cream and snow shovel sales) that are not fundamentally related to each other. That
is, there is no reason to think that a change in one variable will lead to a change in the
other; in fact, the correlation between the two variables may seem surprising until the
hidden variable is considered. Although there is no direct relationship between these
two variables, they are mathematically correlated because each is correlated individually
with a third “hidden” variable. Therefore, for a variable to act as a hidden variable, there
must be three variables, all of which are mathematically correlated (either directly or
indirectly).
In the example above, season is correlated with ice cream sales (people are more likely
to buy ice cream in the summer when the weather is hot). Season is also correlated with
snow shovel sales (people are more likely to buy snow shovels in winter when the
weather is cold and snow begins to fall). However, there is no direct connection between
ice cream sales and snow shovel sales: ice cream sales don’t go up because no one is
buying snow shovels, and people don’t purchase snow shovels because they are not
buying ice cream. Nonetheless, the two variables are correlated because both ice cream
sales and snow shovel sales are correlated with the same variable: season.
Hidden variables are not the same as “mediating variables,” which are variables which
are affected by one variable, and then affect another variable in turn. For example, being
worried about grades
1. may cause a student to study harder, and thus get better grades, but we wouldn’t
consider studying to be a hidden variable linking worry and getting better grades.
Those two variables ARE fundamentally related, in that the worry is leading to the
better grades. If students are more worried, they may study harder and get even
better grades.
2. may cause a student to stress eat and gain weight , but we wouldn’t consider
eating to be a hidden variable linking worry and weight gain. Those two variables
ARE fundamentally related, in that the worry is leading to the weight gain. If
students are more worried, they may gain even more weight.
In this situation, we’d see a correlation between weight gain and grades, driven by the
hidden variable, worry. Students couldn’t just eat more food and expect their grade to
improve, nor could they make a point of doing poorly in their courses just to lose weight.
These two variables are not fundamentally related.
Question 1 of 5
A hidden variable, such as GDP, may explain variation in oil consumption across various
countries, and provide more clarity than looking solely at the number of barrels of oil
consumed.
A researcher finds a positive correlation between the number of traffic lights in a town or city and
the number of crimes committed each month in that town. The hidden variable is population. Cities
with a greater number of people have more traffic and thus need more traffic lights. These cities
also have more people who can commit crimes (and be victims of crimes), and more crimes are
committed.
Market researchers at a corporation assess the sales and revenue for the corporation’s hot dog
subsidiary, but do not pay attention to the fact that many people in their market are vegetarians.
The researchers’ lack of understanding about the dietary habits of the market is a hidden variable.
A retail store owner offers a small discount on the same-day delivery service she offers for her
store’s products. In the week following the discount offer, sales via the delivery service jumped by
50%. The hidden variable is weather; it rained throughout that week and more people opted for
delivery rather than going to the store.
A student finds that there is a positive correlation between the volume of music and the prevalence
of acne. The hidden variable is age; teenagers tend to listen to louder music and have more acne.
Be creative! Your choice of variables should not be too similar to the examples provided
above. Ideally, it should be a bit surprising that the two variable are related – until you
consider the hidden variable.
The number of roads built is apparently correlated to the number of children born. The
more roads, the more the number of children born. In reality there is a third variable :
economy. Increase or improvement in economy brings about an increase in the number
of roads built and the number of children born. –Me+0/+2
As per the medical journal issued in 2013, there was a positive correlation found
between skin cancer and exercise. Initially it is confusing to agree that an increase in
exercise also increases the number of skin cancer cases. Both these variables do not
seen fundamentally related. However, there lies a hidden variable influencing these two
variables which is climatic condition. It is a known fact that people living in warmer areas
tend to lead a more active outdoor lifestyle then people living in colder areas which
means people tend to be more exposed to sunlight in warmer areas leading to risk of
skin cancer. –Sahiba+8/+21
There is a positive correlation between earlier bed times and net worth. The hidden
variable is age; in general, your net worth increases as you age and you tend not to stay
up so late as you age. –Prakash+6/+17
When I go to the beach, there is a strong positive linear correlation between the distance
between my umbrella and the lifeguard stand and the number of shells I collect.
Mathematically, the farther my umbrella is from the lifeguard stand, the more shells I
collect. However, I cannot simply move my umbrella closer to the stand to collect more
shells. The hidden variable is the tide. As the tide comes in farther, the lifeguards move
their stand backwards, where I usually place my umbrella. This is also a tough time to
find good shells. When the tide goes back out, the lifeguards move back, and I can find
better shells more easily. –Jason+5/+17
The Input Y Range is B1:B12 and the Input X Range is A1:A12. You must check
the Labels in first row box since we included A1 and B1 to ensure that the scatter
plot’s axes are appropriately labeled.
The data set we just used is called a time series—a data set in which one of the
variables is time. Most of the data sets we have analyzed up until this point have been
cross-sectional data. Let’s make sure we understand the difference between these two
types of data sets.
Time Series: Time series data contain data about a given subject in temporal
order, measured at regular time intervals (e.g. minutes, months, or years). U.S.
oil consumption from 2002 through 2012 is an example of a time series.
Managers collect and analyze time series to identify trends and predict future
outcomes.
Cross-Sectional: Cross-sectional data contain data that measure an attribute
across multiple different subjects (e.g. people, organizations, countries) at a
given moment in time or during a given time period. The average oil consumption
of ten countries in 2012 is an example of cross-sectional data. Managers use
cross-sectional data to compare metrics across multiple groups.
Question 1 of 4
Question 1 of 4
Question 1 of 4
For each of the following scenarios, determine whether it would be better to analyze
cross-sectional or time series data.
We want to know the current average height and weight of citizens in each country that
belongs to the European Union.
Cross-Sectional
Since we are interested in the average height and weight of citizens living in different
countries in the European Union at a specific point in time (“currently”), we should
analyze a cross-section of citizens.
Time Series
See correct answer for explanation.
Question 2 of 4
We want to know if a company’s profits have increased after it started advertising more.
Cross-Sectional
See correct answer for explanation.
Time Series
To determine whether profits have increased during a period of time, we must compare
profits over time. Therefore, we should analyze time series data.
Question 3 of 4
Cross-Sectional
Since we are interested in final exam scores for a single point in time (this semester), we
should analyze cross-sectional data of this year’s results.
Time series
See correct answer for explanation.
Question 4 of 4
Cross-Sectional
See correct answer for explanation.
Time series
To determine whether rates of dementia have decreased, we must compare dementia
rates over time. Therefore, we should analyze time series data.