BA 1.2 - Visualizing Data
BA 1.2 - Visualizing Data
1 Recognizing Patterns
Some people are naturally more comfortable working with numbers than others. For
most of us, visual representations of data make it easier to identify trends and patterns
that can help us make better decisions. This is especially true for large data sets; the
more data we have, the harder it typically is to distinguish trends and patterns based
solely on reviewing the data set.
Data visualizations are all around us. The nutritional information on a cereal box offers a
standardized format that helps us understand how a single bowl of cereal fits into the
context of our overall daily food intake. The dashboard on your car is a visual
representation of complex data about a car’s performance—such as its speed, fuel and
oil levels, and whether or not its headlights are on. The dashboard has been carefully
designed to help us understand how the car is operating with only a quick glance.
Managers often look at data in the form of graphs or charts. Consider this set of data
about ten Boston Red Sox players on the roster for the 2013 season—a year in which
the Red Sox won the World Series.
Reflection
What do you notice about the heights of these ten Red Sox Players? Among other
topics, you may wish to consider the following in your response:
The heights of these players do vary to quite an extent. The data reveals that the maximum is 76
inches while the minimum is 68 inches -- a difference of 8 inches.
The shortest player is 68 inches and the two tallest players are 76 inches
In addition to graphing data, we can group data into categories that make it easier to
perform analyses within a category or across multiple categories. The groupings we
choose are often influenced by the question we are asking or the problem we want to
solve. An income statement, also known as a profit and loss (P&L) statement, is a good
example of a way to arrange financial data to make it easier to understand. Accountants
separate data into categories such as income and expenses so companies can analyze
their performance.
Statements with financial data can look intimidating at first, but you’ll become
comfortable using them more quickly than you might think. Over time, most managers
become as adept at understanding a balance sheet or P&L as a driver is at monitoring a
car’s performance by taking a quick glance at the dashboard.
1.2.2 Histograms
One of the most useful and commonly used graphical representations of data is a
histogram. A histogram displays the frequency, or number, of data points (often called
observations) that fall within specified bins.
Histograms allow us to quickly discern trends or patterns in a data set and are easy to
construct using programs such as Excel. The graph of the Red Sox players’ heights
shown below is an example of a histogram.
1.2.2_01_Histograms.wmv
Let’s review a few key concepts before we learn how to create a histogram in Excel.
On the horizontal axis, we display a series of single values, each of which represents a
bin, or range of possible values. On the vertical axis, we display the frequency of the
observations in each bin. With a small data set, we can count and assign data points to
bins as we just did, but with large data sets, this approach would be extremely tedious.
This is where programs like Excel are helpful!
Let's consider another example, the amount of oil consumed by the ten top oil-
consuming countries in 2012, and create a histogram of that data set in Excel. Before we
begin, let’s decide which bins to use. Notice that the consumption data (shown below)
have been sorted from largest to smallest. Because the data values range from
approximately 2 to 19 million barrels per day, we’ll use bins 1, 2, …, 19.
By convention, Excel includes in the range the number represented by the bin label. So
bin 1 includes all countries with oil consumption less than or equal to 1 million barrels
per day (x≤1); bin 2 includes all countries with oil consumption greater than 1 but less
than or equal to 2 million barrels per day (1<x≤2); and bin 19 includes all countries with
oil consumption greater than 18 but less than or equal to 19 million barrels per day
(18<x≤19).
Question 1 of 2
Step 1
Label column D. Since this label will appear on the horizontal axis of the histogram,
copy the label from cell A1 into cell D1.
Step 2
To enter the bins in column D, input 1 into cell D2, then input 2 into cell D3, and
continue down the column until you have entered all 19 bin labels.
Note
Numeric responses should be entered into the blue answer cells. All of the blue
answer cells must be filled before you can submit your answer.
You can manually input each value or you can use auto-fill, which allows you to
quickly populate values without entering them one-by-one.
To use auto-fill, enter the first two values in cells D2 and D3. Highlight those two
cells and place your cursor at the bottom right-hand corner of cell D3. The cursor
will turn into a black cross. Drag the cross down the column until you reach
cell D20. When you release the mouse, the values will auto-fill.
Correct!
Step 3
Step 4
The Input Range is the original oil consumption data in column A with its
label, A1:A11.
The Bin Range is the set of values from 1-19 that you created in column D with
its label, D1:D20.
Make sure to include the cells containing labels when inputting your ranges and
check the Labels in first row box, as this ensures that your histogram will be
appropriately labeled.
Correct!
The Input Range is A1:A11 and the Bin Range is D1:D20. You must check
the Labels in first row box since we included A1 and D1 to ensure that the
histogram’s axes are appropriately labeled. Note that Excel automatically adds a bin
called “More” to histograms. In this case “More” includes all countries with oil
consumption greater than 19 million barrels per day.
In order to create histograms in Excel, you may need to download the Analysis
ToolPak, which is an add-in program. Please consult the Microsoft website for more
information on how to install the Analysis ToolPak for your own version of Excel.
Note that you do NOT need to obtain the Analysis ToolPak to complete the
online spreadsheets that are part of the HBS Online Business Analytics
course. The add-in program is only necessary if you wish to use these analytical
tools in your own version of Excel.
The histogram you just created is skewed, meaning that it has a tail that extends out to
one side. The tail is the part of a graph that appears long or “flattens”, and has bins with
lower frequencies. Skewness measures the degree of a graph’s asymmetry. If the right
tail is longer, we say the distribution is skewed to the right or “right-tailed.” Likewise, if
the left tail is longer, we say the distribution is skewed to the left or “left-tailed.” The oil
consumption data set is skewed to the right.
Question 1 of 5
United States
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. The United States consumes 18.6 million barrels
per day.
China
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. China consumes 10.3 million barrels per day.
Japan
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Japan consumes 4.7 million barrels per day.
India
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. India consumes 3.6 million barrels per day and
therefore falls into bin 4.
Russia
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Russia consumes 3.2 million barrels per day and
therefore falls into bin 4.
Saudi Arabia
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Saudi Arabia consumes 2.9 million barrels per
day.
Brazil
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Brazil consumes 2.8 million barrels per day.
Germany
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Germany consumes 2.4 million barrels per day.
Canada
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Canada consumes 2.3 million barrels per day.
South Korea
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. South Korea consumes 2.3 million barrels per
day.
United States
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. The United States consumes 18.6 million barrels
per day.
China
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. China consumes 10.3 million barrels per day.
Japan
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Japan consumes 4.7 million barrels per day.
India CORRECT
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. India consumes 3.6 million barrels per day and
therefore falls into bin 4.
Russia CORRECT
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Russia consumes 3.2 million barrels per day and
therefore falls into bin 4.
Saudi Arabia
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Saudi Arabia consumes 2.9 million barrels per
day.
Brazil
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Brazil consumes 2.8 million barrels per day.
Germany
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Germany consumes 2.4 million barrels per day.
Canada
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. Canada consumes 2.3 million barrels per day.
South Korea
Bin 4 includes all countries that consume more than 3 million barrels and less than or
equal to 4 million barrels of oil per day. South Korea consumes 2.3 million barrels per
day.
Question 2 of 5
Greater than 1 million and less than or equal to 2 million barrels per day
The frequency of bin 2, which represents oil consumption greater than 1 million and less
than or equal to 2 million barrels per day, is zero. Look for the bin with the highest
frequency.
Greater than 2 million and less than or equal to 3 million barrels per day
The tallest bar corresponds to bin 3, which means that bin 3 has the highest frequency.
Therefore, the range containing the most countries in the data set includes all countries
that consume more than 2 million and less than or equal to 3 million barrels of oil per
day.
Greater than 3 million and less than or equal to 4 million barrels per day
Only two countries consume more than 3 million and less than or equal to 4 million
barrels per day. Look for the bin with the highest frequency.
Greater than 18 million and less than or equal to 19 million barrels per day
Only one country consumes more than 18 million and less than or equal to 19 million
barrels per day. Look for the bin with the highest frequency.
Question 3 of 5
According to the histogram, which range contains the country in the data set that
consumes the least amount of oil?
More than 0 and less than or equal to 1 million barrels per day
The frequency of bin 1, which represents oil consumption less than or equal to 1 million
barrels per day, is zero. This means that no countries consume an amount in this range.
The amount of oil consumed per day is shown on the horizontal axis. Look for the range,
or bin, with the lowest value that has a frequency of at least one.
More than 1 and less than or equal to 2 million barrels per day
The frequency of bin 2, which represents oil consumption more than 1 million and less
than or equal to 2 million barrels per day, is zero. This means that no countries consume
an amount in this range. The amount of oil consumed per day is shown on the horizontal
axis. Look for the range, or bin, with the lowest value that has a frequency of at least
one.
More than 2 and less than or equal to 3 million barrels per day
The amount of oil consumed per day is shown on the horizontal axis. Bin 3 is the lowest
range that has a frequency of at least one (in this case, the frequency is 5). Therefore,
the lowest consumer of oil consumes more than 2 million and less than or equal to 3
million barrels per day.
More than 18 and less than or equal to 19 million barrels per day
The country in bin 19 consumes more than 18 million barrels per day. Other countries
consume less. The amount of oil consumed per day is shown on the horizontal axis.
Look for the range, or bin, with the lowest value that has a frequency of at least one.
Question 4 of 5
How many countries consume more than 10 million and less than or equal to 11 million
barrels of oil per day?
0
The number of countries that consume more than 10 million and less than or equal to 11
million barrels of oil per day is indicated by the height of the bar at bin 11. Because the
height exceeds zero, we know that at least one country’s consumption is within this
range. How many countries consume more than 10 million and less than or equal to 11
million barrels?
1
The number of countries that consume more than 10 million and less than or equal to 11
million barrels of oil per day is indicated by the height of the bar at bin 11. The frequency
of that bar is one, which indicates that one country consumes more than 10 million and
less than or equal to 11 million barrels of oil per day.
8
The number of countries that consume more than 10 million and less than or equal to 11
million barrels of oil per day is indicated by the height of the bar at bin 11. Eight countries
consume less than or equal to 10 million barrels of oil per day. How many countries
consume more than 10 million and less than or equal to 11 million barrels?
9
The number of countries that consume more than 10 million and less than or equal to 11
million barrels of oil per day is indicated by the height of the bar at bin 11. Nine countries
consume less than or equal to 11 million barrels of oil per day. How many countries
consume more than 10 million and less than or equal to 11 million barrels?
Question 5 of 5
Suppose you are interested in gaining a deep understanding about the distribution of
salaries at your company.
Which histogram provides the greatest insight about the distribution of the salary data for 15
employees?
COLD CALL
You were not selected to take this cold call. Please review other students' responses
below.
Which histogram do you think best displays the 2012 revenue for the top 100 U.S.
companies? Why?
I think Option C is the most adequate graph since it uses the right amount of bins.
Using larger bins such as in Option A and B simplifies the graph but provides less detail
about the distribution. Therefore they prevent us from seeing interesting trends in the
data.
On the other hand, very small bins such as in Option D provide graphs that show such
low frequencies that it can be difficult to see any patterns in the data. –Carla+5/+30
Option B. We see it is right tailed, and the distribution of 2012 revenue in a reasonable
format.Option C and D have too much data. –Tarun+16/+24
COLD CALL
You were not selected to take this cold call. Please review other students' responses
below.
How would you decide what bins to use when creating a histogram? What factors might
influence the bins you select?
Choosing bins balances showing trends versus showing noise. Ideal bins demonstrate
general trends in the data, skewing right or left, without overwhelming the visual with too
much noise. While raw numbers show granular data, histograms should be used for
more abstract shape attributes of the data. –Jason+7/+40
I would look over the entire range of data to get a sense of how many data points will be
shown and how widely dispersed are the data points. –JJ+2/+24
1.2.3 Outliers
1.2.3_01_Outliers.wmv
Question 2 of 2
If you were interested in knowing the average oil consumption of the top oil-consuming
countries, how would you handle the outliers?
The Input Range is B1:B31 and the Bin Range is D1:D8. You must check
the Labels in first row box since we included B1 and D1 to ensure that the
histogram’s axes are appropriately labeled.
Question 2 of 6
How would you describe the shape of the distribution shown below of the real estate
pricing data?
Uniform
A uniform distribution has constant probability across a range of possible outcomes.
Thus the bars of the histogram of a uniform distribution will have the same frequency
provided the bins over the range of possible outcomes are of equal size. Since the
frequencies of the bins in this graph vary, the distribution is not uniform.
Right-tailed
This graph has a tail that extends out the right side. As selling price increases, the
frequency of each bin above $600,000 is much less than those below $600,000.
Therefore, we infer that this distribution is skewed to the right, or right-tailed.
Left-tailed
This graph is not left-tailed. Although it has a tail, the tail extends out the right side, not
the left side. Thus we cannot infer that the distribution is left-tailed.
Symmetric
This graph is not symmetric; it has a tail that extends out to one side.
Question 3 of 6
How many houses cost more than $400 thousand and less than or equal to $800
thousand?
Approximately 2
By convention, Excel includes in a bin’s range the number represented by the bin label.
For example, the first bin (labeled $200,000) includes all houses with values less than or
equal to $200,000 and the second bin (labeled $400,000) includes all houses with
values greater than $200,000 but less than or equal to $400,000. The only bins with
frequency 2 are the fourth bin (labeled $800,000), which indicates that approximately 2
houses cost more than $600,000 and less than or equal to $800,000, and the sixth bin
(labeled $1,200,000), which indicates that approximately 2 houses cost more than
$1,000,000 and less than or equal to $1,200,000). The number of houses that cost more
than $400,000 and less than or equal to $800,000 is indicated by the height of the bars
at bins $600,000 and $800,000. How many houses cost more than $400,000 and less
than or equal to $800,000?
Approximately 11
The number of houses that cost more than $400,000 and less than or equal to $800,000
is indicated by the height of the bars at bins $600,000 and $800,000. The frequency of
the bar above bin $600,000 is approximately 9 and the frequency of the bar above bin
$800,000 is approximately 2. Therefore, approximately 9+2=11 houses cost more than
$400,000 and less than or equal to $800,000.
Approximately 15
By convention, Excel includes in a bin’s range the number represented by the bin label.
For example, the first bin (labeled $200,000) includes all houses with values less than or
equal to $200,000 and the second bin (labeled $400,000) includes all houses with
values greater than $200,000 but less than or equal to $400,000. Approximately 15
houses cost less than or equal to $400,000. The number of houses that cost more than
$400,000 and less than or equal to $800,000 is indicated by the height of the bars at
bins $600,000 and $800,000. How many houses cost more than $400,000 and less than
or equal to $800,000?
Approximately 25
By convention, Excel includes in a bin’s range the number represented by the bin label.
For example, the first bin (labeled $200,000) includes all houses with values less than or
equal to $200,000 and the second bin (labeled $400,000) includes all houses with
values greater than $200,000 but less than or equal to $400,000. Approximately 25
houses cost less than or equal to $600,000. The number of houses that cost more than
$400,000 and less than or equal to $800,000 is indicated by the height of the bars at
bins $600,000 and $800,000 How many houses cost more than $400,000 and less than
or equal to $800,000?
Question 4 of 6
The following data set contains the heights of several members of the Boston Red Sox.
Create a histogram of the data using the bins provided in column C.
Correct!
The Input Range is B1:B11 and the Bin Range is C1:C4. You must check
the Labels in first row box since we included B1 and C1 to ensure that the
histogram’s axes are appropriately labeled.
Question 5 of 6
The following data set contains the heights of several members of the Boston Red Sox.
Create a histogram of the data using the bins provided in column C.
Correct!
The Input Range is B1:B11 and the Bin Range is C1:C6. You must check
the Labels in first row box since we included B1 and C1 to ensure that the
histogram’s axes are appropriately labeled.
Question 6 of 6
Below are three histograms showing the heights of several members of the Boston Red
Sox. Which do you think is more effective in showing the distribution of player heights?