5 Summarizing Data
5 Summarizing Data
However, before summarizing the data, it is advisable to simply explore the data to see what is
there. Play with the data as if you were in a sandbox, before you start forming opinions about
the data and prematurely drawing conclusions.
If your data is in a Table, then there are some functions that make basic exploration easy. In the
header row at the top of the Table, each variable name is followed with a down arrow. If you
click on it you can see that you can sort or filter your data. The first two sorts (smallest to largest
and largest to smallest for numeric variables and A to Z vs Z to A for text) are self-explanatory.
The third is labeled Sort by Color. This is a custom sort. You can sort at multiple levels. That is,
sort students by their Program, then within each Program you can sort by Gender, then among
each Gender-Program combination you can sort by GPA,… Note that the whole file is sorted, not
simply the variable that you selected. Normally in Excel, you have to select the data you wish to
sort.
You can also Filter your data. This could be by selecting the specific values you wish to examine,
or by using a variety of functions to select ranges of values. You can do a Custom Filter that
combines ranges by using logical operators, such as OR and AND. You can filter on several
different variables separately. This is useful when you want to explore specific sub-groups in
your data.
Suppose that we want to look just at the Business students. We can click on the down arrow by
Program and uncheck all programs except Business.
Maybe we want only the International Business students. Click on the down arrow by Home and
select International.
We can continue selecting just Females or just 1st year students. We can also use more complex
text filters. Or we can select just those that expect to earn over $50,000, or between $40,000
and $80,000…
We can also Sort the data in a similar fashion. If you select Sort by Color you can then choose
Custom. Sort that allows you to sort by several variables concurrently.
When first exploring data, doing a simple sort on each numeric variable can be insightful. For
example, sort Salary, either smallest to largest or largest to smallest.
Look at the smallest values. Someone expects to earn $0? $500? $9,450?
When you are done sorting, it is recommended that you return your data to the original sort
order. Initially, the data was likely sorted by the record ID. Resort the Table by ID, smallest to
largest. If your data set does not have an ID column, it is recommended that you create one.
Without a record ID column that is initially set as smallest to largest, you do not have any way to
restore your original sort order.
As we explore our data and come to understand it, we may start asking ourselves questions
about it.
Like most actions on “objects” in Python, these are executed by using “methods” attached to
the object. If we wanted to sort the data frame with respect by Program2 and then by Salary,
we would give the following instruction.
The default sort is ascending, but this can be changed by changing the ascending setting to
False. Note that we are sorting on Program2 and Salary, so the sort order will be descending
for BOTH of these variables.
In the examples above, only the first and last 5 rows are displayed. To display the top (head) 10
rows, add the method .tail(10)
And if you want just the bottom (tail) 10 rows, add the method .tail(10)
To filter cases, we must specify the condition(s) on which we wish to filter. For example, if we
want to select only observations for which Salary exceeded $100,000, we would write
Typing out long expressions, such as repeatedly typing df_Grad_Exp, can be simplified by
typing the first one or two letters and then pressing the tab key. Python will respond with a
list of all variables starting with these letters.
If you want to know what methods or functions you can apply to an object, type a period
after the object and then press tab.
If you can’t remember the syntax associated with the function, type shift+tab and you will get
a help screen for the function.
A row will be added to the bottom of the table. Scroll to the last row. Select the entry in the
Salary column. A down arrow should appear. Click on the arrow.
Select Count Numbers. 752 students reported a salary value. Look at Average, Max and Min.
The average salary was $52,914.76, with a maximum of $650,000 and a minimum of $0. The
$650,000 appears unrealistic. We noted before when Sorting, that some students gave values
that we thought were suspicious.
Go to the down arrow by Salary in the header row and select Number Filters. Select Between
and then select 20000 and 120000 as your limits. You now get an average of $51,480.40. It
would appear that the extremes at each end did not seriously distort our average.
Click on cell ID for Total row 813 (cell A813). Click on the arrow and select Count.
811 students attempted the survey. This is all of the students. Click on the Total cell for Year.
802 told us the year they started. Repeat this for Home and for Home2. Why is the count 811 for
Home and 801 for Home2? When we applied VLOOKUP to Home, blanks were coded as #NA.
This may be a problem.
What happens when you look at Gender and Gender2? 811 and 665? Why? Initially, we used
VLOOKUP to code 0 and 1 as Male and Female, but VLOOKUP treated a blank as zero = Male. We
added an IF statement to treat blanks as blanks. But, Excel coded blanks as blank text. So the
cells are no longer empty and they were counted.
Question: why did 802 tell us when they started but only 665 tell us gender?
Are salary expectations of Business students similar to those of Arts students? We know that we
have some strange salaries reported, so let us Filter on those between $20,000 and $120,000.
Although tedious, you can answer many simple questions with the Total row and Filters.
https://youtu.be/rs_H0T1sY3U
If you do not specify a column, you will get the mean value for every numeric variable.
We will explain the meaning of each of these measures in the next chapter.
With string (text) data, we can count the values with value_counts(). This method does not give
us a count of the total observations, but counts the number of observations taking on each
unique value.
This last example, is a good segue into our next topic of aggregating data.
Usually, frequency distributions do not have text descriptions, just the ranges of values.
We have looked at how to group data by making numeric values into categorical values. In
Excel, this was done with VLOOKUP using an approximate match. In Python, we used the cut
function in a similar manner. Both methods create categorical variables, but do not
automatically make a summary like the table above.
A histogram is a graphical depiction of a frequency distribution. There are several ways to create
a histogram in Excel. We will look at two methods in this section and a third method when we
look at pivot tables and charts.
If you wish to set up groupings of different sizes, then you will have to use the Analysis ToolPak
add-in in Excel.
If you are using Office 365 in the cloud, then installation is somewhat more complicated. The
Analysis Toolpak is not included in 365. Go to Insert and then select Add-Ins
Select STORE from the options listed at the top of the pop-up screen. From the list of categories
at the left, select Data Analytics. Scroll down the list of suggestions until you see XLMiner.
The XLMiner add-in is free and behaves in a very similar way to the Analysis Toolpak.
If you are using a Mac, then you must go through a similar process to the above. You may need
to google to find installation instructions.
To use the Histogram function in Excel, you must define your groups. Excel calls the groups
“bins”.
Since the end of one bin is the start of the next, we only need to define the end of the bin. Note
that this differs from VLOOKUP that required us to define the start of each group. (confusing!)
Let us use a bin width of $10,000 and define bins up to $120,000. To keep our data sheet clean,
we recommend that you put the bins on a separate sheet, such as your LookUp Table
worksheet.
In column R, put a heading “Salary bins” in the 1st row and then enter 10000, 20000, 30000,…. In
the 2nd, 3rd, 4th,….rows. Note: do not use commas. Write 10,000 as 10000.
Go to the Data tab and on the far right, you should see Data Analysis. Click on it. A pop up
window appears. Select Histogram.
The histogram is a simple column chart. You can format the labels on the chart to make it more
attractive. It is common practice to remove the space between columns in a histogram.
To the right are several formatting options. Change the Gap Width to 0.
The More group on the right is deceptive. Excel makes all columns the same width. But this
group actually represents values from $120,000 to $650,000! The tail on the right should be very
very long, but that would distract us from looking at where most of the data is.
Don’t be distracted by outliers. They are important, but not the focus.
It would be nice to filter out extreme values. If you apply a filter and then construct your
histogram, Excel ignores the filter! The only way to filter data and then make a histogram is to
• Apply the filter.
• Copy the filtered data to a new sheet.
• Build the histogram using the copied data.
https://youtu.be/3ysPbcSqDXE
The most recent version of Excel has an alternate way to construct histograms.
• Select the Salary column.
• Go to Insert and select Recommended Charts.
• You may see an image of a histogram, but if not, select All Charts and select Histogram
from the list.
But, we can improve this significantly. Right click anywhere on the horizontal axis. Select
Format Axis. A dialog box appears. You will see that Automatic is checked. Let us select Bin
Width and enter 10000 (do not use a comma 10,000). We would also like to limit the number of
bins, but Excel only allows you to fix one setting.
The advantage of using an Excel Chart instead of the Analysis Toolpak is that we can use the
Filtering tools in Table. Select the down arrow beside Salary and select Number Filters and then
select Less than or equal to. Enter 120000.
This now looks similar to what we obtained with the Histogram function in the Analysis
Toolpak. As a Chart, we can change the title and label the axes. In general, it is recommended
that all charts have informative titles and labelled axes.
As noted in a previous section, the value_counts() method will count the number of
observations taking a specific value of a categorical variable. For example, with Program2, we
had
Too many decimals are distracting and not informative. To round the results to 2 decimal places,
add the method round(2).
With numeric data, we saw that we can use the cut function to change numeric data to
categories.
In the example we looked at before, we defined the bins with a series of upper limits for the
bins.
We can also simply specify the number of bins we want and let Python create equal sized bins.
You lose control over whether the end points to the bins are “attractive” values and outliers can
seriously distort the groups. For example, if we were to split the Salary data into 20 bins, we
would get
To make a frequency distribution into a histograms, we can do this in Pandas with the plot
method, or we can load Python’s library of plotting functions.
The quick approach would be to ask for a bar chart of the frequency distribution. Below is an
example with bins with widths of $10,000 and the last bin being all values above $100,000. We
can add axis labels and a title.
Before leaving this topic, we must ask why we want frequency distributions or histograms?
The rationale for a frequency distribution may be very different from that for a histogram. It
goes to the question of why/how we use tables versus charts.
In a table, we are trying to summarize information while still retaining some detail. We are
interested in how the groups (bins) are defined and the percentage of values within a group.
Readability is important.
In a chart, we are visually looking for patterns. You “read” a table, but “feel” a chart. In most
instances, you do not look at the scales on the chart axes. You don’t care if the vertical axis
shows frequency or percentage. But you also don’t pay attention to the labeling of the
horizontal axis. You may look for a sense of where the middle is, and what are low and high
values, but the exact numbers are not important. You get a sense of shape – symmetry or
skewness.
Although we present frequency distributions and histograms as a single topic, they serve very
different purposes. Defining groups is very important for frequency distributions (tables), but
may not be important for histograms (charts). This distinction in how we mentally process
information will be revisited several times throughout the text.
You could then construct 3 histograms and compare them. But this comparison is hard to do
visually.
How do you compare these distributions? Arts and Business samples are similar size, but Science
is less than half the size. If comparing frequency distributions or histograms, you need
comparable scales. Convert each frequency to percentage of the total sample size. We call these
relative frequencies.
We must take the three frequency distributions and combine them in a single table, with
columns for each of Arts, Business and Science. For each, create three new columns of relative
frequencies, by dividing the frequencies by the corresponding sample sizes.
Select the relative frequency columns and Insert a Clustered Column Chart.
This chart allows you to more easily compare the patterns among the programs. We will see in
the next chapter an alternative way to summarize data and build charts that is more flexible
than using the Histogram function.
In a later chapter, we will look at a graphical technique that is better suited to comparing the
distributions of many groups. Although histograms are simple charts, they still contain too much
detail for comparing multiple groups.
Classic patterns include the Bell shape with balanced tails on left and right – most common when
summarizing averages. This curve is known as the Normal Distribution and is the basis of much
of statistical analysis.
Skewed distributions with a long tail to left or right are very common. Long right tails are
common with financial data.
Distributions that are lumpy, are often ones that describe two distinct groups. For example, if we
were to construct a histogram of summer earnings (q10 in the original data set) and we had the
first group ending in zero, we would get
The first group represents those students who did not have a summer job and so earned
nothing, and the rest of the distribution is those with jobs. What causes the spikes? Most
students rounded their estimate to the nearest $1,000, but the chart has groupings in $500
increments. A different grouping would result in a smoother histogram. However, unless your
sample is extremely large, most histograms have some lumpiness due to randomness in the
observations. This is to be expected and is not a deficiency in the quality of the data.
Image Citations:
Figures 5-1 to 5-24: Images courtesy of author using Microsoft Excel
Figure 5-25: Image courtesy of Shishirdasika under CC BY-SA 3.0
Figures 5-26 to 5-27: Images courtesy of author using Microsoft Excel