2.1 Descriptive Statistics (Tabular and Graphical)
2.1 Descriptive Statistics (Tabular and Graphical)
Displaying data graphically often allows us to get important information quickly. In this note, we are going to
use some examples to demonstrate this. All experienced Excel or other spreadsheet application software users
know that spreadsheet programs are strong in producing graphs. Typical graphs that can be created with Excel
include:
Pie Chart
Bar Chart
Histogram
Line Chart
Scatter Plot
Area Chart
Box Plot
There are however some other useful graphs that cannot be created easily with Excel, such as:
Stem-and-Leaf Plot
Dot Plot
Besides these graphical methods of data presentation, it is sometimes useful to describe and summarize data in
table forms. Excel comes with a powerful tool, called PivotTable, which can do this very efficiently.
To answer the first question, let’s first try to summarize the data in a neater form so that we can get a better idea
about what’s going on. To summarize data using PivotTable, place your mouse any where in the data range
and select Data/PivotTable Report from the menu bar. Then, simply follow the instructions afterward. Using
PivotTable, for example, we get the following result:
Average of Annual_Salary
Gender Total
0 52126.0
1 43645.7
Here, “0” means male and “1” means “female”. The average salary of male workers is 52126 and the average
salary of female workers is only 43645.7. So, it appears that female workers on average get lower pay.
But, there is a slight problem here. Have we taken into consideration of other factors yet? It may be the case
that females are paid less because of their weaker qualifications in other areas (e.g., seniority, education, etc).
The following table allows us to see whether workers, female or male, with the same educational level get about
the same average salary. (Again, it is produced with PivotTable.)
1
Now, we can see clearly from this table, given the same educational level, females are always paid less than
males on average. (So, there may be a discrimination case to keep some lawyers busy and alive.)
If the table is not convincing enough, the following graphs are created from the above table. Don’t tell me that
you still cannot see who gets paid more or less.
You can also manipulate the PivotTable to get other information. As an example, try to see if you can answer
the following questions:
Does salary depend on seniority or education? To answer these questions, the following scatter plots were
created:
2
Salary vs. Age
$120,000
$100,000
$80,000
Salary
$60,000
$40,000
$20,000
$0
0 10 20 30 40 50 60 70
Age
$120,000
$100,000
$80,000
Salary
$60,000
$40,000
$20,000
$0
0 5 10 15 20 25 30 35 40
Experience
$120,000
$100,000
$80,000
Salary
$60,000
$40,000
$20,000
$0
0 1 2 3 4 5 6 7 8 9
Education
What can you say after seeing these scatter plots? How is “salary” related to “age”, “experience”, and
“education”?
3
We can also use PivotTable to generate other useful information. For example:
The age distribution of male and female employees
The salary distribution of male and female employees
Following is the age distribution of Beta’s employees, females and males are counted separately.
4
This table has too much details. To make it easier to read, we can use the “Group” function in the PivotTable to
combine categories. To use the “group” function, do the following:
Place the mouse cursor any where on “Age” column of the pivot table, then either “right click” the mouse
or choose Data/Group and Outline/Group.
Specify the way you want to group your data in the popup window.
After re-arrangement, we get the following: (the exact output depends on how you group the data)
We can also display the salary distribution, separated by age, in table form. This requires the change of
PivotTable setup. Any time you want to change the setup of your pivot table, do the following:
Place mouse cursor any where on the pivot table.
“Right click” the mouse
Select Wizard.
Make whatever change you want
This following table displays the salary distribution, separated by age. (The salary columns have been grouped)
Example 2
The file “Sky.xls” contains three years (1991-1993) of sales data for the Sky’s the Limit Women’s apparel store.
Each observation represents sales during a 4-week period. Thus, the first observation is sales during the first 4
weeks of 1991, and so on.
Time series plot is particularly useful for identifying patterns that may exist in any time series data. We can
create time series plot with Excel easily. Simply create a line chart based on your time series data.
5
Sky's Sales Data 1991-1993
300000
250000
200000
Sales
150000
100000
50000
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Week
Example 3
A question of great interest is how the distribution of family income has changed in the United States during the
last 20 years. The file “Income.xls” contains data for a sample of 499 family incomes (in real 1995 dollars).
For each family, the 1975 and 1995 incomes are listed (Although these data are fictitious, they are consistent
with what has actually happened to U.S. family income during these years.) Based on these data, discuss as
completely as possible how the distribution of family income in the United States changed from 1975 to 1995.
Box-Plot is very useful when we want to graphically depict a large amount of data. This plot allows us to see
the mean, the median, the interquartiles, and some other information. When two box-plots are done side-by-
side, we can also compare the difference in the distribution of two data sets.
The right and left of the box are at the third and first quartiles. Therefore, the length of the box equals the
interquartile range (IQR), and the box itself represents the middle 50% of the observations.
The vertical line inside the box indicates the location of the median. The point inside the box indicates the
location of the mean (Some software do not show the mean).
Horizontal lines are drawn from each side of the box to represent observations that are either below the first
quartile or above the third quartile.
Based on the following box-plot, what can you say about the family income change in the U.S. from 1975 to
1995?
Income Distribution, 1975 vs. 1995
Example 4
The file “Grade.xls” contains the final exam grades of a business statistics class. Generate Stem-and-Leaf plot,
frequency table, and histogram to show the grade distribution.
6
To create frequency distribution table and the histogram, we will use Excel’s PivotTable and PivotChart
Report. The results are shown below:
Grade Distribution
Count of ID Count of ID
25
Grade Total
<50 2 20
50-59 3
60-69 9 15
Total
70-79 22 10
80-90 16
>90 3 5
Grand Total 55 0
<50 50-59 60-69 70-79 80-90 >90
Grade
To create the Stem-and-Leaf plot, we use PHStat. To do so, from the menu bar, select PHStat/Descriptive
Statistics/Stem-and-Leaf Display. Then, follow the instruction afterward. The stem-and-leaf plot is shown
below:
Stem-and-Leaf Plot
Stem unit: 10
Statistics
Sample Size 55 4 99
Mean 75.0545 5 379
Median 76 6 235677899
Std. Deviation 10.0819 7 0012334445556667788999
Minimum 49 8 0001123344666779
Maximum 92 9 122
How do you interpret this stem-and-leaf plot? For example, how many students got less than 60? How many
got higher than 90?
Using the file Beta.xls again, the summary statistics for “salary” is shown below:
7
Salary
Mean 47722.77
Standard Error 3351.308
Median 41081.5
Mode #N/A
Standard Deviation 24166.62
Sample Variance 5.84E+08
Kurtosis -0.05775
Skewness 0.877974
Range 94914
Minimum 14371
Maximum 109285
Sum 2481584
Count 52