Business Statistics Notes
Business Statistics Notes
Expectations:
Please note that there are many other Functions, Chart Wizard options, and Data Analysis Tools in Excel. We
expect that your Excel mastery will not be strictly limited to what we show you in the lectures and labs, and we
encourage you to experiment with Excel on your own.
Summary Statistics
The easiest way to start “getting to know” your data is to do some basic calculations that summarize the data.
The definitions below should help you understand what we are calculating in the lectures, lab and homework.
Average (or Mean): The average of a set of data is the sum of the data values divided by the number of
observations (n).
Median: Once the data are arranged in order, the median is the value at the center so that 50% of the
observations are smaller than this value and 50% of the observations are larger. If the number of observations
(n) is an odd number, the median is the middle observation. If the number of observations (n) is an even
th
number, the median is the average of two middle observations. The median is also the 50 percentile.
Mode: The mode, if one exists, is the most frequently occurring observation. There can be more than one
mode. For example, 13 students have a GPA of 2.82. This is the most common GPA, so it is the mode.
(There is no GPA shared by 14 students).
Percentile: Percentiles separate large data sets into 100ths. The pth percentile is a number such that p
th
percent of the observations are at or below that number. For example, the 75 percentile is an ÖSS Score of
308.75, and it tells you that 75% of the scores are at or below 308.75.
Standard deviation: This tells us how much variation is in the data. In basic terms: how much the data points
tend to deviate from the mean.
In-class/lab example:
Consider a fictitious database (stats_lab.xlsx) which contains the following information:
- Student ID
- ÖSS Points
- ÖSS Points, Rounded to the nearest integer
- ÖSS National Ranking
Let’s start by looking at just the ÖSS Points, Rounded column. To easily compute the summary statistics, we
will give the GPA sequence a name. Highlight C2:C251, then click the Formula Tab, then Define Name in the
Ribbon. Type a Name into the Name Box, such as points_round, and then click OK.
Excel has built-in functions to compute most of the summary statistics. These functions have very intuitive
names.
Average----------------AVERAGE
Median-----------------MEDIAN
Mode-------------------MODE
Minimum---------------MIN
Maximum--------------MAX
Standard deviation----STDEV
Each one of these functions takes one argument, namely a cell range or a name. For example, to find the
average of the GPAs, we enter =AVERAGE(points_round) into any blank cell.
The minimum and maximum are simple measures of variation. They tell us how much variety there is within
th th
the data. However, they are very sensitive to extreme values. It may be a better idea to use 5 and 95
percentile figures as measures of variation. Two Excel functions which are very useful in extracting
information out of a string of numbers are PERCENTILE and FREQUENCY.
Examples:
PERCENTILE(points_round,0.05) = 259.4
This implies that only 5% (0.05) of the ÖSS Scores at Özgirişim U are below 259.4
PERCENTILE(points_round,0.25) = 276
This tells us that 25% of the students had ÖSS Scores below 276. Combining this information with what we
observed above, we can conclude that many more students had ÖSS Scores between 259.4 and 276. But
how do we know how many? We can use the frequency function. We know that the number of ÖSS Scores
less than or equal to 276 can be found using this formula:
And the number of students with ÖSS Scores less than or equal to 285 is:
FREQUENCY(points_round,276) = 23 students.
So, how can we combine these two pieces of information to find the number of students who have GPS less
than or equal to 276 but greater than 259.4?
23 - 5 = 18 students.
Check this against our percentile results. According to our percentile conclusions, about 20% (25% - 5%) of
students should fall in this range. To find the total number of ÖSS Scores, use:
COUNT(points_round) = 89
18 / 89 = 20.2%, which is fairly close to 20%!
You can create a table with either percentile or frequency values, such as the ones below, which will give you
a nice summary of the data. These two tables are two slightly different ways of summarizing the cumulative
distribution of the ÖSS Scores.
Percentile 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Value
Histogram
We can also graph the ÖSS Scores columns using a histogram in Excel. A histogram is a useful chart that
illustrates the non-cumulative distribution of the data. It shows us a picture of how the data is spread out and
allows us to quickly see any patterns. It is a graph that consists of vertical bars constructed on a horizontal
line that is marked off with intervals for the variable being displayed. If there is a normal distribution, the graph
will look bell-shaped and symmetrical. There will be one tail (the long thin part) on each side of the chart,
showing that the high values and the low values are the least popular.
The first thing to do is to create a Frequency Table, using an Array Function in Excel. Suppose we want the
size of the bins to be 10.
First, create a column for the bins, and label it “bins”. Then enter the bin parameters into this column.
For this example, since we know that the min(points_round) is 193 and the max(points_round)=338,
we want our bins to be between 200 and 340. Let’s enter the following numbers into the bins column
200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, and 340.
Next, we need to create an Array Function in Excel. This means that the function will apply to an entire
range, or array, of cells. Name the column to the right of the Bins column “Frequency.”
o Next, highlight all of the cells in the frequency column – all the empty cells beside the bin labels.
o Once they are highlighted, type “ =frequency( “
o Inside the bracket, your first argument will be the points_round range. Either type “points_round”
if you named the range, or swipe the range.
o Enter a comma (or semicolon), and swipe the bins.
o Close the bracket but DO NOT PRESS ENTER YET. The formula should read:
=frequency(points_round,bins).
o Press ctrl + shift + enter and all the cells in the Frequency column will be filled in by the array
function.
o If you click inside any one of the frequency cells and look in the formula bar you can see that the
entire function is now surrounded by curly brackets {}, which indicate that it is an array function.
To create the graph, select all of the values in the frequency column.
o Click on the Chart Wizard
o Select Column chart. Click Next.
o Go to the Series tab. For Category (X) axis labels, swipe the bin labels. Click Next.
o On the Titles tab, label the chart as follows - Chart title: Distribution of ÖSS Scores; X axis: ÖSS
Scores; Y Axis: Number of Students.
o Click on Legend tab, unselect Show Legend, Click Next.
o Choose to display the chart as an object in the current worksheet and click Finish.
Now that chart is complete, we will format it.
o Double click on any one of the columns in the chart. Select Options. Reduce the Gap Width to
10%.
o You can polish this chart in many different ways. You can change the background color, or the
color of the bars, or use fill effects. You can change the fonts (bold, italic, larger/smaller). You
can show data values on top of the columns, or you can add a data table.
Your completed Frequency Table and Histogram should look similar to the ones below. Look for the
relationship between the table and the chart.
ÖSS Points Histogram
bins frequency
200 1
210 1
220 0
230 0
240 0
250 0
260 6
270 8
280 12
290 8
300 18
310 8
320 7
330 13
340 7
Scatter Diagram
In the second part of the example, we will explore the relationship between two of the four columns: ‘ÖSS
National Ranking’ and the student’s ranking within the Özgirişim incoming class. First, create another column
that shows each student’s ranking within the dataset in column E. Then let’s chart this data by creating an XY
(Scatter) Diagram. To do this, follow these steps:
Highlight the data in both columns, not including the headers. Do this quickly by selecting the top cell
in each column (D2 and E2). Then press: ctrl + shift + down arrow.
Click on the Insert Tab, and then select Scatter as Chart.
Make sure that the x axis is the Özgirişim ranking and the y axis is the National Ranking. If not, click on
Select Data under the Design tab and correct it.
Questions to consider:
What does the chart look like? Does there seem to be a linear relationship between the two sets of
numbers?
Your completed Scatter chart could look similar to the one below:
Notice that this chart also includes a line through it and an equation. What do these mean? We will discuss
them below.
Correlation
Correlation Coefficient (R): The correlation coefficient is a standardized measure of the linear relationship
between two variables.
In the chart above, there is obviously a relationship between ÖSS Ranking and students’ ranking within the
incoming Özgirişim class, the lower the class ranking the lower the National Ranking should be, so the
trendline or regression line tells us what that relationship looks like in mathematical terms. Using the trendline
equation, we can predict what a student’s National Ranking is based on his/her Özgirişim Ranking or vice
versa, we can predict what a student’s Özgirişim Ranking will be based on their National Ranking.
However, correlation can also be used to determine causality; does one thing cause another thing to happen?
Therefore, to understand correlation, let’s take a look at the following example concerning drugs. The
following is chart that shows the amount of cocaine plants that have been destroyed per year (x-axis) vs. the
average price of cocaine in the US/Europe that year (y-axis).
Logically, the more cocaine plants that are destroyed the higher the price should be (lower supply leads to
higher prices), which should detract people from buying drugs. However, in this case, there seems to be a
negative correlation, more eradication is leading to lower prices, with some variation. Is there a problem with
the war on drugs program? The next question is: how much of the variation in cocaine prices is explained by
the changes in cocaine eradication? Obviously cocaine eradication should have an impact on supply but how
much? What about other factors like population growth, increases in global wealth, prevalence of drug
programs, etc?
The Correlation Coefficient tells us how much influence cocaine eradication has on the price of cocaine. This
coefficient is also called R. We can calculate it by using the CORREL function in Excel. This coefficient can be
between -1 (perfect negative correlation) and +1 (perfect positive correlation). A coefficient of zero implies no
correlation. The closer the index is to -1 or +1 the stronger the correlation.
You can check if you computed correlation correctly by comparing it with the R squared value on the chart.
To square the number 4 in Excel, type =4^2. The “^” is called a “carrot”. Square the R value and compare it to
2
the R^2 value on the chart. They should match. R is a measure of the quality of the fit. It is the square of the
correlation coefficient found earlier. Assuming that the two sets of numbers really do have a linear
2
relationship, R measures the percentage of the variation in the salaries explained by the variation in cocaine
2
prices. In our example, R is 84.5%. This means that 84.5% of the variation in cocaine prices is explained by
cocaine eradication. So it seems like the war on drugs, at least on the supply end, is having an opposite effect
than its original intention.
Note: be careful about using the square root function (SQRT) of the R^2 value to determine R because you
will only get the absolute value of R. You won’t be able to tell if the correlation is positive or negative.