Module 1
Module 1
2 What is Statistics
Everybody relies on data in one way or another:
Introduction to
• corporate presidents decide company policy based on quarterly sales figures
Statistics • politicians decide on campaign strategy based on polls
• teachers decide grading curves based on a bell curve
• you and I decide whether to smoke or not based on health records of other people
Statistical Analysis with
Software Applications Therefore, we need a comprehensive and understandable way to deal with data:
The basic objective of statistics is to get information about a larger group just by looking at a small part of that
group. We will define
the term population to stand for the set of all measurements of interest
to your first module!
A population could be (1) the set of all photographs of Mars, or (2) the set of heights of people in the US Army, or
(3) the set of all measurements of water quality taking from the Hudson river, or (4) the set of all problems that
can be solved using statistics. On the other hand,
the term sample will denote any subset of measurements selected from the population
This module is a combination of
synchronous and asynchronous For example, a sample could be (a) the pictures selected from population (1) from a specific region of Mars, or
learning and will last for two weeks. (b) the heights of people in a particular division of the US Army, or (3) the set of water measurments of the
Hudson River taken on 7/24/2003, or (4) the statistical problems we are solving in this class. In addition,
the term statistical inference stands for an estimate, prediction, or other generalization about a population
based on information contained in a sample.
With Statistics we want to make inferences about a population based on information collected from a sample.
To be more precise, we will approach a "generic statistical problem" using these four steps:
Problem definition
what is the population of interest, and what are the variables that are to be investigated
Ms. SHIELA MARIE B. MERINO
Data collection
Instructor describe and select the sample from the population
Data analysis
make some statistical inferences from the sample about the population
Analysis Reporting
report the inference together with a measure of reliability for the inference where we use the
term variable to mean a characteristic or property of an individual population where the
observations can vary.
No part of this module may be
reproduced in any form without prior Example: A tax auditor is responsible for 25,000 accounts. How many accounts are in error (resulting in a loss of
permission in writing from the revenue)?
Instructor/Author.
The steps involved in trying to find a suitable answer to this question are:
Data analysis: Some statistical theory is applied to allow drawing a conclusion from the sample of 2000 accounts The most accurate approach apparently would be to ask everyone living in NYC about their income, add it up,
and applying it to all 25,000 accounts. In this case, the likely theory involves computing 84/2000 = 4.2%, but and divide by the total number of people asked (which will give the precise average).
possible other formulas are necessary as well.
However, that is not only impractical (very time consuming and expensive), it could not even theoretically work.
Analysis reporting: Based on our data analysis we infer that approximately 4.2% of the accounts will be in error. For example, when asking the last people in NYC about their income, the ones we asked first may have moved
That guess has an error of +/- 0.9% out of NYC, their income may have changed by now, or new people might have moved into NYC that we not on
our original list.
Note that we have not clarified how to obtain the error for our guess, but will describe that in a later module.
The analysis of the data is usually done by using a calculator, or - more frequently these days - with the help of a Therefore, instead of finding the exact average, we can try to estimate it. So, our first problem will be to
software package. In our course we will use Microsoft Excel to help us perform statistical analysis. randomly select a small sample, say of size 1000, find the average income of that sample (which is perfectly
within our capabilities), and then draw conclusions from that sample about the whole population.
A second detail not mentioned in the above example is how the 2000 accounts were selected at random. Of
course, one would like to select that sample as 'unbiased' or as 'randomly' as possible. It turns out that selecting We might try to use the following procedure to select our random sample:
such a random sample is not easy. In fact, it is frequently the most difficult process in applying statistics in the
real world! 1. Open the latest New York City phone book
2. Select one page "at random"
Example: What's the average income of people living in NYC? 3. Select the first 1000 people starting from that page.
The most accurate approach apparently would be to ask everyone living in NYC about their income, add it up, Call them and ask them for their income. Compute the average of that group, and say that this average is
and divide by the total number of people asked (which will give the precise average). representative for the average income of all people in NYC, approximately.
However, that is not only impractical, it could not even theoretically work. But this is not at all a "legal" procedure to obtain a "random sample": all people selected will most likely be from
one borough, or all may have a name starting with "Mc" (and are likely to be of Irish ancestry, which introduces
Discussion Topic: Discuss the difficulties involved when trying to question everyone living in New York City bias).
about their income to determine the average income of NYC residents.
Even if we somehow managed to select people from the phone book without any bias, it's still not good enough:
Therefore, instead of finding the exact average, we can try to estimate it. So, our first problem will be to we will be missing people with unlisted numbers (usually high income), as well as people with no phone (usually
randomly select a small sample, say of size 1000, find the average income of that sample (which is perfectly low income), and some of the people selected may choose to not answer our questions.
within our capabilities), and then draw conclusions from that sample about the whole population.
Before we continue, we need to define clearly what we mean by "random sample":
We might try to use the following procedure to select our random sample:
A random sample of size n is a sample that is selected by a process such that any other sample of that size has
1. Open the latest New York City phone book the same chance of being selected.
2. Select one page "at random"
3. Select the first 1000 people starting from that page. In other words, a random sample is a sample where the selection has taken place without any bias of any sort.
Call them and ask them for their income. Compute the average of that group, and say that this average is In the real world, selecting a 'random sample' is difficult and often impossible. However, we will next learn a
representative for the average income of all people in NYC, approximately. procedure for doing that in special cases. In general, though, we will avoid that problem and simply assume that
a random sample somehow has been selected.
Discussion Topic: Discuss whether the above method of randomly selecting 1000 people in New York City will
indeed yield a "random sample", using your intuitive understanding of what a "random sample" ought to be. If Discussion Topic: Immediately after an election, TV channel A forecasts that candidate X will receive 52% of the
you find flaws in the above procedure, try to come up with a better one. vote, with a margin of error of 2%. At the same time channel B predicts that the same candidate will receive
47% of the vote, also with a margin of error of 2%. Discuss what's wrong with this picture, and how this could
happen. Do you recall any actual occasion where the winner of a major election has been incorrectly predicted
based on statistical analysis?
Here is a simple example that illustrates how a random sample can be selected.
To select, for example, a random sample of size n = 5 from a population of 2000 measurements we proceed as
follows:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
1. Label all measurements from 0 to 1999, in any order
2. Start a computer program that can generate random numbers 1.4 Variables and Distributions
3. Use that computer program to generate 5 random numbers between 0 and 1999
4. Select the 5 measurements from the total population that correspond to those random numbers When we are looking at a particular population, selecting samples to make inferences, we need to record our
observations or the characteristics of the data we are studying.
This procedure will give a random sample (assuming the computer's random number generator is working
correctly), but is not always applicable. The biggest problem is that of being able to label all numbers (or A variable is the term used to record a particular characteristic of the population we are studying.
outcomes) in the population.
For example, if our population consists of pictures taken from Mars, we might use the following variables to
For example, if you want to find the average pollution of a certain river, you can not label all possible capture various characteristics of our population:
measurements.
• quality of a picture
In addition, we need to know more details about how to "start a computer program that can generate random • title of a picture
numbers" .... we will learn that in later sections. • latitude and longitude of the center of a picture
• date a picture was taken
From now on, we will take a very simple approach: we will ignore that problem and assume that somehow a
random sample has been selected (possible by the above method). We will, however, learn how to use Excel to and so on. It is useful to put variables into different categories, as different statistical procedures apply to
select a random sample in case we can label all outcomes of our population. different types of variables. Variables can be categorized into two broad categories, numerical and categorical:
Categorical Variables are variables that have a limited number of distinct values or categories. They are
sometimes called discrete variables.
Numeric Variables refer to characteristics that have a numeric values. They are usually continuous variables, i.e.
all values in an interval are possible.
Categorical variables again split up into two groups, ordinal and nominal variables:
Ordinal variables represent categories with some intrinsic order (e.g., low, medium, high; strongly agree, agree,
disagree, strongly disagree). Ordinal variables could consist of numeric values that represent distinct categories
(e.g., 1=low, 2=medium, 3=high). To best remember this type of variable, think of "ordinal" containing the word
"order".
Nominal variables represent categories with no intrinsic order (e.g., job category or company division). Nominal
variables could also consist of numeric values that represent distinct categories (e.g., 1=Male, 2=Female).
It is usually not difficult to decide whether a variable is categorical (discrete) or numerical (continues).
Example: An experiment is conducted to test whether a particular drug will successfully lower the blood pressure
of people. The data collected consists of the sex of each patient, the blood pressure measured, and the date the
measurement took place. The experiment is conducted three times, once before the patient was treated, once
one hour after administrating the drug, and again 2 days after administrating the drug. What variables comprise
this experiment?
The characteristics measured in the experiment are the patient's sex, blood pressure, and the date. The sex is a
nominal variable, the blood pressure is numeric, and the date is ordinal. The fact that the experiment is
repeated does not change the number of variables recorded, each time 3 variables are recorded.
Note that recording these three variables only would not enable a successful data analysis. It seems likely that in
order to test the drug's effectiveness you need to correlate the measurements taken on different dates with
particular patients. In other words, you want to know what the blood pressure for each patient was each time
you took it. Thus, you should introduce at least one more variable into your experiment, such as a patient ID
(which is a nominal variable, even though the ID could be a number).
Example: Consider the following survey, given a random sample of Seton Hall University students:
Q1: What is your status: [ ] Freshmen [ ] Sophomore [ ] Junior [ ] Senior [ ] Graduate Student
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Q4: How often do you use the following support services? Example: Suppose a company issues sales reports for two years, 2004 and 2005, as shown in the table and picture
below. We can consider this report to consist of two variables (v_2004 and v_2005, say), each one having 4 values
few times few times few times (for North, South, East, and West, respectively). Are the distributions of values hetero- or homogeneous?
daily never
per week per month per year
Academic Advisement
The Career Center
Dining Services
Health Services
Counseling Services
Recreation Center
PC Support Services
Campus Ministry
1 2 3 4 5 -1
Academic Advisement
The Career Center The values for the 2004 variable (v_2004 if you like) are pretty close to each other. In the chart you can see that
all blue bars are approximatley of equal height. If I would look at the original figures, I would find an (about)
Dining Services
equal amount of sales for North, South, East, and West, no region would stick out, particularly. Thus, each region
Health Services is equally likely in terms of number of sales - the distribution is heterogeneous (if I checked where an individual,
Counseling Services randomly selected, came from, each region would be approximately equally likely).
Recreation Center
PC Support Services The values for the 2005 variable (v_2005 in our terminology) differ widely. In the chart the red bars are of
Campus Ministry different heights, with "East" being by far the highest. If I would look at the original figures, I would find that
Note: 1 =Strongly Agree, 2 =Agree, 3 =Neutral, 4 =Disagree, 5 =Strongly Disagree, -1 =Not Applicable most sales were made in the East. Thus, a sale from the East is much more likely than from any other region - the
distribution is homogeneous (if I checked where an individual, randomly selected, came from, she would most
likely come from the east).
The survey consists of a total of 19 variables, as follows: Q1 is an ordinal variable, Q2 is nominal, and Q3 is
numeric. Q4 consists of 8 variables (corresponding to 8 rows of the table), all being ordinal, and Q5 again consists
of 8 ordinal variables. Note in particular that the 8 variables in questions 5 are not numeric. The numbers are Discussion Topic: Which type of variable (ordinal, nominal, or numeric) you think will be most useful for
simply codes for particular categories. statistical analysis? Which type of variable you think is usually present in a surveys given to groups of
people? Look at survey results from newspapers or online and report the variables and their categories.
When the results of a survey or an experiment are recorded, the results usually vary. Most or all outcomes for
each variable occur, and they usually occur with different frequencies. Recognizing patterns in the frequencies of
outcomes is in fact one of the goals of statistics.
The distribution of a variable refers to the set of all possible values of the variable and the associated
frequencies or probabilities.
Sometimes variables are distributed so that all outcomes are equally, or nearly equally likely. Other variables
show results that "cluster" around one (or more) particular values. End of Module
Thank you!!!!!
A heterogeneous distribution is a distribution of values of a variable where all outcomes are nearly equally
likely.
A homogeneous distribution is a distribution of values of a variable that cluster around one or more values,
while other values are occurring with very low frequencies or probabilities.
Example: Suppose you are conducting a survey that tries to determine whether woman are typically smaller than
men. Thus, your survey, given to 100 randomly selected people, asks for the respondent's sex and height. Do you
anticipate homogeneous or heterogeneous distributions from these variables?
Since approximately half of all people are male and half are female, and the survey was given to 1000 randomly
selected participants, there should be approximately the same number of men and women queried. Thus, the
variable sex should have a heterogeneous distribution, both values are just about equally likely. The second
variable, height, however, will likely cluster around one or two most frequent values. Or conversely, few people
are really short (4 feet or less) or really tall (7 feet or more), so this variable should be homogenously
distributed.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Module 2 SAN MATEO MUNICIPAL COLLEGE
Gen. Luna St. Guitnang Bayan I, San Mateo, Rizal
Tel No. (02) 997-9070
Using Microsoft www.smmc.edu.ph
Excel
Statistical Analysis with Module Duration
Software Applications
• Week 1 to 3 - Synchronous Meeting and Asynchronous Learning
• Messenger (direct messages or group chat) Search :
SASA
2.1 What is Excel
Microsoft Excel is a spreadsheet program that can be used to enter data in tabular form and to perform a large
variety of computations on that data. In addition, Excel can be used to create a wide range of graphical charts,
and can even act as a simple database program to store, search, and retrieve data.
Note: The instructions and screen shots that follow refer to Excel 2003. If you use a newer version (e.g. Excel
to your first module! 2007) you will find that most of the basic functionality is the same, but the menus and appearances are different.
You could use the "Help" feature to find out how specific commands in the 'new' Excel relate to previous ways of
doing them. For example, if you search "Help" for "Excel 2003" you will find an interactive "Excel 2003 to Excel
2007 command reference guide". You might find that guide helpful if you have any questions with the instructions
below.
A file in Excel is called a "Work book", and each workbook can contain several "Sheets". A "sheet" is a table with
This module is a combination of rows and columns, where the rows are numbered and whose columns are labeled with letters from A to Z, then
synchronous and asynchronous AA, AB, ... and so on.
learning and will last for two weeks.
NOTE: If Excel does not automatically start, or you see a lot of strange characters instead, then right-click on the
above link and save the document to your desktop. Then double-click on the downloaded icon to make Excel start.
Either method will be okay for the remainder of the course.
If everything works, you will see a "live" spreadsheet in a separate window that looks similar to the video below.
When you are done, close the window that contains the spreadsheet. If you are asked whether to save your changes,
select no. Make sure you know how to return to this document in your web browser.
To start Excel, click on the "Start" button and search for an entry labeled "Microsoft Excel". You might need to
click "All Programs" to see that entry, or it might be located directly on your desktop. If you are using Windows 8,
look on the "Start" screen for the Excel icon.
If you cannot find any of these entries, double click on the desktop icon "My Computer", then "Drive C" and then
Each cell in a table can contain text, numbers, or formulas, as well as more sophisticated types such as
on the "Program Files" folder. If you double-click on a folder named "Microsoft Office" it should contain another
currencies, dates, etc. In addition, sheets can contain graphic elements such as charts that can be linked to
folder named "Office10" or "Office 11" or "Office 12", which in turn will contain the "Excel" program. Double-click
values in specific cells.
that Excel icon to start the program. Alternatively, if you are using Windows 8, swip on the right side of the
screen to bring up the Charms menu, tap Search, and type "Excel".
If you still cannot find Excel, you may not have the program installed properly, and you should call the help desk
at extension x2222.
Once Excel is open, you can enter data by clicking on a particular cell, then typing text, numbers, or formulas. If
you move to another cell by clicking on it,your changes will be entered into the current cell. You can also:
• hit TAB to enter your data and move the active cell to the right of the current one
• hit ENTER to enter your data and move the active cell to the next row, usually to the beginning of that
row
• use the ARROW keys to enter your data and move the active cell in the indicated direction
• hit END, LEFT-ARROW to move the the first cell in a row
• hit END, RIGHT-ARROW to move to the last cell in a row
Practice: What's the label of the last column in Excel? How many rows can an Excel spreadsheet contain? Move
to the cell A1 and enter the numbers 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, one number per cell, in the cells A1
to A10.
Excel tries to format your data based on what you enter. Numbers, for example, are right-aligned while text is
The power of Excel comes from creating formulas that perform calculations on cells containing numbers. Every left-aligned. To ensure that Excel treats your input the way you want t
time a number involved in a formula changes, the resulting value of the formula also changes automatically. To
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
• if you start your input with a single quote ' it will be treated as text, even if you enter a number Practice: Enter your name in cell A12. Enter the formula
• if you start your input with an equal sign, it will be treated as a formula
=sum(A1:A10)
Excel knows many formulas, and we will get to know some of them as our course progresses. For now, we will into cell A11, typing it exactly as shown.
introduce thesum formula, which will compute the sum of all numeric entries in a specified range of cells.
You can save the data you worked on last just as in every other Microsoft Windows application (select File | Save
or click the disk-like icon on the toolbar), and you can, of course, retrieve your data again any time. Please make
sure to remember the folder in which you saved your data.
Practice: Save the data you created (it should contain the numbers 10 through 100, their sum, and your name).
There is a lot more to know and learn about Excel. For additional details, you could take a course or workshop
about Excel through CTC or TLTC, or you could work through the extensive Microsoft Office Help documentation
available by selecting "Help | Microsoft Excel Help". If the "Office Assistant" shows up, click on "Options" and un-
check the "Use Office Assistant" checkbox. Then select "Help | Microsoft Help" again to see a list of all available
topics.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
You should now find the sum of the cells C1, C2, and C3 in cell C4. Instead of using the arrow keys in steps 4 and
2.3 Calculations with Spreadsheet Data 5 you can also select the cells whose sum you want to find by "dragging" the mouse over them. In other words,
you could also:
Excel can perform a wide variety of operations with the data you entered, but in this course we will only need a
small subset of all available functions. This section will explain - very quickly - how to enter some of the more 1. Click on the cell C4 which should contain the sum of the numbers in the C column.
commonly used functions. 2. Type:
=
There are many different methods in which Excel will let you perform the operations described here; we will to indicate that this cell will contain a formula
usually give only one of the possible methods. If you already know a more efficient way to accomplish the same 3. Continue typing:
goal, use your method. sum(
to specify that you want to use the summation function - make sure to include the open parenthesis (
Example: Suppose I want to find the sum and the average of some tabular data, for each row separately. 4. Use the mouse to drag the cursor over the cells from C3 to C1.
Now C1:C3 will be placed in your function
First open Microsoft Excel and enter the data into the spreadsheet, as follows: 5. Complete the function by typing
)
(the closing parenthesis) and hit ENTER
which will accomplish the same task. Now your spreadsheet should look similar to the following:
To compute the sum of the numbers in column C, proceed as follows (if you follow these instructions slowly and
carefully it should work just fine):
1. We assume you have entered the number 10 to 90, as shown in the above picture. Now click on the cell
C4 which is going to contain the sum of the numbers in the C column.
2. Type: Instead of enter the final two formulas into cells A4 and B4, we can now copy and paste the formula from cell C4.
= The range of the summation formula will automatically adjust to compute the sums of columns A and B,
to indicate that this cell will contain a formula respectively.
3. After the equal sign, continue to type:
sum( 1. Click on cell C4
to specify that you want to use the summation function - make sure to include the open parenthesis ( 2. Select "Edit | Copy" or use the Control-C keyboard shortcut
4. Now use the 3. Click on cell B4 and select "Edit | Paste", or use the Control-V keyboard shortcut
Up-Arrow 4. Click on cell A4 and again select "Edit | Paste", or use the Control-V keyboard shortcut
key to move the active cell to C3. Note that C3 will automatically be placed in your function
5. Hold down the
SHIFT key and press the Up-Arrow twice
Now C1:C3 will be placed in your function
6. Complete the function by typing
)
(the closing parenthesis) and hit ENTER
Excel offers many other functions that are interesting for statistical analysis, such as:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
• QUARTILE - Returns the quartile of a data set
• RANK - Returns the rank of a number in a list of numbers
2.4 Installing the Analysis Pack
• SKEW - Returns the skewness of a distribution
• SLOPE - Returns the slope of the linear regression line Excel contains a variety of "add-ons" that allow you to perform additional calculations beyond the basic features
• SMALL - Returns the k-th smallest value in a data set build into Excel from the start. Some of these add-ons might require you to insert the Microsoft Office CD ROM,
• STDEV - Estimates standard deviation based on a sample others can be installed without that disk. In general, the more add-ons you install into Excel, the longer the
• TTEST - Returns the probability associated with a Student's t-test program takes to start up. Therefore you only want to install those options that you are really going to use, or
• VAR - Estimates variance based on a sample uninstall add-ons when you don't need them any longer.
• ZTEST - Returns the two-tailed P-value of a z-test
For this class you must install the "Analysis ToolPak", which contains a variety of procedures for conducting
Practice: Compute the average of the numbers in columns A, B, and C (of course do not include the sums in the statistical analysis. Installing an add-on is simple, but differs slightly depending on your version of Excel. Here is
fourth row in your calculation). Make sure the average for each column is computed below the sum. Then the procedure for the
format the sum and average numbers bold and italics.
Analysis ToolPak for Excel 2007
• Highlight the "Add-Ins" option on the list on the left. Most likely the Analysis ToolPak add-in will show as
inactive, as in the picture above.
• Highlight the "Analysis ToolPak" in the list of inactive add-ins and click on the "Go ..." button at the
bottom. You'll see another dialog box like this:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
We will explore several of these options in the rest of this course, but you are welcome to click on "Help" now to
learn more about the Analysis Tookpak.
End of Module
Thank you!!!!!
You should not be prompted to insert a CD-ROM. If you do have to insert the Microsoft Office CD-ROM you should
contact the Help Desk at x2222 for further assistance (unless, of course, you do have the CD-ROM ...)
The functions from the analysis toolpak will now be available in the "Data" ribbon on the right-most side, named
"Data" (and not in the "Add Ins" ribbon as you might expect.
The specific functions in that add-in are the same as for most versions of Office. If you select the "Data Analysis
..." option under "Data" you will see the following entries for performing statistical analysis on data in your
spreadsheet:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Module 3 SAN MATEO MUNICIPAL COLLEGE
Gen. Luna St. Guitnang Bayan I, San Mateo, Rizal
Tel No. (02) 997-9070
Graphical Data www.smmc.edu.ph
Representation
Statistical Analysis with Module Duration
Software Applications
• Week 1 to 3 - Synchronous Meeting and Asynchronous Learning
• Messenger (direct messages or group chat) Search :
Usually when data is collecting there is a lot of numbers, results, responses, etc. In fact, there is usually so much
data that it needs to be summarized before it can be analyzed. One approach to summarizing data is to
summarize it in graphical or tabular form. Since a picture is worth more than 10,000 words, we hope to be able
to detect patterns or be able to draw conclusions once we see data presented graphically.
to your first module!
In this chapter we will discuss how to use Excel to create Pie and Bar Charts for categorical variables and
Histograms for numerical variables. We will also show how charts can be used to emphasize different points of
views without modifying or falsifying data.
Pie charts are a convenient way to visualize data if the categories that divide the data are not that numerous,
like 10 or less. Pie charts apply to categorical variables (either ordinal or nominal). In most cases pie charts are
not appropriate for numerical variables since there usually would be too many different numbers for that type of
variable.
Example: Suppose a survey was done among 1000 adults about their job situations, with the following results:
Instead of using a table - which may or may not look "pretty" - we need to represent the data in a pie chart. We
proceed as follows:
Note that Excel has automatically converted the raw data into percentages of the total and rounded it
properly. In other words the figure for "one job" was converted to
536 / (122 + 536 + 342) * 100 = 536 / 1000 * 100 = 53.6 %, rounded up to 54%
• Highlight the six cells containing the titles and numbers,click on the "Insert" ribbon and hit the Pie Practice: If your pie chart does not have a 3D-look, what would you need to do to re-create the chart but this
chart wizard: time with a 3D look? Or if you did pick a 3D look originally, now pick a 'flat' design.
There are a number of different pie chart types. For now, pick the first type to create a simple 2D pie Note: If you move your cursor over the various slices of the pie while inside Microsoft Excel, you will see the total
chart, which will be inserted into your spreadsheet. number as well as the number in percent corresponding to that slice. In fact, when you double-click on the pie
you can choose the "Data Labels" tab from the "Format Data Series" dialog box to include the numbers in a variety
of formats in the graph - try it out now.
Exploding your pie chart: You can also explode your pie chart (which sounds a lot more fun than it is). Simply
click on one of the pie slices (not any text, though) and drag it outwards a little - your chart will explode! You
can either make one slice move out of the pie or all slides. This is useful to highlight one particular slice. In the
example I have also colored that slide green to excentuate it even further by right-clicking on it and selecting
"Format Data Series”.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Bar charts are applicable to categorical variables, just as pie charts, but they can accommodate more categories
than pie charts.
Example: A survey was done to find the number of workers employed by major foreign investors.
This time we need to represent the data as a bar chart, with vertical bars, or columns, representing the number
of workers employed by major foreign investors (in some unit of measurement):
Nice, but not great (I don't like that the bars go horizontally, it would be nicer if they went vertically instead), so
we will want improve on this a little:
• Click on the Change Chart Type and select the Column template and pick 3D Clustered Column
• Select the 'Series 1' label on the chart and remove it by hitting Delete
• Switch to the Layout ribbon and add a Chart Title
• Right-click on the y-axis, pick Format Axis and make sure that the range on the y-axis goes from 0 to 6800
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
As mentioned, we will use Excel to create such a frequency histogram. This time, however, we will not use the
3.4 Frequency Histogram Chart Wizard but instead our first procedure from the Analysis ToolPak.
The previous chart types work well for categorical data since there are usually a limited amount of categories. • Start Excel and enter the above numbers, all in one column. You do not need to enter a title or anything
The most important type of graphical data representation for numerical data is a Frequency Histogram, or else, just the numbers in one column. Note that the picture only shows the first and last few numbers,
histogram for short. Let's consider an example: but you should, of course, enter all numbers in the first column.
Example: In an anonymous survey of students in a stats course (like the one you filled out at the beginning of • Now bring up the "Data Analysis ..." dialog (remember, it is available on the "Data" ribbon). If you do not
the class) you were asked your sex, male or female. Here are the responses received: see this item, you must first install the "Analysis Pak", as described previously. Anyhow, a dialog box
similar to the following will appear
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1
First, as a quick review, is this a numeric, nominal, or ordinal variable - I hope you're thinking "nominal" and you
don't get fooled by the minor detail that the values are all numbers (which are merely code for the categories).
Second, a usual bar chart (or pie chart) would not work well. I am not really interested in the fact that some
responses were 1, others were 2. Instead I want to know how many 1's (men) andhow many 2's (women) there
are, or in the frequencies of the various responses. In this case I could (relatively) easily count the values
manually to find the following frequencies:
Frequency
Male (1) 15 • Highlight the entry "Histogram", as in the above picture, then click on "OK".
Female (2) 24
Totals 39
This frequency table tells me, for example, that more women than men are taking this stats class. Also, if I meet
a person from this class completely at random on the street, there is a "15 in 39" chance it is a man and a "24 in
39" chance it's a woman (we'll do some probability theory later but this should make common sense).
In the above example we could generate our frequency table manually easy enough, and we could subsequently
use that table to generate an appropriate chart. But if we have hundreds or thousands or responses, we would
like to use Excel to generate the frequency table and associated chart. Also, it may not be completely clear in
which category responses fall. especially if the variables are numeric. As usual, Excel will provide a relatively
convenient method for us to automate our calculations.
Example: Many communities add fluoride to water to prevent tooth decay. In a 25 day period, these levels of
fluoride were measured:
75, 86, 84, 85, 97, 94, 89, 84, 83, 89, 88, 78, 77, 76, 82, 72, 92, 105, 94, 83, 81, 85, 97, 93, 79
There are too many numbers for pie or bar chart, and in fact we are not interested in the actual numbers as
much as we are interested in the frequency with which they occur. Hence, we want to group them into
categories, and then graph the frequency counts of these categories instead of the original numbers.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
The "Histogram" tool is appropriate to compute frequency tables and charts for numeric variables. You could In the above dialog box, you will notice three "cell selector" icons (for the "Input Range", the "Bin Range", and
continue reading the instructions or check out the video if you prefer: the "Output Range")
In our case, you should click on the cell selector icon next to the "Input Range". The "Histogram" dialog box
will temporarily shrink and you can now use the mouse and/or cursor keys to select the appropriate cells by
To continue with the instructions, you need to enter the options for a (frequency) histogram next, including the
highlighting them:
location of the data to be used and the categories that you want to use.
When you have selected the appropriate cells, click the "Return" icon in the small "Histogram" dialog to return to
the original "Histogram" dialog box. You should now see the text "$A$1:$A$25" as "Input Range".
Next, we should determine the "bins" that in turn will determine the category boundaries. However, for a "quick"
histogram, we do not need to fill in the bin range and instead Excel will compute the highest and lowest data
point and subdivide the values automatically into evenly divided categories. In other words, for this particular
example we will leave the "Bin Range" empty (we will provide a second example where we manually determine
the categories, or bins).
The various options in this dialog box need further explanation (click on "Help" inside that dialog box):
We can now choose to display a frequency table only, or a frequency table together with a histogram chart,
simply by selecting or de-selecting "Chart Output". We have done that (i.e. frequency table plus chart), and the
output looks similar to the following:
Input Range Enter the reference for the range of data you want to analyze.
Bin Range: Enter the cell reference to a range that contains an optional set of boundary values that define
[optional] bin ranges. These values should be in ascending order. Excel counts the number of data points
between the current bin number and the adjoining higher bin, if any. A number is counted in a
particular bin if it is equal to or less than the bin number down to the last bin. All values below
the first bin value are counted together, as are the values above the last bin value. If you omit
the bin range, Excel creates a set of evenly distributed bins between the data's minimum and
maximum values.
Labels Check this if the first row or column of your input range contains labels. Clear this check box if
your input range has no labels; Excel generates appropriate data labels for the output table.
Output Range: Enter the reference for the upper-left cell of the output table. Excel automatically determines
the size of the output area and displays a message if the output table will replace existing data.
New Click to insert a new worksheet in the current workbook and paste the results starting at cell A1
Worksheet Ply of the new worksheet. To name the new worksheet, type a name in the box.
As usual, we can now customize our chart by double-clicking on its components to replace the various titles by
New Click to create a new workbook and paste the results on a new worksheet in the new more meaningful names, and removing the "Frequency" label. Out final histogram might look like this:
Workbook workbook.
Pareto (sorted Select to present data in the output table in descending order of frequency. If this check box is
histogram) cleared, Excel presents the data in ascending order and omits the three rightmost columns that
contain the sorted data.
Cumulative Select to generate an output table column for cumulative percentages and to include a
Percentage cumulative percentage line in the histogram chart. Clear to omit the cumulative percentages.
Chart Output Select to generate an embedded histogram chart with the output table.
There are a lot of options, perhaps confusing ones, but the only mandatory option is the input range. In our case,
we need to enter the data range (the cells containing the data) and we need to make sure that the option "Chart
Output" is selected. Excel provides an easy method to select the data range (or many other cell ranges within
dialog boxes).
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
We have changed the color of the frequency bars to red, the background to light brown, and we have replaced 6. Find the separation points for each category, given by:
the various titles by more appropriate names.
o minimum + 1 * width
In this example Excel determined the categories for our numeric variable (the "bins") automatically. Excel o minimum + 2 * width
decided: o ...
o minimum + (n-1) * width
• category 1 goes from 0 to 72 and includes 1 measurement
• category 2 goes from 72 to 78.6 and includes 4 measurements where n is the number of categories we want.
• category 3 goes from 78.6 to 85.2 and includes 9 measurements
• category, 4 goes hour 85.2 to 91.8 and includes 4 measurements This could be done easily by hand, but Excel provides formulas for doing this as well. In fact, a by-hand
• category 5 goes from 91.8 to 98.4 and includes 6 measurements computation (with the help of a simple calculator is probably easier - and you should most definitely do that - but
• category 6 includes everything above 98.4 and includes 1 measurement since we are committed to Excel, I'll show how to do it with Excel formulas.
This means, for example, that on 9 of the 25 days measured the pollution was between 78.6 and 85.2. In the picture below, we show the results of these computations, together with the formulas that yield the
various numbers:
By the way, I hope you notice that this description of the categories leaves room for for interpretation. Where,
for example, would a value go that is right on the border between two bins? For example, would a measurement
of 78.6 fall into category 2 or in 3? What do you think will Excel decide in a borderline case such as this? As a
hint, look at the last two categories and generalize from there.
Practice: Open the Excel spreadsheet linked below. It shows the age of respondents to a survey. Generate a
frequency histogram and determine if the variable is homogeneous or heterogeneous. Use the default number of
categories Excel comes up with.
Excel's histogram tool works well for numeric variables, but in case the variable is not numeric another procedure
works better and we will outline that in a subsequent section. In our next example, however, we assume we do
have a numeric variable and we are interested in defining the categories (aka bins) ourselves. This is usually not
necessary but is useful, particularly for large data sets. but if you are pressed for time you may skip this portion
and perhaps mark it for review at a later time. As a reward for those not adverse to a challange, you'll get to The formulas, displayed in blue on light brown background, are actually entered in place of the numbers shown in
analyze the salaries of Major League Baseball players over the past decades and the end of this section; the above picture, i.e. you will not see those formulas in the actual Excel spreadsheet. To recall how to enter
interesting stuff indeed! formulas like these, please refer to a previous section.
Please note that Excel's default categories for numeric variables usually (but not always) work fine, but it is Note that there are four category breakpoints, which will result in five categories:
sometimes necessary to have a specific number of categories. The procedure to define your own categories is
perhaps a little more complicated than our previous procedure but - we will look at it as an opportunity, not a • numbers below 13.8, including 13.8
difficulty ... • numbers bigger than 13.8 and less than or equal to 26.6
• numbers bigger than 26.6 and less than or equal to 39.4
Example: A study was done that measured heights of widgets produced in a certain factory. Here are the • numbers bigger than 39.4 and less than or equal to 52.2
results: • numbers bigger than 52.2
3, 2, 5, 1, 4, 11, 3, 8, 23, 2, 6, 17, 5, 12, 35, 3, 8, 23, 6, 14, 41, 7, 16, 47, 8, 18, 53, 10, 22, 65, 9, 20, 59 Now we proceed just as before, but this time we will select the cells D8 to D11 as "Bin Range" in addition to the
previous example. In other words:
Construct a frequency table with associated chart using five categories and again using eight categories.
1. Select "Data Analysis ..." from the "Tools" menu
As usual, start Excel and enter the above data, all numbers in one column. 2. Select "Histogram" and click on "OK"
3. Select as "Input Range" the cells containing the data numbers, in our example A1 to A33
Before we can generate the histogram using the "Data Analysis ..." menu entry we need to perform a few 4. Select as "Bin Range" the cells containing the numbers in D8 to D11
calculations so that we get the desired number of categories. In the previous example Excel handled the category 5. Select the option "Chart Output" and, by necessity (as mentioned above), also "New Workbook" for the
selection automatically, but this time we want to specify bounds so that we get exactly 5 (or 8) categories. Here output
is what we have to do:
1. Decide on the number of categories you want (usually between 5 and 10)
2. Compute the minimum (smallest) value of our data
3. Compute the maximum (largest) value of our data
4. Compute the range of our data, i.e. range = maximum - minimum
5. Compute the width of each category, i.e. width = range / (number of categories)
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Of course we could again change the corresponding titles, colors, ranges, etc. by double-clicking on them. By the
way, do you think this is a homogeneous or heterogeneous distribution - I hope you're thinking that most values
clump around the category 9, which would indicate it is - if you had to decide - homogeneous. And note that you
would arrive at the same conclusion if you had considered the previous histogram with 5 bins.
Click on "OK" to produce a frequency chart similar to this one: Practice: The next Excel spreadsheet contains data for salaries of Major League Baseball player from 1988 to
2011. Open the data files and create a histogram for the salary variable. Think about whether it is actually a good
idea to create this histogram. Perhaps there is some problem with this, some information that you don't quite
capture if you create a histogram for the entire data?
It is now very simply to create a frequency table and chart with 8 (or any other number of) categories: in the
original spreadsheet, simply change the number of categories from 5 to 8 (which will change the "Width" as well
as the current category cutoff points) and add 3 additional cutoff points (using the appropriate formulas):
Select again "Histograms" from the "Data Analysis ..." entry of the "Tools" menu, but make sure that the "Bin
Range" now contains the new cutoff points just computed. The final result will look similar to this:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Now suppose we want to give a presentation in which the state of Arkansas looks reasonably good as compared to
3.5. Bending the Rules the state of New Jersey. We could create a bar char that minimizes the differences in state spending by using a
particularly "large" scale on the y-axis:
Using graphical data representation provides a great opportunity to visualize data so that it conveys a particular
point of view. This is not cheating, it is simply using some visual aides to make your data appear to support one
particular point of view over another without actually changing the data.
Here is, for example, a table of how much different states spend per students in dollars in 1980:
If we insert a vertical bar chart as described in a previous section without picking any options, it might look
similar to the following:
We are also de-emphasizing the empty space that results in choosing a large y-scale by placing the chart title into
that area. In this chart it is still clear that NJ spends the most per student - after all, we cannot change the
actual data - but the difference does not look quite so stark any more. As another option, we could remove the
vertical gridlines to make it harder to see exactly how much money the different states actually spend.
Now let's try the opposite: we want to give a presentation in which the state of Arkansas looks very bad as
compared to the state of New Jersey. Thus, we pick a scale on the y-axis that makes sure that the difference
between Arkansas and New Jersey appears as large as possible. In particular, we choose a y-scale that starts at
1000 and ends at 2400, instead of more standard values such as 0 to, say, 3000.
It is easy to see that New Jersey spends the most per student, about twice as much as states like Arkansas or
Mississippi. The difference between NJ and AK is pretty clear.
We also picked an "aggressive" color (red) for the Arkansas figure and a "calm" color (green) for New Jersey,
emphasizing the fact that we want to represent Arkansas as "bad" and New Jersey as "good". In this chart AK looks
pretty bad compared to NJ - in fact, it seems as if NJ spends many times more money per student than AK - but w
have not changed the actual data values.
All three charts represent the same data and they are perfectly valid. Yet visually they tell different stories.
There are many other tricks that are used frequently to represent data in such as way as to support one
particular point of view without outright misrepresenting reality.
Exercise: Support you have some data showing the cases of H1N1 flue infections per region as follows:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
• Region 01 - Boston: 215
• Region 02 - New York: 229
3.6. Frequency Charts for Categorical Variables
• Region 03 - Philadelphia: 193
• Region 04 - Atlanta: 301 Often one would like to know the frequency of occurrence of values for a variable in percent. This is similar to a
• Region 05 - Chicago: 1788 frequency histogram we studies earlier, but a histogram only applies to numerical variables, while the procedure
• Region 06 - Dallas: 734 outlines in this section will apply to categoricalvariables. Unfortunately the procedure is somewhat lengthy, but
• Region 07 - Kansas City: 164 with a little bit of practice it should not be too bad.
• Region 08 - Denver: 175
• Region 09 - San Francisco: 1080 Example: A survey was conducted in the summer of 2004, asking several students in a statistics course a number
• Region 10 - Seattle: 420 of questions about their background and musical taste. The data can be found by clicking on the link below.
Display a bar chart for the race of the students. In other words, compute how many of the students are white,
If you were a health official in Dallas, you might want to use this data to try to get people in your region to black, hispanic, etc. and display those figures in a bar chart.
vaccinate against the H1N1 flue. Thus, you are trying to create a chart that emphasizes the number of cases in
Dallas versus the other regions so that your citizens are motivated to get vaccinated. Here are a few suggestions: Here is the spreadsheet that contains the results of this survey:
Student Survey
Loading this data into Excel, we see that there is one column of interest, entitled "Race". However, that column
represents a categorical variable (ordinal or nominal?) so we can not compute a frequency histogram. Also, the
values are not numeric so we can't ask Excel to automatically add up all the "hispanics" (for example).
But since that column does contain the data we want to display, we need to learn a new procedure for handling
categorical data. The procedure should automatically count the frequencies of the various races and present
those counts in a bar chart.
Before we figure out how Excel can do this automatically let's simply do it by hand. Inspecting the data we see
that there are 5 categories, White, Black, Hispanic, Pacific Islander, and Other. We type these categories into an
empty part of the Excel spreadsheet and manually count how many people in each category are contained in our
data. We add these counts, or frequencies, next to each category manually:
Which of these charts do you like the best? Create your own chart to emphesize the figure in "Region 06 - Dallas".
Try this: Check your local newspaper or online news source to find some charts. See if these charts try to
promote any particular point of view or if they are relatively neutral.
Now it should be easy to create the appropriate bar chart - make sure to do it, it works just as described in the
previous section on creating simple bar charts.
Our manual procedure barely worked because we did not have that much data. For large data sets we need to
figure out an automatic procedure to create a table of frequencies and the associated bar chart. Fortunately,
Excel has just such a procedure, called a Pivot Table. The Pivot tool is found as the first button of the "Insert"
ribbon. It looks slightly different depending on the version of Excel but the differences are pretty minor and you
should be able to figure it out. For detailed assistance - if you are using Excel 2007 - you could try this helpful
video.
The Pivot tool is actually a lot more flexible than we will need in our course, but it will for sure create the type
of tables that we will be interested in. We will, in fact, see that tool again subsequent sections.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
• Load the above spreadsheet into Excel and click anywhere outside the data area, for example below the Exercise: Please practice using the same data set and create:
last row of data. If you don't put the cursor into a free cell, the Pivot tool will be disabled.
• Select "Insert" -> "Pivot Table" and choose the entire data table for the Inpout/Range field, using the by 1. a bar chart representing the number of males and females in the survey
now familiar range selector. Make sure to pick the entire data table, including the first row containing 2. a bar chart representing the number of vegetarians in the survey
labels. 3. For review, create a histogram for "weight". The obstacle here will be that there is one (or more) blank
• You can choose to put the resulting pivot table into a new spreadsheet, but for now we can just leave it data points for this variable.
at the bottom of the data range. 4. The more review, create a histogram for "height". In this case you will notice that one guy entered 5 as
• Click "Okay" to generate the pivot table (which will initially be empty) height, which does not make sense compared to all the other data points. What should you do with this
exceptional value?
You will see a "potential frequency/percentage table" containing labels such as "Drop Row Field Here", "Drop
Column Fields Here", etc., but no data values are yet contained in the table. There will also be a window called You should of course use the Pivot tool, not count the frequencies manually. For your reference, the charts for
"Field List" containing the available variables from your data, in our case "ID", "Sex", "Weight", "Height", "Race", the first two questions are as follows:
etc. You can "drag-and-drop" these variables to the various slots in the table to create a variety of useful tables
for data analysis.
The last two questions require a good old histogram, which we have covered before. The apparent problems with
this will be discussed in homework.
End of Module
Thank you!!!!!
• Drag the variable "Race" from the field list into the "Drop Row Fields" area of the table. Your table will
adjust, showing you all available "Race" categories but as for now no frequencies (counts) yet.
• Next, again drag the variable "Race" from the field list, but this time drag it to the "Drop Data" or "Value"
area in the middle.
You will finally see the counts of how many occurrences fell inside each race category, which of course will turn
out similar to the one we created manually before, except this time it includes the "blank" category (and the
order may be different.
For extra credit, see if you can eliminate the "blank" response row. Hint: maybe you can find a drop-down menu
somewhere where you can 'uncheck' unwanted categories. Also, when you double-click the "Count of race" label
in the table you can specify exactly what type of counts should be shown and in which way it should be
formatted. Try for example to get your counts to appear as percent of the overall total.
• You can now create the bar chart as usual, including or excluding the blank response as you see fit. But it
is even easier: if you position your cursor inside the "pivot" table, an "Analyze" ribbon will appear as one
of the "Pivot Tools" - it will contain a "Pivot Chart" button, use that to create a chart.
In subsequent sections we'll revisit the Pivot Table tool and investigate additional options and possibilities. If you
have problems with the Pivot tool so far, you might want to check out the following video:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Module 4 SAN MATEO MUNICIPAL COLLEGE
Gen. Luna St. Guitnang Bayan I, San Mateo, Rizal
Tel No. (02) 997-9070
Numerical Data www.smmc.edu.ph
Representation
Statistical Analysis with Module Duration
Software Applications
• Week 1 to 3 - Synchronous Meeting and Asynchronous Learning
While charts are certainly very nice and often convincing, they do have at least one major draw-back: they are
not very "portable". In other words, if you conduct an experiment measuring cholesterol levels of male and
female patients it is certainly great to create appropriate histograms to illustrate the outcome of your
to your first module! experiment. However, if you are asked to summarize your results, for example for a radio show or just during a
conversation, these charts will not help much.
Instead you need a simple, short, and easy-to-memorize summary of your data that - despite being short and
simple - is meaningful to others with whom you might share your results.
This module is a combination of For example, in our study of levels of cholesterol we could condense the results by stating that the "average"
synchronous and asynchronous level of cholesterol for men is X, while the average for women is Y, and most people would understand. Of
course, when we condense data in this way, some level of detail is lost, but we gain the ease of summarizing the
learning and will last for two weeks. data quickly.
This chapter will discuss some "statistics" that can be used to summarize data numerically while still trying to
capture much of the detailed structure hidden in the data. Among the descriptive statistics we will study are the
mean, mode, and median, the range, variance, and standard deviation, and more detailed descriptors such as
percentiles and skewness. Towards the end of the chapter we will learn about the "box plot" that combines many
of the numerical descriptors in one picture.
Note that this does not mean that the median is (n+1)/2 (if n is odd) but rather that the median is that number
Of course, the idea - ultimately - is to use the sample mean as an estimate for the population mean (which
which can be found at position (n+1)/n.
is usually not known). For now, we will just show examples of computing a mean, and later we will discuss in
detail how exactly the sample mean can be used to estimate the population mean.
The median is usually easy to compute when the data is sorted and there are not too many numbers. For unsorted
numbers, or for lots of numbers, the median becomes quite tedious, mainly because you have to sort the data
Example: A sample of 7 scores from people taking an achievement test were taken. The numbers are:
first. But of course Excel has a built-in function
95, 86, 78, 90, 62, 73, 89 =median(RANGE)
that will automatically compute the median of the numbers in a given range of cells.
Then the mean of that sample is:
Note: In Excel the =median(RANGE) function ignores cells containing no numeric data, i.e. cells that contain no
= (95 + 86 + 78 + 90 + 62 + 73 + 89) / 7 = 573 / 7 = 81.9 data or text data, do not contribute anything to the computation of the median. Also, for an even number of
numbers the median is automatically computed to be the middle between the two middle numbers.
The median applies to numerical variables, and in some situations to ordinal variables. It does not apply to
Excel actually provides a simple function for computing averages, namely the nominal variables.
=average(RANGE) Discussion Topic: Discuss how to find the mean and the median of ordinal data, and why neither of these
descriptive parameters makes any sense for nominal variables.
function. Using Excel, we can simply compute the above mean by entering the seven data observations into a
new spreadsheet, then find a convenient spot to display the average number, and finally entering the The Mode
appropriate=average(RANGE) function, where RANGE should be replaced by the appropriate range of cells. Try
it out now - the answer should of course be 81.9
The mode is that observation that occurs most often. It is usually not unique, and is therefore not that often
used, but it has the advantage that it applies to numerical as well as categorical variables. As with the median,
Note: In Excel the =average(RANGE) function ignores cells containing no numeric data, i.e. cells that contain no the mode is easy to find if the data is small and sorted:
data or text, do not contribute anything to the computation of the mean. Cells that contain a zero do, however,
do contribute to the average.
Example: Scores from a test were: 1, 2, 2, 4, 7, 7, 7, 8, 9. What is the mode?
The mean applies to numerical variables, and in some situations to ordinal variables. It does not apply to The mode is 7, because that number occurs more often than any other number.
nominal variables.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Example: Scores from a test were: 1, 2, 2, 2, 3, 7, 7, 7, 8, 9. What is the mode? Exercise: Find the mean, mode, and median of the salary of Major League Baseball players. Why are they so
different? Which one best represents the measure of central tendency? Did we compute the population mean (or
This time the mode is 2 and 7, because both numbers occur three times, more than the other numbers. median) or the sample mean (or median)?
Sometimes variables that are distributed this way are called bimodal variables.
For data that consists of lots of numbers, and/or data that is not sorted, the mode, as the median, is major league baseball salaries
cumbersome to compute by hand. Of course Excel provides an appropriate formula, in this case the
=mode(RANGE)
Incidentally, the measures of central tendency computed above represent population measures, since they took
function. However, if the cell range consists several numbers with the same frequency (i.e. a bimodal variable as all major league baseball players into account. Had I only used a subset of players to compute mean, mode, and
in the second example above) then the Excel =mode(RANGE) function returns only the first (smallest) number as median, the values would be sample measures.
the mode.
Mean and Median for Ordinal Variables
If all values occur exactly once, the Excel mode function returns N\A for "not applicable".
As I mentioned, the mean and median work best for numerical values, but you can compute them, in a matter of
Mean, Median, and Mode: Pros and Cons speaking, for ordinal variables as well.
Since there are three measures of central tendency (mean, median, and mode) it is natural to ask which of them Example: Suppose you want to find out how students like a particular statistics lecture, so you ask them to fill
is most useful (and as usual the answer will be ... "it depends" -:) out a survey, rating the lecture "great", "average", or "poor". The 14 students in the class rank the lecture as
The usefulness of the mode is in the fact that it applies to any variable. For example, if your experiment contains "great", "great", "average", "poor", "great", "great", "average", "great", "great", "great", "average", "poor",
nominal variables then the mode is the only meaningful measure of central tendency (you could of course use "great", "average"
frequency histograms to represent your data, as discussed in the previous chapter).
Compute the mean, the mode, and the median.
Mean and median usually apply in the same situations, so it is more difficult to determine which one is more
useful. To understand the difference between median and mean, consider the following example: Obviously the mode is "great", since that is the most frequent response. For the other measures of central
tendency I have to introduce numeric codes for the responses. I could define, for example:
Example: Suppose we want to know the average income of parents of students in this class. To simplify the
calculations and to obtain the answer quickly, we randomly select 3 students to form a random sample. Let us "great" = 1, "average" = 2, and "poor" = 3
consider two possible scenarios:
Then my data is equivalent to
• Case 1: The three incomes may be, say, 25,000, 30,000, 35,000
• Case 2: The three incomes may be, say, 25,000, 30,000, 1,000,000 1, 1, 2, 3, 1, 1, 2, 1, 1, 1, 2, 3, 1, 2
Compute mean and median in each case and discuss which one is more appropriate. Now it is easy to see that the average is 22 / 14 = 1.57 and the median is 1.
The actual computations are pretty simple. Of course the actual values for these central tendencies depend on the numeric code I am using for the original
variables. I would need to justify or at least mention the codes I am using in a report so that the answers can be
• In case 1 the mean is 30,000 and the median is also 30,000. put in proper context. In a proper survey I would in fact list the code values together with the responses. One
• In case 2 the mean is 351,666, whereas the median is still 30,000 particular type of response that is frequently used in surveys is a Likert scale.
Clearly we were unlucky in case 2: one set of parents in this sample is very wealthy, but that is - probably - not
representative for the students of the class. However, we selected a random sample, so scenario 1 is equally
likely as scenario 2. Therefore it seems that the median is actually a better measure of central tendency than the
mean, especially for small numbers of observations. In other words:
However, for large sample sizes the mean and the median tend to be close to each other anyway, and the
mean does have two other advantages:
• the mean is easier to compute than the median since it does not require sorted observations
• the mean has nice theoretical properties that make it more useful than the median
We will use both mean and median in the remainder of this course, while the mode will be less useful for us and
will usually be ignored.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
A Likert scale is a sequence of items (responses) that are usually displayed with a visual aid, such as a horizontal 19*7.4 + 8*20.2 + 1*33 + 2*45.8 + 3*58.6 = 602.6
bar, representing a simple scale.
and therefore the average would be approximately 602.6/33 = 18.26. The true average of the original data is
17.15. Thus, our estimate average is pretty close to the true average.
Of course if you had the original data, you would not need to do this estimation - you would of course use that
data to compute the mean. But there are cases where you only have the aggregate data in table form, in which
case you could use this technique to find at least an approximate value for the mean.
Example: A study of salaries of graduates from a University shows their income as follows:
We have seen how to compute mean, mode, and median for numeric data, and how to create frequency tables $53,840 - $65,500 2
for categorical variables and histograms for numeric ones. As it turns out, it is possible to compute these
measures of central tendency even if only the aggregate data in terms of a frequency table or histogram is
available.
3, 2, 5, 1, 4, 11, 3, 8, 23, 2, 6, 17, 5, 12, 35, 3, 8, 23, 6, 14, 41, 7, 16, 47, 8, 18, 53, 10, 22, 65, 9, 20, 59
Category Count
Total 33
Based solely on this table, estimate the mean and compare it with the true mean of the full data set.
• 19 data points are between 1 and 13.8, that is 19 data points are averaging (1+13.8)/2 = 7.4
• 8 data points are between 13.8 and 26.6, that is 8 data points are averaging (26.6+13.8)/2 = 20.2
• 1 data point is between 26.6 and 39.4, or 1 data point averages (26.6+39.4)/2 = 33.0
• 2 data points average (39.4+52.2)/2 = 45.8
• 3 data points above 52.2, or between 52.2 and 65.0, so that 3 data points average (52.2+65)/2 = 58.6
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Estimate the average incoming. Hint: you may use the following table (of course together with Excel) to get Example: Find the median and the mode of the following salary table
organized.
Salary Range Count
Salary Range range midpoint Count product
$7,200 - $18,860 130
$7,200 - $18,860 13030 130 1693900
$18,860 - $30,520 698
$18,860 - $30,520 24690 698 17233620
$30,520 - $42,180 254
$30,520 - $42,180 36350 254 9232900
$42,180 - $53,840 16
$42,180 - $53,840 48010 16 768160
$53,840 - $65,500 2
$53,840 - $65,500 59670 2 119340
Total 1100 29047920 We add two columns to the table: one containing the frequency as percent and the second containing the
cumulative percent:
To estimate the average, we compute the blue values in the above table. Then we divide the sum of the products Salary Range Count Percent Cumulative %
by the sum of the counts to get as average 29047920/1100 = $26,407.20
$7,200 - $18,860 130 130/1100 = 11.8% 11.8%
There is no way to determine the actual average from this table, since you don't really know how the numbers fit
into the various intervals. We would need access to the original raw data to find the true mean. It turns out, $18,860 - $30,520 698 698/1100 = 63.5% 63.5+11.8 = 75.3%
though, that the true average, using the original data is $26,064.21 which is indeed close to our estimate. In a
similar way you can compute the mean of an ordinal variable. Try some problems. $30,520 - $42,180 254 254/1100 = 23.1% 75.3+23.1 = 98.4%
That settles finding the mean, but how do we find the median or the mode? Well, that is actually much easier $42,180 - $53,840 16 16/1100 = 1.4% 98.4+1.4=99.8%
than the mean:
$53,840 - $65,500 2 2/1100 = 0.2% 99.8+0.2=100%
• compute the percentages for the frequency table: the category with the largest percentage is the mode
• add a column named "cumulative percent" to the frequency table by computing the sum of all Total 1100 100%
percentages of all categories below the current one: the median is the first category where the
cumulative percent is above 50%
We can now see that the mode is the 2nd category $18,860-$30,520, since it occurs most often at 63.5% and the
median is also the 2nd category, since it is the first one where the cumulative percent is above 50%.
Note that finding the median depends on the fact that the categories are ordered, of course, which means that
the variable is ordinal (or numeric in case of a histogram).
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Here are the answers (and the formulas used to compute them):
4.3 How to select Random Samples
We have previously introduced the mean and the median. Now we want to see how to use Excel to compute these
values for (reasonably) large data sets, as well as learn how to predict the population mean using a sample mean
and/or median. First, we need a data set that we can analyze.
Click on the above Excel link to download an Excel spreadsheet that contains data about the salary levels of
graduates from the University of Florida in the early 90's. The Excel spreadsheet should look similar to this (only
the first few rows are displayed in the picture below):
In other words, according to our data we would say that the average salary of all graduates from the University of
Florida earned approximately $26,000.
Note that the mean and the median are very close together, which is usual for a "balanced" distribution (we'll
define that a little later). From the information we have about the data set, we actually do not know if the data
really contains all graduates or just a representative sample.
First, let's find the average as well as the median of the salary level for all graduates in the survey. • If the data did include all graduates, $26,064 is the population mean, and there is no statistical error
involved.
• If the data did not include all graduates, but a representative sample instead, then $26,064 would be
1. Go to the end of column C
the sample mean, and we would use that as an estimate for the (unknown) population mean. In this case,
2. Enter =average(, then use the mouse to select all cells in column C that contain numbers
we really should also provide a margin of error for our estimate - we will do that in a later module.
3. Type ) to close the parenthesis, then hit RETURN
4. Move one cell below the average
While Excel can compute the mean and median very quickly for this data set, it would be tedious to do so "by
5. Enter the formula =median(, then use the mouse to select all cells in column C that contain numbers
hand". To simplify the computation and to illustrate the difference between population mean and sample mean,
except, of course, the cell containing the current average that was previously computed.
we will assume that the Excel data set is the entire information for all recent graduates of the University of
6. Type ) to close the parenthesis, then hit RETURN
Florida and do the following:
7. Add some labels in front of the numbers just computed
• Select 10 salaries at random from that data set. These numbers form a sample of size 10.
• Compute the mean and median of this sample (which is easy to do, even with a calculator)
• Compare this sample mean and median to the actual population mean
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Selecting a Random Sample from a Population
The problem in the above example is how to select the 10 numbers "at random". To remove any bias (which we
would introduce if we attempt to pick randomly "by hand", say), we will use Excel's random sample selection tool:
In other words, the sample mean we computed using 10 salaries is $23,700, and the sample median is $25,750
(again, your numbers will be different since your random sample should be different). Either number is a
reasonable estimate for the actual population mean which was about $26,000.
To complete a valid statistical analysis, we should also provide a maximal error for our estimation, but we will
cover that in a later module.
Discussion Topic: If you repeated the above exercise once for really small sample sizes, and again for larger
sample sizes, which would give better estimates?
Note that computing the sample mean for only 10 numbers is very easy and the result is pretty close to the actual
population mean of over 1,100 salaries. We have therefore achieved a compromise: we use less effort for our
computation of the mean, but our answer will be somewhat less accurate. Alternatively, we could expend a lot of
effort in the computation of the mean (using the entire population) and as a benefit our result will be totally
accurate.
Use the "cell selector" icons to select all cells containing salaries, and enter the sample size of 10 in the
"Number of Samples:" input field.
Note that in this dialog box the sample of size 10 will go into a new worksheet. Click OK and a sample of size 10
is selected at random. In our case, the sample data is as follows (note that since the sample is selected at
random, your numbers will differ from the ones below):
• Now let's use the standard =average(RANGE) and =median(RANGE) functions for this sample set to
compute the sample mean and median, as in the following picture:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
We had to use two formulas because one involves the population mean, the other the sample mean. Practically,
4.4 Measures of Variability: Range, Variance, and Standard however, the formula is the same. It is useful to compute the variance at least once "by hand" before we show
Deviation how to use Excel to accomplish the same feat quickly and easily.
In both cases, the mean is 10, indeed. However, the first machine seems to be the better one, since most nails 6 4 16
are close to 10 inches. Therefore:
8 2 4
We must find additional numbers indicating the 'spread' of the data.
8 2 4
The Range
10 0 0
The easiest measure of the data spread is the range. It is simply the highest data value minus the lowest data
value (we have seen the range before). In the above example, the range is the same for both data, namely 14 - 6 10 0 0
= 8. The range is, while useful, too crude a measure of variability.
10 0 0
The Variance
10 0 0
We want to find out how much the data points are spread around the mean. To do that, we could find the
difference between each data point and the mean, and average these differences. However, we want to measure 10 0 0
the differences to the mean regardless of the sign (positive or negative difference). Therefore, we could find the
absolute value of the difference between each data point and average that. But for theoretical reasons an 12 -2 4
absolute value function is not easy to deal with, so that one chooses a square function instead (which also
neutralizes signs). Finally, for yet other theoretical reasons we shall use not the sample size n to compute an 12 -2 4
average, but instead n-1.
14 4 16
Hence, we will use this formula to compute the data spread, or variance:
Variance = add up the squares of (Data points - mean), then divide that sum by (n - 1) Therefore, the variance for machine A is: (16 + 4 + 4 + 0 + 0 + 0 + 0 + 0 + 4 + 4 + 16) / 10 = 48 / 10 = 4.8
There are two symbols for the variance, just as for the mean: Machine B:
6 4 16
• (for the population variance) 8 2 4
8 2 4
• (for the sample variance)
•
10 0 0
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
12 -2 4 In our above example of machine B we would compute the variance using this shortcut as follows:
12 -2 4 x x2
14 -4 16 6 36
14 -4 16 6 36
14 -4 16 6 36
8 64
Therefore, the variance for machine B is: (16 + 16 + 16 + 4 + 4 + 0 + 4 + 4 + 16 + 16 + 16) / 10 = 112 / 10 = 11.2
8 64
In other words, the variance, or spread around the mean, for machine A is 4.8 while machine B has a variance
(spread) of 11.2. That means that machine A, as a rule, produces nails that stick pretty close to the average nail 10 100
length. Machine B, on the other hand, produces nails with more variability that machine A. Therefore, Machine A
would be much preferred over machine B. 12 144
Note: The unit of the variance is the square of the original unit; hence, it is not the best number (considering 12 144
units). Therefore, one introduces an additional number, called the standard deviation:
14 196
The Standard Deviation
14 196
The standard deviation is the square root of the variance.
14 196
As with the mean, there are two letters for variance and standard deviation:
sum(x) = 110 sum(x2) = 1212
• Computing the mean: Using Excel to compute Range, Variance, and Standard Deviation
Example: Use the above formulas to compute the mean, the range, the variance, and the standard deviation of
the salaries of graduates for the University of Florida. The data set (in Excel format) can be obtained by using
At first this second formula looks much more complicated, but it is actually easier since it does not involve All that is involved here is adding the appropriate formulas to the Excel worksheet. The results (including the
computing the mean first. In other words, using the second formula we can compute the variance (and therefore formulas) are displayed below:
the standard deviation) without first having to compute the mean.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
4.5 Quartiles and Percentiles
At this point we can describe the results of an experiment using 2 numbers (or parameters): the mean (or
median) and the standard deviation (computed from the variance). That will tell us the "center" of the
distribution of values (mean) and the "spread" around that center (standard deviation). For example, if we
measure the height of US army soldiers we could say that the average height of US soldiers is 1.73 meters, with a
standard deviation of 0.15 meters (the numbers are made-up). This gives you a reasonable idea about how a
generic solder looks like (he/she is about 1.73 m tall) and how much variation from that generic look there is. To
describe the distribution in more detail we need additional descriptive measures.
Following this notation, the median should actually be called the "middle quartile" Q2, since it is that number
such that 50% are less than it and 50% are larger.
NOTE: To find the quartiles, you must first sort your data (similar to finding the mean).
The numbers are already sorted, so that it is easy to see that the median is 4 (three numbers are less than 4 and
three are bigger). In other words, 4 splits our numbers up into the set of smaller numbers {1, 2, 3} and the set of
larger ones {5, 6, 7}. The quartiles, in turn, split up these sets in the middle, so that Q1 = 2 and Q3 = 6.
Note that the numbers 1, 2 are less than or equal to the lower quartile, while 2, 3, 4, 5, 6, 7 are larger than or
equal to Q1. Therefore, 2 out of 7 or 28% of values are less than or equal to Q1 and 6 out of 7 = 86% are larger
than Q1.
Now the median is 3, leaving two sets {1, 2} and {4, 5}. To split these numbers in the middle does not work, so it
is not immediately clear what the quartiles are.
• If Q1= 1, then one value out of 5 is less than or equal to Q1, or 20%. That's not correct, so Q1 must be
bigger than 1.
• If Q1 = 2, then two values out of 5 are less than or equal to Q1, or 40%. Similarly, 4 values out of 5 , or
80%, are larger than or equal to Q1 so that the lower quartile is 2.
• Sort all observations in ascending order • Sort all observations in ascending order
• Compute the position L1 = 0.25 * N, where N is • Compute the position L3 = 0.75 * N, where N is
the total number of observations. the total number of observations.
• If L1 is a whole number, the lower quartile is • If L3 is a whole number, the lower quartile is
midway between the L1-th value and the next midway between the L3-th value and the next
one. one.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
• If L1 is not a whole number, change it by • If L3 is not a whole number, change it by To find the k-th Percentile:
rounding up to the nearest integer. The value at rounding up to the nearest integer. The value at
that position is the lower quartile. that position is the lower quartile. • Sort all observations in ascending order
• Compute the position L = (k/100) * N, where N is the total number of observations.
• If L is a whole number, the k-th percentile is the value midway between the L-th value and the next one.
Examples: Find the quartiles for the values 1, 2, 3, 4, 5, 6, 7 and also for the values 1, 2, 3, 4, 5 using this new • If L is not a whole number, change it by rounding up to the nearest integer. The value at that position is
method. the k-th percentile.
For the set 1, 2, 3, 4, 5, 6, 7 we have N = 7. Thus: Example: Consider the following cotinine levels of 40 smokers:
• L1 = 0.25 * 7 = 1.75, which gets rounded up to 2. Thus, I take the number in the 2nd position to be the 0 87 173 253 1 103 173 265 1 112
lower quartile
• L3 = 0.75 * 7 = 5.25, which gets rounded up to 6. Thus, I take the 6th number (i.e. 6) to be the upper
quartile. 198 266 3 121 208 277 17 123 210 284
For the set 1, 2, 3, 4, 5 we have N = 5. Thus: 32 130 222 289 35 131 227 290 44 149
• L1 = 0.25 * 5 = 1.25, which gets rounded up to 2. Thus, I again take the number in the 2nd position to be 234 313 48 164 245 477 86 167 250 491
the lower quartile
• L3 = 0.75 * 5 = 3.75, which gets rounded up to 4. Thus, I take the 4th number (i.e. 4) to be the upper
quartile. Find the quartiles and the 40th percentile.
Percentiles First note that before we start our computations we must sort the data - computing percentiles for non-sorted
data is the most common mistake (so please avoid it). Here is the same data again, this time sorted:
Quartiles are useful and they help to describe the distribution of values as we will see later. However, we often
want to know how one particular data value compares to the rest of the data. For example, when taking
0 1 1 3 17 32 35 44 48 86
standardized test scores such as SAT scores, I want to know not only my own score, but also how my score ranks
in relation to all scores. Percentiles are perfect for this situation.
87 103 112 121 123 130 131 149 164 167
The k-th Percentile is that number such that K % of all data values are less and (100 - K) % are larger than it. More
precisely, at least K% of the sorted values are less than or equal to it and at least (100 - K)% of the values are 173 173 198 208 210 222 227 234 245 250
greater than or equal to it.
253 265 266 277 284 289 290 313 477 491
Note: The lower quartile is the same as the 25th percentile, the median is the same as the 50th percentile, and
the upper quartile is the same as the 75th percentile.
Now we can do our calculations, where N = 40 (number of values in our data set).
• Lower Quartile: 0.25 * 40 = 10, so we need to take the value midway between the 10th value, which is 86,
and the 11th value, which is 87. Hence, the lower quartile is 86.5
• Upper Quartile: 0.75 * 40 = 30, so we need to take the value midway between the 30th value, which is
250, and the 31st value, which is 253. Hence, the upper quartile is (250 + 253) / 2 = 251.5
• 40th Percentile: 0.4 * 40 = 16, so the 40th percentile is (130 + 131) / 2 = 130.5
However, for percentiles another question is usually asked: given a particular value, find that percentile that
corresponds to this value. In other words, determine how many values are less and how many values are larger
than the particular value.
• percentile value of x = (number of values less than x) / (total number of values) * 100
Example: Suppose you took part in the above study of cotinine levels, and your personal continine level was
245. What is the percentile value of 245, and how many people in the study had a higher cotinine level that you?
First note that in our sorted data the value 245 is in 29th position (I used the sorted data, of course). Therefore,
according to our formula:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Thus, by definition of percentiles, 72.5% of values are less than 245 while (100 - 72.5) = 27.5% are larger than
245.
Of course Excel can be used to find percentiles, and therefore upper and lower quartiles (which are just the 25th Since these numbers are in percent, we have:
and 75th percentile, respectively).
• Afghanistan is at the 5.6th percentile in life expetancy, i.e. about 5% of countries have shorter, 95% have
The Excel function to compute percentiles is "=percentile(RANGE, K)", where RANGE is a range of cells and K is longer life expectancy than Afghanistan
the percentile to compute as a decimal number between 0 and 1. The data does not have to be sorted, Excel can • Japan is at the 100th percentile in life expetancy, i.e. about 100% of countries have shorter, nobody has
handle it automatically. longer life expectancy than Japan
• USA is at the 77.3th percentile in life expetancy, i.e. about 77.3% of countries have shorter, 22.7% have
The Excel function to compute the rank of a value x in a data set as a percentage of the data set (in other words, longer life expectancy than the USA
the percentile value of x) is "=percentrank(RANGE, X)". The data does not have to be sorted, Excel can handle it
automatically. Example: To practice, use the previous life expectancy data and compute the mean, mode, median, variance,
and standard deviation, the max and min values, and the upper and lower percentages.
For example, the function "=percentile(A1:A10, 0.4)" computes the 40th percentile of the values in the cells A1 to
A10, while "=percentrank(A1:A10, 0.4)" computes the how many-th percentile the value of x is in the dataset.
Example: The following Excel spreadsheet contains some data about life expectancy and literacy rates in about
100 countries of the world in 1995. Compute the mean, median, variance, standard deviation, and upper and
lower quartile of the life expectancy and percentage of people who read. What is the percentile value for life
expectancy in Japan, the USA, and in Afghanistan? 4.6 Box Plot and Skewed Distributions
By now we have a multitude of numerical descriptive statistics that describe some feature of a data set of values:
Life Expectancy Data mean, median, range, variance, quartiles, percentiles, ranks, etc. There are, in fact, so many different
descriptors that it is going to be convenient to collect many of them in a suitable graph called the Box Plot.
We use the formulas "average", "median", "var", and "stdev" as introduced previously to compute the various
descriptive statistics. The new formula "percentile" is used to computer the quartiles as well as the 40th The Box Plot, sometimes also called "box and whiskers plot", combines the minimum and maximum values (and
percentile. Note that the data does not have to be sorted when using these formulas, Excel will take care of that therefore the range) with the quartiles into one useful graph. It consists of a horizontal line, drawn according to
problem automatically. scale, from the minimum to the maximum data value, and a box drawn from the lower to upper quartile with a
vertical line marking the median.
It might sound pretty convoluted, so to see how it works it is best to consider an example.
Example: In an earlier example we considered the following cotinine levels of 40 smokers. Draw a box plot for
that data.
We already computed the lower and upper quartiles to be Q1 = 86.5 and Q3 = 251.5, respectively. It is easy to
see that the minimum is 0 and the maximum is 491. A quick computation shows that the median is 170. The
corresponding box plot looks therefore as follows:
To find the relative ranking (aka percentiles) for Japan, the USA, and Afghanistan we use the "percentrank"
function where we substitute the life expectancy for the respective countries for x:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
For some data sets you will see some points beyond the max/min value of the whisker. Those points are outliers;
they are exceptionally small or large as compared to the rest of the data. Technically these outliers are the
max/min, but they would distort the box plot too much. The exact definition of an outlier will be provided
below.
Note that the difference between the upper and lower quartile is called the Inter Quartile Range, or IQR. It is
used to define outliers (see below).
Example: Find the IQR for the Life Expectancy data above.
We know from the above box plot that the "lower hinge" is 63.5 and the upper hinge is 76. By definition, that
means that the quartiles are Q1 = 63.5 and Q3 = 76. That makes the Inter Quartile Range IQR = 76 - 63.5 = 12.5
boxplot.xls
A histogram (distribution) is called
When you open the file, Excel will show you a worksheet with a finished box plot already, and a column on the
Bell-Shaped or normal
right in green where you can enter or paste your data. Simply delete the data currently in that column and
replace it with your new data to create a new plot. The box plot will update automatically.
if it looks similar to a "bell curve".
Example: Create a box plot for the Life Expectancy by country that we considered before.
Most data points fall in the middle,
We first need to open the Life Expectancy data file - click on the icon below for the data file. there are few exceptionally small
and few exceptionally large values.
When the spreadsheet opens up, mark all numeric data in column B (the Life Expectancy column) but not
including the codlumn header and copy them to the clipboard (for example, press CTRL-C). Then open
the boxplot.xls spreadsheet and position your cursor to the first data value in column M. Paste the copied data
A histogram (distribution) is called
values (for example, press CTRL-V) into that column and the box plot will automatically update itself so that you
should see the following picture:
skewed to the right
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
A histogram (distribution) is called
You can tell the shape of the histogram (distribution) - in many cases at least - by just looking the box plot, and
you can also estimate whether the mean is less than or greater than the median. Recall that the mean is
impacted by especially large or small values, even if there are just a few of them, while the median is more
stable with respect to exceptional values. Therefore: Distribution is shifted to the left, the mean should be less than median (the exact numbers are: mean = 0.3319,
median = 0.4124).
• If the distribution is normal, there are few exceptionally large or small values. The mean will be about
the same as the median, and the box plot will look symmetric.
• If the distribution is skewed to the right most values are 'small', but there are a few exceptionally large
ones. Those large exceptional values will impact the mean and pull it to the right, so that the mean will
be greater than the median. The box plot will look as if the box was shifted to the left so that the right
tail will be longer, and the median will be closer to the left line of the box in the box plot.
• If the distribution is skewed to the left, most values are 'large', but there are a few exceptionally small
ones. Those exceptional values will impact the mean and pull it to the left, so that the mean will be less
than the median. The box plot will look as if the box was shifted to the right so that the left tail will be
longer, and the median will be closer to the right line of the box in the box plot.
• longer tail on the left means skewed to the left means mean on the left of median (smaller)
• longer tail on the right means skewed to the right means mean on the right of median (larger)
• tails equally long means normal means mean about equal to median>
Example: Here is some (fictitious) data in an Excel sheet for three variables named varA, varB, and varC.
distribution-data.xls
Create a box plot for the data from each variable and decide, based on that box plot, whether the distribution
of values is normal, skewed to the left or skewed to the right, and estimate the value of the mean in relation to
the median. Then compute the values and compare them with your conjector.
One of the data columns results in the following box plot and interpretation based on it:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
The other data column has the following box plot and interpretation based on it: In a somewhat similar fashion you can estimate the standard deviation based on the box plot:
Both estimates work best for normal distribution, i.e. distributions that are not skewed, and the first
approximation works best if they are no outliers. We will later determine additional relations between the
standard deviation for normallydistributed data.That reminds me: another useful application for the IQR is to
define outliers:
• outliers are data points that fall below Q1-1.5*IQR or above Q3+1.5*IQR
Example: Consider the above data on cotinine levels of 40 smokers. Find the IQR and use it to estimate the
standard deviation. Also, identify any outliers.
Distribution is shifted to the right, the mean should be greater than the median (the exact numbers are: mean = - The data ranges from 0 to 491 (from min to max), while the Q1 = 86.5 and Q3 = 251. Thus, we have two estimates
0.3192, median = -0.4061) for the standard deviation:
The final data column has the following box plot and interpretation based on it: • s is approximately equal to range / 4 = 491 / 4 = 122.75
• s is approximately equal to 3/4 * IQR = 0.75*(251-86.5) = 123.375
The estimate is pretty close and since the true standard deviation is 119.5, they are both pretty close to the
actual value. The best part of these estimates is, however, that they are so very simple to compute and thus they
give you a quick ballpark estimate for the standard deviation.
So there are no outliers in this case (which is one reason why the estimate of range/ 4 works prertty well).
Example: Find all outliers for the life expectancy data above.
Distribution is (approximately) normal, mean and median should be similar (the exact numbers are: mean = 0.013
median = 0.041) For that data set we found that IQR = 76 - 63.5 = 12.5 and therefore outliers woukd be data values:
Unfortunately I forgot to write down which of these cases correspond to varA, varB, and varC - can you figure it • above Q3 + 1.5*IQR = 76 + 1.5 * 12.5 = 94.75
out -:) • below Q1 - 1.5*IQR = 63.5 - 1.5 * 12.5 = 44.75
Box Plot, Outliers, and Standard Deviation Thus, the three data points for Uganda (42), Cent. Afri. R (43), and Tanzania (43) are outliers below, while there
are no outliers above. Note that since there are outliers, the range/4 estimate for the standard deviation should
not work as well as the estimate based on the IQR. Confirm that!
We have seen that even though the box plot does not explicitely include the mean, it is possible to get an
approximate idea about it by comparing it against the median and the skewness of the box plot:
• if the distribution is skewed to the left, the mean is less than the median
• if the distribution is skewed to the right, the mean is bigger than the median
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
4.7 Descriptive Statistics in Excel
Excel provides a convenient tool to compute many of the most commonly used descriptive statistics such as
mean, mode, median, variance, and standard deviation all at once.
Example: The following Excel spreadsheet contains some data about life expectancy and literacy rates in about
100 countries of the world in 1995. Compute the mean, mode, median, variance, standard deviation, and range
of the two variables.
First, as usual, we need to load the data into Excel. The spreadsheet should look similar to the following:
Make sure that you also check the box "Labels in First Row", then click on "OK".
• After clicking on "OK", Excel will compute a variety of descriptive statistics all at once and display them in
a new worksheet, as follows:
To compute a variety of descriptive statistics all in one swoop, we proceed as follows:
• Select "Data Analysis ..." from the "Tools" menu entry and select "Descriptive Statistics":
• Enter the Input Range for the data, i.e. place the mouse over column B, click and hold the mouse button,
then drag the mouse over column C as well. Both columns B and C should now be selected. Make sure
there is a checkmark next to "Summary Statistics" in the "Output Options".
We can see, for example, that for the average "Life Expectancy" we have computed the mean to be 67.48, the
median to be 71, and the mode to be 76. The standard deviation is 9.96, the variance is 99.14 and the range is
37.
These descriptive statistics computed by Excel are familiar, and Excel computes a number of additional values
such as range, minimum, etc., that are self-explanatory except for "Kurtosis" and "Skewness". We will ignore
Kurtosis, but we actually know how to interpret skewness:
• If the skewness is negative, the histogram (distribution) for the data is skewed to the left
• If the skewness is positive, the histogram (distribution) for the data is skewed to the right
• If the skewness is approximately zero, the histogram (distribution) for the data is symmetric and usually
normal
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
distribution-data.xls
Module 5
that we analyzed in the previous section about box plots and skewed distributions, and compare the skewness
coefficient with the results of your analysis in the previous section.
Percentage Tables
Loading that data set into Excel and running the "Descriptive Statistics" for all three columns simultaneously
and Cross Tabs
yields:
Statistical Analysis with
Software Applications
SASA
• You can see that "varA" has a negative skewness of -0.65. Thus, the histogram for varA should be skewed
to the left and the mean should be smaller than the median. Thus, the first box plot in our previous
analysis corresponds to varA.
• "varB" has a skewness close to zero so that its distribution should be normal and mean and median should
be similar. Thus, the third box plot from the example in the previous section corresponds to varB. This module is a combination of
• "varC" has a positive skewness so the distribution would be skewed to the right and the mean should be synchronous and asynchronous
greater than the median. Therefore the second box plot from our earlier example describes varC.
learning and will last for two weeks.
End of Module
Thank you!!!!!
Module Duration
As usual, Excel will be just the tool for this job. Load the following spreadsheet into Excel containing the results
of this survey:
Selected Employees
After loading this data we see that there is one column of interest, entitled "Salary Level". However, that column
represents a categorical variable (ordinal or nominal?) so we cannot compute a frequency histogram. So, we need
to learn a new procedure for handling categorical data.
The appropriate tool to create percentage tables for category data is the "Pivot Table ...". Actually, the
PivotTable is more flexible than we will need in our course, but it will for sure create the type of tables that we
will be interested in. We have, of course, already seen the Pivot table (and chart) tool in 3.6, but now we'll
explore a few more options. And, we will, in fact, see that tool again in the next section(s).
The Pivot tool is found as the first button of the "Insert" ribbon: load the above spreadsheet into Excel. Select
"PivotTable ..." from the "Insert" ribbon and select the entire data set, all columns and rows.
First drag the variable "Salary Level" from the Pivot Fields into the "Drop Row Fields" area of the table. Your table
will adjust, showing you all available salary levels. Next, again drag the variable "Salary Level" again, but this
time drag it to the "Drop Data Item" area in the middle (or to the "Values" area on the right. You will then see the
counts of how many occurrences fell inside each salary level, similar to the following:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Note that your picture might look slightly different. Make sure, though, that you see two "buttons", one called
"Salary Level" (which applies to the first column) and "Count of Salary Level" (which should apply to the second
column). We see, for example, that there were 33 people in the salary range of $10K to 20K, 230 in the next
Hit OK and you should see our complete frequency chart for the ordinal variable "salary level":
salary range, and so forth. Of course raw numbers are not so useful, so we would like to convert them to
percentages.
You can adjust what exactly is shown in the "data" area of your table by double-clicking the link entitled "Count
of Salary Level" that you see above -Note that you do need to double-click "Count of Salary Level", not "Salary
Level". You will see the following dialog:
From the table we can see, for example, that 6.96% + 48.52% = 55.48% of employees earn $30,000 or less.
Example: Use the same Excel data set to find the percentage of males and females that took part in this survey,
as well as the percentage of the various job categories.
Here are the resulting tables so that you can check your own results:
Here you can specify what computation you want to show in the data area of your table. In most cases, Count is
the preferred selection, but you do need to change the format for the count. Click the "Show Values as" tab and • Percentages of male/females in the survey
select "% of Column Total", as shown:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
• Percentages of job categories in the survey
Note: Once you have such a table - Excel calls it a Pivot Table - you can change the categories (variables) to be
displayed by dragging them in and out of the table. Maybe you could try to experiment to create a percentage
table relating salary level with gender (sex), i.e. work with two variables simultaneously.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Example: The residents of Green Township were asked what their opinion about a new Zoning Ordinance was.
5.2 Percentage Tables The answers were broken down by age of the people who were questioned. The result of the survey is
summarized in the following table:
So far we have analyzed data one variable at a time. We have seen how to compute mean, mode, median, and
variance, but each formula only applied to one variable at a time. Now we want to investigate two (or more)
variables simultaneously. Usually, a typical question about two variables is: age age
Total
50 or under over 50
Is there some relation between one variable and another one, and if so, how can one use knowledge about one
variable to predict, approximately, the other. For Zoning 92 87 179
Answers to such questions can be very useful: Against or no opinion 158 75 233
• if smoking causes cancer, we should stop smoking Total 250 162 412
• if having an advanced college degree increases the chance to have a well-paying job, we should try our
best to graduate college
• if exercising and working out increases our general state of health, we should exercise and work out This table can, of course, be entered into Excel directly, and using a few tools that Excel provides the data entry
regularly and formatting is quick and effortless.
• if a new drug really does have a positive impact on lowering blood pressure, we should take it if we have
high blood pressure • First, let's enter the "raw" data, i.e. all data that is actually collected as opposed to computed data. Our
spreadsheet will look similar to this:
In most cases the "if" part is the difficult one to determine, i.e. it is not so easy to find out whether two variables
(for example smoking and cancer) are indeed related, and even if they are related, it is even harder to determine
which is cause and which is result (if smokers have higher cancer rates, does smoking cause cancer, or does
having cancer cause you to smoke). In general, correlation does not necessarily imply causation. Here are two
examples of incorrectly infering causation from correlation (see http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation):
• The more firemen fighting a fire, the bigger the fire is observed to be. Therefore firemen cause an
increase in the size of a fire.
• Sleeping with one's shoes on is strongly correlated with waking up with a headache. Therefore, sleeping
with one's shoes on causes headaches.
But let's start at the beginning. We will start our investigation about relationships between variables by taking a Note that some labels may not be completely visible - we will rectify that later automatically.
closer look at representing data in tables. We won't worry about correlation or causation for now.
• Next, we will asked Excel to compute the totals for us "on the fly". Excel provides a very convenient
button for that on the "Home" ribbon: the "Auto Sum" button . Position the cursor in the cell for the
first row total and press "Auto Sum" . Excel will indicate the cells that it is going to sum up, which
should be all cells to the left (of course you could also enter the "=sum()" formula and pick the range
manually, but the 'auto sum' tool is quicker in this case).
• Press ENTER to accept the choice and Excel will automatically compute the total and enter it in the
appropriate cell.
• Keep on using the "Auto Sum" button until all totals are computed. In other words, each time position the
cursor first in the cell that will contain the sum, then press "Auto Sum", then press ENTER.
• You can optionally format your table to make it look nice. Select "Format as Table" on the "Home" ribbon
and pick a format you like. Here is the final table, nicely formatted (with all labels visible):
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Converting to Row Percentage
In cell B8 the 'raw data' is 92, indicating that there were 92 people age 50 or under who were 'for' zoning. The
total in row 8 is 179, indicating that a total of 179 people were 'for' zoning. Thus, the row percentage of people
'for' zoning who are age 50 or under is
Let's convert each number into the appropriate row percentages, i.e. we will use the row total to convert a
number into percent:
First, we will copy the original table to a new location just below it:
Finally, it is simple to get Excel to format the numbers in cells B8:D10 as percentages:
1. Mark the cells containing the ratios in the second table (B8 to D10)
2. Click the "%" symbol on the "Home" ribbon, or choose "Percentage" from the drop down menu in the
"Number" group.
• you could convert the numbers to percentages "by hand" (well, by hand would mean using a regular
calculator)
• you could convert the numbers to percentages using Excel
Excel would do the computations for you, of course, but in this case the power of Excel might actually be
overkill. So, we will first convert the figures by hand, using a normal calculator, and then - as an appendix for the
Excel enthusiast - we'll show how to use Excel if you insist.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Converting to Column Percentage If you prefer written instructions, try these: our starting point is the original table with the raw data values,
copied to a second version as in this picture:
Similarly, we can create a third table, containing column percentages. The details are left as an exercise, but for
the calculations we need to convert the raw data values by dividing by the column totals instead of
the row totals. If we do everything correctly our three tables (properly formatted) should look like this:
Now position the cursor into cell B8 (second table, containing the value 92). We will replace this "raw" value by a
computed one as follows:
o Question 1 asks, rephrased: out of all people who are for the zoning law, how many of them are age 50 or If we keep going this way, we can convert all numbers into row percentages. Actually, all numbers so far will be
under. In other words, question 1 considers all people who are for the zoning law as a total - that is a row decimal values, but we can easily format them as percentages as outlined above.
total, so that the answer to question 1 is the row percentage 51.4%.
o Question 2 asks, rephrased: out of all people who are 50 or under, how many of them are for the zoning And it should be possible, now, to convert the raw data into column percentages, using Excel - see above for the
law. In other words, question 2 considers all people who are 50 or under as a total - that is a column final tables.
total, so that the answer to question 2 is the column percentage 36.8%.
The advantages of this approach is that - while more work than using a calculator - the percentage table(s) will
automatically update if the raw data values change if we use Excel formulas.
From that example we see that the key to answer questions such as these is which group is considered the "total"
group for the particular question:
Practicing
To practice, consider the following table of raw numbers, relating sex (or gender) with puls rate (the numbers
o if the total for that group is found in a row, use row percentages
o if the total for that group is found in a column, use column percentages are from a survey for this class):
It seems that generating these percentage tables is a fair amount of work. Of course Excel provides an easier low pulse rate high pulse rate Totals
method for generating such tables from actual data, which we will explore soon.
Male 1 1 2
Appendix: Converting to Percentages using Excel
Female 9 6 15
In this case it is simpler doing the proper computations using a regular calculator, but we could use Excel just as
well. I recommmend you check out the video below, or follow the step-by-step instructions after that, but the
video contaiins a few fun and helpful Excel tricks - you don't want to miss them. Totals 10 7 17
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Convert the table into (a) row percentages and (b) column percentages, then use the appropriate figures to
answer the following questions:
Selected Employees
With Excel, the appropriate tool to create such crosstabs tables is - you guessed it - the "PivotTable ..." in the
"Insert" ribbon.
You will see the table (hopefully) familiar from the previous section, where you can drag variables from the This table shows, for example, that 32 female employees out of 474 total employees earn between $10,000 and
PivotTable Fields into the "Row", "Column", and/or "Values" areas. We won't make use of "Filters". $20,000, while, for example, 45 male employees earn more than $60,000.
To analyze the relationship - if any - between "Gender" and "Salary Level", drag the variable "Salary Level" to the Similarly, we can create tables to relate "Salary Level" with "Years of Education". We could start again from
"Row" field of the table and the variable "Gender" to the "Column" field (if you accidentally drop a variable in the scratch, but since we already have the pivot table available, we can simply drag the "Salary Level" and "Gender"
wrong spot, simply drag it back). variables out of the table and drag the "Years of Education" and "Salary Level" to the respective row, column, and
data area to create our next table:
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Similarly, we can create the table relating "Salary Level" with "Job Category", but I leave the details up to you. To
finish our discussion on crosstabs tables, let's use the above tables to answer specific questions:
Example: Using the same Excel data as before, answer the following questions:
1. How many female employees earn less than $40,000 ? How many males ? You should see the final table, containing column percentages, as shown below. Note that you can see that the
2. How many people earning more that $60,000 have 15 years of eduation or less ? How many of those have table shows column percentages since the percentages accross the columns add up to 100%.
more than 15 years of education ?
At first glance it might seem that we have already created the right tables to answer these questions. However,
this time we want the answers in percent, while our above tables contain actual numbers. And, as we have
discussed in the previous section, when we want to generate percentage tables we need to decide whether we
want row or column percentages. Therefore, we first need to discuss exactly which tables to generate to answer
these questions before worrying about how to do it in Excel.
To answer question 1, we clearly need to generate a table relating salary with gender. Let's use salary as row
variable and gender as column, just as above:
Now we can answer question 1 easily: Recall the question was "how many female employees earn less than
$40,000". In the female column we need to add the numbers: 14.81% + 66.67%+13.43% since everyone in those
corresponding cells is female and earns less than $40,000. You could of course use Excel to add these numbers, or
simply add them in your head. The answer to the question is: 94.91% of females earn less than $40,000. For the
male employees, the answer is: 63.95% of males earn less than $40,000.
Actually, this seems to indicate that female employees, as a rule, earn less money than male employees. We will
learn how to answer questions such as these in the next section.
It is left as an exercise to answer the second set of questions "How many people earning more than $60,000 have
15 years of education or less ? How many of those have more than 15 years of education". Here are, for your
information, the answers (recall that you first need to determine whether to use row or column percentage
Question 1 uses as total all female employees, and in our table the "females" go along a column. Therefore, we tables, which of course depends on your choice of row and column variables, as well as on the particular
need to generate column percentages in the above table. question. Here is a particular table we chose to create:
• Double-click the "Count of Salary Level" button in your table and click on "Options" in the dialog that pops
up. Then select "% of column" in the "Show Values As" field and click OK.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
5.4 Chi-Square Test for Crosstab Data
In the previous section we computed crosstab tables, relating two categorical variables with each other. A
natural question to ask now is:
The complete answer to this question would take us into the realm of hypothesis testing, which we have not yet
introduced. For example, what exactly do we mean by "independent". But the question is intuitively easy to
understand, so we will give a brief discussion on how to answer the above question without covering all of the
mathematical details.
By rthe ways, is that table using row or column percentages ? The correct answers are:
Example: Consider the crosstabs table we generated before, relating income with sex for a particular company,
• 2.22% of people earning more than $60,000 have 15 years or less of education using the data file Selected Employees. The table we generated (using the Pivot Table as desribed previously)
• 97.78% of people earning more than $60,000 have more than 15 years of education with raw numbers, not row or column percentages, looked as follows:
Again, a more interesting question would be to determine whether more years of education generally result in
higher salaries (it does looks that way) - we will answer that type of question in the next section.
A natural and interesting question is: "Is there a relationship between salary and sex (the row and column
variables) or do the two variables appear to be independent of each other".
To answer this question we assume at first that the row and column variables are independent, or unrelated. If
that assumption was true we would expect that the values in the cells of the table are balanced.
To determine what we mean by balanced, let's take a simple example with two variables, sex and smoking, for
example. We are interested in figuring out whether there is a relation between sex (male/female) and smoking
(yes/no), or whether the two variables are independent of each other. We therefore conduct an experiment and
ask a randomly selected group of people for their sex and whether they smoke. Then we construct the
corresponding crosstabs table. Let's say we get a tables as follows (the actual numbers are fictitious):
Smoking 30 5 35
Not smoking 10 55 65
Totals 40 60 100
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Of the 35 people that are smoking, 30 of them are male. Conversely, of the 65 people that are not smoking, 55 of With this information we could construct a crosstabs tables as follows:
them are female. Such an outcome - using common sense - would suggest that there is a relation between
smoking and sex, because the vast majority of smokers is male, while the majority of non-smokers is female.
On the other hand, we might have gotten a table like this (again with fictitious numbers):
Male Female Totals
Now the smokers and non-smokers are divided pretty much evenly among men and woman, suggesting perhaps
that the two variables are independent of each other (a person's sex does not seem to have an impact on their But what kind of distribution in the various cells would we expect if the two variables were independent?
smoking habit).
• We know that 30 of 100 (30%) are smoking; there are 40 males and 60 females - if male and female had
Now let's look exactly how a balanced table should look like if we assume that two variables are indeed nothing to do with smoking (the variables were independent) that we would expect that 30% of the 40
independent. Suppose we are again conducting our experiment and select some random sample, but for now we males are smoking, while 30% of the 60 females were smoking.
only look at totals for each variable separately (the actual numbers are once again fictitious). Suppose, for • We also know that 70 of 100 (70%) are not smoking; there are 40 males and 60 females - if the two
example: variables were independent, I would similarly expect that 70% of the 40 males were not smoking and 70%
of the 60 females were not smoking
• number of smokers is 30, number of non-smokers is 70
• number of males is 40, number of females is 60 Under the assumption of independence I would expect my table to look as follows:
• total number of data values (subjects) is 100
Male Female Totals
Totals 40 60 100
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
In other words, if a crosstabs table with 2 rows and 2 columns has a row totals r1 and r2, respectively, and column
totals c1and c2, then if the two variables were indeed independent we would expect the complete table to look as
follows:
X Y Totals
A r1 * c1 / total r1 * c2 / total r1
B r2 * c1 / total r2 * c2 / total r2
Totals c1 c2 total
But now we can create an effective procedure to test whether two variables are independent:
1. Create a crosstabs table as usual, called the actual or observed values (not percentages)
2. Create a second crosstabs table where you leave the row and column totals, but replace the count in the For example, the entry in cell G5 (column G, row 5) of the Expected Values table is the product of the column G
i-th row and j-th column by: total (cell G9) times the row 5 total (cell I5) divided by the overall total (cell I9). Similarly, the entries of the
other expected values are:
(total of row i) * (total of column j) / (overall total)
Value of cell G3 = G9 * I3 / I9 Value of cell H3 = H9 * I3 / I9
Fill in all cells in this way and call the resulting crosstabs table the expected values table, because these
numbers would beexpected in this position of the table if the two variables under investigation were indeed
Value of cell G4 = G9 * I4 / I9 Value of cell H4 = H9 * I4 / I9
independent. Now, here is the clue:
Value of cell G5 = G9 * I5 / I9 Value of cell H5 = H9 * I5 / I9
If the actual values are very different from the expected values, the conclusion is that the variables can notbe
independent after all (because if they were independent the actual values should look similar to the expected
values). ... ...
The only question left to answer is "exactly when is this diffence too large", i.e. at which point can I assume that
the difference between expected and actual values is so large that I have to conclude that the variables cannot If we compare the values, we see that of the people making $60K or more, fewer than expected are female (0
be independent. Before we answer that question, let's return to our original example. versus 20.51) while more than expected are male (45 versus 24.49). On the other hand, in the low-income
category of $10-$20K, more than expected are female (32 versus 15.04) while fewer than expected are male (1
versus 17.96). That seems to point towards a gender bias for salaries, i.e. women make less money than men as a
Actual versus Expected Values rule, or to phrase it differently:
Recall that we wanted to determine whether sex (gender) and salary level are independent of each other based the row and column variables do not seem independent of each other; so that there seems to be a dependence
on the particular company studied. According to our theoretical discussion above, we create a second table with between them
the same number of rows and columns (and labels) and name that table "Expected Values", while the original
table will be named "Actual Values". We simply copy-and-paste the original table and erase the "inside cells" so In other words:
we can recompute them (do not copy the very top line of the original table).
• if the sum of differences(*) between actual and expected values is "small", the assumption of
Note that the table with actual values must contain counts, not percentages independence is valid
• If the sum of differences(*) between actual and expected values is "large", the assumption of
In the table of expected values, each entry is computed as the: independence is invalid, hence there must be a relation between the variables
product of the row and column total for that cell, divided by the overall total The big question left is: when is the difference small enough to accept the independence assumption, and when
is the difference so large that we can no longer assume independence and must therefore accepted dependence.
The next picture illustrates these computations: The answer to this question is provided by the Chi-Square test.
(*) We don't really need the sum of differences, because then negatives and positives would cancel each other out. Instead, we square all differences before
adding them up to eliminate possible negative signs. Fortunately, though, Excel will handle the details of our computation.
The Chi-Square test computes the sum of the differences between actual and expected values (or to be precise
the sum of the squares of the differences) and assign a probability value p to that number depending on the size
of the difference and the number of rows and columns of the crosstabs table.
• If the probability value p computed by the Chi-Square test is very small, differences between actual and
expected values are judged to be significant (large) and therefore you conclude that the assumption
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
of independence is invalidand there must be a relation between the variables. The error you commit by on the sex of the employees. Since p is so close to zero our error is close to zero as well, so we are pretty certain
rejecting the independence assumption is given by this value of p. that our conclusion is correct.
• If the probability value p computed by the Chi-Square test is large, differences between actual and
expected values are not significant (small) and you do not reject the assumption of independence, i.e. it
is likely that the variables areindeed independent.
=chisq.test(actual_range, expected_range)
Important Restriction: The Chi-Square test is not appropriate if any of the expected values are less than 5.
Excel willnot check this restriction - you need to manually inspect the expected values to ensure none of them
have a value of less than 5.Let's call this the Rule of Thumb test.
Now we can finish the above example: we used the Selected Employees data to generate the crosstabs table:
First we copy this table of actual counts to a second table (IMPORTANT: DO NOT COPY THE TOP ROW OF THE
TABLE) to another table and compute the expected values as outlined above:
Please not that all expected values are bigger than 5 (the smallest expected value is 15.04), so the Chi-Square
test is applicable. Thus, in an empty cell, enter the Chi-Square test function:
=chisq.test(B3:C8, G3:H8)
In the above case Excel computes that value to p = 1.91E-22, which is scientific notation for
0.00000000000000000000191. Thus, p is most definitely small and hence we conclude that there is a relation
between the two variables sex and salary. In other words, the salary level in this particular company does depend
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
Example: Using the data from Selected Employees, is there a relation between salary and years of schooling? Then we copy the table (of actual values) to a second table and construct the expected values as described above
(this time the table is pretty small, so computing the expected values is not that much work).
We first use the data and the Pivot Table tool to create a crosstabs table of salary versus years of schooling, as in
the previous section: Finally, we use the chitest Excel function to compute the value of p. Here are the resulting figures:
This time the smallest expected value is 9.02, which is above 5 so that it is valid to apply our Chi-Square test.
Again p is very close to zero, stating that there is a relation between party affiliation and opinion on capital
Next we copy this table (without copying the top row) and paste it below the actual table. Then we compute the punishment. In fact, if you compare the actual versus expected values for the Democrats you can see that fewer
expected values, which is a lot of work. Finally, we compute the Chi-Square test value p: democrats than expected favor the death penalty, while more than expected oppose it. For Republicans it is just
the other way around. For independent and people with other party affiliations there seems to be little
difference between actual and observed values.
So far in all our examples the variables were dependent (probably). Of course that is not always the case. For
example, if we setup the corresponding tables for actual and expected values for the data from the GSS96 survey
relating "Life is" with opinion on "Capital Punishment", we see that the computed p value is p = 0.045, which
means that if we did reject the assumption of independence and hence stated that the outlook on life and the
opinion on capital punishment are related, we would make an error of 4.5% - that might be more than we are
willing to accept (see figure below).
Again the value of p is for all intent and purposes zero, so we can with high certainty conclude that there is a
relation between years of education and salary level (in other words, based on this data you make more money
with more years of education, just like your parents told you).
BUT we failed to check if the expected values pass our "rule-of-thumb" test! In fact, several of the expected
values are small, certainly smaller than 5. Thus, in this case the Chi-Square test is not reliable and we should not
believe its conclusions since the assumptions of the test were not satisfied! To remedy the problem, you could re-
categorize the data by using fewer groups so that hopefully the expected values in the new tables will all be
above 5
Example: Every year there are large-scale surveys, selecting a representative sample of people in the US and
asking them a broad range of questions. One such survey is the General Social Science (GSS) survey from
1996 (which contains mostly categorical data). Use the data (which is real-life data from 1996) to analyze if
there is a relation between party affiliation and people's opinion on capital punishment. Finally, note that the Chi Square Test for crosstabs tables, as described here, checks whether two variables are
independent of each other or not. If the test results in our conclusion that two variables are related, then the
After downloading and opening the GSS96-selected.xls data file, we construct a crosstabs table for "Party Affil" next question is: how strong are they related. There are different statistical procedures that allow you to decide
and "Capital Punishment" as described in previous sections, using the Pivot Table. about the strength of a relationship, but for categorical data Excel does not provide the necessary functions to
quickly perform these necessary computations.
San Mateo Municipal College College of Business and Accountancy San Mateo Municipal College College of Business and Accountancy
Prepared Prepared
To analyze the strength of a possible relation between two variables we will restrict ourselves to numerical
variables and move on to the next chapter.
End of Module
Thank you!!!!!