Basics of SPSS
Basics of SPSS
A Step-by-Step Manual
Fourth Edition
by
Lars E. Perner
i
Contents
Acknowledgements ..................................................................................................... ii
Introduction................................................................................................................... 9
ii
Reliability: Finding coefficient Alpha and other measures of
reliability in a scale ............................................................................ 54
t-test: Testing for differences in two population means. ..................... 55
Advanced features: Factor, Cluster, Hiloglinear, and MANOVA........ 56
Appendices:
iii
To the Instructor
iv
very difficult to get out. As for typing in the data directly, rather than
through the data entry module, it is my experience that such practice helps
the user appreciate the fixed format of the data entry. Also, this method of
data entry is more efficient for large data sets since the "key puncher" is not
required to enter a carriage return after entering each variable.
The other side of the coin is that many people still feel uneasy when it
comes to dealing with computers. Many official software manuals, and
even a number of secondary texts, tend to present the material in a
relatively abstract and "sterile" manner. At the loss of a slight degree of
generality, I have instead chosen to present examples that will help the user
"fill in the blanks" on his or her own programs. That is, I frequently use
"real" variable names rather than referring to some abstract notion such as
"varlist." In addition, I have used a great deal of humor and
anthropomorphism to put the "reluctant" computer user off guard. On that
topic, I see no reason why humorous examples cannot be as informative and
educational as boring ones. The purpose of examples is to show the student
how data can be analyzed, and while "real World" projects may be less
engrossing, the skills learned in a humorous situation can be generalized to a
routine, or perhaps even boring, situation.
v
making a limited number of hypotheses before running the statistical
procedures. The program named "SIGNIF.EXE," which is included on the
distribution diskette for this manual, allows the user to compute the
probability of making at least one Type I error given n significance tests.
Having noted students' tendency to attempt a very large number of
analyses, it is strongly recommended that the instructor stress the concept of
accumulating error levels in class.
With SPSS being available in so many forms, one may wonder about
the wisdom about using the personal computer version as opposed to
mainframe versions such as SPSS-X available on most campuses. After all,
SPSS/PC+ may be installed only on a few computers on campus while
mainframe terminals are readily available to students, some of whom may
even be able to dial up the university mainframe by modem from home. I
think there are several reasons why the PC version is preferable. First, many
vi
students have already had experience on the IBM PC and thus need much
less introduction. Secondly, should the student wish to include part of the
SPSS output directly in a report, taking it from the "SPSS.LIS" output file is
much easier than downloading an output file from the mainframe. Finally,
those students who will end up using SPSS on the job are much more likely
to find a SPSS/PC+ than the mainframe version in industry.
In that same vein, a question arises as to whether one should use the
complete SPSS/PC+ program or SPSS/PC+ Studentware. While I feel that
using the "real thing" will be a better preparation for practical industry
applications, I don't think that students who only have the Studenware
edition available will be seriously shortchanged in an introductory course.
vii
To the Student
Much has changed since I was first exposed to the Statistical Package
for the Social Sciences (SPSS) in the early 1980s. As an industry standard,
SPSS now exists in versions for many different computers. At the time,
however, I was confined to a mainframe version which did not seem terribly
user friendly.
viii
who feel uncomfortable with computers to feel more at ease with the
subject. You should not feel guilty about enjoying the reading, however.
While you may risk getting some dirty looks in the library if you laugh out
loud, the examples, although often far fetched, illustrate real research issues
and are just informative as boring examples. Why shouldn't enlightened
poodle breeders commission marketing research just like the manufacturers
of laundry detergent? Those people who split their investment funds
between an inventory of poodles and a controlling interest in Proctor &
Gamble will be just concerned about increasing levels of prejudice against
small dogs as about consumer trends toward buying generic household
products.
A few final cautions. The computer has today made it very easy to
perform statistical calculations that could literally have taken a person
months to perform in past decades. With this potential, however, comes an
opportunity for serious abuse. This can take two forms. First, anyone can do
complex statistical calculations in SPSS/PC+, but the output may not be at
all meaningful. In class, you may have discussed the distinction between
nominal, ordinal, interval, and ratio scales of measurement. However, the
computer doesn't know where your data comes from and will gladly comply
with your request to include nominal data in a procedure that really
requires interval, or even ratio, level data. Therefore, be sure you
understand the assumptions behind a statistical technique before running it.
ix
tests to come out significant by chance alone.
x
1
What Can SPSS/PC+ Do for Me?
At the most basic level, you might want to tabulate some data you
have collected in a survey or through other means. Later on in the book, we
will meet a dog breeder who is very interested in whether people own dogs
or not and what kind they prefer. Suppose he has asked you to do a survey.
After you have entered the data, you can ask SPSS/PC+ for a table that
indicates how many people gave each of the possible answers to a question:
Valid Cum
Value Label Value Frequency Percent Percent Percent
1
There is no need to cut and paste! Lotus graphics ("*.pic" files) can be
imported directly into WordPerfect 5.0 or 5.1.
3
Italian and French food; and those people preferring the Continental U.S.
would prefer Western type food such as steaks, hamburgers, and fries. You
are not quite sure about those who prefer to visit Hawaii. SPSS/PC+ allows
you to test your hypotheses:2
2 1 1 11 1 2 16
Orient 6.3 6.3 68.8 6.3 12.5 10.7
4.2 3.0 18.6 3.4 40.0
3 2 27 4 22 55
Europe 3.6 49.1 7.3 40.0 36.7
8.3 81.8 6.8 75.9
4 1 4 39 3 2 49
Hawaii 2.0 8.2 79.6 6.1 4.1 32.7
4.2 12.1 66.1 10.3 40.0
Column 24 33 59 29 5 150
Total 16.0 22.0 39.3 19.3 3.3 100.0
Cramer's V .59023
-------------------------------------------------------------------------------
If you are familiar with the Chi square statistic, you can see that there
is strong evidence to reject the null hypothesis that food and vacation
preference are "independent." As a matter of fact, the Cramer's V statistic
2
Normally, you should define hypotheses more specifically before
testing them. For now, we are just testing whether the two variables in
question (food preference and favorite vacation destination) are
dependent.
4
even suggests a modest relationship.
Once you have done a correlation analysis, you might feel that the
proper next step is to do a multiple regression analysis to see if you can
improve your ability to predict based on the introduction of additional
variables. Unlike Lotus 123, SPSS/PC+ gives you several choices as to
which method you want to use (forward inclusion, backward deletion,
stepwise consideration, or "forced" entry). If a traditional method doesn't
suit your needs, you can introduce non-linear or log-linear models. Let's try
to "predict" a person's telephone bill from his or her expenditures on other
items and other demographic information.
3
Actually, a correlation analysis is in practice applied many times
even when only ordinal level data is available. This is not sanctioned as
correct by most statisticians, but the this approach can sometimes still
yield meaningful results. When one or both of the variables depart
seriously from the assumption of interval properties, the true relationship
between the variables may be greatly underestimated. On the other hand,
a correlation will rarely provide "false positives" or suggest a relationship
that does not exist.
5
Multiple R .67310
R Square .45306
Adjusted R Square .44673
Standard Error 84.09635
Analysis of Variance
DF Sum of Squares Mean Square
Regression 3 1517312.98023 505770.99341
Residual 259 1831698.85247 7072.19634
-----------------------------------------------------------------
Now you can optionally calculate a confidence interval for the means
of heights of the women on in each. Tentatively, it looks like Dallas might be
the best bet. (An added benefit is that by choosing this location, you will be
closer to the Oil Barons' Club).
Of course, there is always the possibility that you decided to split the
cost of doing the survey with a classmate who believes that there is more
money to be made in tall men's clothing. In that case, of course, you would
want to find out about the heights of the men in the different cities as well.
However, you would want to keep track of the heights of the men and
women separately, both because the city that has the tallest men might not
have the tallest women and because the great between-sex height
differences would greatly inflate the estimate of within-gender variability.
"Means" allows you to produce this table:
7
Summaries of HEIGHT Height of subject
By levels of CITY City of residence
SEX Sex of respondent
Now that you have been working with SPSS/PC+ for quite some time,
you are really getting good at marketing research, and you feel that you can
handle almost anything--even the unexpected. One day, you receive a
distressed phone call from Rudolph the Redneck Reindeer. Rudolph is
hysterical because his employer, an elderly man who likes to wear a red suit
during the winter months, has warned his long time sleigh puller that he
may have to lay him off because people are beginning to demand greater
sophistication from reindeer. You agree to do a survey for Rudolph to find
out how important that aspect really is to consumers. However, having
studied marketing research for some time, you realize that one question or
"item" will not give you a result that is reliable enough to give you an answer
that is dependable. You therefore decide to create a scale of "Appreciation of
Sophistication in Reindeer," where subjects will be asked to indicate their
level of agreement or disagreement with Likert type questions on a scale
ranging from "strongly agree" (1) "to neither agree nor disagree" (4) "to
strongly disagree" (7). Now you want to test whether the average score on
the questions will be reliable enough to be meaningful. You and Rudolph
hope that people will score as low as possible on that scale, suggesting that
his employer's concern is unwarranted. After you "reverse score" item #4
(which is worded in the opposite direction of the other questions), you are
ready to generate the following estimate of internal consistency:
8
R E L I A B I L I T Y A N A L Y S I S - S C A L E (S O P H I S T)
RELIABILITY COEFFICIENTS
ALPHA = .9218
SPSS has many features, of which you will probably only be using a
few. It is important not to lose sight of the forest for the trees (or, in more
modern terms, not to lose sight of the computer for the chips). This manual
contains descriptions of many simple procedures to run, with more
information being available in various manuals put out by SPSS, Inc. and
10
third party sources. (If you are unsure about particular statistical
procedures, these manuals also function very well as statistical texts since
they contain very good, real life illustrations of the statistical techniques
discussed). The attempt of this book is not to teach you all the details of
SPSS/PC+, but simply to allow you to adapt sample programs to your needs.
Assuming that you have already collected your data, the statistical
analysis generally involves five stages:
4. Please rank the following foods in the order you prefer to feed them to your dog:
6. Please write next to each of the questions below the number from the following
scale which most closely matches your level of agreement or disagreement:
1 2 3 4 5 6 7
Strongly Strongly
agree disagree
001
002
003
..
-------> more data here <--------
..
300
Now we get to actually code the questions. This includes giving the
question (or variable) a one word name, assigning a number to each possible
answer on the questionnaire4, determining how many digits are needed for
the question, determining in which columns the data will be put and,
4
SPSS/PC+ actually allows the use of alphanumeric characters, that is,
letters of the alphabet, as data. However, the use of alphanumeric data
will often cause problems which are difficult to solve and it may be a
poor practice since many other statistical programs will not allow such
data.
14
optionally, assigning the question and answers "labels," i.e. short phrases
describing their meaning. Although this process may sound overwhelming,
it is quite simple once we go through it.
The name of the variable can simply refer to the number of the
question (e.g. "Q1," "ITEM18"), or it can be descriptive of the meaning of the
variable (e.g. "AGE," "INCOME.") The rules for naming SPSS/PC+ variables
are very similar to those for naming DOS files, i.e
Q1
QUEST1
QUEST1A
QUESTA1
Let's look at question number 1. Our first task is to name it. Either
you can call it something like Q1, to keep it simple, or you can name it
something more descriptive like "DOGOWN." When you write your own
program, you have a choice; for now, we will call it DOGOWN. Next, notice
that there are three possible answers. (Your client insisted on including the
"Not sure" option since the questionnaire would be administered in the
neighborhood of a major university, making it likely that a number of absent
minded professors would be asked to respond.) We now have to assign a
number to each. Let's assign a "1" to "Yes," a "2" to "No," and a "3" to "Not
sure." Now, are these response categories enough?
Not really. Two things could happen. First, the respondent might
accidentally overlook or refuse to answer the question (a common situation
when you ask about such emotionally charged and private topics as income
or extra marital activities). The next several questions illustrate the
situation that occurs when not everybody is supposed to answer a question,
and we will discuss how to code such instances when we talk about those
questions. For this question, missing data can only arise when a person
either omits the question or provides one that is not useful. This could
happen if someone wrote a sarcastic comment instead of answering or
simply overlooked the question. Whenever a person fails to answer a
question that he or she should have answered, we will assign a numeric value.
For this question we will code it as a "9." When the question is not
applicable, just leave the space blank and the computer will assign the
response as a "system missing value." (Notice that using "9" as the missing
value for this question will result in all the numbers in between four and
eight, inclusive, not being used).
The next question, which we will call DOGCOUNT, asks how many
dogs the respondent owns. We will assume that the respondents are
reasonably normal and do not own more than 98 dogs; hence, we will
reserve only two digits for that variable. Note that here we may encounter
the kind of missing data that occurs when people legitimately omit
questions since the questions between sections two and five should only be
answered by those people who own dogs. When people "legitimately" skip
questions, we will put blank spaces in the columns designated for the
16
variable. For those people who indicated that they own at least one dog by
answering "Yes" to the first question but failed to answer this question, we
will put in the missing value of "99."
Note that open ended questions have a great potential for missing
data. Suppose that someone misunderstood the question and thought he or
she was asked about his or her favorite pet. If he or she answered "Polar
bear," you would most likely classify the answer as missing. You should be
prepared for certain other kinds of "missing answers." Perhaps a respondent
unsympathetic to the objectives of our research might scribble in something
like "I hate all dogs!"
The next questions require very little discussion since subjects will be
responding directly with numbers. Thus, for section 4, all we have to do is to
assign one digit to each variable (i.e. each dog food that we ask the
17
respondent to rate) and reserve "9" as a missing value. Skipping slightly
ahead, the same holds for the Likert scale questions of Section 6.
In section 5, we will assume that no one spends more than $99.98 per
week on dog costs, and we will thus assign five digits and a missing value of
"99.99." (SPSS/PC+ allows you to designate a variable as a dollar amount
rather than as plain number. However, not all programs have that
capability, so let's not get into that.) Also notice that we have are reserving
space for the period. We could arrange to use four digits instead, but why be
so stingy with the space?
For age, we will assign two digits and a missing value of "99." For sex,
"male" will be assigned "1" and "female" "2."
For the question of annual household income, we will assume that the
figure does not exceed 999,9985 for any respondent since we are located in a
university community. We will thus reserve six digits since SPSS/PC+
would not appreciate the comma (unlike the period allowed in the question
on weekly spending). Incidentally, anyone earning over a million dollars a
year would, in statistical jargon, probably be considered an "outlier"--a sort
of maverick who would probably be excluded from our analysis anyway.
while a case from a person who does not own a dog, and thus was not asked
to answer some of the questions, could look like this:
5
Remember, we need $999,999 (i.e. 999999) for missing data.
18
Not only do we know that each case should end in the same
column; we also know that many of the blank spaces should be in the
same places. In this case, we have reserved blank spaces between
most of the variables, but none within the variables in the Likert scale
and rank-order sections. Putting in spaces there would make the data
seem more confusing.
19
Now we are ready to pursue the bottom line of this text, that is,
the writing of the SPSS/PC+ program. The next chapter will discuss
how to enter the data and commands into SPSS/PC+; here, we will
just talk about what to enter. Here is a program for the questionnaire
we have been discussing. Please don't be intimidated if it looks
overwhelming at first. We will go through it line by line.
TITLE "Dog Preference Study".
DATA LIST /id 1 -3 owndog 5 dogcount 7 -8
favorite 10-11 food1 to food5 13-17
spend 19-23 (2) likert1 to likert5 25-29 age 31-32 sex 34
income 37-42.
VARIABLE LABELS
owndog "Ownership of dog"
dogcount "Number of dogs owned"
favorite "Breed of favorite dog"
food1 "Rating of generic dry dog food"
food2 "Rating of generic canned dog food"
food3 "Rating of Mighty Dog"
food4 "Rating of Lucky Dog"
food5 "Rating of Kit 'n' Caboodle"
spend "Weekly spending on dog food"
likert1 "Poodles are fragile"
likert2 "Poodles are stupid"
likert3 "Poodles are self-centered"
likert4 "Poodles are cute"
likert5 "Poodles are over -priced"
income "Annual household income".
VALUE LABELS
owndog 1 "Yes" 2 "No" 3 "Not sure"/
favorite 1 "Poodle" 2 "Fox Terrier" 3 "Yorkshire Terrier"
4 "Daschund"
10 "German Shepherd" 11 "Collie" 12 "Saint Bernard"
13 "Pit Bull" 14 "Malamute" 15 "Afghan" 16 "Cocker Spaniel"
17 "Dobermand" 18 "Golden Retriever" 19 "Rotweiler"/
food1 to food5 1 "Generic dry dog food" 2 "Generic dry cat food"
3 "Generic canned dog food" 4 "Kit 'n' Caboodle"
5 "Mighty Dog"/
sex 1 "Male" 2 "Female"/
likert1 to likert5 1 "Strongly agree" 7 "Strongly disagree".
MISSING VALUE owndog likert1 to likert5 (9)/
dogcount favorite age (99)/spend (99.99)/income 999999.
BEGIN DATA.
001 1 02 02 52314 03.50 66727 29 2 026000
002 2 22252 31 1 032150
003 2 45445 26 2 135000
004 1 01 03 32512 05.50 67615 21 2 018500
First, note that some lines are indented while others are not. In
general, indented lines are continuations of commands were that
started immediately at the left margin on some line above it.
SPSS/PC+ really doesn't care if you indent or not, but it will make
your program more readable. As you can see, each command
eventually ends with a period, which tells the computer to take in the
next line as a new command. If you leave out the period, the
computer will not understand your commands and will give you an
error message. Fortunately, such errors are easy to detect and correct,
so if you leave out a few periods, it only means that you will have to
do some editing after you first try to run the program. Even
experienced SPSS/PC+ users often have problems in their first
attempts at any program, but the more experience you get with the
program, the easier it gets to correct the problems.
income 37-42 .
The "data list" command tells the computer about the positions
of the variables on the data lines. If you have made a table detailing
this information, you already have all the information needed.
Otherwise, you will have to do some arithmetic now to calculate the
beginning and ending columns of each variable.
is the same as
item1a to item1e .
Technical Note
item1a to item1e
owndog to likert5
Sending Output
To the Printer
By default, SPSS/PC+ displays the statistical output on the
screen. If, instead, you would like to send the output to the printer,
simply put the following two lines in your program:
The "set more off" command frees you from having to press
<RETURN> or <SPACE> at the end of each screenful of data. This
means that the output may "pass you by" before you have a chance to
read it. If this happens, you may either want to leave out this
command or wait to read the output until you have it printed on paper.
When typing in the data, be sure to check that you are "on
target" with respect to the columns. Generally, all the lines should be
equally long. Also, be sure to check that, when you have typed in a
complete entry, you are one column farther out than the last position
listed in the data list.
FREQUENCIES VARIABLES=all/STATISTICS=all.
There are several ways you can enter the program and data
into SPSS/PC+. Since SPSS/PC+ uses an ASCII, ("plain text") file to
handle the input data, you can either use REVIEW, the editor supplied
with SPSS/PC+, or a word processor such as WordStar, WordPerfect,6
or Microsoft Word. When using a word processor, be sure to set the
margins so that the lines can be long enough. The default margins for
most word processors will normally allow only about sixty-five
characters on a line.7 Be sure to save the file as an ASCII file, i.e. not
in word processing format.
6
Use the text-in/text-out feature (<CONTROL> <F5>) to create an
ASCII file.
7Also notice that some word processors set margins in terms of length
rather than characters. In the newer versions of WordPerfect or Microsoft
Word, margins are by default set in terms of inches. In such programs,
you may wish to switch to a smaller font instead of adjusting the margins.
33
Once you are in SPSS/PC+, a logo will first flash and you will
then be presented with a menu. Press <ALT>-M, then <F3>
<RETURN>. Now specify the name you want to give the file8 that
will contain your program and data and press <RETURN>. For
example, if you were entering the questionnaire about dogs, you
might call it "A:POODLE.SPS." (It is traditional to use the file name
extension ".SPS," but if you like to be different, this convention is not
required).
If your file already exists, you will probably be brought into the
document at the bottom. To go to the top, press <CONTROL>-
<HOME>. (As you might expect, <CONTROL>-<END> will bring you
to the bottom of the document). You can use the arrows on the
keyboard to move one space or line at a time. The <INS> key will
toggle between insert and write-over (that is, whether the computer
will type new text on top of existing text or move the old text over to
make room for the new). (The default is "on"; however, you may want
to turn it off if you are editing data and you want to overwrite some
incorrect contents).
8If your floppy disk is in Drive A:, the filename should start with "A:,"
e.g. "A:poodle.sps" in our case.
34
below the cursor. Therefore, you should always be sure to go to the top of
the file before saving it. The complete sequence to save, including this
first step, is:
<CONTROL>-HOME>
<F9> <RETURN> filename <RETURN>
You may not have time to type in all of your program and your
entire set of data in one sitting. You can leave REVIEW at any time
and resume at a later point.
35
Essential REVIEW
Commands
SAVE FILE
<CONTROL>-<HOME>
<F9> <RETURN> <RETURN>
GO TO THE TOP OF THE FILE
<CONTROL>-<HOME>
MOVE UP OR DOWN ONE LINE OR MOVE LEFT OR RIGHT
Use the cursor keys
INSERT A LINE ABOVE THE CURRENT LINE
<SPACE> <RETURN>
EXIT FROM SPSS/PC+
(Be sure to save first). <F10> E EXIT. <RETURN>
Exercise
You are now ready to enter data into SPSS/PC+ and analyze it.
Your first exercise is simply to type in the following program and run
it. From the menu, call up SPSS/PC+, then press <ALT>-M <F3>
<RETURN>, followed by the filename "A:exerc1.sps" (make sure you
have a formatted floppy disk in Drive A:), and press <RETURN>
again.
Analysis
So you thought that you were finally done with the program
and data entry? Well, not quite yet! Your data is probably good
enough to be published in a tabloid magazine as it is, but there is one
additional step that a conscientious researcher must take.
The program may point out some errors in your program. Such
errors are often caused by (1) omission of a period, quote, or slash, (2)
the misspelling of a command or variable name, or (3) other
"typographical" error. The computer will beep and stop after each
screen of information has been displayed. Note down any errors and
press <RETURN> to continue.
Note that taking care of one error may fix several other
complaints that SPSS/PC+ had in the previous run. If, for example,
38
you leave out a period or slash, SPSS/PC+ may encounter several
subsequent "errors"--expressions that are not allowed in the given
context. In other words, if you left out some punctuation, SPSS/PC+
may expect something that is not forthcoming and will continue to
complain.
When you are satisfied that the errors have been removed from
the program, add the line
frequencies variables=all.
Valid Cum
Value Label Value Frequency Percent Percent Percent
In this case, it is quite evident that an error has been made since
there is no such legitimate value as "4" for this question. That is, you
either own a dog, don't own a dog, don't know if you own a dog, or
39
refuse to answer the question. Therefore, the code "4" cannot
represent a acceptable answer. We now know that something went
wrong, and we will want to track down the error. Also note that we
really would have no way of detecting if the value of "3" had been
entered one time too many (at the expense of some other code) since
that would not show up as an illegitimate value. Be sure to note down
all unacceptable values. (In this case, there is only this one
"objectionable" value). Note that the period indicates a "system
missing value" (or a blank) and that "9" is our defined missing value to
be used when the given answer is not usable).
Once you have weeded out the incorrect values, you may want
to run the frequencies check again to see if you got them all or if new
ones have come about as a result of editing.
40
Step 5: Using Statistical ProceduresAnd Computations
Frequencies
you will get the mean, standard deviation, median, mode, and various
other statistics associated with the distribution.
DOGS Number of dogs owned or leased
Valid Cum
Value Label Value Frequency Percent Percent Percent
Recoding Variables
As is evident from the example above, you first state the name
of the variable. Each value range is then specified in parentheses,
followed by an equal sign, and the desired recoded value.
Reverse Scoring
When using Likert scales and other measures of opinion or attitude, it is
sometimes desirable to word questions in the opposite direction of what one is
looking for. There are two reasons for this approach. First, wording the question
one way may be clearer or more natural than wording it the other way. Secondly, it
may be desirable to reverse the polarity of the question to prevent respondents from
simply checking the same answer for each question.
In reverse scoring, we will essentially turn the scale upside down. That is,
we will convert the highest value to the lowest, the lowest value to the highest, and
so forth. In this example, since we have a seven point scale, the command would
look like this:
RECODE likert4 (7=1) (6=2) (5=3) (3=5) (2=6) (1=7).
(Notice that "(4=4)" is superfluous. That is, if the person neither agrees
nor disagrees, that fact is not going to change with the polarity of the question).
43
Crosstabulation
2 4 10 16 5 19 6 60
Medium 33.3
3 10 6 17 5 15 7 60
Tall 33.3
Column 20 26 47 15 55 17 180
Total 11.1 14.4 26.1 8.3 30.6 9.4 100.0
where "firstvar" and "secondv" are the two variables you want to
tabulate against each other. Notice that the optional statistics take up
44
a lot of room. If you only want Chi square ( χ2), you can reduce the
output by substituting "STATISTICS=1" for "STATISTICS=all."
Or, you could select the statistics available from this list:
Statistics Available in
Crosstabs
1 Chi square
2 Phi or Cramer's V, depending on the
number of variables
3 Contingency coefficient
4 Lambda
5 Uncertainty coefficient
6 Kendall's Tau-b
7 Kendall's Tau-c
9 Somers' d
10 Eta
11 Pearson's r
If you want more detail, you can get row and column
percentages, i.e. the percentage of the row and column that each cell
contributes, by putting in the "/OPTIONS=3,4" parameter. Thus, if
you were tabulating "firstvar" with "secondv" and wanted these
features as well as Chi square and Cramer's V, the command would
be:
CROSSTABS TABLES=firstvar BY secondvar
/STATISTICS=1,2
/OPTIONS=3,4.
45
Note that the period goes at the very end of all the
subcommands. You would not place a period after "firstvar BY
secondvar" if you included additional subcommands as we did in this
case.
T = (n2-n)/2
To find out how many tables would result from running two
different lists against each other, multiply the number of variables in
each list by each other. For example,
2 4 10 16 5 19 6 60
Medium 33.3
3 10 6 17 5 15 7 60
Tall 33.3
Column 20 26 47 15 55 17 180
Total 11.1 14.4 26.1 8.3 30.6 9.4 100.0
Cramer's V .13092
Contingency Coefficient .18205
Kendall's Tau B -.02042 .3730
Kendall's Tau C -.02222 .3730
Pearson's R -.03093 .3401
Gamma -.02806
/OPTIONS=options .
Thus,
CORRELATION x60 to x65/OPTIONS=5.
would create a matrix of x60 to x65 against each other. Notice that
close to half of the correlations would be redundant (i.e. X62 against
9
Note that, unless you change the options, this significance level is
one-tailed.
49
X63 is the same as X63 against X62) and the coefficients of the
correlations on the diagonal would be all ones since we are correlating
each variable with itself. (That is, a variable perfectly "predicts"
itself).
Please note that, when doing a regression analysis, you are not
guaranteed any result. If no predictor variables are significant at the
10
Some more esoteric varieties (such as hierchical regression) are also
available but will not be discussed.
50
first step, the process will simply tell you that the "PIN limit" of .05 (or
whatever level you may have specified if you chose to override the
default) has been reached and terminate. This is a frequent outcome,
reflecting an empirical reality, and does not indicate that an error has
been made.
Discriminant Analysis
11
Technically, it is possible to specify more than two groups. If, for
example, you were specified "GROUPS= CLEVEL(2,3,4)," you could
involve sophomores, juniors, and seniors in your analysis.
51
which variables you wish to attempt to use as predictors. Thus, if you
were trying to predict purchase of your product (purchase=1) vs. non-
purchase of your product (i.e. no purchase or a purchase of your
competitor's product) (purchase=2), using income, sex, age,
education, and various other demographic variables (DEMO1 to
DEMO15), your procedure might look like this:
DSCRIMINANT GROUPS=purchase(1,2)/
VARIABLES=income,sex,age,educ,demo1 to
demo10/STATISTICS=1,2.
Again, the title sounds a little esoteric, but the basic concept is
quite simple. Sometimes, a single population mean may be deceptive
because it is heavily influenced by some underlying variable.
Suppose, for example, that we know that the average height of a
group is 69 inches (5'9"). We might find it useful to break down this
figure by men and women, realizing that a lot of the variability is due
to between group (sex) differences:
Summaries of HEIGHT
By levels of SEX
But we can do better than this. Suppose that we also know that
half of the people are basketball players (bball=1) and half are not
(bball=0). We can now do a breakdown at two levels. Notice that sex
is still the outermost criterion for distinction since this is considered
the stronger source of influence. (That is, we expect the average male
non-basketball player to be taller than the average female player).
54
Summaries of HEIGHT
By levels of SEX
BBALL basketball player or not
1=construction
2=other blue collar
3=professional
4=other white collar
5=other
and were are interested in testing for income differences between the
first four groups, our statement would look as follows:
ONEWAY VARIABLES=income BY occup(1,4).
55
The ANOVA program is somewhat more complex and should
be attempted only by individuals well versed in statistics. Details are
in the SPSS/PC+ Base Manual.
where "var1" and "var2" are the names of the two variables you want
to plot.
W 9
e 1
e
k 1
l
y 6
c R
h 1 21 11
o 1 1 2124 2123 12
c 3 1 1 1 2313111 1
o 1 23 211332 21 1
l R 1 112 22 1 1
a 1 1
t
e 0 1 1
30 50 70 90
20 40 60 80
There are two kinds of t-test commonly used. The one is used to
test for differences between groups, as if we were going to test for
differences in height between males (sex=1) and females (sex=2). The
syntax for this kind of t-test is
T-TEST GROUPS=sex(1,2)/VARIABLES=height.
GT Greater than
GE Greater than or equal
LT Less than
LE Less than or equal
NE Not equal
A.: No. There is a bug that will often "invalidate" the arrow keys
after an SPSS/PC+ program has been run. The best way to get
around this problem is to save your data and reboot the system.
A.: No. Either print it to an ASCII file (using PrintFile) or use the
IMPORT command in SPSS/PC+.
Q.: I have noticed that some people use the menu system in
SPSS/PC+ to write programs. Why isn't it covered in this text?
A.: It's true that the menus free you from typing in certain
commands in their entirety, but in the author's view, they
create more problems than it solves.
62
Q.: What about the Data Entry module? Doesn't that save time?
A.: Some people find it easier to use this module, but since it has to
be purchased as a separate product, it may not be available to
you in industry, and it is therefore not a good idea to become
dependent on it. Also, if you are entering a long questionnaire,
it is inefficient and frustrating to have to press <RETURN>
after entering each variable.
Q.: I just have a short survey and all I want is some averages and
perhaps some standard deviations. Do I really have to use
SPSS/PC+ for this?
A.: No. Lotus 123 will allow you to compute the mean, standard
deviation, maximum, and minimum for a data range. In
addition, you can do a regression analysis (DataRegression).
63
Appendix B
Frequencies
Warning!
Valid Cum
Value Label Value Frequency Percent Percent Percent
7 1 .4 .4 .4
8 1 .4 .4 .8
9 15 5.4 5.7 6.5
10 15 5.4 5.7 12.2
11 16 5.7 6.1 18.3
12 27 9.7 10.3 28.5
13 39 14.0 14.8 43.3
14 36 12.9 13.7 57.0
15 36 12.9 13.7 70.7
16 30 10.8 11.4 82.1
17 18 6.5 6.8 89.0
18 20 7.2 7.6 96.6
19 5 1.8 1.9 98.5
20 2 .7 .8 99.2
21 2 .7 .8 100.0
99 16 5.7 MISSING
------- ------- -------
TOTAL 279 100.0 100.0
13
For this example, we will use information from a Frequencies printout,
but you could also get the needed information from the procedure Means .
14
For formulas, see an introductory statistics or research methods text.
67
From the above, we can conclude with 95% certainty that the
"true" population mean is somewhere between 13.64 and 14.30 years.
If we wanted to be more certain than that, the confidence interval
would be wider.
Let's suppose that you have been hired by Tallon Termite & Pest
Control, Inc., to assess the extent to which aardvarks constitute
serious competition nitrogen program. You are directed to find out
snack preferences of the rivals by sampling one hundred aardvarks
and asking if they prefer ants or termites. The results are as follow:
Valid Cum
Value Label Value Frequency Percent Percent Percent
Crosstabs
15
q is not always the equivalent of the proportion preferring ants. In
some cases, there may be several ways in which something can be "not p"
without qualifying as q. For example, suppose that the aardvarks were
also given the choice "other." In that case, q would be equal to the
proportion preferring ants plus the proportion preferring other snacks.
69
Crosstabulation: HEIGHT Height of customer
By ANIMAL Species of preferred stuffed animal
2 4 10 16 5 19 6 60
Medium 33.3
3 10 6 17 5 15 7 60
Tall 33.3
Column 20 26 47 15 55 17 180
Total 11.1 14.4 26.1 8.3 30.6 9.4 100.0
Cramer's V .13092
Contingency Coefficient .18205
Kendall's Tau B -.02042 .3730
Kendall's Tau C -.02222 .3730
Pearson's R -.03093 .3401
Gamma -.02806
people to fall into the category of being both tall and preferring bears.
This figure compares to the actual, or observed, count of seventeen
people falling into that cell.
3. When a large number of cells are empty, the test may not be
valid.
Pearson Correlation
WARNING!
Pearson correlation coefficients are completely
meaningless when one or more of the variables is
nominal (categorical). If we survey people for their
preferences for soft-drinks and "1" corresponds to
Coke, "2" corresponds to Pepsi, and "3" corresponds to
"Sprite," the fact that Sprite drinkers may be special
and "reaching for more" doesn't make them exactly
three times as much of anything as Coke drinkers. (To
use categorical data in correlation or regression
analysis, you can compute indicator variables, which
have metric properties, and use those instead).
Correlations: VACATION
INCOME .5636
( 263)
P= .000
73
The first figure is the sample correlation coefficient. This
ranges between -1.0, when there is a perfect negative correlation
between the two variables, and 1.0, when there is a perfect positive
correlation between the variables.
16
Note that a case will be missing in correlation if the only one of the
two variables is missing.
17
Because we are taking a finite sample from a population, we expect
our sample results differ somewhat from those of the actual population.
74
a 400 1
n 1 1
n 1 1 11 12 1
u 11 3211 1 31 1 1 R
a 1 3515221412 1 1
l 300 2131115341221213 11
2 22 23422 5434111
h 2 1 1 2 15216352 36 42 1
o 1 2112521314132 1 1
u 1 1 1 121 13 22 1 11
s 200 1 111 121 1 1
e R 1 11 12 1 2 522 2 1
h 21 1 1 1
o
l
d 100 1
Working With
Large Data Sets
Now suppose that instead of just two lines per case, we have a
data set that requires five lines per subject. Two cases, using that
system, might look like this:
001 51881042413879258528815244790246113226620946180660912
001 17360960447695181568034889642259507514245342577273341
001 90272540808655925656834500842489437576499768779375834
001 85877256735693738165767968166614884079224430157401661
001 93770572368521277158218564000020348310063127470916776
002 28192629327531949569666071329351801708701532387853052
002 46719233470503286424781463574656343764730263589509917
002 56568263273220578001682471949627692765870597258624634
002 92326111451257030599355546291915734015614726564885112
002 85184875444974819662113452795996745459258090787424135
and going on like that until five lines have been created.
For every case, then "LINE1" would be equal to "1,"
"LINE2," to 2, etc.
2. At the end of the five lines, create a data line that only contains
the one "variable" called "BOGUS." This will create a blank line
that makes it more obvious where each case ends. Note that
you must assign the columns that "BOGUS" could take up even
though the variable will be "system missing" for all cases. We
will just say that it occupies column 1 of line 6. The complete
data list from the above might look like this:
data list /id1 1-3 line1 4 q1 to q50 6-55
/id2 1-3 line2 4 q51 to q100 6-55
/id3 1-3 line3 4 q101 to q150 6-55
/id4 1-3 line4 4 q151 to q200 6-55
/id5 1-3 line5 4 q201 to q250 6-55
/bogus 1.
Two cases from this data set might look like this:
0011 51881042413879258528815244790246113226620946180660912
0012 17360960447695181568034889642259507514245342577273341
0013 90272540808655925656834500842489437576499768779375834
0014 85877256735693738165767968166614884079224430157401661
0015 93770572368521277158218564000020348310063127470916776
0021 28192629327531949569666071329351801708701532387853052
0022 46719233470503286424781463574656343764730263589509917
0023 56568263273220578001682471949627692765870597258624634
0024 92326111451257030599355546291915734015614726564885112
0025 85184875444974819662113452795996745459258090787424135
If your data set becomes so big that it takes a long time to run,
you might consider making a "system file," which will increase the
execution speed considerably. See Appendix D for more information.
77
Appendix D
Now you are ready to access the system file for further analysis.
To do that, create a new file (say, "FASTSURV.SPS"). Put in a "get"
statement as the very first line in this file:
GET /FILE="filename .SYS" .
Advanced Tip
The following will illustrate how you can import sets of data
that consist of at most one line per case into SPSS/PC+. (If you have
more than one line of data, you will most likely have to use an
SPSS/PC+ system file. See Appendix D and the section on Import in
the SPSS/PC+ Base Manual.) Since you already have the data
entered into LOTUS, it is assumed that you are reasonably familiar
with the program.
81
Warning!
Lotus is not very strict about whether you put your cases
across rows or columns. That is, when you put the first variable
from a case in a cell, you have a choice between putting the next
variables to the right of this variable or below. For example, if you
put VAR1 in A1, you can either put variables 2 through 5 in A2..A5
or B1..E1.
First, get into LOTUS 123 and bring up the file that contains the
data you wish to use in SPSS/PC+. Make sure that your data is
aligned properly in the columns by issuing the "/Range Format Fixed"
sequence of commands.
Now make sure that the data does not take up more than eighty
columns or spaces. If it does, you might try to reduce the length by
first reducing excess columns assigned to any one variable. For
example, while LOTUS assigns nine digits to each column by default,
you don't need that many for a variable like age. (For reasons beyond
the scope of this book, you need one more column than the number of
digits of the greatest number.) To reduce the number of columns
reserved for age to three, get into the column containing the age and
issue the directive "/Worksheet Column Set width" and then specify
3. If your data still does not fit, you might think about only taking
some of the variables into SPSS/PC+. You can do that by moving only
those variables you want into another area of the spreadsheet.
82
Warning!
The program will now run just like any other program.
84
Appendix F
A B C D E F
+a3+b3+c3+d3 .
After you completed the calculation for the first case, you
would have to copy the formula from E3 to E4 through E6:
/cE3 <CR> E4..E6 <CR>.
in SPSS/PC+.
18
For those interested in econometric and forecasting methodology, it
is also possible to use lagged data. Details are found in the manual.
86
Important Note!
First of all, both SPSS/PC+ and dBase clearly divide the data
into individual cases or "records." In dBase, the data associated with
each entity is kept within a single record. Thus, if we have a database
of student grades, a dBase record might look like this:
Record 1
ID 5551515555
MIDTERM 84
PAPER 95
FINAL 88
5551515555 084095098
Find the part of the output you would like to include in your
19
Alternatively, can use the "SET RESULTS" command to direct your
output to a permanent file (see manual).
20
SPSS.LIS is an ASCII file and should be edited and read in
accordingly.
91
document and use the block function (<ALT>-<F4>) to mark the
appropriate text. To copy the text, use <CONTROL>-<F4> to copy
and press <SHIFT>-<F3> to move back to document #1. Use the
cursors to move where you want the text inserted, then press
<RETURN>.
If you want more text from the output, you can repeat the
process.
However, you may run into some problems when you try to
print to a laser printer or other kind printer that prints on stacks of
single sheets of paper. This happens because, although there is room
for sixty-six lines of text on a laser printer, not all of the space on the
paper is available for the printer's use. Thus, the pages may become
disalligned; that is, the page breaks may not occur at the right places.
Statistical Significance
You will recall from your statistics classes the use of the null
and alternative (or "research") hypotheses (Ho and Ha). You assume
the null hypothesis, which states that there is no relationship, to be
true until you find overwhelming evidence that it is not--in that case,
you "accept" the alternative hypothesis, knowing that there is a
certain chance (e.g. 5%) that the relationship does not exist.
21
The theory behind Bernouli trials and combinations is beyond the
scope of this text but is generally discussed in introductory finite
mathematics texts.
98
Statisticians have never satisfactorily resolved the issues and
problems that surround simultaneous significance testing. However,
there are some ways of limiting its potential dangers:
Expanded Glossary
Statistics
case: One entity or unit of data on one or more variables; usually one
subject, individual, or other unit (such as a firm). For example,
100
when you administer a questionnaire, each person is a case, but
if you have both a pre-test and a post-test, the two tests for the
same subject together constitute a case.
Chi square ( χ 2): A statistical test that determines the probability that
two nominal22 variables, given the sample data, are
"independent," i.e. whether knowing one variable will help
"predict" the other. The formula is
22
Chi square may be used on ordinal, interval, or ratio scaled
variables, but more efficient tests are available for such variables.
101
where k is the number of "cells" (i.e. different
combinations of the two variables), and Oi and Ei are the
observed and expected number of cases in the respective
cell. (See also crosstabulation.)
2 4 10 16 5 19 6 60
Medium 33.3
3 10 6 17 5 15 7 60
Tall 33.3
Column 20 26 47 15 55 17 180
Total 11.1 14.4 26.1 8.3 30.6 9.4 100.0
data set: The total collection of cases and variables included in a file.
For example, if you administer a questionnaire to one hundred
people and enter the answers to all the questions into the
computer, this would be your data set.
Valid Cum
Value Label Value Frequency Percent Percent Percent
At the "lowest" level are nominal scales, where the numbers are
arbitrary. Thus, if, for a given variable, the code "1"
corresponds to "Democrat" and "2" corresponds to "Republican,"
the scale is not meant to imply that Republicans are exactly
twice as much of anything as Democrats. In fact, we could
reverse the two codes without making any real difference.
Likert scale: A scale which asks the respondent to rate his or her
opinion on a bi-polar scale. For example, a respondent might be
asked to rate his or her level of agreement with a statement
using a scale such as:
Strongly Strongly
agree disagree
1 2 3 4 5 6 7
105
mean: The qoutient of sum of all the cases of a variable to the sample
size; i.e., for the sample mean,
missing data: The condition that arises when data is not available or
not applicable on one or more variables for one or more cases.
Missing data can take two forms.
mode: The value that occurs most frequently on a given variable. For
example, in the set {1, 2, 3, 3, 3, 3, 4, 4}, the mode is 3.
Computer Terms
ASCII file: A plain text file that can be edited. SPSS/PC+ data files
are ASCII files while system files, word processing files, and
executable files are normally not ASCII files.
cursor: The small blinking character that tells you where on the
screen you are editing.
system file: An SPSS/PC+ file in which the data, value labels, and
variable labels are compressed so that they can be more
quickly accessed by SPSS/PC+. System files are normally only
worth the effort if the data set is very large. For more
information, see Appendix D.
114
Appendix