SPSS Help and Tutorial - How To Use SPSS
SPSS Help and Tutorial - How To Use SPSS
Table of Contents
Introduction – Part 1...............................................................................................4
Data View.................................................................................................................5
Variable View..........................................................................................................6
Defining Variables...................................................................................................6
Data Entry................................................................................................................8
Descriptive Statistics...............................................................................................9
Frequency Analysis.................................................................................................9
Crosstabs................................................................................................................11
Data Manipulation................................................................................................12
Select Cases............................................................................................................12
Splitting a File........................................................................................................14
Reporting................................................................................................................16
Appendix................................................................................................................17
Null Hypothesis......................................................................................................18
Statistical Tests......................................................................................................19
Tests of Significance..............................................................................................19
For additional SPSS help, visit http://www.youtube.com/mycsula
Correlations...........................................................................................................19
Paired-Samples T Test..........................................................................................20
Independent-Samples T Test................................................................................22
Data Manipulation................................................................................................27
Appendix................................................................................................................35
Simple Regression..................................................................................................37
Scatter Plot.............................................................................................................37
Multiple Regression...............................................................................................43
Data Transformation............................................................................................46
Computing..............................................................................................................46
Polynomial Regression..........................................................................................47
Chart Editing.........................................................................................................49
Chi-Square.............................................................................................................55
This handout (Descriptive Statistics) introduces basic skills necessary to run PASW Statistics. It
includes how to create a data file and run descriptive statistics. It is especially tailored to answer
three research questions formulated in the sample survey questionnaire, eventually giving users
an overview of how PASW Statistics can be used for survey research. The three research
questions formulated in the sample survey are as follows:
1. What kind of computer do people prefer to own?
2. What color do people prefer for their computer?
3. Is computer color preference different between genders?
DATA VIEW
When PASW Statistics is launched, the Data Editor window opens in Data View, which looks
similar to a Microsoft Excel spreadsheet (which is just an array of rows and columns). The
difference is that the rows and columns in Data View are referred to as cases and variables,
respectively (see Table 1).
Table 1 - Elements in Data View
Element Description
Variable Each column represents a variable. Any survey questionnaire item or test
item can be a variable. Commonly defined variable types are numeric or
string. When defining variables as numeric, users need to specify decimal
places. Variable names can be up to 256 characters long and must start
with a letter. Make variable names meaningful and easily recognizable.
Case Each row represents a case. The participants in the study can be cases. For
example, if 100 participants are involved in your study, then 100 cases (or
rows) of information should be generated. Responses to the question items
should be entered consistently from left to right for each participant.
VARIABLE VIEW
Variable View is where variables are defined by assigning variable names and specifying the
attributes, such as data type (“String,” “Date,” “Numeric,” etc.), value labels, and measurement
scales (“Nominal,” “Ordinal,” or “Scale”). Users can think of Variable View as the backbone
structure for the Data View; data cannot be entered nor viewed without first defining variables in
Variable View (see Table 2).
Table 2 - Elements in Variable View
Element Description
Variable Name PASW Statistics will initially give a default variable name (var00001) that
users can change. It is recommended to assign a brief and meaningful
name to variables (e.g., “Name,” “Gender,” and “GPA”).
Variable Type The variable type determines how the cases are entered. Generally, text-
based characters are of “String” type and number-based characters are of
“Numeric” type. For example, if a user has a variable called “Name,”
then its variable type should be “String.” Similarly, a variable named
“GPA” should be a “Numeric” type with (normally two) decimal places.
Value Labels Value labels allow users to describe what the variable name stands for.
For example, if a variable has been defined as “Fav,” most likely others
may not know what it stands for. To avoid misinterpretation, value labels
can be utilized to clearly define variable names.
DEFINING VARIABLES
First, variable names based on your research questionnaire need to be assigned. If variable names
are not assigned, PASW Statistics will assign default names that may not be recognizable.
Second, the Type attribute should be specified for each variable. If necessary, assign labels to
values to help all users of the file understand the data better.
DATA ENTRY
After defining the variables, users can enter data for each case. If variables are defined as having
a “Numeric” data type, then numeric data should be entered. PASW Statistics will only accept
numeric digits (0-9) for a “Numeric” data type. If variables are defined as “String” data, any
keyboard character can be entered.
To enter data:
1. Click the Data View tab at the lower left corner of the Data Editor window (see Figure
7).
2. Click in a cell and type the corresponding data. The entry will also appear in the Cell
Editor (see Figure 8).
Figure 7 - Data View Tab
Cell Editor
Descriptive Statistics
After data has been entered, users may begin analyzing the data by using descriptive statistics.
Descriptive statistics are the most commonly used statistics for summarizing data frequency or
measures of central tendency (mean, median, and mode).
Research Question # 1
What kind of computer do people prefer to own?
FREQUENCY ANALYSIS
We can use frequency analysis to answer the first research question. Frequency analysis is a
descriptive statistical method that shows the number of occurrences of each response chosen by
the respondents. When using frequency analysis, PASW Statistics can also calculate the mean,
median, and mode to help users analyze the results and draw conclusions. The following
example will use a frequency analysis to answer “Research Question # 1: What kind of computer
do people prefer to own?” using the data collected from our sample survey (see Appendix).
7. Click the Statistics… button. The Frequencies: Statistics dialog box opens (see ).
8. Select the Mean, Median, and Mode check boxes in the Central Tendency section; select
the Std. deviation check box in the Dispersion section.
9. Click the Continue button. This returns you to the Frequencies dialog box.
10. Click the OK button. An Output Viewer window opens and displays the statistics and
frequency table (see Figure 12). The columns of the table “Computer Owned” display the
“Frequency,” “Percent,” “Valid Percent,” and “Cumulative Percent” for each different
type of computer owned.
The measures of central tendency (mean, median, and mode) can be used to summarize various
types of data. Mode can be used for nominal data, such as computer type, computer color,
ethnicity, etc. Mean or median can be used for interval/ratio data, such as test scores, age, etc.
The mean is also useful for data with a skewed distribution.
Research Question # 2
What color do people prefer for their computer?
CROSSTABS
Crosstabs are used to examine the relationship between two variables. To answer the second
research question, users will need to analyze two variables: “Computer Owned” and “Color”
(which indicates color preference). Using crosstabs will show the intersection between these two
variables and reveal the computer type and color preferred by most people.
Data Manipulation
Data files are not always ideally organized in a form to meet specific needs. For example, users
may wish to select a specific subject or split the data file into separate groups for analysis.
SELECT CASES
If you have two or more subject groups in your data and you want to analyze each subject in
isolation, you can use the select cases option. For example, the data we are currently analyzing
has both male and female participants. However, if you wish to analyze only female cases, then
you select “Gender” cases and set the condition for female cases only.
From the cross tabulation in the Output Viewer window in below, look at the column for the
most preferred color and the row for the computer types. Since we selected only female cases,
what is the computer color most preferred by women? Ten women chose “IBM or Compatible”
with color option “5.” Thus, you may conclude that most female participants prefer the color “5”
for “IBM or Compatible” computers. However, what does “5” represent? This problem arose by
not labeling the variable value “5” as “Other.” Moreover, even if it were labeled “Other,” it does
not indicate any particular color, making it difficult to draw a conclusion. In order to avoid such
problems, it is suggested that you provide a blank space where participants can specify “Other”
color preferences besides the ones specified in the survey questionnaire.
Example:
What kind of color do you like to have for your computer?
1. Beige 2.Black 3.Gray 4.White 5.Other __________
Research Question # 3
Is computer color preference different between genders?
SPLITTING A FILE
To answer the third research question, we need to split the file. You can analyze one particular
group of subjects using the select cases option. However, if you wish to compare the response or
performance differences by groups within one variable, it is best to use the split files option.
Answer: Yes
Explanation: There is a computer color preference difference based on gender. From the
crosstabulation output, females prefer “IBM or Compatible” of “Other” color over the colors
beige, black, gray, or white. The male group prefers “IBM or Compatible” of “black” color.
FIND AND REPLACE
In PASW Statistics, the Find and Replace function is more efficient to use. Users can use Find
and Replace in Data View. However, only the Find function is available for users in Variable
View.
NOTE: Under the Match to section of the Find and Replace dialog box (see Figure 22), Contains
means PASW Statistics will find each instance of the word/phrase/number appearing in a cell,
whether or not it is the only information enclosed. The Entire cell option will find the
word/phrase/number that matches the entire cell as a whole. Selecting the Begins with and Ends
with options will search the character indicated by the user.
Reporting
Once the statistical analysis is complete, the final step is to create a report. In the report, you may
include PASW Statistics output (e.g., graphs and tables) for supporting your analysis. Using the
Copy and Paste functions, the tables/graphs generated in PASW Statistics can be copied from the
Output Viewer window and pasted into a Microsoft Word document without having to create
new tables or graphs.
Research Questions
Survey Questions
1. What is your name? _____________________________
This handout (Test of Significance) introduces 1) several data entry and data manipulation
techniques that help you save time, 2) basic skills to perform tests of significance, such as
correlations and t tests, and 3) an introduction to multiple response sets. The step-by-step
instructions will help you understand how to interpret the output of your tests from data supplied
by your research question(s). Follow the steps carefully to get appropriate results. Please note
that a slightly different process might yield unexpected and complicated results. This is a
continuation of the PASW Statistics Descriptive Statistics handout.
Null Hypothesis
The null hypothesis (H0) represents a theory that has been presented, either because it is believed
to be true or because it is to be used as a basis for an argument. It is a statement that has not been
proven. It is also important to realize that the null hypothesis is the statement of no difference.
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is
no better, on average, than the current drug (in other words, the new drug exhibits the same
behavior as the old drug). The null hypothesis (H0) and the alternative hypothesis (H1) can be
stated as:
Special consideration is given to the null hypothesis. This is due to the fact that the null
hypothesis relates to the statement being tested, whereas the alternative hypothesis relates to the
statement to be accepted if and when the null is rejected.
The final conclusion, once the test has been carried out, is always given in terms of the null
hypothesis. The result is either "Reject H0 in favor of H1" or "Do not reject H0"; the conclusion is
never "Reject H1" or "Accept H1."
NOTE: The null hypothesis essentially states that the given cases or items under consideration are
statistically the same or exhibit the same behavior without any significant difference. The alternate
hypothesis states that the given cases exhibit different behavior or that they have a statistically significant
difference.
Statistical Tests
Statistics is a set of mathematical techniques used to summarize research data and determine
whether the data supports a proposed hypothesis. PASW Statistics includes tools that can be used
to analyze variables and determine the strength and nature of the relationship between two
variables and whether the means (averages) of two data sets (samples) are statistically the same
or different.
Tests of Significance
The following examples are sample research questions that can be answered using PASW
Statistics analytical methods.
CORRELATIONS
A correlation is a statistical device that measures strength or degree of a supposed linear
association between two or more variables. One of the more common measures used is the
Pearson correlation, which estimates a relationship between two interval variables.
Research Question # 1
Is there a relationship between academic performance and Internet access?
Answer: Yes
Explanation: As shown in Figure 24 above, the correlation index for the relationship between “active”
and “posttest” is 0.476, which is between 0.4-0.7. The correlation index for the relationship between
“active” and “gpa” is 0.448, which is between 0.4-0.7. The results from these analyses indicate that
there is a moderate, positive relationship between academic performance and Internet access.
PAIRED-SAMPLES T TEST
A Paired-Samples T Test is used to test if an observed difference between two means is
statistically significant. To run a t test, the following assumptions should be met: the data 1) has
normal distribution, 2) is a large data set, and 3) has no outliers. If any of these assumptions are
not met, then a nonparametric test should be used.
H0: There is no influence of using the Internet on academic achievement for this class.
H1: There is an influence of using the Internet on academic achievement for this class.
The hypothesis is that Internet familiarity cannot influence the academic achievement in the
computer class. The variables that reflect academic achievement are “pretest” and “posttest.”
Answer: Yes
Explanation: The observed mean difference is -4.5172. Since the value of t is -3.820 at p < .001,
the mean difference (-4.5172) between “pretest” and “posttest” is statistically significant.
According to the Sig. of 0.001 (which is less than 0.05), the hypothesis is rejected. Therefore, it
can be inferred that there was instructional effect taking place in the computer class.
Two variables are required in the data set. One variable is the measured parameter. Examples
include weight, height, or frequency. The second variable divides the data set into two groups.
Light and Dark are the groups whose means will be compared.
Research Question # 3
Is there a difference in the average number of seedlings grown in the light
and those grown in the dark?
In this example, 20 Petri dishes each contained 10 celery seeds. Ten of the dishes were kept in
the dark for one week; the other 10 were placed under a grow light for the same amount of time.
At the end of the week, the number of seeds that sprouted was counted in each dish.
H0: There is no difference between seedlings under the light and in the dark ( (light) = (dark)).
H1: There is sig. difference between seedlings under the light and in the dark ( (light) ≠ (dark) ).
NOTE: The first set of hypotheses is testing the variance, while the proceeding set is testing for the mean.
The variances have to be equal before we can determine if the means are equal.
NOTE: Variance: The arithmetic mean of the squared deviations from the mean, which is essentially used
to see how far the single samples are from the mean. We need to make sure the variances are equal before
we can determine if the means are equal. If the variances are equal, users will be able to move to the T
Test. If the variances are not equal, users will have to do more testing.
Answer: Yes
Explanation: The mean difference in seedlings sprouted between the two treatments (light and
dark) was -2.900. The value of t, which is -3.179, was statistically significant (p=0.005).
Therefore, the null hypothesis is rejected.
3. Select the “American,” “TWA,” “United,” “USAir,” and “Other” airline variables and
move them to the Variables in Set: list box.
4. Make sure the Dichotomies option is selected and enter [1] in the Counted value: box.
5. Type [Airlines] in the Name: box.
6. Type [Airline frequency of response] in the Label: box.
7. Click the Add button. The set is created as “$Airlines” and listed in the Multiple
Response Sets: list box.
8. Click the Close button.
Research Question # 4
In a survey of airline passengers, which airline was selected as having been
flown most often in the previous six months?
To analyze the frequency of response for each variable in a multiple response set:
1. Click the Analyze menu, point to Multiple Response, and select Frequencies…. The
Multiple Response Frequencies dialog box opens (see Figure 32).
2. Select the multiple response set labeled “$Airlines” and move it to the Table(s) for: list
box.
3. Click the OK button. An Output Viewer window opens with the frequency analysis (see
Figure 33).
Answer: United
Explanation: As seen in the Output Viewer window, there were 18 people surveyed and 44 total
responses generated. Of the 44 total responses, United was selected most often with 12 responses
(representing 27.3% – the largest portion of the total responses).
Research Question # 5
In a survey of airline passengers, which airline was selected most often by
those passengers who identified themselves as afraid to fly?
2. Select the “FearFactor” variable as the Row(s): variable and the “$Airlines” multiple
response set as the Column(s): variable.
3. Select the “FearFactor” variable after it is designated as the Row(s): variable. The
Define Ranges… button becomes active.
4. Click the Define Ranges… button. The Multiple Response Crosstabs: Define Variable
Ranges dialog box opens (see Figure 35).
5. Enter [0] in the Minimum: box and [1] in the Maximum: box for the “FearFactor”
variable.
6. Click the Continue button.
7. Click the Options… button. The Multiple Response Crosstabs: Options dialog box opens
(see Figure 36).
8. Select the Cases option and then click the Continue button.
9. Click the OK button. The Output Viewer window opens with the crosstab results (see
Figure 37).
Answer: USAir
Explanation: Of the 18 people surveyed, ten identified themselves as being afraid to fly. Within
that group of survey respondents, USAir was the airline selected most often (seven times).
Data Manipulation
PASW Statistics also provides tools to make data manipulation a simple task.
To insert a variable:
1. Switch to Data View.
2. Click the “posttest” variable heading to highlight the column.
3. Click the Edit menu and select Insert Variable. A new variable is inserted to the left of
the highlighted variable (“posttest”).
NOTE: The new variable is created with a default name “VAR00001” which can be changed
later.
4. To define the properties of the new variable, double-click the variable heading. The
Variable View is activated for the new variable.
5. Type [midterm] in the Name column of the new variable.
6. Change the variable type if desired.
In the same manner, it is possible to insert cases in a particular location in Data View. For
instance, assume that a case should be inserted between case “10” and “11” for a particular
student’s record. By following the instructions below, one case will be inserted after the 10th
case.
An identifier has no meaning other than to distinguish each case from one another, and to
identify the correlating cases from the additional data files. This identifier can be a unique value,
number, or letter combination to be applied to each case.
NOTE: The variables do not have to be the same across data files.
The merging data files function can be used to satisfy this requirement.
10. Once this information has been defined in Variable View, switch by clicking the Data
View tab to enter the corresponding case information.
11. Enter [Alfred] in case 1 of the ID variable, [Bethel] in case 2 of the ID variable, down to
[Jessie] in case 10 of the ID variable. Enter the corresponding information according to
Table 5. See Figure 46 for the results.
Table 5 - Input Case Information
Case ID January February March April
1 Alfred Dog Star Pizza Water
2 Bethel Cat Square Fruit Soda Pop
3 Chris Cat Triangle Veggies Grape Juice
4 Dante Dog Rectangle Sandwich Orange Juice
5 Erica Tiger Oval Chips Aloe Water
6 Fernando Tarantula Circle Calzon Beer
7 Grenadine Dog Octagon Salad White Wine
8 Harold Bees Polygon Soup Naked Juices
9 Isadora Turtle Rhombus PandaExpress V8 Juice
10 Jessie Hamster Oval Egg Salad Lemonade
12. Save the file by clicking the File menu and selecting Save. The Save Data As dialog box
opens.
13. Select the Desktop as the destination and type [Merge 1] in the File name: text box.
14. Click the Save button.
15. Close the Output Viewer window.
To merge data files: (First, make sure the files have the same IDs.)
1. Open the files “Merge 2” and “Merge 3” and check for consistency across all of the IDs.
2. Minimize the “Merge 2” and “Merge 3” data files.
3. Once back in the “Merge 1” file, click the Data menu, point to Merge Files, and select
Add Variables… (see Figure 47).
5. Locate and select the “Merge 2” data file and click the Open button.
6. Click the Continue button. The Add Variables from Merge 2.sav dialog box opens (see
Figure 49).
7. Select the Match cases on key variables in sorted files check box.
8. From the Excluded Variables: list box, select “ID>(+)” (see Figure 49), and using the
transfer arrow button , move it to the Key Variables: box.
9. Click the OK button. A warning message dialog box opens (see Figure 50).
10. Click the OK button to close the warning message. The finished product should look like
Figure 51.
Background Information
1. Age: _____________________________
2. Major: ____________________________
3. G.P.A.: ___________________________
Internet Access
5. Do you have a computer at home?
1. Yes 2. No
6. Where do you surf on the Internet? (You can circle more than one option for this question.)
Questions 8 through 19 are designed to investigate the frequency and types of activities on
the Internet. These questions have a 4 point Likert-scale ranging from strongly disagree to
strongly agree. Please circle the option that best describes your activities on the Internet.
20. Are there any other Internet activities that are not included in this survey? If so, please
describe them below.
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
This handout (Regression Analysis) provides basic instructions on how to answer research
questions and test hypotheses through the use of linear regression (a technique which examines
the relationship between a dependent variable and a set of independent variables). The value of
the dependent variable (e.g., salesperson’s total annual sales) can be predicted based on its
relationship to the independent variables used in the analysis (e.g., age, education, and years of
experience). The two research questions proposed for this workshop are as follows:
1. How much will each salesperson make this year?
2. Who will qualify for a $1,000 bonus?
Simple Regression
Simple regression estimates how the value of one dependent variable (Y) can be predicted based
on the value of one independent variable (X). The linear equation for simple regression is as
follows:
Y = aX + b
Research Question # 1
How much will each salesperson make this year?
SCATTER PLOT
A scatter plot displays the nature of the relationship between two variables. It is recommended to
run a scatter plot before performing a regression analysis to determine if there is a linear
relationship between the variables. If there is no linear relationship (i.e., points on a graph are
not clustered in a straight line), there is no need to run a simple regression.
5. If necessary, select the Simple Scatter option, and then click the Define button (see ). The
Simple Scatterplot dialog box opens (see ).
6. Select the variable “Last year sales [lastsale]” from the list box on the left.
PASW Statistics 17 (SPSS 17), Part 3 39
7. Click the first transfer arrow button to move the variable to the Y Axis: box.
8. Select the variable “Years of experience [yearexpe]” from the list box on the left.
9. Click the second transfer arrow button to move the variable in the X Axis: box.
10. Click the OK button. The Output Viewer window opens with a scatter plot of the
variables (see Figure 55).
NOTE: A graph similar to Figure 55 will be displayed in the Output Viewer window. This scatter
plot indicates that there is a linear relationship between the variables “Last year sales” and “Years
of experience.”
The next step is to find a line that best accommodates the pattern of points in this scatter plot.
The steps on how to enhance graph appearance are included in the last section of this handout.
4. Select the variable “Years of experience [yearexpe]” from the variable list box on the
left and move it to the Independent(s): box by clicking the second transfer arrow button.
5. Click the OK button.
The following tables present the results of a simple regression. “R Square” (.918) indicates that
this model accounts for almost 92% of the total variation in the data (see Figure 58).
The slope and the y-intercept as seen in Figure 59 should be substituted in the following linear
equation to predict this year’s sales: Y = aX + b. In this case, the values of a, b, x, and y will be
as follows:
a = 1954.658
b = 440.987
X = Years of experience (values of independent variable)
Y = Last year sales (values of dependent variable)
NOTE: The new independent variable, “yearexp2” is used instead of “yearexpe” in order to predict
this year’s sales.
3. In the Numeric Expression: box, enter the following equation by typing or selecting
from the dialog box keypad:
[1954.658 * yearexp2 + 440.987]
NOTE: It is recommended to select the variable “yearexp2” directly from the variable list box
on the left of the Compute Variable dialog box to prevent typing mistakes.
4. Click the OK button. The results will be displayed in the Simple column in Data View
(see Figure 61).
2. Locate the variable “Simple” and click the Ellipses button under the Type column.
The Variable Type dialog box opens (see Error: Reference source not found).
3. Select the Dollar option, and then select the $###,###,### format (12 digits width with 0
decimal places).
4. Click the OK button, and then click the Data View tab.
NOTE: The prediction of this year’s sales for each salesperson are computed under the new
variable named “Simple” as shown in Error: Reference source not found.
Multiple Regression
Multiple regression estimates the coefficients of the linear equation when there is more than one
independent variable that best predicts the value of the dependent variable. For example, it is
possible to predict a salesperson’s total annual sales (the dependent variable) based on
independent variables such as age, education, and years of experience. The linear equation for
multiple regression is as follows:
Z = aX + bY + c
As indicated in the output table, the coefficient for “Years of experience” is “1874.5”and the
coefficient for “Years of education” is “609.391.”
PREDICTING THIS YEAR’S SALES WITH MULTIPLE REGRESSION MODEL
To predict this year’s sales for each salesman, the values of a, b, and c should be substituted in
the following linear equation: Z = aX + bY + c
This year sales = 1874.5 * Years of experience + 609.391 * Years of education + (-8510.838)
NOTE: The predictions of sales for each salesperson using two independent variables are listed under the
new variable named “multiple.”
Data Transformation
Situations may arise where data transformation is useful. Most data transformations can be done
with the Compute… command. Using this command, the data file can be manipulated to fit
various statistical performances.
Research Question # 2
Who will earn a $1,000 bonus?
COMPUTING
Since each person’s yearly sales were already predicted, those who made more than $2,000
above the predicted values, obtained via multiple regression analysis, will receive $1,000 as a
bonus. Using the Compute… command, those salespeople who met the criteria can be easily
located by comparing the values of this year’s actual sales with the predictions from multiple
regression analysis computed in the previous lesson.
The first step in predicting who will receive a bonus is to calculate the difference between this
year’s actual sales and the prediction of this year’s sales from the multiple regression analysis.
6. Click the If… button. The Compute Variable: If Cases dialog box opens (see Figure 71).
7. Select the Include if case satisfies condition: option.
8. Enter the following expression by typing or selecting from the dialog box keypad:
[thissale - multiple >= 2000]
NOTE: It is recommended that you select the variables and the >= sign directly from the variable
list box and keypad provided in the dialog box to prevent mistakes.
Polynomial Regression
This type of regression involves fitting a dependent variable (Yi) to a polynomial function of a
single independent variable (Xi). The regression model is as follows (see Table 6 for the meaning
of the variables):
Yi = a + b1Xi + b2Xi2 + b3Xi3 + … + bkXik + ei
Table 6 - Breakdown of the Variables
Variable Meaning
a Constant
bj The coefficient for the independent variable to the j’th power
ei Random error term
REGRESSION ANALYSIS
To look at the growth relationship between weight and age:
1. Open the “Growth.sav” file.
2. Click the Analyze menu, point to Regression, and select Curve Estimation…. The
Curve Estimation dialog box opens to define the parameters of the analysis (see Figure
73).
3. Transfer the “wght” variable to the Dependent(s): box and the “age” variable to the
Independent Variable: box.
NOTE: The weight (dependent) variable is what is being predicted using the age (independent)
variable.
4. Deselect the Plot models check box.
5. Select the Display ANOVA table check box.
6. Under Models, deselect the Linear check box and select the Cubic check box.
7. Click the OK button.
Multiple regression can be used to fit polynomials of higher order. If X is the dependent variable,
use the Transform and Compute options of the Data Editor (as discussed earlier in this lesson)
to create new variables X2 = X*X, X3 = X*X2, X4 = X*X3, etc., then use these new variables
(X, X2, X3, X4, etc.) as a set of independent variables for a multiple regression analysis.
Chart Editing
During the final stage of research, enhancing the appearance of charts and figures can be very
helpful for readers to understand what may seem to be confusing statistics. This will save the
time and effort to copy and paste an object from one program to another and to modify its
features. The following steps explain some useful methods to enhance the appearance of a chart.
10. Click the Show Grid Lines button on the Standard toolbar to show the Properties
dialog box.
11. Select the Grid Lines tab, select the Major ticks only option, click the Apply button, and
then click the Close button (see Error: Reference source not found).
12. Click the Select the Y axis button on the Standard toolbar to manipulate the Y-axis.
The Properties dialog box opens.
13. Select the Scale tab (see Error: Reference source not found).
Figure 82 - Before Manipulating the X-axis Figure 83 - After Manipulating the X-axis
This handout (Chi-Square and ANOVA) introduces basic skills for performing hypothesis tests
utilizing Chi-Square test for Goodness-of-Fit and generalized pooled t tests, such as ANOVA.
The step-by-step instructions will guide the user in performing “tests of significance” using
PASW Statistics and help the user understand how to interpret the output for research questions.
Chi-Square
The Chi-Square (χ2) test is a statistical tool used to examine differences between nominal or
categorical variables. The Chi-Square test is used in two similar but distinct circumstances:
To estimate how closely an observed distribution matches an expected distribution – also
known as the Goodness-of-Fit test.
To determine whether two random variables are independent.
Research Question # 1
Can the hospital schedule discharge support staff evenly throughout the week?
A large hospital schedules discharge support staff assuming that patients leave the hospital at a
fairly constant rate throughout the week. However, because of increasing complaints of staff
shortages, the hospital administration wants to determine whether the number of discharges
varies by the day of the week.
Before the Chi-Square test is run, the observed values need to be declared.
2. Select the “Day of the Week [dow]” variable and transfer it to the Test Variable List: box
(see Figure 89).
3. Click the OK button. The Output Viewer window opens (see Figure 90).
Explanation: Figure 91 indicates that the calculated χ2 statistic, for six degrees of freedom, is
29.389. Additionally, it indicates that the significance value (0.000) is less than the usual
threshold value of 0.05. This suggests that the null hypothesis, H0 (patients leave the hospital at a
constant rate), can be rejected in favor of the alternate hypothesis, H1 (patients leave the hospital
at different rates during the week).
Research Question # 2
The hospital requests a follow-up analysis: can staff be scheduled assuming that patients
discharged on weekdays only (Monday through Friday) leave at a constant daily rate?
H0: Patients discharged on weekdays only (Monday through Friday) leave at a constant daily
rate.
NOTE: The expected values are equal to the sum of the observed values divided by the number of
rows, while the observed values are the actual numbers of patients discharged.
Using the Chi-Square test procedure, it was determined that the rate at which patients were
discharged from the hospital was not constant over the course of an average week. This was
primarily due to a greater number of discharges on Fridays and fewer discharges on Sundays.
When the range of the test was restricted to weekdays, the discharge rates appeared to be more
uniform. Staff shortages could be corrected by adopting separate weekday and weekend staff
schedules.
Research Question # 3
Does first-class mailing provide quicker response time than bulk mail?
A manufacturer tries first-class postage for direct mailings, hoping for faster responses than with
bulk mail. Order takers record how many weeks each order takes after mailing.
H0: First-class and bulk mailings do not result in different customer response times.
Before the Chi-Square test is run, the cases must be weighted. Because this example compares
two different methods, one method must be selected to provide the expected values for the test
and the other will provide the observed values.
Explanation: The manufacturer hoped that first-class mail would result in quicker customer
response. As indicated in Figure 94, the first two weeks indicated different response times of
four and seven percentage points, respectively. The question was whether the overall differences
between the two distributions were statistically significant.
The Chi-Square statistic was calculated to be 12.249 at eleven degrees of freedom (see Figure
95). The significance value (p) associated with the data was 0.345, which was greater than the
threshold value of 0.05. Hence, H0 was not rejected because there was no significant difference
between first-class and bulk mailings. The first-class mail promotion did not result in response
times that were statistically different from standard bulk mail. Therefore, bulk postage was more
economical for direct mailings.
Research Question # 4
Which of the alloys tested would be appropriate for creating an underwater sensor array?
H0: The four alloys exhibit the same kind of behavior and are not different from one another.
2. In Data View, click the Analyze menu, point to Compare Means, and select One-Way
ANOVA…. The One-Way ANOVA dialog box opens (Figure 97).
3. Select the “pits” variable from the box on the left and transfer it to the Dependent List:
box (see Figure 97).
4. Select the “Alloy [alloy]” variable from the box on the left and transfer it to the Factor:
box (see Figure 97).
5. Click the Options… button. The One-Way ANOVA: Options dialog box opens (see
Figure 98).
6. Select the Descriptive, Homogeneity of variance test, and Means plot check boxes.
7. Click the Continue button.
8. Click the OK button. The Output Viewer window opens.
Explanation: Figure 99 lists the means, standard deviations, and individual sample sizes of each
alloy. Figure 100 provides the degrees of freedom and the significance level of the population;
“df1” is one less than the number of sample alloys (4-1=3) and “df2” is the difference between
PASW Statistics 17 (SPSS 17), Part 4 65
the total sample size and the number of sample alloys (20-4=16). Figure 101 lists the sum of the
squares of the differences between means of different alloy populations and their mean square
errors. In Figure 101, the “Between Groups” variation “6026.200” is due to interaction in
samples between groups. If sample means are close to each other, this value is small. The
“Within Groups” variation “335.600” is due to differences within individual samples. The
“Mean Square” values are calculated by dividing each “Sum of Squares” value by its respective
degree of freedom (“df”). The table also lists the F statistic “95.768,” which is calculated by
dividing the “Between Groups Mean Square” by the “Within Groups Mean Square.” The
significance level of “0.000” is less than the threshold value of 0.05 and indicates that the null
hypothesis can be rejected, leading to the conclusion that the alloys are not all the same.
Research Question # 5
Is the mean difference between alloy sets statistically significant?
The previous null hypothesis was rejected, leading to the conclusion that all the alloys do not
exhibit the same behavior. The next part of the analysis is to determine if the mean difference
between individual alloy sets is statistically significant.
H0: μ0 = μ1…= μa
H1: μ0 ≠ μ1 …≠ μa
2. Click the Post Hoc… button. The One-Way ANOVA: Post Hoc Multiple Comparisons
dialog box opens (see Figure 103).
Figure 103 - One-Way ANOVA: Post Hoc Multiple Comparisons Dialog Box
Explanation: Figure 104 shows the results of comparing pairs of means between different alloy
sets. Each row indicates the difference between the two corresponding treatments. Alloys “1”
and “4” have a mean difference of “2.4” (a relatively small value). Also, the significance level of
“0.420” indicates that the null hypothesis cannot be rejected for the comparison of alloys “1” and
“4.”
There is no statistically significant difference between them. Alloy pairs “1” and “2,” “1” and
“3,” “2” and “3,” “2” and “4,” and “3” and “4” have large mean differences with significance
values of “0.000.” In these cases, the null hypothesis can be rejected, leading to the conclusion
that they are statistically different. Also, the means plot (see Error: Reference source not found)
shows that both alloys “1” and “4” have average mean values of pits very close to each other.
Because alloys “1” and “4” have the lowest mean number of corrosion pits, they are the best
candidates for the array. Depending on the relative costs of the two alloys, the one that is more
cost effective can be selected to construct the array.
Research Question # 6
Will typing ability and test method affect student test scores?
To answer the question, an essay final is given to the class. Two test methods are used – half the
students are assigned to write the final with a blue-book and the other half with notebook
computers. In addition, the students are partitioned into three groups, namely: no typing ability,
some typing ability, and highly skilled at typing. After evaluating the final, the mean score of
each group is examined.
H0: Typing ability and test method do not affect student test scores.
H1: Typing ability and test method do affect student test scores.
2. In Data View, click the Analyze menu, point to General Linear Model, and select
Univariate… (see Figure 107). The Univariate dialog box opens (see Figure 108).
3. Select the “SCORE” variable from the box on the left and transfer it to the Dependent
Variable: box (see ).
4. Select the “ABILITY” and “METHOD” variables from the box on the left and transfer
them to the Fixed Factor(s): box (see Figure 108).
5. Click the Options… button. The Univariate: Options dialog box opens (see Figure 109).
6. Select the Descriptive statistics check box.
7. Click the Continue button.
8. Click the OK button. The Output Viewer window opens (see Figure 110 and Figure
111).
Explanation: Figure 110 lists the means and standard deviations from three abilities in two
methods. Students who have “some typing ability” and use the “computer” method achieve the
highest mean score (mean=36.67). As indicated in Figure 111, because the significance value of
“Method” (0.901) is more than the threshold value (0.05), it can be concluded that the “Method”
factor alone does not affect test scores. The significance values of “Ability” (0.033) and the
interaction between the two factors “Ability*Method” (0.047) are less than the threshold value
(0.05), leading to the conclusion that “Ability” and the combination of “Ability” and “Method”
(“Ability*Method”) do affect student test scores.
4. Click the OK button. PASW Statistics will process and read the Excel file and convert all
first row column headings into variables using the best approximation for the variable
attributes (see Figure 114 and Figure 115).
The reverse situation may also arise, where data in a PASW Statistics file must be analyzed
using Excel. This can be accomplished by exporting the contents of the Data Editor into an
Excel spreadsheet.
Figure 121 - Options Dialog Box Figure 122 - PASW Statistics Syntax Editor Window
8. Move the “gender” variable to the Row(s): box and the “method” variable to the
Column(s): box.
9. Click the Paste button. The Crosstabs dialog box closes and the command is pasted in
the PASW Statistics Syntax Editor window (see Figure 123). The first question in
Table 7 has been entered into the script file.
NOTE: Scripts for each of the remaining analytical techniques would be entered into the script
file by using the Paste button in each dialog box after the parameters were set.
Figure 123 - PASW Statistics Syntax Editor Figure 124 - Save Syntax As Dialog Box
Window
10. Save the script file by clicking the File menu in the PASW Statistic Syntax Editor
window and selecting Save As…. The Save Syntax As dialog box opens (see ).
11. Enter the location and name for the file and click the Save button.
PASW Statistics provides several options when running a script file. PASW Statistics script files
have the “.sps” file extension. The Run menu of the PASW Statistic Syntax Editor contains
commands for All, Selection, Current, and To End.
Figure 125 - File Menu When Selecting Syntax Figure 126 - Run (Syntax) Menu