10 Data Preparation
10 Data Preparation
Business Research
Data Preparation
Coding
• Preparing the codebook involves deciding (and documenting) how you
will go about:
• defining and labelling each of the variables
• assigning numbers to each of the possible responses.
Restaurant Preference.sav
Starting SPSS
Different ways
1. Double-click the SPSS icon
2. Open an existing data file
3. Export from Excel: First setup the data file in Excel to correspond to
SPSS
1. File, Open, Data, from the SPSS menu.
2. Select type of file you want to open,Excel *.xls *.xlsx, *.xlsm .
3. Select file name.
4. Click 'Read variable names' if the first row of the spreadsheat contains
column headings.
5. Click Open.
http://wps.prenhall.com/bp_malhotra_mr_6/127/32612/8348719.cw/
index.html
Steps Involved in Preparing Data Files
• Step 1. Check and modify, where necessary, the ‘Options’ that SPSS
uses to display the data and the output that is produced.
• Step 2. Set up the structure of the data file by ‘defining’ the variables.
• Step 3. Enter the data—that is, the values obtained from each
participant or respondent for each variable.
Step 1
Edit Options
Step 2 and 3: Excel File Conversion
In Excel sheet:
• All data should be in numeric
• Variable names should be in first row
• Variable names should be compatible with SPSS, i.e. one word and
only those symbols permitted by SPSS, such as _ or –
• No formulae or functions
• E.g. file for practice: Dell Direct
File Open Data Choose excel file in the drop-
down menu for files
• Define Variable labels, Value labels and the type of Measure
Entering Data into SPSS
Part 1: https://www.youtube.com/watch?v=Kp_js1i6xwE
Part 2: https://www.youtube.com/watch?v=I7YI-o6KWzk
Part 3: https://www.youtube.com/watch?v=WADaJOeuc8I
Part 4: https://www.youtube.com/watch?v=s5TJBxVJ4OY
Converting String Variables to Numeric
• https://www.youtube.com/watch?v=GFBsCg_idPo
Exercise: Dell Direct.sav
• Have your codebook ready
• Ensure all headings in Excel sheet are according to SPSS labeling rules
– Excel headings transfer to SPSS as Variable names
• Transfer Excel sheet ‘Dell Direct’
• Complete the information on SPSS according to questionnaire
1. Reverse coding where needed
2. Adding up the scores from the items that make up each scale to give an
overall score
• Then drag down the cell to copy formula down and get results for all data columns
• Conditional formatting Highlight cell rules Greater than (Write 0)
• If missing data for a respondent is significant, say more than 20%, then it is best to
delete the respondent data.
• Add or delete data in SPSS, if any
Data Imputation
Replacing missing value with mean:
• Mean substitution is attractive because it is a conservative estimate
since the mean of the scores on the variable does not change.
• However, the variance of the variable is reduced compared to another
score that may have been the missing value, because one is replacing
the missing value with its distribution mean.
• Correlations with other variables are also lowered because of the
reduced variance.
• Mean substitution is less frequently used because other more
accurate methods of imputation have been developed.
Data Imputation
Regression analysis:
• Regression analysis takes completed case values from a data set and
generates a regression equation to predict the missing values.
• This approach may be more sophisticated than mean but it has
limitations.
• If the variables used to predict the missing values are not good
predictors of the missing values, then the outcome is not optimal.
Data Imputation
Expectation maximization (EM)
• Expectation maximization (EM) involves creating a distribution of partially
missing data and making inferences about missing data under the likelihood
of that created distribution
• Then drag down the cell to get results for all data columns
• Very small SD means that respondent has written similar responses for all questions: 0 SD
means that the respondent has chosen the same responses, 3s or 4s: delete these
• Highlight small responses: Conditional formatting Highlight cell rules less than
• You may write ‘0.5‘. Two rows will be deleted
Check Outliers on Continuous Variables
• Analyze Descriptive Statistics Explore
• In ‘Statistics’ check ‘Outliers’. Continue
• In ‘Plots’ check “Histogram’ and ‘Normality plots with tests’. Continue
• ‘Age’ and ‘Experience’ are continuous variables. Add these to Dependent list. OK
• Look for extreme values to
check if any odd numbers
appear, such as ages below
specified range
Check Outliers on Continuous Variables + Normality
Check
• Analyze Descriptive Statistics Explore
• In ‘Statistics’ check ‘Outliers’. Continue
• In Plots’ check “Histogram’ and ‘Normality plots with tests’. Continue
• ‘Age’ and ‘Experience’ are continuous variables. Add these to Dependent list. OK
• Test says distributions are not normal
• However, perform other tests for normality
• Distributions are fairly normal
Age Experience
3. Check Outliers on Interval Variables
• Analyze Descriptive Statistics Frequencies
• Add all variables except id to variable list. In ‘Statistics’ check “Skewness’ and ‘Kurtosis’.
Continue. OK.
• The first table ‘Statistics’ shows missing values. Amos does not run with missing values so
impute numbers, if any.
• Look for values >1 in Skewness and Kurtosis
• Calculate z scores for skewness and kurtosis = Value / standard error
• Rule of thumb: values > 3.3 (3.29 to be exact) are problematic (recall ± 3 σ )
Approaches for Minimizing the Effects of Outliers
• Less desirable methods are to delete the case or variable with the
outlier(s) based on an assessment of whether the case is
representative of the population that the sample was drawn from.
• Using trimmed means by discarding 5 percent of the largest scores
and doing the same for 5 percent of the smallest scores.
• Conducting a data transformation of the original raw scores reduces
the influence of extreme scores by bringing the outliers closer to the
majority of scores in the distribution.
• Commonly used procedure.
• Raw scores are converted to z-scores or T-scores.
Transforming Data to Bring Closer to Normality
• For positively skewed distributions a square root or a log10
transformation can be used for to try to normalize them.
• For negatively skewed distributions, reflecting the negative
distribution (reversing it to positive) and then using a square root or a
log transformation may normalize the distribution.
Transforming Data to Bring Closer to Normality
1. Select: Transform Compute
2. Under Target Variable type the name of the variable to be transformed. This
command will create a new column of transformed scores on the Data View, so
the L (for Logarithm) is used for clarity of what each column of scores
represents.
3. Under Functions group: click on All. Then under Functions and Special
Variables: scroll down until you find LG10, then click on it and click on the
arrow button to the left point up. This places LG10 under Numeric Expression:
4. Next go to variables under Type and Label and click on the variable to be
transformed and then click on the arrow to the right of the box of variables. It
will show up in the place under Numeric Expression where the ? was. So, the
Numeric Expression should look like LG10(variablename).
5. Then, click OK and the log10 transformed variable will be on your Data-View
spreadsheet.
6. Re-calculate Descriptives Explore of new to check the transformation
Checking for Outliers
https://www.youtube.com/watch?v=WSflSmcNRFI
https://www.youtube.com/watch?v=qQqF6HZo0Gc
Working with Data (Dell data)
• Sort Cases: Data Sort Cases q14
• Form a new variable that denotes the total number of things that people have
ever done online based on q2_1 to q2_7. Run a frequency distribution of the
new variable and interpret the results. Note the missing values for q2_1 to
q2_7 are coded as 0.
Calculating Total Scores
• Make sure any reverse coding needed has been performed.
• Calculating z scores (standardized variable)
• Analyze Descriptive Statistics Descriptives Select “Save standardized
value as variable”
• Comparing standardized (Z) and unstandardized (sum/total) scores
• Analyze Correlate Bivariate
If correlation is at least .9 then both can be used interchangeably
Scale Reliability
• Part 1: https://www.youtube.com/watch?v=2gHvHm2SE5s
• Part 2: https://www.youtube.com/watch?v=9rS49o1rdnk
Improving Scale Reliability
• https://www.youtube.com/watch?v=xVl6Fg2A9GA
Graph Steering – 3D
• Graph Legacy Dialogs 3D Bar
• X will show properties of variable
• Y will show properties of graph
Example:
1. Graph Legacy Dialogs 3D Bar
2. In ‘X’ axis, select ‘Groups of cases’, in ‘Z’ axis, select “Groups of cases’
3. Click ‘Define’
4. In ‘X’ Category, add ‘Gender’, in ‘Z’ category, add ‘Usage_3G’
5. In ‘Stack/Cluster by’, add ‘q4: Overall how satisfied are you with your Dell
computer system?’
6. Click ‘OK’