0% found this document useful (0 votes)
39 views42 pages

10 Data Preparation

The document outlines the process of data preparation and coding for business research, including the creation of a codebook that defines variables and their corresponding codes. It details the steps for entering data into SPSS, checking for errors, handling missing data, and transforming variables to ensure accurate analysis. Additionally, it provides guidance on identifying and dealing with outliers and unengaged responses to maintain data integrity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views42 pages

10 Data Preparation

The document outlines the process of data preparation and coding for business research, including the creation of a codebook that defines variables and their corresponding codes. It details the steps for entering data into SPSS, checking for errors, handling missing data, and transforming variables to ensure accurate analysis. Additionally, it provides guidance on identifying and dealing with outliers and unengaged responses to maintain data integrity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Advanced and Applied

Business Research
Data Preparation
Coding
• Preparing the codebook involves deciding (and documenting) how you
will go about:
• defining and labelling each of the variables
• assigning numbers to each of the possible responses.

• Code each reply option in each question. For example:


• Categorical scale: Gender: Male – 1 Female – 2
• Likert scale: Strongly disagree =1 Disagree=2 Neither D/Nor A=3
Agree=4 Strongly agree=5
• Semantic differential scale: Happy __ x__x__x__x__x__x__ Sad (to be R)
1 2 3 4 5 6 7
• Develop a code book. It should list all the variables in the
questionnaire, their abbreviated names used in SPSS, codes, and
coding instructions
Coding
• Develop a code book. Example:
SPSS Name Variable Coding Instructions Measurement Scale
ID Respondent Identification Input number assigned to each Nominal
No. questionnaire
Gender Gender Input the number circled Categorical/Nominal
1=Males; 2=Females
Education What is the highest level of Input the number circled Categorical/Nominal
education you have 1=None; 2=Primary School; 3=Matric;
completed? 4=Inter; 5=Undergraduate;
6=Graduate; 7=Postgraduate;
8=Professional Certification;
9=Vocational Training
Age What is your age? Input age in numbers Ratio

LifSat1 to Life satisfaction scale Input the number circled; Interval/Scale


Lifsat5 1=strongly disagree to 7=strongly agree
Codebook Excerpt for Data File: Table 14.2 Preference.sav
Restaurant

Column Variable Variable Question Coding


Number Number Name Number Instructions
1 1 ID 1 to 20 as coded

2 2 Preference 1 Input the number circled.


1=Weak Preference
7=Strong Preference
3 3 Quality 2 Input the number circled.
1=Poor
7=Excellent
4 4 Quantity 3 Input the number circled.
1=Poor
7=Excellent
5 5 Value 4 Input the number circled.
1=Poor
7=Excellent
6 6 Service 5 Input the number circled.
1=Poor
7=Excellent
Codebook Excerpt for Data File: Table 14.2 (Cont.)
Restaurant Preference.sav (Contd.)

Column Variable Variable Question Coding


Number Number Name Number Instructions
7 7 Income 6 Input the number circled.
1 = Less than $20,000
2 = $20,000 to 34,999
3 = $35,000 to 49,999
4 = $50,000 to 74,999
5 = $75,000 to 99,999
6 = $100,00 or more
Restaurant Preference
Restaurant Preference.sav

ID PREFER. QUALITY QUANTITY VALUE SERVICE INCOME


1 2 2 3 1 3 6
2 6 5 6 5 7 2
3 4 4 3 4 5 3
4 1 2 1 1 2 5
5 7 6 6 5 4 1
6 5 4 4 5 4 3
7 2 2 3 2 3 5
8 3 3 4 2 3 4
9 7 6 7 6 5 2
10 2 3 2 2 2 5
11 2 3 2 1 3 6
12 6 6 6 6 7 2
13 4 4 3 3 4 3
14 1 1 3 1 2 4
15 7 7 5 5 4 2
16 5 5 4 5 5 3
17 2 3 1 2 3 4
18 4 4 3 3 3 3
19 7 5 5 7 5 5
20 3 2 2 3 3 3
SPSS Variable View of the Data of Table 14.1

Restaurant Preference.sav
Starting SPSS
Different ways
1. Double-click the SPSS icon
2. Open an existing data file
3. Export from Excel: First setup the data file in Excel to correspond to
SPSS
1. File, Open, Data, from the SPSS menu.
2. Select type of file you want to open,Excel *.xls *.xlsx, *.xlsm .
3. Select file name.
4. Click 'Read variable names' if the first row of the spreadsheat contains
column headings.
5. Click Open.
http://wps.prenhall.com/bp_malhotra_mr_6/127/32612/8348719.cw/
index.html
Steps Involved in Preparing Data Files
• Step 1. Check and modify, where necessary, the ‘Options’ that SPSS
uses to display the data and the output that is produced.

• Step 2. Set up the structure of the data file by ‘defining’ the variables.

• Step 3. Enter the data—that is, the values obtained from each
participant or respondent for each variable.
Step 1
Edit Options
Step 2 and 3: Excel File Conversion
In Excel sheet:
• All data should be in numeric
• Variable names should be in first row
• Variable names should be compatible with SPSS, i.e. one word and
only those symbols permitted by SPSS, such as _ or –
• No formulae or functions
• E.g. file for practice: Dell Direct
File Open Data Choose excel file in the drop-
down menu for files
• Define Variable labels, Value labels and the type of Measure
Entering Data into SPSS
Part 1: https://www.youtube.com/watch?v=Kp_js1i6xwE

Part 2: https://www.youtube.com/watch?v=I7YI-o6KWzk

Part 3: https://www.youtube.com/watch?v=WADaJOeuc8I

Part 4: https://www.youtube.com/watch?v=s5TJBxVJ4OY
Converting String Variables to Numeric
• https://www.youtube.com/watch?v=GFBsCg_idPo
Exercise: Dell Direct.sav
• Have your codebook ready
• Ensure all headings in Excel sheet are according to SPSS labeling rules
– Excel headings transfer to SPSS as Variable names
• Transfer Excel sheet ‘Dell Direct’
• Complete the information on SPSS according to questionnaire
1. Reverse coding where needed

2. Adding up the scores from the items that make up each scale to give an
overall score

3. Transforming skewed variables for analyses that require normally


distributed scores

4. Collapsing continuous variables (e.g. age) into categorical variables (e.g.


young, middle-aged and old) to do some analyses such as analysis of
variance; and

5. Reducing or collapsing the number of categories of a categorical variable


(e.g. collapsing the marital status into just two categories representing people
‘in a relationship’/‘not in a relationship’).
Data Cleaning
• Checking For Errors in SPSS
• Finding error in the data file
• Correcting error
Checking for Errors
• To make identification of errors visible, make both values and labels
visible
Edit Options Output Variables in item labels shown as:
Then choose “Values and Labels” from the drop-down menu

• Look for values that fall outside the prescribed range


• Check minimum and maximum values – next slide
Detecting Erroneous Data
• For Categorical data:
Analyze Descriptive Statistics Frequencies

• For Continuous data:


Analyze Descriptive Statistics Frequencies
Checking for Errors
Look for values that fall outside the prescribed range
• For categorical variables: check minimum and maximum range
Analyze Descriptive Statistics Frequencies
Statistics Minimum; Maximum
Check values with codebook for any irregularities

• For continuous variables: mean, standard deviation, minimum, maximum


Analyze Descriptive Statistics Descriptives
Options Mean Standard Deviation Minimum; Maximum
Check against codebook for consistencies
Identifying and Dealing with Missing Data
• One way to determine the impact of missing data is to create dummy
variables.
• For example, a dummy variable might be created for variable A by
applying the code 0 to represent each case with no missing data on
variable A and using the code 1 to represent all cases with missing
data on variable A.
• This dummy variable has two groups of cases: (1) those cases with no
missing data, and (2) those cases with missing data.
• This variable can be used as an independent variable on variable B, or
C etc. to see if there is a significant difference on variable B (C, D, E,
etc.) comparing the no missing data group to the missing data group
using independent t-tests
2. Check for Missing Data in Excel
• One quick way to do this is to coy paste on Excel
• Data View Control+A Control+C
• Paste in Excel Control+V
• In column AW, write formula: =COUNTBLANK(A1:AV1)

• Then drag down the cell to copy formula down and get results for all data columns
• Conditional formatting Highlight cell rules Greater than (Write 0)
• If missing data for a respondent is significant, say more than 20%, then it is best to
delete the respondent data.
• Add or delete data in SPSS, if any
Data Imputation
Replacing missing value with mean:
• Mean substitution is attractive because it is a conservative estimate
since the mean of the scores on the variable does not change.
• However, the variance of the variable is reduced compared to another
score that may have been the missing value, because one is replacing
the missing value with its distribution mean.
• Correlations with other variables are also lowered because of the
reduced variance.
• Mean substitution is less frequently used because other more
accurate methods of imputation have been developed.
Data Imputation
Regression analysis:
• Regression analysis takes completed case values from a data set and
generates a regression equation to predict the missing values.
• This approach may be more sophisticated than mean but it has
limitations.
• If the variables used to predict the missing values are not good
predictors of the missing values, then the outcome is not optimal.
Data Imputation
Expectation maximization (EM)
• Expectation maximization (EM) involves creating a distribution of partially
missing data and making inferences about missing data under the likelihood
of that created distribution

• Repeating analyses with and without (imputed) missing data is highly


recommended following any of the methods of handling missing data.
• You will be comparing the results for similarities and differences.
• If the results of the two analyses are similar, then this provides the
researcher with self-assurance as to interpretation of the results.
• Further data investigation is needed if the results are different.
• It is good practice to report results from both a missing data set and an
imputed missing data set.
Imputing Missing Values in SPSS
• Transform Replace Missing Values
• First beside Method: choose Linear trend at point
• Then click over the variable with missing data under New Variable(s)
• Click OK
• Check Data View for new variable with imputed data.
Check for Unengaged Responses
• One quick way to do this is to coy paste on Excel
• Data View Control+A Control+C
• Paste in Excel Control+V
• In column AW, write formula: =STDEV.P(A1:AR1)

• Then drag down the cell to get results for all data columns
• Very small SD means that respondent has written similar responses for all questions: 0 SD
means that the respondent has chosen the same responses, 3s or 4s: delete these
• Highlight small responses: Conditional formatting Highlight cell rules less than
• You may write ‘0.5‘. Two rows will be deleted
Check Outliers on Continuous Variables
• Analyze Descriptive Statistics Explore
• In ‘Statistics’ check ‘Outliers’. Continue
• In ‘Plots’ check “Histogram’ and ‘Normality plots with tests’. Continue
• ‘Age’ and ‘Experience’ are continuous variables. Add these to Dependent list. OK
• Look for extreme values to
check if any odd numbers
appear, such as ages below
specified range
Check Outliers on Continuous Variables + Normality
Check
• Analyze Descriptive Statistics Explore
• In ‘Statistics’ check ‘Outliers’. Continue
• In Plots’ check “Histogram’ and ‘Normality plots with tests’. Continue
• ‘Age’ and ‘Experience’ are continuous variables. Add these to Dependent list. OK
• Test says distributions are not normal
• However, perform other tests for normality
• Distributions are fairly normal

Age Experience
3. Check Outliers on Interval Variables
• Analyze Descriptive Statistics Frequencies
• Add all variables except id to variable list. In ‘Statistics’ check “Skewness’ and ‘Kurtosis’.
Continue. OK.
• The first table ‘Statistics’ shows missing values. Amos does not run with missing values so
impute numbers, if any.
• Look for values >1 in Skewness and Kurtosis
• Calculate z scores for skewness and kurtosis = Value / standard error
• Rule of thumb: values > 3.3 (3.29 to be exact) are problematic (recall ± 3 σ )
Approaches for Minimizing the Effects of Outliers

• Less desirable methods are to delete the case or variable with the
outlier(s) based on an assessment of whether the case is
representative of the population that the sample was drawn from.
• Using trimmed means by discarding 5 percent of the largest scores
and doing the same for 5 percent of the smallest scores.
• Conducting a data transformation of the original raw scores reduces
the influence of extreme scores by bringing the outliers closer to the
majority of scores in the distribution.
• Commonly used procedure.
• Raw scores are converted to z-scores or T-scores.
Transforming Data to Bring Closer to Normality
• For positively skewed distributions a square root or a log10
transformation can be used for to try to normalize them.
• For negatively skewed distributions, reflecting the negative
distribution (reversing it to positive) and then using a square root or a
log transformation may normalize the distribution.
Transforming Data to Bring Closer to Normality
1. Select: Transform Compute
2. Under Target Variable type the name of the variable to be transformed. This
command will create a new column of transformed scores on the Data View, so
the L (for Logarithm) is used for clarity of what each column of scores
represents.
3. Under Functions group: click on All. Then under Functions and Special
Variables: scroll down until you find LG10, then click on it and click on the
arrow button to the left point up. This places LG10 under Numeric Expression:
4. Next go to variables under Type and Label and click on the variable to be
transformed and then click on the arrow to the right of the box of variables. It
will show up in the place under Numeric Expression where the ? was. So, the
Numeric Expression should look like LG10(variablename).
5. Then, click OK and the log10 transformed variable will be on your Data-View
spreadsheet.
6. Re-calculate Descriptives Explore of new to check the transformation
Checking for Outliers

https://www.youtube.com/watch?v=WSflSmcNRFI

https://www.youtube.com/watch?v=qQqF6HZo0Gc
Working with Data (Dell data)
• Sort Cases: Data Sort Cases q14

• This action is irreversible so we need to exercise caution. Create a


variable, example ‘Respondent Id’ so file may be restored to its
original form
• In Variable View, Click “q1” Insert variable. New variable will
appear at the top. Name it “Respondent_ID”
• In Data View, fill in values from 1 to 372 (till end of data) in the
Respondent_ID column
Splitting Data Files
• To run analysis on separate groups simultaneously

Data Split File Compare Groups

• Then select variable to be split


• To reverse, and run analysis on all data:

Data Split File Analyze all cases, do not create


groups
Selecting Cases
• To conduct analysis separately

Data Select Cases If condition is satisfied If

• Type the name of variable, or select variable and click


• Add choice; e.g. sex = 1
• To reverse, and run analysis on all groups:

Data Select Cases All cases option


Recoding Data: New Variables
1. Transform Recode into Different Variables…
2. Click on q1 and move it to Numeric Variable Output Variable box.
3. Type “internet_usage” in Name box and “2_Internet usage” in
Label box.
4. Click Old and New Values
5. Select Range. Type ‘1’ through ‘2’
6. In New Values, type ‘1’ Click Add.
7. Select Range. Type ‘3’ through ‘6’
8. In New Values type ‘2’ Click Add.
9. Click Continue Click Change Click OK.
Calculating Total Scores
• Instructing SPSS to add together scores from all the items that make up the
subscale or scale. Example: Dell Data
• Transform Compute Variable Target Variable (e.g. Overall_MM)
• Type and Label Label (describe: e.g. Overall Market Maven) Continue
• Select items q10_1 to q10_4 one by one and add + sign each time

• Form a new variable that denotes the total number of things that people have
ever done online based on q2_1 to q2_7. Run a frequency distribution of the
new variable and interpret the results. Note the missing values for q2_1 to
q2_7 are coded as 0.
Calculating Total Scores
• Make sure any reverse coding needed has been performed.
• Calculating z scores (standardized variable)
• Analyze Descriptive Statistics Descriptives Select “Save standardized
value as variable”
• Comparing standardized (Z) and unstandardized (sum/total) scores
• Analyze Correlate Bivariate
If correlation is at least .9 then both can be used interchangeably
Scale Reliability
• Part 1: https://www.youtube.com/watch?v=2gHvHm2SE5s
• Part 2: https://www.youtube.com/watch?v=9rS49o1rdnk
Improving Scale Reliability
• https://www.youtube.com/watch?v=xVl6Fg2A9GA
Graph Steering – 3D
• Graph Legacy Dialogs 3D Bar
• X will show properties of variable
• Y will show properties of graph
Example:
1. Graph Legacy Dialogs 3D Bar
2. In ‘X’ axis, select ‘Groups of cases’, in ‘Z’ axis, select “Groups of cases’
3. Click ‘Define’
4. In ‘X’ Category, add ‘Gender’, in ‘Z’ category, add ‘Usage_3G’
5. In ‘Stack/Cluster by’, add ‘q4: Overall how satisfied are you with your Dell
computer system?’
6. Click ‘OK’

You might also like