Topic 3 Data Processing_bus 221(0)
Topic 3 Data Processing_bus 221(0)
DATA PROCESSING
• EDITING QUESTIONNAIRE
• CODING
• CREATING DATA FILE IN A COMPUTER
DATA ANALYSIS TECHNIQUES
• DESCRIPTIVE ANALYSIS
– UNIVARIATE ANALYSIS (ONE VARIABLE)
• BIVARIATE ANALYSIS (TWO VARIABLES)
• MULTIVARIATE ANALYSIS (MORE THAN TWO
VARIABLES)
Assignment..
• Download a Questionnaire from the E-learning
• Prepare a code book for the questionnaire
• Use the questionnaire and collect data from at
least ten respondents
• Download and Install SPSS in your computer
• Download a book called: Survival Manual for SPSS
• Use the Manual to learn about Data Processing
using SPSS
Editing Data
• Checking and adjusting responses in the completed
questionnaires
• Purpose for editing:
• For consistency among responses
• For completeness in responses
• To facilitate the coding process
Editing - Checking Questionnaire
A questionnaire returned from the field may be unacceptable
for several reasons.
– Parts of the questionnaire may be incomplete.
– The pattern of responses may indicate that the respondent
did not understand or follow the instructions.
– The responses show little variance.
– One or more pages are missing.
– The questionnaire is answered by someone who does not
qualify for participation.
Treatment of Unsatisfactory Results
•Returning to the Field – The questionnaires with
unsatisfactory responses may be returned to the field, where
the interviewers re-contact the respondents.
•Assigning Missing Values – If returning the questionnaires to
the field is not feasible, the editor may assign missing values
to unsatisfactory responses. (During data analysis)
•Discarding Unsatisfactory Respondents – In this approach,
the respondents with unsatisfactory responses are simply
discarded.
Coding
• Coding means assigning a code, usually a number,
to each possible response to each question.
• Codebook: a summary of the instructions you will
use to convert the information obtained from each
subject or case into a format that computer (SPSS
or any software) can understand.
• Preparing the codebook involves:
– defining and labelling each of the variables; and
– assigning numbers to each of the possible responses
CODING- EXAMPLE
Question for respondents Coding
• Which Programme do you • Variable name
study? (SELECT ONE) – Programme Type
– BAF-BS • Coding Instructions
– BAF-PS 1 = BAF-BS
– BBA-MM 2 = BAF-PS
– BBA-EIM 3= BBA-MM
– BPSCM 4 = BBA-EIM
5 = BPSCM
Coding
• Code the following Questions:
• Indicate your sex (MALE/FEMALE)
• Have you ever worked before? (YES/NO)
• Indicate your age…………….
• How do you rate yourself in terms of Self-
esteem (HIGH/MEDIUM/LOW)
• What is your education level
(University/Colleges/Secondary/Primary)
Examples of Coding…
Age Political Affiliation
1=1 CCM = 1
2=2 Chadema = 2
3=3 CUF = 3
4=4
5=5 Self-esteem
Low = 1
Sex Medium = 2
Male = 1 High= 3
Female = 2
Coding of open-ended questions
• Eg. What is the major source of your start-up capital for
your business?
• You might notice most of the respondents listing their
source of financing as related to
– Loan from friends/family, assistance from family, Loan from
SACCOs, UPATU
• Major groups of responses under the variable name
STARTCAPIT, and assign a number to each (loan from
friends=1, Assitance = 2, SACCOS=3, UPATU = 4)
• You also need to add another numerical code for
responses that did not fall into these listed categories
(OTHERS= )
Example of a codebook
Description of Variable Variable name (SPSS) Coding instructions
Identification Number ID
Type I error
We decide to (rejecting a true Correct
reject the null hypothesis) decision
null hypothesis
Decision
Type II error
We fail to Correct (rejecting a false
reject the decision null hypothesis)
null hypothesis
Hypothesis Testing…
• The critical concepts are these:
• There are two hypotheses, the null and the alternative hypotheses.
• The procedure begins with the assumption that the null hypothesis
is true.
• The goal is to determine whether there is enough evidence to infer
that the alternative hypothesis is true, or the null is not likely to be
true.
• There are two possible decisions:
– Conclude that there is enough evidence to support the alternative
hypothesis. Reject the null.
– Conclude that there is no enough evidence to support the alternative
hypothesis. Fail to reject the null.
• Therefore, The smaller the P-value (), the stronger the evidence
against Null Hypothesis H0 11.47
Interpreting the P-Value
• The smaller the p-value, the more statistical evidence
exists to support the alternative hypothesis.
• If the p-value is less than 1%, there is overwhelming
evidence that supports the alternative hypothesis.
• If the p-value is between 1% and 5%, there is a strong
evidence that supports the alternative hypothesis.
• If the p-value is between 5% and 10% there is a weak
evidence that supports the alternative hypothesis.
• If the p-value exceeds 10%, there is no evidence that
supports the alternative hypothesis.
11.48
Correlation analysis
Correlation Analysis
• Correlation:
– determines whether and to what degree a relationship exists
between two or more quantifiable variables
– the degree of the relationship is expressed as a coefficient of
correlation, r
• Linear relationships implying straight line association are
visualized with scatter plots
• Strong linear relationship
– When the points lie close to a straight line, and weak if they are
widely scattered
• The presence of a correlation does not indicate a cause-
effect relationship primarily because of the possibility of
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Scatter Plot Examples
Strong relationships Weak relationships
y y
x x
y y
x x
Scatter Plot Examples
No relationship
x
Correlation Coefficient
• The sample correlation coefficient r is an estimate
of population correlation coefficient and is used to
measure the strength of the linear relationship in
the sample observations
• Correlation Coefficients:
– Range between -1 and 1
– The closer to -1, the stronger the negative linear
relationship
– The closer to 1, the stronger the positive linear
relationship
– The closer to 0, the weaker the linear relationship
Strength and Direction….
• Graded interpretation :
• r_: Below 0.1 = weak; r: 0.1- 0.29 = medium; 0.30 -
0.49 = strong and 0.5 -1.0= very strong correlation
• These guidelines apply whether or not there is a
negative sign out the front of your r value.
• Remember, the negative sign refers only to the
direction of the relationship, not the strength.
• The strength of correlation of r=.5 and r=–.5 is the
same. It is only in a different direction.
Testing hypothesis – Correlation Coefficient
(X1) 1
(X2) 2
3
(X3) Dependent
Variable
4
(X4)
Multiple Regression Equation
Y = a + b1X1 + b2X2 + b3X3+ b4X4
Notation
Y is the Dependent Variable
Xs are Independent Variables
a is the Y intercept, where the regression line crosses the Y axis
b1 is the partial slope for X1 on Y
b1 indicates the change in Y for one unit change in X 1,
controlling for X2, X3, X4
b2 is the partial slope for X2 on Y
b2 indicates the change in Y for one unit change in X 2,
controlling for X1, X4, X3
Assumptions of Multiple regression
• Sample size: The issue is generalisability which needs large sample
size.
• If your results do not generalise to other samples, then they are of
little scientific value.
• Different guidelines concerning the number of cases required for
multiple regression have given:
• Stevens (1996) recommends that ‘for social science research, about
15 subjects per predictor are needed for a reliable equation’.
• Tabachnick and Fidell (2001) gave a formula for calculating sample
size: taking into account the number of independent variables : n >
50 + 8m (where m = No. of Ind. variables).
• If you have five independent variables you will need 90 cases.
• More cases are needed if the dependent variable is skewed.
Assumptions……
• Multicollinearity and singularity
• This refers to the relationship among the independent
variables.
• Multiple regression doesn’t like multicollinearity or singularity
• Multicollinearity exists when the independent variables are
highly correlated (r=.9 and above).
• Singularity occurs when one independent variable is actually a
combination of other independent variables (e.g. when both
subscale scores and the total score of a scale are included).
• These certainly don’t contribute to a good regression model,
• Always check for these problems before you start analysis
Assumptions….
• Outliers
• Multiple regression is very sensitive to outliers (very
high or very low scores
• Check for extreme scores before you start analysis
• Do this for all the variables, both dependent and
independent variables.
• Outliers can either be deleted from the data set or,
replaced by a standardized score for that variable
Assumptions….
• Normality & linearity
• These all refer to various aspects of the distribution
of scores and the nature of the underlying
relationship between the variables
• Your variables need to be normally distributed,
especially dependent variable
• There should liner relationship between
Independent Variables and Dependent Variable
Checking for Multi-collinearity
• Two tests are available: Variance Inflation Factor (VIF) and
Tolerance (Available in SPSS)
• Variance Inflation Factor (VIF) – measures how much the
variance of the regression coefficients is inflated by
multicollinearity problems.
• If VIF equals 0, there is no correlation between the independent
measures.
• A VIF measure of 1 is an indication of some association between
predictor variables, but generally not enough to cause problems.
• A maximum acceptable VIF value would be 5.0; anything higher
would indicate a problem with multicollinearity.
• Some books recommend a cut-off point of 10 for VIF
Checking for MultiCollinearity…..
• Tolerance – the amount of variance in an independent
variable that is not explained by the other independent
variables.
• If the other variables explain a lot of the variance of a
particular independent variable we have a problem with
multicollinearity.
• Thus, small values for tolerance indicate problems of
multicollinearity.
• The minimum cutoff value for tolerance is typically .20
• That is, the tolerance value must be smaller than .20 to
indicate a problem of multicollinearity.
MultiCollinearity tests………
• You can also to check Pearson Correlation
coefficients
• Correlation coefficients between IVs should not be
very high: r = .9 or more
• If the IVs are highly correlated variables, you may
need to remove one the IVs.
Example of Multiple regression
• Research on Factors influencing Repeat
Purchase in a Hotel
• Dependent Variable
– Return in Future in a Hotel (Repeat purchase)
• Independent variables
– Wide variety of Menu items
– Excellent Food quality
– Excellent Food taste
Procedure for standard multiple regression
100
Interpretation of output from independent-
samples t-test
• Checking the information about the groups
• In the Group Statistics box SPSS gives you the mean and
standard deviation for each of your groups (in this case:
male/female).
• It also gives you the number of people in each group (N).
• Always check these values first. Do they seem right?
• Are the N values for males and females correct? Or are
there a lot of missing data?
• If so, find out why. Perhaps you have entered the wrong
code for males and females (0 and 1, rather than 1 and
2). Check with your codebook
Interpretation of output for t-test
ANOVA
Multiple Comparisons
Mean
Difference 95% Confidence Interval
(I) Age of respondent (J) Age of respondent (I-J) Std. Error Sig. Lower Bound Upper Bound
24 and younger 25-40 -.22139* .09014 .046 -.4402 -.0026
41 and older -.31974* .08929 .002 -.5366 -.1029
25-40 24 and younger .22139* .09014 .046 .0026 .4402
41 and older -.09835 .08156 .543 -.2954 .0987
41 and older 24 and younger .31974* .08929 .002 .1029 .5366
25-40 .09835 .08156 .543 -.0987 .2954
*. The mean difference is significant at the .05 level.
116
Interpreting the Post Hoc Results
• A statistically significant difference (p = .05) exist in
the proportion of Internet users who have made an
on-line purchases for those in the age groups:
• 24 or younger vs. 25 – 40: the proportion of the 24
or younger age group is .22139 smaller than the
proportion in the 25 – 40 age group.
• 24 or younger vs. 41 or older: the proportion of the
24 or younger age group is .31974 smaller than the
proportion in the 41 or older age group.
Presenting the results
• A One-way ANOVA was conducted to determine if the
proportion of Internet users who made on-line
purchases was influenced by the users’ age.
• The test found a highly statistically significant difference
among the age groups (p = .002).
• Post hoc analysis showed that the proportions of users
making a purchase from the middle and older age
groups were higher than that for the younger age group
(p = .05).
• Therefore, there is highly significant statistical evidence
to support the hypothesis that age influences on-line
Qualitative data analysis
Data Analysis for Qualitative data
• The first difference between qualitative and
quantitative data analysis is that the data to be
analyzed are text, rather than numbers
– No hypotheses to be tested
– No variables
Quantitative vs. Qualitative
• Explanation through numbers • Explanation through
words
• Objective
• Subjective
• Deductive reasoning
• Inductive reasoning
• Predefined variables and
measurement • Creativity, extraneous
variables
• Data collection before
analysis • Data collection and
analysis intertwined
– priorities, importance
– processes, practices