Chap4 Normality (Data Analysis) FV
Chap4 Normality (Data Analysis) FV
Chap4 Normality (Data Analysis) FV
Normality
Normality and
and the
the multi-
multi-
collinearity
collinearity problem
problem
Data Analysis
International School of Business
Study
Study Objectives
Objectives
1. Define Normal distributions.
2. Present the methods used to check normality
3. Discuss the methods improving normality
4. Present the problem of multi-collinearity
5. How to avoid the multi-collinearity problem?
Normal distributions
– Most of observations around center
– Less above and lower central values,
approximately the same proportions
– Most often Gaussian distribution
3
Normal distribution
• Many characteristics are distributed through the
population in a ‘normal’ manner
– Normal curves have well-defined statistical properties
– Parametric statistics are based on the assumption that the
variables are distributed normally
• Most commonly used statistics
• This is the famous “Bell curve” where many cases fall
near the middle of the distribution and few fall very
high or very low
– I.Q.
I.Q. distribution
Source: www.wilderdom.com/.../L2-1UnderstandingIQ.html
Not normal distributions
• More observations in one part.
7
Asymmetrical distribution 8
How would you describe/present your
respondents if the data are numeric?
2 groups of measures:
1. Central tendency (central value,
average)
2. Variance
9
Is it enough measure of central
tendency to describe respondents?
10
What measures are to be used for sample
description?
If distribution is NORMAL
– Mean
– Variance (or standard deviation)
11
EMPYRICAL RULE
X=8
SD=2,5
-2SD +2SD
X
13
Normality assessment
Summary
• Comparison of measures of central tendency;
empirical rule (mean and standard deviation)
• Skewness and kurtosis
• Graphical
• Kolmogorov-Smirnov test (Shapiro-Wilk test;
Jarque-Bera test)
14
Skewness
Symmetrical distribution
• IQ
Frequency
• SAT
• “No skew”
• “Zero skew”
• Symmetrical
Value
Skewness
Asymmetrical distribution
Frequency
• GPA of MIT students
• “Negative skew”
• “Left skew”
Value
Skewness
(Asymmetrical distribution)
Frequency • Income
• Contribution to
candidates
• Populations of countries
• “Residual vote” rates
• “Positive skew”
Value
• “Right skew”
Skewed distributions
18
Skewness
Skewness
Frequency
Value
Kurtosis
k>3 leptokurtic
Frequency
k=3 mesokurtic
k<3 platykurtic
Value
A few words about the normal
curve
• Skewness = 0
• Kurtosis = 3
1 ( x ) / 2 2
f ( x) e
2
Q-Q Plot
Kolmogorov-Smirnov Test
How to improve normality?
Outliers have Z-scores lower than -3.29 or higher than +3.29 (z-scores having a
probability of 0.001).
Descriptive analysis
OK
Click Ascending
High values
Source: http://pse.cs.vt.edu/SoSci/converted/Dispersion_I/box_n_hist.gif
Boxplot
75th Procentile
75th Procentile
Mean( *)
Median
25th Procentile
25th Procentile
Outliers
The Multi-collinearity Problem
Score on Exam
Score on Exam
90 90
an idea. 80
70
80
70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
1 6 11 1 6 11
Hours Spent Studying Hours Spent Studying
Figure 7.2
No Correlation Between Hours Spent Studying and Exam
Scores
100
90
80
S c o re s o n Exa m
70
60
50
40
30
20
10
0
0 2 4 6 8 10 12
Hours Spent Studying
Correlation
• Very high Correlation = 0.80 to 1
• Moderately high Correlation = 0.60 to 0.79
• Moderate Correlation = 0.50 to 0.59
• Moderately weak Correlation = 0.30 to .49
• Weak to nil Correlation from 0 to 0.29
y
x
• A linear correlation coefficient that implies a
strong positive or negative association that is
computed using observational data does not
imply causation among the variables.
• The coefficient of correlation can be used to
test for a linear relationship between two
variables
Testing the coefficient of correlation
Hypothesis test of the correlation:
- H0: ρ = 0 (the variables are statistically independent)
- Ha: ρ ≠ 0 The variables are statistically dependent
- Test statistic: r n2
t
1 r 2
- The test statistic t follows the student distribution with
v=n-2 degrees of freedom provided that the variables
are bivariate normally distributed
Correlation Coefficient (r)
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
57
516 3751 579 39898
Introduction to SPSS
Recode
1. File
2. “Transform,”
next “Recode,” and
“Into different variables”
3. “Recode” window,
click variable, arrow
71
SPSS
4. New “Name”
“genrecode”
“Label”
Click “Old and New Values”