Chap4 Normality (Data Analysis) FV

CHAPTER 4
Normality
Normality and
and the
the multi-
multi-
collinearity
collinearity problem
problem
Data Analysis
International School of Business
Study
Study Objectives
Objectives
1. Define Normal distributions.
2. Present the methods used to check normality
3. Discuss the methods improving normality
4. Present the problem of multi-collinearity
5. How to avoid the multi-collinearity problem?
Normal distributions
– Most of observations around center
– Less above and lower central values,
approximately the same proportions
– Most often Gaussian distribution
3
Normal distribution
• Many characteristics are distributed through the
population in a ‘normal’ manner
– Normal curves have well-defined statistical properties
– Parametric statistics are based on the assumption that the
variables are distributed normally
• Most commonly used statistics
• This is the famous “Bell curve” where many cases fall
near the middle of the distribution and few fall very
high or very low
– I.Q.
I.Q. distribution
Source: www.wilderdom.com/.../L2-1UnderstandingIQ.html
Not normal distributions
• More observations in one part.
7
Asymmetrical distribution 8
How would you describe/present your
respondents if the data are numeric?
2 groups of measures:
1. Central tendency (central value,
average)
2. Variance
9
Is it enough measure of central
tendency to describe respondents?
10
What measures are to be used for sample
description?
If distribution is NORMAL
– Mean
– Variance (or standard deviation)
If distribution is NOT NORMAL

– Median
– IQRT or min/max
Those measures are used also with numeric ordinal data
11
EMPYRICAL RULE
Number of observations (%) 1, 2 ir 2.5

SD from mean if distribution is normal
12
Example
X=8
SD=2,5
-2SD +2SD
X
13
Normality assessment
Summary
• Comparison of measures of central tendency;
empirical rule (mean and standard deviation)
• Skewness and kurtosis
• Graphical
• Kolmogorov-Smirnov test (Shapiro-Wilk test;
Jarque-Bera test)
14
Skewness
Symmetrical distribution
• IQ
Frequency
• SAT
• “No skew”
• “Zero skew”
• Symmetrical
Value
Skewness
Asymmetrical distribution
Frequency
• GPA of MIT students
• “Negative skew”
• “Left skew”
Value
Skewness
(Asymmetrical distribution)
Frequency • Income
• Contribution to
candidates
• Populations of countries
• “Residual vote” rates
• “Positive skew”
Value
• “Right skew”
Skewed distributions
18
Skewness
Skewness
Frequency
Value
Kurtosis
k>3 leptokurtic
Frequency
k=3 mesokurtic
k<3 platykurtic
Value
A few words about the normal
curve
• Skewness = 0
• Kurtosis = 3
1  ( x   ) / 2 2
f ( x)  e
 2
Q-Q Plot
Kolmogorov-Smirnov Test
How to improve normality?
• Check abnormal values (outliers)

• Monotonous Transformation
• Adjustment of variables through ratios
• Testing non-linear models
Outliers
• Outliers are non-typical data. They represent a threat for

analysis
Detection
Convert the variable into a Z-score: standardized.
This conversion applies only for quantitative and ordinal variables.
Outliers have Z-scores lower than -3.29 or higher than +3.29 (z-scores having a
probability of 0.001).
Descriptive analysis
To compute the z-score

through SPSS, select: 1-
Descriptive Statistics, 2-
Descriptives
Step 3
Select the variable
Move it to the right box

Option: Save standardized values as variables
OK
Select this option. It keeps the name and

adds z: “assets” becomes “zassets”.
Example (Given by SPSS manual)
We obtain the new variable
outliers (if lower than

–3.29 or higher
than3,29)
Click Ascending
High values
Some outliers are detected

Checking
Check the case for the other variables and

compare with means, ranges, etc.
Comparison with Mean and S-D
Comparing 7 to 1.76 and taking into account the

Standard-deviation (1.532), we identify the aberration.
Automatic identification
Boxplot
Source: http://pse.cs.vt.edu/SoSci/converted/Dispersion_I/box_n_hist.gif
Boxplot
75th Procentile
75th Procentile
Mean( *)
Median
25th Procentile
25th Procentile
Outliers
The Multi-collinearity Problem
• Nominal variables and the creation of dummy variables

• High correlation coefficients
Correlation describes strength of
association
• Falls between -1 and +1, with sign indicating direction of
association
The larger the correlation in absolute value, the stronger the

association (in terms of a straight line trend)
Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation = 0.37
GDP and fertility, correlation = - 0.56
GDP and percent using Internet, correlation = 0.89
Illustrations: Scatterplots
Negative Correlation Positive Correlation
 Scattergrams: ==> Graphs gives you 100 100
Score on Exam
Score on Exam
90 90
an idea. 80
70
80
70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
1 6 11 1 6 11
Hours Spent Studying Hours Spent Studying
Figure 7.2
No Correlation Between Hours Spent Studying and Exam
Scores
100
90
80
S c o re s o n Exa m
70
60
50
40
30
20
10
0
0 2 4 6 8 10 12
Hours Spent Studying
Correlation
• Very high Correlation = 0.80 to 1
• Moderately high Correlation = 0.60 to 0.79
• Moderate Correlation = 0.50 to 0.59
• Moderately weak Correlation = 0.30 to .49
• Weak to nil Correlation from 0 to 0.29
-1.00 0.00 1.00

Perfect negative No relationship Perfect positive
Correlation Correlation
Correlation
metric × metric
A scatter diagram shows the relationship
between two quantitative variables measured on
the same observation. Each observation in the
data set is represented by a point in the scatter
diagram. The predictor variable is plotted on the
horizontal axis and the response variable is
plotted on the vertical axis. Do not connect the
points when drawing a scatter diagram.
Example: Drawing a scatter plot
The following data are based on a study for
drilling rock. The researchers wanted to
determine whether the time it takes to dry drill a
distance of 5 feet in rock increases with the depth
at which the drilling begins. So, depth at which
drilling begins is the predictor variable, x, and
time (in minutes) to drill five feet is the response
variable, y. Draw a scatter diagram of the data.
Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1,
Feb. 1991, p. 6.
Positively associated variables
Two variables that are linearly related are said to be

positively associated when above average values of
one variable are associated with above average values
of the corresponding variable. That is, two variables
are positively associated when the values of the
predictor variable increase, the values of the response
variable also increase.
Negatively associated variables
Two variables that are linearly related are said to be

negatively associated when above average values of
one variable are associated with below average values
of the corresponding variable. That is, two variables
are negatively associated when the values of the
predictor variable increase, the values of the response
variable decrease
The linear correlation coefficient or Pearson product
moment correlation coefficient is a measure of the
strength of linear relation between two quantitative
variables. We use the Greek letter (rho) to represent
the population correlation coefficient and r to represent
the sample correlation coefficient. We shall only
present the formula for the sample correlation
coefficient.
Properties of the linear correlation
coefficient
1. The linear correlation coefficient is always between -1
and 1, inclusive. That is, -1 < r < 1.
2. If r = +1, there is a perfect positive linear relation
between the two variables.
3. If r = -1, there is a perfect negative linear relation
between the two variables.
4. The closer r is to +1, the stronger the evidence of
positive association between the two variables.
Properties of the linear correlation
coefficient
5. The closer r is to -1, the stronger the evidence of
negative association between the two variables.
6. If r is close to 0, there is evidence of no linear relation
between the two variables. Because the linear
correlation coefficient is a measure of strength of linear
relation, r close to 0 does not imply no relation, just no
linear relation.
7. It is a unitless measure of association. So, the unit of
measure for x and y plays no role in the interpretation
of r.
Example
For the following data

(a) Draw a scatter diagram and comment on the type of
relation that appears to exist between x and y.
(b) By hand, compute the linear correlation coefficient.
Example
Determine the linear correlation coefficient of
the drilling data.
xi  x yi  y  xi  x   yi  y 
sy    
sx s
 x  y  s
y
x
• A linear correlation coefficient that implies a
strong positive or negative association that is
computed using observational data does not
imply causation among the variables.
• The coefficient of correlation can be used to
test for a linear relationship between two
variables
Testing the coefficient of correlation
Hypothesis test of the correlation:
- H0: ρ = 0 (the variables are statistically independent)
- Ha: ρ ≠ 0 The variables are statistically dependent
- Test statistic: r n2
t
1 r 2
- The test statistic t follows the student distribution with
v=n-2 degrees of freedom provided that the variables
are bivariate normally distributed
Correlation Coefficient (r)
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
57
516 3751 579 39898
Introduction to SPSS
Recode
1. File
2. “Transform,”
next “Recode,” and
“Into different variables”
3. “Recode” window,
click variable, arrow
71
SPSS
4. New “Name”
“genrecode”
“Label”
Click “Old and New Values”
5. Ex.old values and

new values as following:
1→5
2→4
3→3
4→2
5→1
“Continue”
72 ----

Chap4 Normality (Data Analysis) FV

Uploaded by

Copyright:

Available Formats

Chap4 Normality (Data Analysis) FV

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap4 Normality (Data Analysis) FV

Uploaded by

Copyright:

Available Formats

CHAPTER 4

If distribution is NOT NORMAL

Number of observations (%) 1, 2 ir 2.5

• Check abnormal values (outliers)

• Outliers are non-typical data. They represent a threat for

Convert the variable into a Z-score: standardized.

This conversion applies only for quantitative and ordinal variables.

To compute the z-score

Select the variable

Move it to the right box

Select this option. It keeps the name and

We obtain the new variable

outliers (if lower than

Some outliers are detected

Check the case for the other variables and

Comparing 7 to 1.76 and taking into account the

• Nominal variables and the creation of dummy variables

The larger the correlation in absolute value, the stronger the

Examples: (positive or negative, how strong?)

-1.00 0.00 1.00

Two variables that are linearly related are said to be

Two variables that are linearly related are said to be

For the following data

5. Ex.old values and

You might also like