Chap4 Normality (Data Analysis) FV

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 72

CHAPTER 4

Normality
Normality and
and the
the multi-
multi-
collinearity
collinearity problem
problem

Data Analysis
International School of Business
Study
Study Objectives
Objectives
1. Define Normal distributions.
2. Present the methods used to check normality
3. Discuss the methods improving normality
4. Present the problem of multi-collinearity
5. How to avoid the multi-collinearity problem?
Normal distributions
– Most of observations around center
– Less above and lower central values,
approximately the same proportions
– Most often Gaussian distribution

3
Normal distribution
• Many characteristics are distributed through the
population in a ‘normal’ manner
– Normal curves have well-defined statistical properties
– Parametric statistics are based on the assumption that the
variables are distributed normally
• Most commonly used statistics
• This is the famous “Bell curve” where many cases fall
near the middle of the distribution and few fall very
high or very low
– I.Q.
I.Q. distribution
Source: www.wilderdom.com/.../L2-1UnderstandingIQ.html
Not normal distributions
• More observations in one part.

7
Asymmetrical distribution 8
How would you describe/present your
respondents if the data are numeric?

2 groups of measures:
1. Central tendency (central value,
average)
2. Variance

9
Is it enough measure of central
tendency to describe respondents?

10
What measures are to be used for sample
description?

If distribution is NORMAL
– Mean
– Variance (or standard deviation)

If distribution is NOT NORMAL


– Median
– IQRT or min/max
Those measures are used also with numeric ordinal data

11
EMPYRICAL RULE

Number of observations (%) 1, 2 ir 2.5


SD from mean if distribution is normal
12
Example

X=8
SD=2,5

-2SD +2SD
X

13
Normality assessment
Summary
• Comparison of measures of central tendency;
empirical rule (mean and standard deviation)
• Skewness and kurtosis
• Graphical
• Kolmogorov-Smirnov test (Shapiro-Wilk test;
Jarque-Bera test)

14
Skewness
Symmetrical distribution
• IQ
Frequency
• SAT

• “No skew”
• “Zero skew”
• Symmetrical
Value
Skewness
Asymmetrical distribution
Frequency
• GPA of MIT students

• “Negative skew”
• “Left skew”

Value
Skewness
(Asymmetrical distribution)
Frequency • Income
• Contribution to
candidates
• Populations of countries
• “Residual vote” rates

• “Positive skew”
Value
• “Right skew”
Skewed distributions

18
Skewness
Skewness
Frequency

Value
Kurtosis
k>3 leptokurtic
Frequency

k=3 mesokurtic

k<3 platykurtic

Value
A few words about the normal
curve
• Skewness = 0
• Kurtosis = 3

1  ( x   ) / 2 2
f ( x)  e
 2
Q-Q Plot
Kolmogorov-Smirnov Test
How to improve normality?

• Check abnormal values (outliers)


• Monotonous Transformation
• Adjustment of variables through ratios
• Testing non-linear models
Outliers

• Outliers are non-typical data. They represent a threat for


analysis
Detection

Convert the variable into a Z-score: standardized.

This conversion applies only for quantitative and ordinal variables.

Outliers have Z-scores lower than -3.29 or higher than +3.29 (z-scores having a
probability of 0.001).
Descriptive analysis

To compute the z-score


through SPSS, select: 1-
Descriptive Statistics, 2-
Descriptives
Step 3

Select the variable

Move it to the right box


Option: Save standardized values as variables

OK

Select this option. It keeps the name and


adds z: “assets” becomes “zassets”.
Example (Given by SPSS manual)

We obtain the new variable

outliers (if lower than


–3.29 or higher
than3,29)

Click Ascending
High values

Some outliers are detected


Checking

Check the case for the other variables and


compare with means, ranges, etc.
Comparison with Mean and S-D

Comparing 7 to 1.76 and taking into account the


Standard-deviation (1.532), we identify the aberration.
Automatic identification
Boxplot

Source: http://pse.cs.vt.edu/SoSci/converted/Dispersion_I/box_n_hist.gif
Boxplot

75th Procentile
75th Procentile
Mean( *)
Median
25th Procentile

25th Procentile

Outliers
The Multi-collinearity Problem

• Nominal variables and the creation of dummy variables


• High correlation coefficients
Correlation describes strength of
association
• Falls between -1 and +1, with sign indicating direction of
association

The larger the correlation in absolute value, the stronger the


association (in terms of a straight line trend)

Examples: (positive or negative, how strong?)


Mental impairment and life events, correlation = 0.37
GDP and fertility, correlation = - 0.56
GDP and percent using Internet, correlation = 0.89
Illustrations: Scatterplots
Negative Correlation Positive Correlation
 Scattergrams: ==> Graphs gives you 100 100

Score on Exam
Score on Exam
90 90
an idea. 80
70
80
70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
1 6 11 1 6 11
Hours Spent Studying Hours Spent Studying

Figure 7.2
No Correlation Between Hours Spent Studying and Exam
Scores

100
90
80
S c o re s o n Exa m

70
60
50
40
30
20
10
0
0 2 4 6 8 10 12
Hours Spent Studying
Correlation
• Very high Correlation = 0.80 to 1
• Moderately high Correlation = 0.60 to 0.79
• Moderate Correlation = 0.50 to 0.59
• Moderately weak Correlation = 0.30 to .49
• Weak to nil Correlation from 0 to 0.29

-1.00 0.00 1.00


Perfect negative No relationship Perfect positive
Correlation Correlation
Correlation
metric × metric
A scatter diagram shows the relationship
between two quantitative variables measured on
the same observation. Each observation in the
data set is represented by a point in the scatter
diagram. The predictor variable is plotted on the
horizontal axis and the response variable is
plotted on the vertical axis. Do not connect the
points when drawing a scatter diagram.
Example: Drawing a scatter plot
The following data are based on a study for
drilling rock. The researchers wanted to
determine whether the time it takes to dry drill a
distance of 5 feet in rock increases with the depth
at which the drilling begins. So, depth at which
drilling begins is the predictor variable, x, and
time (in minutes) to drill five feet is the response
variable, y. Draw a scatter diagram of the data.
Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1,
Feb. 1991, p. 6.
Positively associated variables

Two variables that are linearly related are said to be


positively associated when above average values of
one variable are associated with above average values
of the corresponding variable. That is, two variables
are positively associated when the values of the
predictor variable increase, the values of the response
variable also increase.
Negatively associated variables

Two variables that are linearly related are said to be


negatively associated when above average values of
one variable are associated with below average values
of the corresponding variable. That is, two variables
are negatively associated when the values of the
predictor variable increase, the values of the response
variable decrease
The linear correlation coefficient or Pearson product
moment correlation coefficient is a measure of the
strength of linear relation between two quantitative
variables. We use the Greek letter (rho) to represent
the population correlation coefficient and r to represent
the sample correlation coefficient. We shall only
present the formula for the sample correlation
coefficient.
Properties of the linear correlation
coefficient
1. The linear correlation coefficient is always between -1
and 1, inclusive. That is, -1 < r < 1.
2. If r = +1, there is a perfect positive linear relation
between the two variables.
3. If r = -1, there is a perfect negative linear relation
between the two variables.
4. The closer r is to +1, the stronger the evidence of
positive association between the two variables.
Properties of the linear correlation
coefficient
5. The closer r is to -1, the stronger the evidence of
negative association between the two variables.
6. If r is close to 0, there is evidence of no linear relation
between the two variables. Because the linear
correlation coefficient is a measure of strength of linear
relation, r close to 0 does not imply no relation, just no
linear relation.
7. It is a unitless measure of association. So, the unit of
measure for x and y plays no role in the interpretation
of r.
Example

For the following data


(a) Draw a scatter diagram and comment on the type of
relation that appears to exist between x and y.
(b) By hand, compute the linear correlation coefficient.
Example
Determine the linear correlation coefficient of
the drilling data.
xi  x yi  y  xi  x   yi  y 
sy    
sx s
 x  y  s

y
x
• A linear correlation coefficient that implies a
strong positive or negative association that is
computed using observational data does not
imply causation among the variables.
• The coefficient of correlation can be used to
test for a linear relationship between two
variables
Testing the coefficient of correlation
Hypothesis test of the correlation:
- H0: ρ = 0 (the variables are statistically independent)
- Ha: ρ ≠ 0 The variables are statistically dependent
- Test statistic: r n2
t
1 r 2
- The test statistic t follows the student distribution with
v=n-2 degrees of freedom provided that the variables
are bivariate normally distributed
Correlation Coefficient (r)
x y xy x2 y2
1 8 78 624 64 6084
2 2 92 184 4 8464
3 5 90 450 25 8100
4 12 58 696 144 3364
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561

57
516 3751 579 39898
Introduction to SPSS
Recode

1. File

2. “Transform,”
next “Recode,” and
“Into different variables”

3. “Recode” window,
click variable, arrow

71
SPSS

4. New “Name”
“genrecode”
“Label”
Click “Old and New Values”

5. Ex.old values and


new values as following:
1→5
2→4
3→3
4→2
5→1
“Continue”
72 ----

You might also like