0% found this document useful (0 votes)
55 views32 pages

Lecture09 (Assessing Normality)

The document discusses methods for assessing whether data are normally distributed, which is an important assumption in parametric statistics. It describes graphical methods like histograms, normal quantile plots, and normal probability plots. It also discusses formal tests like the Kolmogorov-Smirnov test and Shapiro-Wilk test. The document emphasizes that normal quantile plots and probability plots are better than histograms for assessing normality, and provides guidance on assessing normality for small sample sizes.

Uploaded by

Elaine Lieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views32 pages

Lecture09 (Assessing Normality)

The document discusses methods for assessing whether data are normally distributed, which is an important assumption in parametric statistics. It describes graphical methods like histograms, normal quantile plots, and normal probability plots. It also discusses formal tests like the Kolmogorov-Smirnov test and Shapiro-Wilk test. The document emphasizes that normal quantile plots and probability plots are better than histograms for assessing normality, and provides guidance on assessing normality for small sample sizes.

Uploaded by

Elaine Lieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Lecture 9:

Assessing the Assumption ofNormality

Sources of Information:

SPSSHelp Topics: P-P Plots, Q-QPlots

Sokal & Rohlf Chapter 6 (sections 6.6 and 6.7)

Dytham Chapter 5 (p. 37-41); Chapter 7 (p. 61-72)


Recall: The Normal Distribution
(= Gaussian distribution) is the single most important distribution
in statistics  used as a basis for parametric statistics.

A continuous random variable has a Normal distribution if that


distribution is symmetric and unimodal, and fits theformula:

You don’t need to memorize or use the formula!


•What it shows is that any particular normal distribution is
determined by 2 parameters: the mean (μ) and thestandard
deviation (σ).
An infinite number ofNormal curves can be drawn by altering the
mean and standard deviation!

(Sokal & Rohlf, Fig.6.2)


Why Assess if Data are Normally Distributed?
For 2 main reasons, we often need to know whethera variable
measured in a sample comes from a population with an
underlying Normal distribution:

1. Toassess if we can apply a Parametric statisticaltest.


• Because the test calculates expected frequencies (the P-value)
for a Normal curve with the same mean and SDas in our sample.

2. In order to confirm or reject a certain underlying hypothesis


about the nature of the factors affecting the phenomenon
studied.
• e.g., skewness, bimodality, etc. tells a lot about control factors.
Overview of Methods to AssessNormality
Graphical Methods
Histogram (density plot)  not veryreliable!
Normal-quantile plot (Q-Q plot)
Normal probability plot (P-Pplot)

Formal Tests
Kolmogorov-Smirnov test
Shapiro-Wilk test
Tests of Skewness & Kurtosis
G-test
Chi-square test
[G-test and Chi-Square test are in Sokal & Rohlf Section 17.2]
Frequency Histograms

Frequency histograms can be extremely useful for


displaying the characteristics of adataset.

They are easily produced in most statistical programs.

BUT, they are a poor tool to objectively assess Normality.

The problem is that the shape of a histogram is usuallya


function of the number and width of the bars,
particularly in small samples.
Example:

Summary of Inter-orbital width in pigeons

Count 40
Mean 11.48
Median 11.6
MidRange 11.75
StdDev 0.69
Min 10.2
Max 13.3
Range 3.1
•These data are approximately Normally distributed. But, our visual
detection depends on the number and width ofbars.
•So, in general, histograms should not be used to examinethe
hypothesis of Normality for adataset
Quantile Plots and Normal ProbabilityPlots
A quantile plot provides an excellent and reliable alternative to
histograms.

A 1-sample quantile plot compares a variable to its ownquantiles.

A quantile = The value at which the fraction of thedata points are


 to it (the quantile).

e.g., the 0.25 quantile contains the smallest 25% ofthe data points
(= quartile, or Q1 in aboxplot),

The 0.5 quantile contains the smallest 50% of thedata points


(= median), etc.

Ifthe data are normally distributed, a 1-sample quantile plot


should form an S-shaped curve, called asigmoid.
Sokal & Rohlf
Figs: Sokal &
Rohlf

•Left Fig. shows the cumulative frequency of a normaldistribution.


•Right Fig. shows the quantiles expressed in standard deviationunits
from the mean. These are called Normal equivalent deviates (NEDs).
They are used in a type of Normal probability plot called a Normal
quantile plot ( = Q-Q plots in SPSS).
Normal Probability Plots
• Provide a simple way to tell whether the values of a variable in a
sample are distributed approximately Normally.
• If data points of a variable plotted versus the expected NEDs (=
‘Expected Normal’ in SPSSgraphs) fall (nearly) on a straightline,
then the distribution of the variable is nearly Normal.
• Normal quantile plots work best for fairly large samples (n > 50)

=NEDs
Sokal &Rohlf Fig. 6.6

You can
compare these
normal quantile
plots with the
Q-Q plots from
SPSS

(P-P plots in
Expected normal

SPSSare the
inverse)

Observed value
An Example of SkewedData

Q-Q Probability Plots: [text fromSPSShelp]


plots the observed values of a variable's distribution againstthe
expected quantiles of any of a number of test distributions.
Probability plots are generally used to determine whether the
distribution of a variable matches a given distribution. If the
selected variable matches the test distribution, the points cluster
around a straight line (through Q1 andQ3).
P-P probability Plots: [text fromSPSShelp]
plots a variable’s observed cumulative proportions against the
expected cumulative proportions of any of a number of test
distributions.
SPSS (normal probability plots):

Perhaps the best way, but only get Q-Q plots (andwith
tests of normality):
Analyze  Descriptive statistics  Explore
• In the ‘Plots’ options, Select ‘Normality plots with tests’

Or, Analyze  Descriptive Statistics  Q-Q plots or P-P plots


[Note: this is the only way to test distributions other than Normal]
Notes on Normal ProbabilityPlots

• If you find yourself wondering whether the data in a Normal


probability plot exhibit evidence of non-Normality, then you
probably don’t have a sufficiently severe violation to worryabout.

• If the violation of the Normality assumption is enough to be


worrisome, it will be readily apparent in the Normal probability
plot.

• When testing for differences between means we are only


interested in severe violations of theNormality assumption. Thus,
mild departures from Normality are of little concern.
Small Samples
• Normal probability plots work best for fairly large samples (n > 50).
• Assessing the Normality assumption in small samples is
problematic. In smaller samples, a difference of one item perclass
(= bar in a histogram) would make a substantial difference in the
cumulative percentage in the tails ofa distribution.
• For small samples (< 50), the method of Rankits ispreferable.
• With this method, instead of quantiles, we use the ranks of each
observation in the sample, and instead of NEDs we plot values
from a table of rankits = the average positions in SDunits of the
ranked items in a Normally distributed sample ofn items.
• Available in SPSSwithin the Analzye  descriptive stats Q-Q
and P-P graphs. Under the options for ‘Proportion estimation
formula’ change from the default (‘Bloms’) to ‘Rankit’.
For small samples,
change from Blom’s
(default) to Rankit
Note: You can test if your data meet other
distributions than a Normal distribution!
Formal Tests of Normality
In SPSS, Analyze  Descriptive statistics Explore
• In the ‘Plots’ options, Select ‘Normality plots with tests’

Get 2 statistical tests of normality in the output, that test …


Ho: the variable has a Normal distribution in the statisticalpopulation.
H1: the variable does not have a Normal distribution inthe statistical
population.

1.Kolmogorov-Smirnov (Lilliefors): is a modification of the Kolmogorov-


Smirnov test that tests for normality when means and variances are not
known, but must be estimated from the data. The Kolmogorov-Smirnovtest
is based on the largest absolute difference between the observed and the
expected cumulative distributions. (reliable even for samples with n <50)

2.Shapiro-Wilk Test: Tests the hypothesis that the sample is from a normal
population. The Shapiro-Wilk statistic is calculated for samples with 50 or
fewer observations.
Formal Tests – Skewness and Kurtosis
We learned before that distributions can deviate fromNormality due
to Skewness and Kurtosis. Thus, statistics that measure these
departures can be useful.

1.Skewness (= asymmetry): 1 tail of the curve is drawn out more


than the other. Skew can be to the right or to the left.

2.Kurtosis describes the proportions of observations found in the


centre and in the tails in relation to those found in the shoulders.
•A leptokurtic curve has more items in the centre and at the tails,
with fewer items in the shoulders relative to a Normal distribution
with the same mean andvariance.
•Aplatykurtic curve has fewer items at the centre and tails, but has
more in the shoulders. A bimodal distribution is an extreme
platykurtic distribution.
Sample statistics for measuring skewness and kurtosis: g1 and g2,
(population parameters: 1 and 2)

Their computation is tedious and should be done with acomputer.

In SPSS, you get these values together with the Descriptive Statistics.

If they are not included with the defaults, you must select them:

Choose: Analyze  Descriptive Statistics  Explore  valuesof


“Skewness” and “Kurtosis” are provided automatically.

Or: Analyze  Descriptive Statistics  Descriptives  Options, then


select “Skewness” and “Kurtosis”.

• In a population with a Normal distribution, both 1 and 2 =0.


• A -ve g1 : skew to the left, +ve g1 : skew to the right.
• A -ve g2 : platykurtosis, +ve g2 : leptokurtosis.
Examples from SPSS (skewed to theright)
Descriptives

Statistic St d. Error
Sulf ate conc (mg/l) Mean 30.05 3.67
95% Confidence Lower Bound 22.64
Interv al f orMean Upper Bound
37.46

5% Trimmed Mean 27.29


Median 26.00
Variance 550.948
St d. Dev iation 23.47
Minimum 8
Maximum 110
Range 102
Interquartile Range 23.00
Skewness 1.707 .369
Kurtosis 3.020 .724

Tests of Normality

Sulf ate conc (mg/l)


Kolmogoarov - Statistic .216
Smirnov df 41
Sig. 4.919E-05
Ho: variable has a Normal Shapiro-Wilk Statistic .811
df 41
distribution in the Sig. .010**

statistical population **. This is an upper bound of the true


a. Lillief ors Signif icance Correction
Descriptives

Normal distribution
Statistic St d. Error
Orbit width (m) Mean 11.480 .109
95% Confidence Lower Bound 11.259
Interv al for Mean Upper Bound
11.701

5% Trimmed Mean 11.456


Median 11.600
Variance .479
St d. Dev iation .692
Minimum 10.2
Maximum 13.3
6
Range 3.1
Interquart ileRange .975
Skewness .298 .374
Kurtosis .051 .733
Count

Tests of Normality

2 Orbit width (m)


Kolmogoarov - Statistic .094
Smirnov df 40
Sig. .200*
0
11.0 12.0 13.0 Shapiro-Wilk Statistic .974
Orbit width (m) df 40
Sig. .567
*. This is a lower bound of the true signif icance.
a. Lillief ors Signif icance Correction

The absolute values of g1 and g2 do not mean much, these statistics


have to be tested for significance.
Testing Hypotheses about g1 and g2
We use the general test for significance ofany sample statistic.
ts = St – Stp e.g. = (g1 - 1) / sg1
SSt
Where,
•St is a sample statistic (e.g., g1)
•Stp is the hypothesized value in the statistical population (against
which the sample statistic is to be tested. e.g., Ho:γ1 = 0).
•SStis the estimated standard error (provided by SPSS[useExplore])
Or, to calculate the standard error(SSt):
when n >150

when n >150

d.f. =
The Hypothesis Test

The Ho is that the distribution is not skewed – that is that 1 =0.

It is a 2-tailed test because g1 can be either negative or positive and


we wish to test whether thereis any skewness. Thus,

Step 1: Ho: 1 = 0 H1: 1  0

Step 2: If we want to test this using sample data with g1 = 0.18936


and n = 9456:

ts = (g1 - 1) = 0.18936 – 0


Sg1 √(6/9456)

= 0.1893 = 7.52
0.02517
Step 3: Set alpha (say 0.05)

Step 4: We use the critical t-value with d.f.= 

t*.05, = 1.960 t*.01, = 2.576 t*.001, =3.291

Therefore, ts = 7.52 has P<< 0.001.

Thus we reject the null hypothesis and conclude that 1  0.

Since g1 is positive, we conclude that the data are significantly


skewed to the right.
The Need for Data Transformation

1.To convert data with a skewed distribution to a Normal


distribution, so that parametric statistical tests can be
applied without risk of error.

2.Parametric tests which compare the means of two or


more independent samples assume that the variances of
each population are similar. Transformation often
equalizes the variances.
Which Transformation to Use?
1. Continuous data

1.If data are skewed to the right: Try a square-root (sqrt),


log (log[x + a]) or natural log (ln[x + a]) transformation

2.If data are skewed to the left: Try an exponential (ex)


transformation

3.If data are bimodal: Situation is hopeless! Nosimple


transformation can help. Usually indicates the sample
contains data from more than onepopulation
Which Transformation to Use?
2. Count Data

The most common transformations are logarithmic,


square-root and arcsine.
1.Logarithmic transformation: when the variance of a
sample of count data is larger than themean.
2.Square-root transformation: when the variance of a
sample of count data is about equal to the mean, or when
a Poisson distribution is expected.
3.The arcsine transformation: when the data are
expressed as proportions.
An example:
log(x) transformation converts a
distribution that is skewed to
the right to an approximately
Normal distribution

You might also like