0% found this document useful (0 votes)
78 views7 pages

Test of Normality

The document discusses statistical tests for normality. It describes the normal distribution and lists several common tests to assess if a dataset follows a normal distribution, including the Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Lilliefors test, D'Agostino and Pearson test, and Jarque-Bera test. It explains how to interpret the results of these tests using a significance level to determine if the data significantly deviates from a normal distribution.

Uploaded by

Darlene Orbeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views7 pages

Test of Normality

The document discusses statistical tests for normality. It describes the normal distribution and lists several common tests to assess if a dataset follows a normal distribution, including the Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Lilliefors test, D'Agostino and Pearson test, and Jarque-Bera test. It explains how to interpret the results of these tests using a significance level to determine if the data significantly deviates from a normal distribution.

Uploaded by

Darlene Orbeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

A test of normality is a statistical test used to determine if a dataset follows a normal distribution or

Gaussian distribution. The normal distribution, often referred to as the bell curve, is a symmetrical
probability distribution characterized by a mean (average) and a standard deviation.

There are several methods and tests available to assess the normality of data. Some of the common
tests for normality include:

1. Shapiro-Wilk Test: The Shapiro-Wilk test is a widely used test for normality. It calculates a test
statistic based on the sample data and compares it to a critical value to determine if the data
significantly deviates from a normal distribution.
The Shapiro-Wilk test is a statistical test used to determine whether a given sample of data
follows a normal distribution. It assesses whether the data significantly deviates from a normal
distribution. Here's an example of how to perform a Shapiro-Wilk test in Python using the
SciPy library:
In this example, the Shapiro-Wilk test is applied to the data array. The test statistic and p-value
are computed. You then compare the p-value to a chosen significance level (alpha) to make a
decision about whether to reject the null hypothesis. If the p-value is less than alpha, you
would reject the null hypothesis and conclude that the data does not follow a normal
distribution. If the p-value is greater than alpha, you would fail to reject the null hypothesis,
indicating that the data does follow a normal distribution.
What is the SciPy library?
SciPy is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific
Python. It provides more utility functions for optimization, stats and signal processing. Like NumPy,
SciPy is open source so we can use it freely. SciPy was created by NumPy's creator Travis Olliphant.

What is the function of SciPy library?


Functions from SciPy
 cluster. hierarchy: Hierarchical clustering.
 fft : Fast Fourier transform.
 integrate: Numerical integration.
 interpolate: Interpolation and spline functions.
 linalg : Linear algebra.
 optimize: Numerical optimization.
 signal: Signal processing.
 sparse: Sparse matrix representation.

2. Kolmogorov-Smirnov Test: The Kolmogorov-Smirnov test is another test for normality. It


compares the cumulative distribution function of the sample data to the cumulative
distribution function of a normal distribution with the same mean and standard deviation.
The Kolmogorov-Smirnov test, often referred to as the K-S test, is a statistical test used to assess
whether a dataset follows a specific probability distribution or if two datasets come from the same
distribution. It is a non-parametric test, meaning it doesn't assume a specific distribution for the data,
making it a versatile tool for distribution comparison.
There are two main variations of the Kolmogorov-Smirnov test:

here are two main variations of the Kolmogorov-Smirnov test:

1. One-sample Kolmogorov-Smirnov test: This version of the test assesses whether a sample
comes from a particular distribution. It calculates the maximum vertical difference (D-statistic)
between the empirical cumulative distribution function (ECDF) of the sample and the
cumulative distribution function (CDF) of the theoretical distribution. The null hypothesis is that
the sample follows the theoretical distribution.
2. Two-sample Kolmogorov-Smirnov test: This version is used to compare the distributions of
two independent samples. It computes the maximum vertical difference (D-statistic) between
the ECDFs of the two samples. The null hypothesis in this case is that the two samples are
drawn from the same distribution.

Here's a general outline of how you would use the Kolmogorov-Smirnov test:

1. Formulate your null and alternative hypotheses:


 For the one-sample KS test, the null hypothesis is that your sample follows a specific
distribution.
 For the two-sample KS test, the null hypothesis is that the two samples come from the
same distribution.
2. Calculate the test statistic (D-statistic): This statistic represents the maximum vertical
difference between the empirical and theoretical (or sample and sample) cumulative
distribution functions.
3. Determine the critical value or p-value: You can compare the calculated D-statistic to a
critical value from the Kolmogorov-Smirnov table or use software to obtain a p-value. The p-
value is used to determine the significance of the results.
4. Make a decision: If the p-value is less than your chosen significance level (e.g., 0.05), you may
reject the null hypothesis. Otherwise, you fail to reject it.

The Kolmogorov-Smirnov test is a useful tool for testing the normality of data, comparing two
datasets, or determining if data follows a specific distribution. It's important to note that it has
limitations, and it might not be the best choice in all situations, especially when dealing with small
sample sizes or discrete data. Other tests and methods may be more appropriate in those cases.

3. Anderson-Darling Test: The Anderson-Darling test is an extension of the Kolmogorov-


Smirnov test and can be more powerful in detecting departures from normality, especially in
the tails of the distribution.
4. Lilliefors Test: The Lilliefors test is a modification of the Kolmogorov-Smirnov test and is used
when you have a small sample size.
5. D'Agostino and Pearson Test: This test combines skewness and kurtosis to assess normality.
It calculates a test statistic based on these measures and compares it to a critical value.
6. Jarque-Bera Test: The Jarque-Bera test also uses skewness and kurtosis to test for normality.
It's often used in econometrics.
When conducting a test of normality, you typically set a significance level (e.g., 0.05) as a threshold. If
the p-value obtained from the test is less than the significance level, you reject the null hypothesis,
indicating that the data significantly departs from a normal distribution. Conversely, if the p-value is
greater than the significance level, you fail to reject the null hypothesis, suggesting that the data is
reasonably consistent with a normal distribution.

It's important to note that no test of normality is perfect, and the results can be influenced by the
sample size. Additionally, real-world data often deviates from a perfect normal distribution, so the
decision to use parametric or non-parametric statistical tests should be based on the specific
characteristics of your data and your research objectives.

What does a statistical test do?


Statistical tests work by calculating a test statistic – a number that describes how much the
relationship between variables in your test differs from the null hypothesis of no relationship.

It then calculates a p value (probability value). The p-value estimates how likely it is that you would
see the difference described by the test statistic if the null hypothesis of no relationship were true.

If the value of the test statistic is more extreme than the statistic calculated from the null hypothesis,
then you can infer a statistically significant relationship between the predictor and outcome
variables.

If the value of the test statistic is less extreme than the one calculated from the null hypothesis, then
you can infer no statistically significant relationship between the predictor and outcome variables.

When to perform a statistical test


You can perform statistical tests on data that have been collected in a statistically valid manner –
either through an experiment, or through observations made using probability sampling methods.

For a statistical test to be valid, your sample size needs to be large enough to approximate the true
distribution of the population being studied.

To determine which statistical test to use, you need to know:

 whether your data meets certain assumptions.


 the types of variables that you’re dealing with.

Statistical assumptions
Statistical tests make some common assumptions about the data they are testing:

1. Independence of observations (a.k.a. no autocorrelation): The observations/variables you


include in your test are not related (for example, multiple measurements of a single test subject
are not independent, while measurements of multiple different test subjects are independent).
2. Homogeneity of variance: the variance within each group being compared is similar among
all groups. If one group has much more variation than others, it will limit the test’s
effectiveness.
3. Normality of data: the data follows a normal distribution (a.k.a. a bell curve). This assumption applies
only to quantitative data.
If your data do not meet the assumptions of normality or homogeneity of variance, you may be able to
perform a nonparametric statistical test, which allows you to make comparisons without any
assumptions about the data distribution.

If your data do not meet the assumption of independence of observations, you may be able to use a
test that accounts for structure in your data (repeated-measures tests or tests that include blocking
variables).

Types of variables
The types of variables you have usually determine what type of statistical test you can use.

Quantitative variables represent amounts of things (e.g. the number of trees in a forest). Types of
quantitative variables include:

 Continuous (aka ratio variables): represent measures and can usually be divided into units smaller
than one (e.g. 0.75 grams).
 Discrete (aka integer variables): represent counts and usually can’t be divided into units smaller than
one (e.g. 1 tree).

Categorical variables represent groupings of things (e.g. the different tree species in a forest).
Types of categorical variables include:

 Ordinal: represent data with an order (e.g. rankings).


 Nominal: represent group names (e.g. brands or species names).
 Binary: represent data with a yes/no or 1/0 outcome (e.g. win or lose).

Choose the test that fits the types of predictor and outcome variables you have collected (if you are
doing an experiment, these are the independent and dependent variables). Consult the tables below
to see which test best matches your variables.

Choosing a parametric test: regression, comparison, or


correlation
Parametric tests usually have stricter requirements than nonparametric tests, and are able to make
stronger inferences from the data. They can only be conducted with data that adheres to the common
assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and
correlation tests.

Regression tests
Regression tests look for cause-and-effect relationships. They can be used to estimate the effect of
one or more continuous variables on another variable.

Predictor variable Outcome variable Research question


example

Simple linear  Continuous  Continuous What is the effect of income on longevity?


regression  1 predictor  1 outcome
Predictor variable Outcome variable Research question
example

Multiple linear  Continuous  Continuous What is the effect of income and minutes of exercise
regression  2 or more  1 outcome per day on longevity?
predictors

Logistic regression  Continuous  Binary What is the effect of drug dosage on the survival of a
test subject?

Comparison tests
Comparison tests look for differences among group means. They can be used to test the effect of a
categorical variable on the mean value of some other characteristic.

T-tests are used when comparing the means of precisely two groups (e.g., the average heights of
men and women). ANOVA and MANOVA tests are used when comparing the means of more than
two groups (e.g., the average heights of children, teenagers, and adults).

Predictor variable Outcome variable Research question example

Paired t-test  Categorical  Quantitative What is the effect of two different test prep
 1 predictor  groups come from the programs on the average exam scores for students from
same population the same class?

Independent t-  Categorical  Quantitative What is the difference in average exam scores for
test  1 predictor  groups come from students from two different schools?
different populations

ANOVA  Categorical  Quantitative What is the difference in average pain levels among
 1 or more  1 outcome post-surgical patients given three different painkillers?
predictor

MANOVA  Categorical  Quantitative What is the effect of flower species on petal


 1 or more  2 or more outcome length, petal width, and stem length?
predictor

Correlation tests
Correlation tests check whether variables are related without hypothesizing a cause-and-effect
relationship.

These can be used to test whether two variables you want to use in (for example) a multiple
regression test are autocorrelated.

Variables Research question example

Pearson’s r  2 continuous variables How are latitude and temperature related?

Choosing a nonparametric test


Non-parametric tests don’t make as many assumptions about the data, and are useful when one or
more of the common statistical assumptions are violated. However, the inferences they make aren’t
as strong as with parametric tests.
Predictor variable Outcome variable Use in place of…

Spearman’s r  Quantitative  Quantitative Pearson’s r

Chi square test of independence  Categorical  Categorical Pearson’s r

Sign test  Categorical  Quantitative One-sample t-test

Kruskal–Wallis H  Categorical  Quantitative ANOVA


 3 or more groups

ANOSIM  Categorical  Quantitative MANOVA


 3 or more groups  2 or more outcome variables

Wilcoxon Rank-Sum test  Categorical  Quantitative Independent t-test


 2 groups  groups come from different populations

Wilcoxon Signed-rank test  Categorical  Quantitative Paired t-test


 2 groups  groups come from the same population

Frequently asked questions about statistical tests


What are the main assumptions of statistical tests?
Statistical tests commonly assume that:

1. the data are normally distributed


2. the groups that are being compared have similar variance
3. the data are independent

If your data does not meet these assumptions you might still be able to use a nonparametric
statistical test, which have fewer requirements but also make weaker inferences.

What is a test statistic?


A test statistic is a number calculated by a statistical test. It describes how far your observed
data is from the null hypothesis of no relationship between variables or no difference among
sample groups.

The test statistic tells you how different two or more groups are from the overall
population mean, or how different a linear slope is from the slope predicted by a null
hypothesis. Different test statistics are used in different statistical tests.

What is statistical significance?


Statistical significance is a term used by researchers to state that it is unlikely their
observations could have occurred under the null hypothesis of a statistical test. Significance is
usually denoted by a p-value, or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the
researcher. The most common threshold is p < 0.05, which means that the data is likely to
occur less than 5% of the time under the null hypothesis.

When the p-value falls below the chosen alpha value, then we say the result of the test is
statistically significant.
What is the difference between quantitative and categorical variables?
Quantitative variables are any variables where the data represent amounts (e.g. height,
weight, or age).

Categorical variables are any variables where the data represent groups. This includes
rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary
outcomes (e.g. coin flips).

You need to know what type of variables you are working with to choose the right statistical
test for your data and interpret your results.

What is the difference between discrete and continuous variables?


Discrete and continuous variables are two types of quantitative variables:

 Discrete variables represent counts (e.g. the number of objects in a collection).


 Continuous variables represent measurable amounts (e.g. water volume or weight).

NULL HYPOTHESIS -In scientific research, the null hypothesis is the claim that no
relationship exists between two sets of data or variables being analyzed. The null
hypothesis is that any experimentally observed difference is due to chance alone, and
an underlying causative relationship does not exist, hence the term "null."

You might also like