Descriptive Statistics and Normality Tests For Statistical Data
Descriptive Statistics and Normality Tests For Statistical Data
Descriptive Statistics and Normality Tests For Statistical Data
studying a smaller sample.[2,4] To draw the variation (variance, SD, standard error, Website: www.annals.in
inference from the study participants in quartile, interquartile range, percentile, DOI: 10.4103/aca.ACA_157_18
range, and coefficient of variation [CV]) Quick Response Code:
This is an open access journal, and articles are provide simple summaries about the sample
distributed under the terms of the Creative Commons
Attribution‑NonCommercial‑ShareAlike 4.0 License, which allows
others to remix, tweak, and build upon the work non‑commercially, How to cite this article: Mishra P, Pandey CM,
as long as appropriate credit is given and the new creations are Singh U, Gupta A, Sahu C, Keshri A. Descriptive
licensed under the identical terms. statistics and normality tests for statistical data. Ann
For reprints contact: [email protected] Card Anaesth 2019;22:67-72.
and the measures. A measure of frequency is usually Table 1: Distribution of mean arterial pressure (mmHg)
used for the categorical data while others are used for as per sex
quantitative data. Patient number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Measures of Frequency
MAP 82 84 85 88 92 93 94 95 98 100 102 107 110 116 116
Frequency statistics simply count the number of times Sex M F F M M F F M M F M F M F M
that in each variable occurs, such as the number of males MAP: Mean arterial pressure, M: Male, F: Female
and females within the sample or population. Frequency
analysis is an important area of statistics that deals with Table 2: Descriptive statistics of the mean arterial
the number of occurrences (frequency) and percentage. pressure (mmHg)
For example, according to Table 1, out of the 15 patients, Mean SD SE Q1 Q2 Q3 Minimum Maximum Mode
frequency of the males and females were 8 (53.3%) and 97.47 11.01 2.84 88 95 107 82 116 116
7 (46.7%), respectively. SD: Standard deviation, SE: Standard error, Q1: First quartile,
Q2: Second quartile, Q3: Third quartile
Measures of Central Tendency
Data are commonly describe the observations in a measure one median of one data set which is useful when comparing
of central tendency, which is also called measures of central between the groups. There is one disadvantage of median
location, is used to find out the representative value of a over mean that it is not as popular as mean.[6] For example,
data set. The mean, median, and mode are three types of according to Table 2, median MAP of the patients was
measures of central tendency. Measures of central tendency 95 mmHg indicated that 50% observations of the data are
give us one value (mean or median) for the distribution either less than or equal to the 95 mmHg and rest of the
and this value represents the entire distribution. To make 50% observations are either equal or greater than 95 mmHg.
comparisons between two or more groups, representative Mode
values of these distributions are compared. It helps in
further statistical analysis because many techniques Mode is a value that occurs most frequently in a set of
of statistical analysis such as measures of dispersion, observation, that is, the observation, which has maximum
skewness, correlation, t‑test, and ANOVA test are calculated frequency is called mode. In a data set, it is possible to have
using value of measures of central tendency. That is why multiple modes or no mode exists. Due to the possibility of
measures of central tendency are also called as measures of the multiple modes for one data set, it is not used to compare
the first order. A representative value (measures of central between the groups. For example, according to Table 2,
tendency) is considered good when it was calculated using maximum repeated value is 116 mmHg (2 times) rest are
all observations and not affected by extreme values because repeated one time only, mode of the data is 116 mmHg.
these values are used to calculate for further measures.
Measures of Dispersion
Computation of Measures of Central Tendency Measures of dispersion is another measure used to show how
Mean spread out (variation) in a data set also called measures of
variation. It is quantitatively degree of variation or dispersion
Mean is the mathematical average value of a set of of values in a population or in a sample. More specifically,
data. Mean can be calculated using summation of the it is showing lack of representation of measures of central
observations divided by number of observations. It is the tendency usually for mean/median. These are indices that give
most popular measure and very easy to calculate. It is a us an idea about homogeneity or heterogeneity of the data.[2,6]
unique value for one group, that is, there is only one answer,
which is useful when comparing between the groups. In the Common measures
computation of mean, all the observations are used.[2,5] One Variance, SD, standard error, quartile, interquartile range,
disadvantage with mean is that it is affected by extreme percentile, range, and CV.
values (outliers). For example, according to Table 2, mean
MAP of the patients was 97.47 indicated that average MAP Computation of Measures of Dispersion
of the patients was 97.47 mmHg.
Standard deviation and variance
Median
The SD is a measure of how spread out values is from
The median is defined as the middle most observation if its mean value. Its symbol is σ (the Greek letter sigma)
data are arranged either in increasing or decreasing order or s. It is called SD because we have taken a standard
of magnitude. Thus, it is one of the observations, which value (mean) to measures the dispersion. Where xi is
occupies the central place in the distribution (data). This individual value, x is mean value. If sample size is <30,
is also called positional average. Extreme values (outliers) we use “n‑1” in denominator, for sample size ≥30, use “n”
do not affect the median. It is unique, that is, there is only in denominator. The variance (s2) is defined as the average
of the squared difference from the mean. It is equal to the quartile in the data is 88 and 107. Hence, IQR of the data is
square of the SD (s). 19 mmHg (also can write like: 88–107) [Table 2].
n n
å (x - x )
i
2
å (x - x )
i
2
s= i =1 s2 = i =1
Percentile
n -1 n -1
The percentiles are the 99 points that divide the data set
For example, in the above, SD is 11.01 mmHg When
into 100 equal groups, each group comprising a 1% of the
n <30 which showed that approximate average deviation
data, for a set of data values which are arranged in either
between mean value and individual values is 11.01.
ascending or descending order. About 25% percentile is
Similarly, variance is 121.22 [i.e., (11.01) 2], which showed
the first quartile, 50% percentile is the second quartile
that average square deviation between mean value and
also called median value, while 75% percentile is the third
individual values is 121.22 [Table 2].
quartile of the data.
Standard error
For ith percentile = [i * (n + 1)/100]th observation, where
Standard error is the approximate difference between i = 1, 2, 3.,99.
sample mean and population mean. When we draw the Example: In the above, 10th percentile = [10* (n + 1)/100]
many samples from same population with same sample =1.6th observation from initial which is fall between
size through random sampling technique, then SD among the first and second observation from the initial =
the sample means is called standard error. If sample SD 1st observation + 0.6* (difference between the second and
and sample size are given, we can calculate standard error first observation) = 83.20 mmHg, which indicated that 10%
for this sample, by using the formula. of the data are either ≤83.20 and rest 90% observations are
Standard error = sample SD/√sample size. either ≥83.20.
For example, according to Table 2, standard error
is 2.84 mmHg, which showed that average mean
difference between sample means and population mean is
2.84 mmHg [Table 2]. Coefficient of Variation
Quartiles and interquartile range Interpretation of SD without considering the magnitude
of mean of the sample or population may be misleading.
The quartiles are the three points that divide the data set
To overcome this problem, CV gives an idea. CV gives
into four equal groups, each group comprising a quarter
the result in terms of ratio of SD with respect to its mean
of the data, for a set of data values which are arranged in
value, which expressed in %. CV = 100 × (SD/mean).
either ascending or descending order. Q1, Q2, and Q3 are
For example, in the above, coefficient of the variation is
represent the first, second, and third quartile’s value.[7]
11.3% which indicated that SD is 11.3% of its mean value
For ith Quartile = [i * (n + 1)/4]th observation, where i = 1, [i.e., 100* (11.01/97.47)] [Table 2].
2, 3. Range
For example, in the above, first Difference between largest and smallest observation is called
quartile (Q1) = (n + 1)/4= (15 + 1)/4 = 4th observation range. If A and B are smallest and largest observations in
from initial = 88 mmHg (i.e., first 25% number of a data set, then the range (R) is equal to the difference of
observations of the data are either ≤88 and rest 75% largest and smallest observation, that is, R = A−B.
observations are either ≥88), Q2 (also called median)
= [2* (n + 1)/4] = 8th observation from initial = 95 mmHg, that For example, in the above, minimum and maximum
is, first 50% number of observations of the data are either less observation in the data is 82 mmHg and 116 mmHg.
or equal to the 95 and rest 50% observations are either ≥95, Hence, the range of the data is 34 mmHg (also can write
and similarly Q3 = [3* (n + 1)/4] = 12th observation from like: 82–116) [Table 2].
initial = 107 mmHg, i.e., indicated that first 75% number Descriptive statistics can be calculated in the
of observations of the data are either ≤107 and rest 25% statistical software “SPSS” (analyze → descriptive
observations are either ≥107. The interquartile range (IQR) is statistics → frequencies or descriptives.
a measure of variability, also called the midspread or middle
50%, which is a measure of statistical dispersion, being Normality of data and testing
equal to the difference between 75th (Q3 or third quartile) The standard normal distribution is the most important
and 25th (Q1 or first quartile) percentiles. For example, continuous probability distribution has a bell‑shaped
in the above example, three quartiles, that is, Q1, Q2, and density curve described by its mean and SD and extreme
Q3 are 88, 95, and 107, respectively. As the first and third values in the data set have no significant impact on
the mean value. If a continuous data is follow normal be handling on larger sample size while Kolmogorov–
distribution then 68.2%, 95.4%, and 99.7% observations Smirnov test is used for n ≥50. For both of the above tests,
are lie between mean ± 1 SD, mean ± 2 SD, and null hypothesis states that data are taken from normal
mean ± 3 SD, respectively.[2,4] distributed population. When P > 0.05, null hypothesis
accepted and data are called as normally distributed.
Why to test the normality of data
Skewness is a measure of symmetry, or more precisely, the
Various statistical methods used for data analysis make lack of symmetry of the normal distribution. Kurtosis is a
assumptions about normality, including correlation, measure of the peakedness of a distribution. The original
regression, t‑tests, and analysis of variance. Central limit kurtosis value is sometimes called kurtosis (proper).
theorem states that when sample size has 100 or more Most of the statistical packages such as SPSS provide
observations, violation of the normality is not a major “excess” kurtosis (also called kurtosis [excess]) obtained
issue.[5,8] Although for meaningful conclusions, assumption of by subtracting 3 from the kurtosis (proper). A distribution,
the normality should be followed irrespective of the sample or data set, is symmetric if it looks the same to the left
size. If a continuous data follow normal distribution, then and right of the center point. If mean, median, and mode
we present this data in mean value. Further, this mean value of a distribution coincide, then it is called a symmetric
is used to compare between/among the groups to calculate distribution, that is, skewness = 0, kurtosis (excess)
the significance level (P value). If our data are not normally = 0. A distribution is called approximate normal if
distributed, resultant mean is not a representative value of skewness or kurtosis (excess) of the data are between − 1
our data. A wrong selection of the representative value of a and + 1. Although this is a less reliable method in the
data set and further calculated significance level using this small‑to‑moderate sample size (i.e., n <300) because it can
representative value might give wrong interpretation.[9] That not adjust the standard error (as the sample size increases,
is why, first we test the normality of the data, then we decide the standard error decreases). To overcome this problem,
whether mean is applicable as representative value of the a z‑test is applied for normality test using skewness and
data or not. If applicable, then means are compared using kurtosis. A Z score could be obtained by dividing the
parametric test otherwise medians are used to compare the skewness values or excess kurtosis value by their standard
groups, using nonparametric methods. errors. For small sample size (n <50), z value ± 1.96 are
sufficient to establish normality of the data.[8] However,
Methods used for test of normality of data medium‑sized samples (50≤ n <300), at absolute
An assessment of the normality of data is a prerequisite z‑value ± 3.29, conclude the distribution of the sample
for many statistical tests because normal data is an is normal.[11] For sample size >300, normality of the data
underlying assumption in parametric testing. There are is depend on the histograms and the absolute values of
two main methods of assessing normality: Graphical and skewness and kurtosis. Either an absolute skewness
numerical (including statistical tests).[3,4] Statistical tests value ≤2 or an absolute kurtosis (excess) ≤4 may be
have the advantage of making an objective judgment of used as reference values for determining considerable
normality but have the disadvantage of sometimes not normality.[11] A histogram is an estimate of the probability
being sensitive enough at low sample sizes or overly distribution of a continuous variable. If the graph is
sensitive to large sample sizes. Graphical interpretation approximately bell‑shaped and symmetric about the mean,
has the advantage of allowing good judgment to assess we can assume normally distributed data[12,13] [Figure 1].
normality in situations when numerical tests might be over In statistics, a Q–Q plot is a scatterplot created by
or undersensitive. Although normality assessment using plotting two sets of quantiles (observed and expected)
graphical methods need a great deal of the experience against one another. For normally distributed data,
to avoid the wrong interpretations. If we do not have a observed data are approximate to the expected data,
good experience, it is the best to rely on the numerical that is, they are statistically equal [Figure 2]. A P–P
methods.[10] There are various methods available to test the plot (probability–probability plot or percent–percent
normality of the continuous data, out of them, most popular plot) is a graphical technique for assessing how closely
two data sets (observed and expected) agree. It forms
methods are Shapiro–Wilk test, Kolmogorov–Smirnov test,
an approximate straight line when data are normally
skewness, kurtosis, histogram, box plot, P–P Plot, Q–Q
distributed. Departures from this straight line indicate
Plot, and mean with SD. The two well‑known tests of
departures from normality [Figure 3]. Box plot is another
normality, namely, the Kolmogorov–Smirnov test and the
way to assess the normality of the data. It shows the
Shapiro–Wilk test are most widely used methods to test
median as a horizontal line inside the box and the
the normality of the data. Normality tests can be conducted
IQR (range between the first and third quartile) as the
in the statistical software “SPSS” (analyze → descriptive
length of the box. The whiskers (line extending from the
statistics → explore → plots → normality plots with tests).
top and bottom of the box) represent the minimum and
The Shapiro–Wilk test is more appropriate method for maximum values when they are within 1.5 times the
small sample sizes (<50 samples) although it can also IQR from either end of the box (i.e., Q1 − 1.5* IQR and
Q3 + 1.5* IQR). Scores >1.5 times and 3 times the IQR were statistically insignificant, that is, data were considered
are out of the box plot and are considered as outliers and normally distributed. As sample size is <50, we have to
extreme outliers, respectively. A box plot that is symmetric take Shapiro–Wilk test result and Kolmogorov–Smirnov
with the median line at approximately the center of the test result must be avoided, although both methods
box and with symmetric whiskers indicate that the data indicated that data were normally distributed. As SD of
may have come from a normal distribution. In case the MAP was less than half mean value (11.01 <48.73),
many outliers are present in our data set, either outliers data were considered normally distributed, although
are need to remove or data should treat as nonnormally due to sample size <50, we should avoid this method
distributed[8,13,14] [Figure 4]. Another method of normality because it should use when our sample size is at least 50
of the data is relative value of the SD with respect to [Tables 2 and 3].
mean. If SD is less than half mean (i.e., CV <50%), data
are considered normal.[15] This is the quick method to test Conclusions
the normality. However this method should only be used Descriptive statistics are a statistical method to summarizing
when our sample size is at least 50. data in a valid and meaningful way. A good and appropriate
For example in Table 1, data of MAP of the 15 patients measure is important not only for data but also for statistical
are given. Normality of the above data was assessed. methods used for hypothesis testing. For continuous data,
Result showed that data were normally distributed as testing of normality is very important because based on the
skewness (0.398) and kurtosis (−0.825) individually were normality status, measures of central tendency, dispersion,
within ±1. Critical ratio (Z value) of the skewness (0.686) and selection of parametric/nonparametric test are decided.
and kurtosis (−0.737) were within ±1.96, also evident Although there are various methods for normality testing
to normally distributed. Similarly, Shapiro–Wilk but for small sample size (n <50), Shapiro–Wilk test should
test (P = 0.454) and Kolmogorov–Smirnov test (P = 0.200) be used as it has more power to detect the nonnormality
Table 3: Skewness, kurtosis, and normality tests for mean arterial pressure (mmHg)
Variable Skewness Kurtosis P
Value SE Z Value SE Z K‑S test with Lilliefors correction Shapiro‑Wilk test
MAP score 0.398 0.580 0.686 −0.825 1.12 −0.737 0.200 0.454
K‑S: Kolmogorov–Smirnov, SD: Standard deviation, SE: Standard error
and this is the most popular and widely used method. University Press; 2015.
When our sample size (n) is at least 50, any other 4. Campbell MJ, Machin D, Walters SJ. Medical Statistics: A text book for
methods (Kolmogorov–Smirnov test, skewness, kurtosis, the health sciences, 4th ed. Chichester: John Wiley & Sons, Ltd.; 2007.
z value of the skewness and kurtosis, histogram, box plot, 5. Altman DG, Bland JM. Statistics notes: The normal distribution. BMJ
1995;310:298.
P–P Plot, Q–Q Plot, and SD with respect to mean) can be
6. Altman DG. Practical Statistics for Medical Research Chapman and Hall/
used to test of the normality of continuous data.
CRC Texts in Statistical Science. London: CRC Press; 1999.
Acknowledgment 7. Indrayan A, Sarmukaddam SB. Medical Bio‑Statistics. New York: Marcel
Dekker Inc.; 2000.
The authors would like to express their deep and sincere 8. Ghasemi A, Zahediasl S. Normality tests for statistical analysis: A guide
gratitude to Dr. Prabhat Tiwari, Professor, Department of for non‑statisticians. Int J Endocrinol Metab 2012;10:486‑9.
Anaesthesiology, Sanjay Gandhi Postgraduate Institute of 9. Indrayan A, Satyanarayana L. Essentials of biostatistics. Indian Pediatr
Medical Sciences, Lucknow, for his critical comments and 1999;36:1127‑34.
useful suggestions that was very much useful to improve 10. Lund Research Ltd. Testing for Normality using SPSS Statistics. Available
the quality of this manuscript. from: http://www.statistics.laerd.com. [Last accessed 2018 Aug 02].
11. Kim HY. Statistical notes for clinical researchers: Assessing normal
Financial support and sponsorship distribution (2) using skewness and kurtosis. Restor Dent Endod
2013;38:52‑4.
Nil.
12. Armitage P, Berry G. Statistical Methods in Medical Research. 2nd ed.
Conflicts of interest London: Blackwell Scientific Publications; 1987.
13. Barton B, Peat J. Medical Statistics: A Guide to SPSS, Data Analysis
There are no conflicts of interest.
and Clinical Appraisal. 2nd ed. Sydney: Wiley Blackwell, BMJ Books;
2014.
References
14. Baghban AA, Younespour S, Jambarsang S, Yousefi M, Zayeri F,
1. Lund Research Ltd. Descriptive and Inferential Statistics. Available from: Jalilian FA. How to test normality distribution for a variable: A real
http://www.statistics.laerd.com. [Last accessed on 2018 Aug 02]. example and a simulation study. J Paramed Sci 2013;4:73-7.
2. Sundaram KR, Dwivedi SN, Sreenivas V. Medical Statistics: Principles 15. Jeyaseelan L. Short Training Course Materials on Fundamentals of
and Methods. 2nd ed. New Delhi: Wolters Kluwer India; 2014. Biostatistics, Principles of Epidemiology and SPSS. CMC Vellore:
3. Bland M. An Introduction to Medical Statistics. 4th ed. Oxford: Oxford Biostatistics Resource and Training Center (BRTC); 2007.