0% found this document useful (0 votes)
63 views17 pages

Descriptive Statistics: Organizing, Summarizing, Describing, and Presenting Data

This document presents essential methods of Descriptive Statistics for biomedical science students and professionals, covering techniques such as mean, median, mode, variance, standard deviation, and data visualization through graphs. It emphasizes the importance of correctly identifying variable types and provides practical examples to enhance understanding and communication of data in the biomedical field. The paper aims to empower users to effectively analyze and present data, which is crucial for research and decision-making.

Uploaded by

mwangi junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views17 pages

Descriptive Statistics: Organizing, Summarizing, Describing, and Presenting Data

This document presents essential methods of Descriptive Statistics for biomedical science students and professionals, covering techniques such as mean, median, mode, variance, standard deviation, and data visualization through graphs. It emphasizes the importance of correctly identifying variable types and provides practical examples to enhance understanding and communication of data in the biomedical field. The paper aims to empower users to effectively analyze and present data, which is crucial for research and decision-making.

Uploaded by

mwangi junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/375770059

Descriptive statistics: organizing, summarizing, describing, and presenting


data

Method · November 2023


DOI: 10.13140/RG.2.2.31782.91203

CITATION READS

1 4,705

1 author:

André Moreno Morcillo


State University of Campinas (UNICAMP)
202 PUBLICATIONS 1,922 CITATIONS

SEE PROFILE

All content following this page was uploaded by André Moreno Morcillo on 21 November 2023.

The user has requested enhancement of the downloaded file.


Descriptive statistics: organizing, summarizing, describing, and presenting data

André Moreno Morcillo1

Abstract: In this paper, we present essential methods of Descriptive Statistics for biomedical science
students and professionals. We explore data summary techniques such as the mean, median, and
mode, measures of dispersion such as variance and standard deviation, and position measures like
quartiles and z-scores. Furthermore, we emphasize the importance of data visualization through
graphs, including pie charts, bar charts, and box plots. We demonstrate how to calculate these
statistics practically and provide examples from the biomedical sciences. This paper aims to empower
students and professionals to understand and effectively communicate data, who is crucial for
research, diagnosis, and decision-making in the biomedical field.

The presentation of scientific research results requires the use of standard techniques or
methods, so that articles and reports can be evaluated by researchers in different countries.
This part of statistics, whose objective is to synthesize, organized and make the presentation of
data, is called Descriptive Statistics. Among other techniques, measures of central tendency, variability
(dispersion) and position can be used, as well as tables, graphs, etc.
Currently, we have excellent software for statistical analysis, and we rarely perform calculations
by hand. However, knowing how these calculations are done can enhance the understanding of the results
obtained with software. Another very important point is knowing which descriptive statistical methods
should be used for different types of variables. Considering these aspects, we present below the
calculation using elementary mathematics and interpretation of the main tools of descriptive statistics.

Working with information or data

The results of quantitative research are translated into information or data, which can express
either quantity or quality. The data that expresses a quantity are called quantitative data or variables,
while those that express a quality are called qualitative (categorical) data or variables. Weight, height,
body mass index, hemoglobin values are examples of quantitative variables. Classification according to
gender (male/female), family income (low/middle/high), and education level (low/middle/high) are
examples of qualitative or categorical variables. We have two types of categorical data: nominal and
ordinal. In the nominal categorical type, all categories have the same degree of importance. As an
example, we can mention gender, where male and female are categories with the same degree of
importance. On the other hand, in the ordinal categorical type, the categories have different degrees of
importance. For example, when we talk about high income, we know that these are families with higher
incomes than families with middle and low incomes. We also know that low income means lower income
than middle- and high-income groups.

1
André Moreno Morcillo, PhD, MD from the State University of Campinas, São Paulo, Brazil
ResearchGate: https://www.researchgate.net/profile/Andre-Morcillo/publications
[email protected]
Descriptive statistics: organizing, summarizing … Morcillo AM

Identifying the variable type correctly is very important, because descriptive statistics methods
and data analysis techniques are specific to each type of variable.

Descriptive statistics methods for quantitative data

When the data set is small, it is enough to present it in a simple way. There is no need to use
sophisticated techniques or resources. Given the set of age of 8 children [7, 6, 4, 7, 7, 8, 7, 12]. A simple
way to describe them would be: the youngest is 4 years old, while the oldest is 12 years old. The most
common age is 7 years.
Try repeating the same process with a slightly larger group. Below, we present the ages (in years)
of 60 patients.
20 48 30 44 97 76 24 48 20 68
89 60 33 53 64 5 24 54 82 67
8 76 65 7 33 37 31 70 10 84
1 60 89 63 22 58 35 45 44 72
3 34 27 2 66 66 33 4 48 20
91 98 58 43 63 96 43 7 92 81

The techniques or methods that will be presented were developed to facilitate the presentation
of large sets of data, enabling their reading and interpretation in a systematic and quick way.
To present quantitative data, some numerical methods are used, with the aim of describing what
occurs in the center of the distribution and how the data is dispersed (variability). These methods, known
as summary measures, can be divided into:

• Central tendency measures:


arithmetic mean, geometric mean, median and mode.
• Variability (dispersion) measures:
variance, standard deviation, range, interquartile range, interquartile interval and coefficient of
variation.
• Position measures:
quantiles and z-scores

Measures of central tendency

1. Arithmetic Mean

The arithmetic mean (mean) is one of the most used measures to describe central tendency. Its calculation
is very easy: we sum the measured values and then divide the result by the number of cases evaluated.

We indicate the mean of a population by  and a sample mean by 𝑥


̅.

2
Descriptive statistics: organizing, summarizing … Morcillo AM
∑𝑋 (2)
𝜇= 𝑁
X = sum of population values; N = number of cases in the population

∑𝑥
𝑥̄ =
𝑛
x = sum of sample values; n = number of elements in the sample

Example: given the set of numbers [99, 100, 101, 102, 105], its arithmetic mean will be:

(99 + 100 + 101 + 102 + 105)


𝑥̄ = = 101.4
5

The arithmetic mean has a disadvantage: it is greatly influenced by extreme values (very large or
very small) in relation to the data set. In the example above, if we change the value 100 to 60 the mean
becomes:
(60 + 99 + 101 + 102 + 105)
𝑥̄ = = 93.4
5
Changing a single element caused a decrease of 8 units in the group mean. Thus, the arithmetic
mean is a good parameter of central tendency when the data has a symmetric distribution3. If data are
positively or negatively skewed, the mean is not a good indicator of the center of the distribution. When
the data distribution is skewed, we should use the geometric mean or the median.

2. Geometric Mean

The geometric mean (gm) is a good parameter of central tendency for data greater than zero and
positively skewed, as occurs with the results of antibody titers, weight, body mass index, etc. Its
calculation is given by the formula:
𝑁
𝑔𝑚 = √(𝑥1 . 𝑥2 . 𝑥3 … 𝑥𝑁 )
It can also be calculated in a much more practical way. To do this, we work with the logarithms4
(logs) of the data. We determine the arithmetic mean of the logarithms and then calculate the
antilogarithm of the mean of the logs. The antilogarithm of the mean of the logs is equal to the geometric
mean. Let's look at a simple example: consider the five values: [10, 100, 1000, 10000, 100000]. Initially we
calculate the mean of the logarithms ( x Logs ).
[𝐿𝑜𝑔(10) + log(100) + log(1000) + log(10000) + log⁡(100000)]
𝑥̅ 𝑙𝑜𝑔 =
5
(1 + 2 + 3 + 4 + 5)
𝑥̅ 𝑙𝑜𝑔 = =3
5
Next, we determine the antilogarithm of the mean of logarithms (𝑥̅ 𝑙𝑜𝑔 )

∑𝑵
𝒊=𝟏 𝒙𝒊
2
The correct formula is 𝝁= 𝑵
. We use ∑ 𝑿 = ∑𝑵
𝒊=𝟏 𝒙𝒊 for convenience and ease.
3
An efficient way to assess the symmetry of a distribution is through a histogram.
4
In this text we use logarithms in base 10 ( Log x )
10
3
Descriptive statistics: organizing, summarizing … Morcillo AM
(𝑥̅𝑙𝑜𝑔 ) 3
𝑔𝑚 = 10 = 10 = 1000

3. Median

If we sort the data in ascending order, the median (md) is the value of the variable observed in
in the center of the distribution. The median divides the ordered data into two groups that have the same
number of cases. Half of the cases have lower values and the other half have values greater than the
median. The median is equivalent to the 50th percentile and the 2nd quartile. To determine it, the sample
must initially be ordered (ascending order) and then the element that occupies the central position must
be looked for. The variable value of this element is the median. In the previous example - given a set of
numbers 99, 100, 101, 102, 105:

Order 1st 2nd 3rd 4th 5th

Value 99 100 101 102 105

In the center of the distribution is occupied by the 3rd element whose value is 101. The median
of this group is 101 (md=101). Note that two elements of the distribution are smaller than the median (99
and 100) and two elements are larger than the median (102 and 105).
The most time-consuming step is to identify the element that is in the center of the distribution
of the data. Excel has a routine that automatically sorts data, which greatly simplifies the work. However,
identifying the central element is still a problem when we want to manually determine the median. We
can employ the following procedures to facilitate the work.

a) When the number of cases is odd, there is always an element in the center of the distribution, whose
position is given by:

𝑁+1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 =
2
N = number of cases

b) When the number of cases is even, we have two elements in the center of the distribution, and the
median will be the mean of them. The positions of the two elements can be determined by:
𝑁 𝑁
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛⁡𝑜𝑓⁡𝑓𝑖𝑟𝑠𝑡⁡𝑒𝑙𝑒𝑚𝑒𝑛𝑡 = 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛⁡𝑜𝑓⁡𝑠𝑒𝑐𝑜𝑛𝑑⁡𝑒𝑙𝑒𝑚𝑒𝑛𝑡 = +1
2 2

N = number of cases

For example, consider the 10 values: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20. Applying the formulas above we will
have:

𝑁 10 𝑁
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛⁡𝑜𝑓⁡𝑓𝑖𝑟𝑠𝑡⁡𝑒𝑙𝑒𝑚𝑒𝑛𝑡 = = =5 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛⁡𝑜𝑓⁡𝑠𝑒𝑐𝑜𝑛𝑑⁡𝑒𝑙𝑒𝑚𝑒𝑛𝑡 = +1 =6
2 2 2

4
Descriptive statistics: organizing, summarizing … Morcillo AM

Position 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th

Number two 4 6 8 10 12 14 16 18 20

The median will be the arithmetic mean of the values of the 5th and 6th elements.

(10 + 12)
𝑀𝑑 = = 11
2

The value 11, which was estimated by interpolation based on the values of the two central
elements of the distribution, does not belong to the data. In this other example with 6 elements 100, 105,
101, 98, 99, 103:

1. Initially we sort the data: 98, 99, 100, 101, 103, 105
2. Next, we determine the two central elements:
𝑁 6 𝑁
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛⁡𝑓𝑖𝑟𝑠𝑡⁡𝑒𝑙𝑒𝑚𝑒𝑛𝑡 = = =3 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛⁡𝑠𝑒𝑐𝑜𝑛𝑑⁡𝑒𝑙𝑒𝑚𝑒𝑛𝑡 = +1 =4
2 2 2

Position 1st 2nd 3rd 4th 5th 6th

Value 98 99 100 101 103 105

3. Now, we can calculate the median:

(100 + 101)
Md = = 100.5
2

The median is not influenced by extreme values, unlike the arithmetic mean; therefore, it can be
used with both symmetrical and asymmetrical distributions. In the example above, if the sixth element
were changed to 105,000, the median of the distribution would be the same.

Position 1st 2nd 3rd 4th 5th 6th

Value 98 99 100 101 103 105,000

(100 + 101)
Md = = 100,5
2

5
Descriptive statistics: organizing, summarizing … Morcillo AM

4. Mode

The mode (mo) is the most frequent value in a data distribution. We can have data distributions
with no mode (amodal), with one mode (unimodal), with two modes (bimodal), or with more than two
modes (multimodal). In the previous example [100,105,101,98,99,103], all values occur once, therefore,
the distribution has no mode (amodal). But with a group of 15 children whose ages are [4, 5, 6, 7, 7, 7, 7,
7, 7, 7, 7, 7, 8, 8, 9], the mode is 7 because 7 is the most frequent age.

Measures of variability (dispersion)

1. Range
The range is the difference between the largest and smallest observed values. It is a measure of
dispersion calculated from only the two largest and smallest values, ignoring the others. Therefore, it is a
limited measure of the dispersion of the data set.
Considering the ages (years) of a group of 10 children [4, 5, 5, 6, 6, 6, 7, 7, 8, 8], the lowest
observed value is 4 and the highest value is 8. The range is 4 years.

𝑅𝑎𝑛𝑔𝑒 = 8 − 4 = 4⁡𝑦𝑒𝑎𝑟𝑠

Now, consider two datasets [10, 11, 12, 13, 14, 15, 60] and [10, 55, 56, 57, 58, 59, 60]. In both,
the range is equal to 50; however, this value does not correspond to the real variability of the groups.

2. Variance

Variance is a measure of variability (dispersion) that takes into account all values in the group.
We represent the variance of a population by σ2 and of a sample by s2.
To determine the variance, we calculate the deviation of each element from the group mean (x-
μ). Next, we calculate the squared differences (x-μ)2. Finally, we divide the sum of squared differences by
number of elements (N). The formula is:
∑(𝑥−𝜇)2
𝜎2 = 𝑁
.
When working with samples, we want the variance s2 to be a good estimate of the population
variance σ2. Considering this fact, we divide by (n−1) instead of n. The variance is calculated as follows:
∑(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1
Example: considering the ages (years) of a group of 10 children [7, 5, 6, 7, 8, 6, 6, 8, 5, 4], initially we
calculate the mean:
(7 + 5 + 6 + 7 + 8 + 6 + 6 + 8 + 5 + 4)
𝑥̄ = = 6.2
10
Next, we create a table with three columns to facilitate the calculations. In the first column we
put the ages. In the second, the differences between each age and the mean of the group (𝑥 − 𝑥̄ ) and, in
the third, the values in the second column squared (𝑥 − 𝑥̅ )2 .
6
Descriptive statistics: organizing, summarizing … Morcillo AM

Ages (𝑥 − 𝑥̄ ) (x− x ) 2

7 0.8 0.64
5 -1.2 1.44
6 -0.2 0.04
7 0.8 0.64
8 1.8 3.24
6 -0.2 0.04
6 -0.2 0.04
8 1.8 3.24
5 -1.2 1.44
4 -2.2 4.84

Total 15.6

Now, we calculate the variance.


∑(𝑥 − 𝑥̅ )2 15.6
𝑠2 = = = 1.7⁡𝑦𝑒𝑎𝑟𝑠 2
𝑛−1 9
With some simple algebraic transformations, we can develop the numerator of the variance
formula, ∑(𝑥 − 𝑥̅ )2 , arriving at an equivalent expression that has the advantage of not using the mean.
(∑ 𝑥)2
∑(𝑥 − 𝑥̅ )2 = ∑(𝑥 2) −
𝑛
Thus, we now have a practical way to calculate the variance:
(∑ 𝑥)2
∑(𝑥 2) −
𝑠2 = 𝑛
𝑛−1
Returning to the previous example and applying this new formula, we have:

Ages x x2

7 7 49
5 5 25
6 6 36
7 7 49
8 8 64
6 6 36
6 6 36
8 8 64
5 5 25
4 4 16

Total 62 400

7
Descriptive statistics: organizing, summarizing … Morcillo AM
(∑ 𝑥)2 622
∑(𝑥 2) − 400 −
𝑠2 = 𝑛 = 10 = 1.7⁡𝑦𝑒𝑎𝑟𝑠 2
𝑛−1 9

3. Standard deviation

Variance is an excellent measure of variability; however, it is rarely used in publications. As we


squared the deviations, we also squared the units of measurement. Thus, the unit of variance for weight
will be kg2, for height will be cm2, etc. The interpretation of these dispersion units becomes very confusing
for the reader. Considering these facts, the square root of the variance is used, which is called standard
deviation. We indicate the standard deviation of a population by σ and of a sample by s.
𝜎 = √𝜎 2 and 𝑠 = √𝑠 2

The standard deviation from the previous example is: 𝑠 = √𝑠 2 = √1.7 = 1.3⁡𝑦𝑒𝑎𝑟𝑠
Because the standard deviation is the square root of the variance, it has the original unit in which
the data was measured. The standard deviation represents how far, on “average”, each observation is
from the mean of the group. The closer the values are to the mean, the smaller the standard deviation
will be and the further they are from the mean, the higher it will be.
Now, we present a new group of 10 children, to calculate the standard deviation of age and
compare it with the previous example: 4, 8, 9, 5, 12, 13, 14, 6, 5, 5.
The arithmetic mean age of this group is:

81
𝑥̄ = = 8.1⁡𝑎𝑛𝑜𝑠
10
The standard deviation is:

2
(∑ 𝑥)2 812
√∑(𝑥 ) − 𝑛 √781 − 10
𝑠= = = 3.7⁡𝑦𝑒𝑎𝑟𝑠
𝑛−1 9

Note that in the first group we had a mean of 6.2 and a standard deviation of 1.3 years. In the
latter, the mean is 8.1 and the standard deviation is 3.7 years.

The variance and standard deviation are good parameters of variability when the data has a
symmetric distribution. If the data is positively or negatively skewed, the variance and standard deviation
are not good indicators of the variability of the distribution. When the data distribution is skewed, we
should use the interquartile range or interquartile interval.

4. Coefficient of variation

The coefficient of variation (CV) is the ratio between the standard deviation and the sample
mean. The coefficient of variation, expressed as a percentage, is a measure used to compare the
dispersions of two or more groups.
𝑠
𝐶𝑉 = . 100
𝑥̄

Considering the two previous examples:


8
Descriptive statistics: organizing, summarizing … Morcillo AM

In the first group of children, we have 𝑥̄ ⁡= 6.2 and s = 1.3


𝑠 1.3
𝐶𝑉 = . 100 = . 100 = 21.0%
𝑥̄ 6.2
In the second group of children, we have 𝑥̄ ⁡= 8.1 and s = 3.7
𝑠 3.7
𝐶𝑉 = . 100 = . 100 = 45.7%
𝑥̄ 8.1
The variability (dispersion) of the second group is 2.2 times greater than that of the first.

Measures of position

1. Quartiles
We call any of the three values that divides the ordered set of data into four groups, each
containing 25% of the cases, a quartile. The 1st quartile separates the group formed by 25% of cases with
the lowest values. The 2nd quartile also divides the group into two subgroups with an equal number of
cases, with half of the cases having lower values and the other half having values higher than the 2 nd
quartile. The 3rd quartile separates the group with the highest values, also with 25% of cases, from the
remaining 75% that have lower values.
The 1st quartile is equivalent to the 25th percentile, the second is equivalent to the 50th percentile
and the median, while the 3rd quartile is equivalent to the 75th percentile.

25% 25% 25% 25%

Minimum 1st Quartile 2nd Quartile 3rd Quartile Maximum

We call the difference between the 3rd and 1st quartile the interquartile range (IQR). It expresses
the variability (dispersion) of cases that occupy the center of the distribution, excluding the smallest 25%
and the largest 25%. The interquartile interval is defined by the values of the 1st and 3rd quartiles.

𝐼𝑄𝑅 = ⁡3𝑟𝑑 ⁡𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 − 1𝑠𝑡 ⁡𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒⁡

How to determine the quartiles?

We initially sort the data and then identify the three values that divide the group into four
subgroups, each with an equal number of cases. To find the position of the element that corresponds to
the 1st Quartile (PQ1), we use the following formula PQ1=(N+1)/4, for the 2nd Quartile use PQ2=2.(N+1)/4
and for 3rd Quartile use PQ3=3.(N+1)/4.
When the position (P) of a quartile is an integer, there is an element in this position in the
researcher's data. Therefore, locate it and check the value of the variable under study. Its value is the
quartile.

9
Descriptive statistics: organizing, summarizing … Morcillo AM

When the quartile position P is a decimal number, the quartile is determined by interpolation,
from two elements of the data set that include P. For example, if PQ1 is 8.3, we use the values of the 8th
and 9th elements in the interpolation. The decimal part, 0.3, is the weighting factor. The formula is:
Quartile = x(8th element) + 0.3.[x(9th element) - x(8th element)]
The quartile value is higher than the value of the 8th element and lower than the value of the 9th
element. For example, if the values of the 8th and 9th elements are 90 and 100, respectively, the quartile
will be: Quartile = 90 + 0.3.[100 - 90] = 93.

2. Z scores
The z-score represents the relative position of the elements in a group in relation to their mean.
The z-score expresses, in standard deviation units, the distance that a given value is in relation to the
mean. To calculate the z score, we use the formula:

(𝑥 − 𝑥̄ )
𝑧 − score =
𝑠
x: variable value; 𝑥̄ : sample mean; s: sample standard deviation

For example, given the set of numbers [100, 101, 105.2, 99.2, 100.5], we initially calculated the
mean and the standard deviation: 𝑥̅ =101.18 and s=2.34. To determine the z-score of 105.2, we do the
following:

(𝑥 − 𝑥̄ ) 105.2 − 101.18
𝑧 − 𝑠𝑐𝑜𝑟𝑒 = = = +1.71
𝑠 2.34

The z-score of 105.2 is +1.71, which means that 105.2 is 1.71 standard deviation units above the
mean of the data group.
The z-score is commonly used in the assessment of the growth of children and adolescents, as
well as in the standardization of variables for machine learning processing.

Data quality assessment


Initially, we should perform a careful assessment of the data, looking for potential problems. This
important step precede the final analysis. For this evaluation, the most important thing is the experience
of the person who will carry out the analysis. It is essential to know the nature and distribution of each of
the variables under study, as well as to evaluate the “quality” of the data that will be analyzed.
When we talk about “quality”, we are referring to the methodological rigor used during
measurements, typing errors, outliers, etc. After this preliminary evaluation, after evaluating the
distribution of the data, descriptive analysis and the application of statistical tests can begin.
Special care should be taken with the outliers. These atypical data are those that are very far
from the center of the distribution, and that can even occur, although sometimes they result from errors
in measurement, notation, or even typing. Outliers are values that are greater than 3rd quartile+1.5.IQR
or less than 1st quartile-1.5.IQR, where IQR is the interquartile range.
10
Descriptive statistics: organizing, summarizing … Morcillo AM

For example, in a study on the height of school-age children, we found cases with a value of
220cm and 240cm. Most likely, there was an error at the time of the anthropometric examination, when
taking notes or even when typing, as it is impossible for there to be school-age children so tall. If these
cases are not removed from the group, there will be serious distortion in the mean and standard
deviation, compromising the statistical tests.
The box plot graph is a very useful and practical tool for conducting this preliminary analysis of
quantitative data. This graph is constructed from five points: the minimum, the first quartile, the second
quartile, the third quartile, and the maximum.
In a Cartesian coordinate system, we begin by marking the minimum and maximum. Next, we
draw a rectangle that passes through the first quartile and the third quartile. Then, we mark the median
inside the rectangle. Finally, we draw two straight-line segments with length equal to 1.5 times the
interquartile range (IQR). The first straight segment is drawn above the upper edge of the rectangle, and
the other is drawn below the lower edge. Cases whose values fall outside of the two extremes of the
straight-line segments are considered outliers and must be reevaluated before proceeding with data
analysis. The figure below shows a box plot.

Descriptive Statistics of categorical or qualitative data

To present qualitative data, we determine frequency distributions and present them in tables
and graphs.

11
Descriptive statistics: organizing, summarizing … Morcillo AM

1. Simple frequency distribution

To obtain a frequency distribution of categorical data, we simply count how many cases there
are in each category. The frequencies of the categories can be expressed as their absolute number or as
a percentage of the total. Calculating the percentage of a given category is very simple: divide the absolute
frequency by the total and multiply by 100. In the next example, the percentage for the eutrophy group
would be:
Eutrophy (%) = 412 / 521 x 100 = 79.07869

We generally approximate to one decimal place which, in the example above, results in 79.1%.

Nutritional assessment using Gomez's criteria of 521 preschool children.

(N) (%)

Eutrophy 412 79.1


Mild Malnutrition 104 20.0
Moderate Malnutrition 5 1.0
Severe Malnutrition 0 0

Total 521 100.0

Sometimes, it may be of interest to the researcher to also present the cumulative frequency. See
the next table.
Nutritional assessment using Gomez's criteria of 521 preschool children.

(N) (%) (%) Accumulated

Eutrophy 412 79.1 79.1


Mild Malnutrition 104 20.0 99.1
Moderate Malnutrition 5 1.0 100.1
Severe Malnutrition 0 0 0

Total 521 100.1 100.1

When working with quantitative variables, it becomes necessary to group the data into
categories to present them in the form of a frequency distribution. The data is grouped into class intervals,
the number of which should not be small or very large, and it is recommended that it range from 5 to 20.
There are some formulas to determine the number of classes, but logic and common sense seem to be
more useful. It is necessary to keep in mind that class intervals must be established in such a way that all
data can be included in only one of the classes. Below we have a frequency distribution of a quantitative
variable (age in months) grouped into class intervals.

12
Descriptive statistics: organizing, summarizing … Morcillo AM

Age distribution (years) of 521 preschool children.

Age (months) (N) (%)


36.0 –| 48.0 35 6.7
48.0 –| 60.0 70 13.4
60.0 –| 72.0 168 32.2
72.0 –| 83.9 204 39.2
84.0 –| 96.0 44 8.4
Total 521 99.9

2. Distribution in relation to two qualitative variables – contingency tables

In this case, the objective is to build a table containing information about two or more variables
of a population or sample.
Distribution of 521 preschool children.

Age (months) Female Male Total

36.0 – 47.9 15 (42.9) 20 (57.1) 35 (100.0)


48.0 – 59.9 41 (58.6) 29 (41.4) 70 (100.0)
60.0 – 71.9 81 (48.2) 87 (51.8) 168 (100.0)
72.0 – 83.9 99 (48.5) 105 (51.5) 204 (100.0)
84.0 – 95.9 24 (54.5) 20 (45.5) 44 (100.0)

Total 260 (49.9) 261 (50.1) 521 (100.0)

N (%)

3. Graphical presentation

a) Pie charts
Pie charts are recommended to present frequency distributions. The area of the circle assigned
to each category is proportional to its frequency. The most practical way to determine it, knowing that
the total (100%) corresponds to an angle of 360º, is: Desired angle = (% x 360)/100. For example, for a
frequency of 45% we must take an angle of 162º: Desired angle = (45 x 360)/100 = 162º.

Below we present an example of a pie chart.

13
Descriptive statistics: organizing, summarizing … Morcillo AM

b) Bar Charts
In the same way as the previous one, this type of graph is recommended for presenting frequency
distributions. In this case, the frequency is related to the height of the bar, and the bars must have the
same width. Below we present a bar graph expressing the distribution of frequencies in relation to family
per capita income.

How to select the appropriate technique for publishing the results?

Guidelines for authors of major medical journals (JAMA, NEJM, BMJ, etc.) are an excellent source
of information. Spriestersbach et al. (2009), Lang & Altman (2015), and Ou et al. (2020) provide general
guidance on the proper presentation of results in articles.
The choice of the best technique should be guided by the type of variable. Additionally, in the
case of quantitative variables, the distribution shape (symmetrical, positively skewed, or negatively
skewed) should be considered. See the examples presented below.
Amorin et al. (2021) conducted a cross-sectional study with 26 children (6-12 years old) from
Londrina, Brazil, with the aim of evaluating eosinophil counts in relation to vitamin D levels. The patients
were stratified into two groups based on the median of vitamin D. Note that for some quantitative
variables, the mean and standard deviation were used, while for others, the median and interquartile

14
Descriptive statistics: organizing, summarizing … Morcillo AM

interval were employed. The criterion for choosing the technique to be used was the shape of the
variable’s distribution (symmetrical, positively skewed, or negatively skewed).

Shakti et al. (2014) selected 543 patients with idiopathic pericarditis and pericardial effusion
registered in the Pediatric Health Information System database (PHIS) – USA, with the aim of
characterizing the patients and hospitalization data. Table 1 presents the demographic data and clinical
characteristics of the patients. Please note that the authors chose to present the results of quantitative
variables in the form of median and interquartile range.

15
Descriptive statistics: organizing, summarizing … Morcillo AM

Bibliography

Altman DG. Practical statistics for medical research. 1st ed. London: Chapman & Hall, 1991.
Amorin CLC, Oliveira JM, Rodrigues A, Furlanetto KC, Pitta F. J Bras Pneumol. 2021;47(1):e20200279.
doi.org/10.36416/1806-3756/e20200279.
Bland M. An introduction to medical statistics. 2nd ed. New York: Oxford University Press, 1995.
Daniel WW. Biostatistics – A foundation for analysis in the health sciences. 6th. Edition. New York: John
Wiley & Sons, Inc., 1995.
Devore JL. Probability and Statistics for Engineering and the Sciences. 8th Ed. Boston: Brooks/Cole,
Cengage Learning, 2012.
Hazra A, Gogtay N. Biostatistics Series Module 1: Basics of Biostatistics. Indian J Dermatol. 2016; 61(1):
10–20.
Lang TA, Altman DG. Basic statistical reporting for articles published in Biomedical Journals: The
‘‘Statistical Analyses and Methods in the Published Literature’’ or the SAMPL Guidelines. International
Journal of Nursing Studies. 2015; 52:5–9.
Lowry L. Concepts and Applications of Inferential Statistics. URL: http://vassarstats.net/textbook/.
Accessed: 28/10/2023.
Ou F-S, Le-Rademacher JG, Ballman KV, Adjei AA, Mandrekar SJ. Guidelines for Statistical Reporting in
Medical Journals. J Thorac Oncol. 2020 Nov;15(11):1722-1726. doi: 10.1016/j.jtho.2020.08.019.
Shakti D, Hehn R, Gauvreau K, Sundel RP, Newburger JW. Idiopathic Pericarditis and Pericardial Effusion
in Children: Contemporary Epidemiology and Management. J Am Heart Assoc. 2014; 3(6): e001483.
Spriestersbach A, Röhrig B, du Prel J-B, Gerhold-Ay A, Blettner M. Descriptive statistics: the specification
of statistical measures and their presentation in tables and graphs. Part 7 of a series on evaluation of
scientific publications. Dtsch Arztebl Int. 2009; 106(36):578-83. doi: 10.3238/arztebl.2009.0578.
Tukey JW. Exploratory data analysis. London: Addison-Wesley Publishing Company, 1977.
Zar J. Biostatistical analysis. 2nd ed. Englewood Cliffs: Prentice-Hall Inc., 1984.

o Ψo

16

View publication stats

You might also like