Descriptive Statistics
Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps describe, show or
summarize data in a meaningful way such that, for example, patterns might emerge from the
data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we
have analysed or reach conclusions regarding any hypotheses we might have made. They are
simply a way to describe our data.
Descriptive statistics are very important because if we simply presented our raw data it would be
hard to visualize what the data was showing, especially if there was a lot of it. Descriptive
statistics therefore enables us to present the data in a more meaningful way, which allows
simpler interpretation of the data. For example, if we had the results of 100 pieces of students'
coursework, we may be interested in the overall performance of those students. We would also
be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this.
How to properly describe data through statistics and graphs is an important topic and discussed
in other Laerd Statistics guides. Typically, there are two general types of statistic that are used to
describe data:
Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. In this case, the frequency distribution is simply the distribution
and pattern of marks scored by the 100 students from the lowest to the highest. We can describe
this central position using a number of statistics, including the mode, median, and mean. You can
learn more in our guide: Measures of Central Tendency.
Measures of spread: these are ways of summarizing a group of data by describing how spread out
the scores are. For example, the mean score of our 100 students may be 65 out of 100. However,
not all students will have scored 65 marks. Rather, their scores will be spread out. Some will be
lower and others higher. Measures of spread help us to summarize how spread out these scores
are. To describe this spread, a number of statistics are available to us, including the range,
quartiles, absolute deviation, variance and standard deviation.
When we use descriptive statistics it is useful to summarize our group of data using a
combination of tabulated description (i.e., tables), graphical description (i.e., graphs and charts)
and statistical commentary (i.e., a discussion of the results).
Inferential Statistics
We have seen that descriptive statistics provide information about our immediate group of data.
For example, we could calculate the mean and standard deviation of the exam marks for the 100
students and this could provide valuable information about this group of 100 students. Any group
of data like this, which includes all the data you are interested in, is called a population. A
population can be small or large, as long as it includes all the data you are interested in. For
example, if you were only interested in the exam marks of 100 students, the 100 students would
represent your population. Descriptive statistics are applied to populations, and the properties of
populations, like the mean or standard deviation, are called parameters as they represent the
whole population (i.e., everybody you are interested in).
Often, however, you do not have access to the whole population you are interested in
investigating, but only a limited number of data instead. For example, you might be interested in
the exam marks of all students in the UK. It is not feasible to measure all exam marks of all
students in the whole of the UK so you have to measure a smaller sample of students (e.g., 100
students), which are used to represent the larger population of all UK students. Properties of
samples, such as the mean or standard deviation, are not called parameters, but statistics.
Inferential statistics are techniques that allow us to use these samples to make generalizations
about the populations from which the samples were drawn. It is, therefore, important that the
sample accurately represents the population. The process of achieving this is called sampling
(sampling strategies are discussed in detail in the section, Sampling Strategy, on our sister site).
Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a
sample is not expected to perfectly represent the population. The methods of inferential statistics
are (1) the estimation of parameter(s) and (2) testing of statistical hypotheses.
Both descriptive and inferential statistics rely on the same set of data. Descriptive statistics rely
solely on this set of data, whilst inferential statistics also rely on this data in order to make
generalisations about a larger population.
What are the strengths of using descriptive statistics to examine a distribution of scores?
Other than the clarity with which descriptive statistics can clarify large volumes of data, there are
no uncertainties about the values you get (other than only measurement error, etc.).
Descriptive statistics are limited in so much that they only allow you to make summations about
the people or objects that you have actually measured. You cannot use the data you have
collected to generalize to other people or objects (i.e., using data from a sample to infer the
properties/parameters of a population). For example, if you tested a drug to beat cancer and it
worked in your patients, you cannot claim that it would work in other cancer patients only
relying on descriptive statistics (but inferential statistics would give you this opportunity).
CENTRAL TENDENCY
In statistics we have to deal with the mean, mode and the median. These are also called the
„Central Tendency“. These are just three different kinds of „averages” and certainly the most
popular ones.
The mean is simply the average and considered the most reliable measure of central tendency for
making assumptions about a population from a single sample. Central tendency determines the
tendency for the values of your data to cluster around its mean, mode, or median. The mean is
computed by the sum of all values, divided by the number of values.
The mode is the value or category that occurs most often within the data. Therefore a dataset has
no mode, if no number is repeated or if no category is the same. It is possible that a dataset has
more than one mode, but I will cover this in the „Modality“ section below. The mode is also the
only measure of central tendency that can be used for categorical variables since you can’t
compute for example the average for the variable „gender“. You simply report categorical
variables as numbers and percentages.
The median is the “middle” value or midpoint in your data and is also called the „50th
percentile“. Note that the median is much less affected by outliers and skewed data than the
mean. I will explain this with an example: Imagine you have a dataset of housing prizes that
range mostly from $100,000 to $300,000 but contains a few houses that are worth more than 3
million Dollars. These expensive houses will heavily effect then mean since it is the sum of all
values, divided by the number of values. The median will not be heavily affected by these
outliers since it is only the “middle” value of all data points. Therefore the median is a much
more suited statistic, to report about your data.
In a normal distribution, these measures all fall at the same midline point. This means that the
mean, mode and median are all equal.
MEASURES OF VARIABILITY
The most popular variability measures are the range, interquartile range (IQR), variance, and
standard deviation. These are used to measure the amount of spread or variability within your
data.
The range describes the difference between the largest and the smallest points in your data.
The interquartile range (IQR) is a measure of statistical dispersion between upper (75th) and
lower (25th) quartiles.
While the range measures where the beginning and end of your datapoint are, the interquartile
range is a measure of where the majority of the values lie.
The difference between the standard deviation and the variance is often a little bit hard to grasp
for beginners, but I will explain it thoroughly below.
The Standard Deviation and the Variance also measure, like the Range and IQR, how spread
apart our data is (e.g the dispersion). Therefore they are both derived from the mean.
The variance is computed by finding the difference between every data point and the mean,
squaring them, summing them up and then taking the average of those numbers.
The squares are used during the calculation because they weight outliers more heavily than
points that are near to the mean. This prevents that differences above the mean neutralize those
below the mean.
The problem with Variance is that because of the squaring, it is not in the same unit of
measurement as the original data.
Let’s say you are dealing with a dataset that contains centimeter values. Your variance would be
in squared centimeters and therefore not the best measurement.
This is why the Standard Deviation is used more often because it is in the original unit. It is
simply the square root of the variance and because of that, it is returned to the original unit of
measurement.
Let’s look at an example that illustrates the difference between variance and standard deviation:
Imagine a data set that contains centimeter values between 1 and 15, which results in a mean of
8. Squaring the difference between each data point and the mean and averaging the squares
renders a variance of 18.67 (squared centimeters), while the standard deviation is 4.3
centimeters.
When you have a low standard deviation, your data points tend to be close to the mean. A high
standard deviation means that your data points are spread out over a wide range.
Standard deviation is best used when data is unimodal. In a normal distribution, approximately
34% of the data points are lying between the mean and one standard deviation above or below
the mean. Since a normal distribution is symmetrical, 68% of the data points fall between one
standard deviation above and one standard deviation below the mean. Approximately 95% fall
between two standard deviations below the mean and two standard deviations above the mean.
And approximately 99.7% fall between three standard deviations above and three standard
deviations below the mean.
MODALITY
The picture below shows visual examples of the three types of modality:
Unimodal means that the distribution has only one peak, which means it has only one frequently
occurring score, clustered at the top. A bimodal distribution has two values that occur frequently
(two peaks) and a multimodal has two or several frequently occurring values.
SKEWNESS
We speak of a positives skew if the data is piled up to the left, which leaves the tail pointing to
the right.
A negative skew occurs if the data is piled up to the right, which leaves the tail pointing to the
left. Note that positive skews are more frequent than negative ones.
A good measurement for the skewness of a distribution is Pearson’s skewness coefficient that
provides a quick estimation of a distributions symmetry. To compute the skewness in pandas you
can just use the „skew()“ function.
KURTOSIS
You can see both for a positively skewed dataset in the image below:
A good way to mathematically measure the kurtosis of a distribution is fishers measurement of
kurtosis.
A normal distribution is called mesokurtic and has kurtosis of or around zero. A platykurtic
distribution has negative kurtosis and tails are very thin compared to the normal distribution.
Leptokurtic distributions have kurtosis greater than 3 and the fat tails mean that the distribution
produces more extreme values and that it has a relatively small standard deviation.
If you already recognized that a distribution is skewed, you don’t need to calculate it’s kurtosis,
since the distribution is already not normal. In pandas you can view the kurtosis simply by
calling the „kurtosis()“ function.
Summary
This post gave you a proper introduction to descriptive statistics. You learned what a Normal
Distribution looks like and why it is important. Furthermore, you gained knowledge about the
three different kinds of averages (mean, mode and median), also called the Central Tendency.
Afterwards, you learned about the range, interquartile range, variance and standard deviation.
Then we discussed the three types of modality and that you can describe how much a distribution
differs from a normal distribution in terms of Skewness. Lastly, you learned about Leptokurtic,
Mesokurtic and Platykurtic distributions.