QM Topic - Data Description & Presentation
QM Topic - Data Description & Presentation
Week 9:
Define and explain the difference between categorical, continuous and discrete data
Define and give examples of nominal, ordinal, interval and ratio data
Categorise example data into the above categories
Define descriptive statistics
Select the suitable visualisation statistical treatments to describe data of differing types*
Week 10
For each of the methods below students can explain their use & use them to manually construct
statistics for data:
Visualisation of categorical data: tables, bar charts (inc. clustered, component and percentage
component), pie charts, pictograms
Visualisation of continuous data: rank order list, frequency tables (with cumulative frequency &
relative frequency/ percentages, cumulative relative frequency and visualisation using
histograms/ ogives).
Display and present some data using appropriate visualisations and calculate descriptive statistics,
learn techniques to draw and use of software’s to display charts and calculate descriptive statistics.
Definition/Explanation Examples
Categorical
data
Continuous
data
Discrete data
Nominal
data Nominal data is the For instance, we can’t use a
simplest data type. It regression model on nominal
classifies (or names) data, because nominal data lacks
data without suggesting the necessary characteristics
any implied relationship required to carry out this type of
between those data. For analysis (namely: no dependent
1
instance, countries or and independent variables).
species of animals are
both forms of nominal
data.
Ordinal data
Ordinal data also
classifies data but it
introduces the concept of
ranking. An example
might be labeling
animals, but this time by
using discrete and
imprecise measures of
their speed (‘slow’,
‘medium’, ‘fast’).
Interval data
Interval data both
classifies and ranks data
(like ordinal data) but
introduces continuous
measurements.
Examples might be the
time of day or
temperature measured
on either the Celsius and
Fahrenheit scale.
Importantly, it always
lacks a ‘true zero.’ A
measurement of zero
can be midway through a
scale (i.e. you can have
2
minus temperatures).
Ratio data
Ratio data classifies and
ranks data, and uses
measured, continuous
intervals, just like interval
data. However, unlike
interval data, ratio data
has a true zero. This
basically means that
zero is an absolute,
below which there are no
meaningful values.
Speed, age, or weight
are all excellent
examples since none
can have a negative
value (you cannot be -10
years old or weigh -160
pounds!)
What is nominal data and what is it used for? How is it collected and
analyzed? Learn everything you need to know in this guide.
There are many different industries and career paths that involve working
with data—including psychology, marketing, and, of course, data analytics.
If you’re working with data in any capacity, there are four main data types
(or levels of measurement) to be aware of: nominal, ordinal, interval, and
ratio. Here, we’ll focus on nominal data.
We’ll briefly introduce the four different types of data, before defining what
nominal data is and providing some examples. We’ll then look at how
nominal data can be collected and analyzed. If you want to skip ahead to a
specific section, just use the clickable menu.
When we talk about the four different types of data, we’re actually referring
to different levels of measurement. Levels (or scales) of measurement
indicate how precisely a variable has been recorded. The level of
4
measurement determines how and to what extent you can analyze the
data.
The four levels of measurement are nominal, ordinal, interval, and ratio,
with nominal being the least complex and precise measurement, and ratio
being the most. In the hierarchy of measurement, each level builds upon
the last. So:
Interval data can be categorized and ranked just like ordinal data,
and there are equal, evenly spaced intervals between the categories
(e.g. temperature in Fahrenheit). Learn more in this complete guide to
interval data.
Ratio data is just like interval data in that it can be categorized and
ranked, and there are equal intervals between the data points.
Additionally, ratio data has a true zero. Weight in kilograms is an
example of ratio data; if something weighs zero kilograms, it truly
weighs nothing. On the other hand, a temperature of zero degrees
doesn’t mean there is “no temperature”—and that’s the difference
between interval and ratio data. You’ll find a complete guide to ratio
data here.
So, before you start collecting data, it’s important to think about the levels
of measurement you’ll use.
6
might use a numbering system to denote the different hair colors: say, 1 to
represent brown hair, 2 to represent blonde hair, 3 for black hair, 4 for
auburn hair, 5 for gray hair, and so on.
Although you are using numbers to label each category, these numbers do
not represent any kind of value or hierarchy (e.g. gray hair as represented
by the number 5 is not “greater than” or “better than” brown hair
represented by the number 1, and vice versa).
7
For example, the variable “hair color” is nominal as it can be divided into
various categories (brown, blonde, gray, black, etc) but there is no
hierarchy to the various hair colors. The variable “education level” is ordinal
as it can be divided into categories (high school, bachelor’s degree,
master’s degree, etc.) and there is a natural order to the categories; we
know that a bachelor’s degree is a higher level of education than high
school, and that a master’s degree is a higher level of education than a
bachelor’s degree, and so on.
So, if there is no natural order to your data, you know that it’s nominal.
As you can see, nominal data is really all about describing characteristics.
With those examples in mind, let’s take a look at how nominal data is
collected and what it’s used for.
8
5. How is nominal data collected and
what is it used for?
Nominal data helps you to gain insight into a particular population or
sample. This is useful in many different contexts, including marketing,
psychology, healthcare, education, and business—essentially any scenario
where you might benefit from learning more about your target
demographic.
If there are lots of different possible categories, you can use open
questions where the respondent is required to write their answer. For
example, “What is your native language?” or “What is your favorite genre of
music?”
Once you’ve collected your nominal data, you can analyze it. We’ll look at
how to analyze nominal data now.
9
gathering descriptive statistics to summarize the data, visualizing your
data, and carrying out some statistical analysis.
So how do you analyze nominal data? Let’s take a look, starting with
descriptive statistics.
Note that, in this example dataset, the first two variables—“Preferred mode
of transport” and “Location”—are nominal, but the third variable (“Income”)
is ordinal as it follows some kind of hierarchy (high, medium, low).
10
At first glance, it’s not easy to see how your data are distributed. For
example, it’s not immediately clear how many respondents answered “bus”
versus “tram,” nor is it easy to see if there’s a clear winner in terms of
preferred mode of transportation.
To bring some order to your nominal data, you can create a frequency
distribution table. This allows you to see how many responses there were
for each category. A simple way to do this in Microsoft Excel is to create a
pivot table. You can learn how to create a pivot table in this step-by-step
guide.
Here’s what a pivot table would look like for our transportation example:
11
You can also calculate the frequency distribution as a percentage, allowing
you to see what proportion of your respondents prefer which mode of
transport. Here’s what that would look like in our pivot table:
The mode: The value that appears most frequently within a dataset
The median: The middle value
The mean: The average value
As you can see, descriptive statistics help you to gain an overall picture of
your nominal dataset. Through your distribution tables, you can already
glean insights as to which modes of transport people prefer.
12
Visualizing nominal data
Data visualization is all about presenting your data in a visual format. Just
like the frequency distribution tables, visualizing your nominal data can help
you to see more easily what the data may be telling you.
Some simple yet effective ways to visualize nominal data are through bar
graphs and pie charts. You can do this in Microsoft Excel simply by clicking
“Insert” and then selecting “Chart” from the dropdown menu.
13
(Non-parametric) statistical tests for
nominal data
While descriptive statistics (and visualizations) merely summarize your
nominal data, inferential statistics enable you to test a hypothesis and
actually dig deeper into what the data are telling you.
Now we want to know how applicable our findings are to the whole
population of people living in London. Of course, it’s not possible to gather
data for every single person living in London; instead, we use the Chi-
14
square goodness of fit test to see how much, or to what extent, our
observations differ from what we expected or hypothesized. If
you’re interested in carrying out a Chi-square goodness of fit test, you’ll find a
comprehensive guide here.
15
Explained the difference between nominal and ordinal data: Both are
divided into categories, but with nominal data, there is no hierarchy or
order to the categories.
Shared some examples of nominal data: Hair color, nationality, blood
type, etc.
Introduced descriptive statistics for nominal data: Frequency
distribution tables and the measure of central tendency (the mode).
Looked at how to visualize nominal data using bar graphs and pie
charts.
Introduced non-parametric statistical tests for analyzing nominal data:
The Chi-square goodness of fit test (for one nominal variable) and
the Chi-square test of independence (for exploring the relationship
between two nominal variables).
16
What Is Ordinal Data?
https://careerfoundry.com/en/blog/data-analytics/what-is-ordinal-data/
What is ordinal data, how is it used, and how do you collect and
analyze it? Find out in this comprehensive guide.
nominal data
ordinal data
interval data
ratio data
If the concept of these data types is completely new to you, we’ll start with
a quick summary of the four different types, and then explore the various
aspects of ordinal data in a bit more detail,
If you’d like to learn more data analytics skills, try our free 5-day data
short course.
17
Fortunately, to make this easier, all types of data fit into one of four broad
categories: nominal, ordinal, interval, and ratio data. While these are
commonly referred to as ‘data types,’ they are really different scales
or levels of measurement.
The first two types of data, known as categorical data, are nominal and
ordinal. These two scales take relatively imprecise measures.
While this makes them easier to analyze, it also means they offer less
accurate insights. The next two types of data are interval and ratio. These
are both types of numerical data, which makes them more complex. They
are more difficult to analyze but have the potential to offer much richer
insights.
18
Ordinal data classifies data while introducing an order, or ranking.
For instance, measuring economic status using the hierarchy:
‘wealthy’, ‘middle income’ or ‘poor.’ However, there is no clearly
defined interval between these categories.
Interval data classifies and ranks data but also introduces measured
intervals. A great example is temperature scales, in Celsius or
Fahrenheit. However, interval data has no true zero, i.e. a
measurement of ‘zero’ can still represent a quantifiable measure
(such as zero Celsius, which is simply another measure on a scale
that includes negative values).
Ratio data is the most complex level of measurement. Like interval
data, it classifies and ranks data, and uses measured intervals.
However, unlike interval data, ratio data also has a true zero. When a
variable equals zero, there is none of this variable. A good example
of ratio data is the measure of height—you cannot have a negative
measure of height.
However, it’s important to learn how to distinguish them, because the type
of data you’re working with determines the statistical techniques you can
use to analyze it. Data analysis involves using descriptive analytics (to
summarize the characteristics of a dataset) and inferential statistics (to infer
meaning from those data).
19
These comprise a wide range of analytical techniques, so before collecting
any data, you should decide which level of measurement is best for your
intended purposes.
While ordinal data is more complex than nominal data (which has no
inherent order) it is still relatively simplistic.
For instance, the terms ‘wealthy’, ‘middle income’, and ‘poor’ may give you
a rough idea of someone’s economic status, but they are an imprecise
measure–there is no clear interval between them. Nevertheless, ordinal
data is excellent for ‘sticking a finger in the wind’ if you’re taking broad
measures from a sample group and fine precision is not a requirement.
20
While ordinal data is non-numeric, it’s important to understand that it can
still contain numerical figures. However, these figures can only be used as
categorizing labels, i.e. they should have no inherent mathematical value.
For instance, if you were to measure people’s economic status you could
use number 3 as shorthand for ‘wealthy’, number 2 for ‘middle income’, and
number 1 for ‘poor.’ At a glance, this might imply numerical value, e.g. 3 =
high and 1 = low. However, the numbers are only used to denote
sequence. You could just as easily switch 3 with 1, or with ‘A’ and ‘B’ and it
would not change the value of what you’re ordering; only the labels used to
order it.
21
For instance, nominal data may measure the variable ‘marital status,’ with
possible outcomes ‘single’, ‘married’, ‘cohabiting’, ‘divorced’ (and so on).
However, none of these categories are ‘less’ or ‘more’ than any other.
Another example might be eye color. Meanwhile, ordinal data always has
an inherent order.
If a qualitative dataset lacks order, you know you’re dealing with nominal
data.
22
4. How is ordinal data collected and
what is it used for?
Ordinal data are usually collected via surveys or questionnaires. Any type
of question that ranks answers using an explicit or implicit scale can be
used to collect ordinal data. An example might be:
This commonly recognized type of ordinal question uses the Likert Scale,
which we described briefly in the previous section. Another example might
be:
It’s worth noting that the Likert Scale is sometimes used as a form of
interval data. However, this is strictly incorrect. That’s because Likert
Scales use discrete values, while interval data uses continuous
values with a precise interval between them.
23
This is particularly prevalent in sectors like finance, marketing, and
insurance, but it is also used by governments, e.g. the census, and is
generally common when conducting customer satisfaction surveys (in any
industry).
For now, though, Let’s see what kinds of descriptive and inferential
statistics you can measure using ordinal data.
Frequency distribution
Measures of central tendency: Mode and/or median
Measures of variability: Range
Frequency distribution
Frequency distribution describes how your ordinal data are distributed.
24
For instance, let’s say you’ve surveyed students on what grade they’ve
received in an examination. Possible grades range from A to C. You can
summarize this information using a pivot table or frequency table, with
values represented either as a percentage or as a count. To illustrate using
a very simple example, one such table might look like this:
As you can see, the values in the sum column show how many students
received each possible grade. This allows you to see how the values are
distributed. Another option is also to visualize the data, for instance using a
bar plot.
Viewing the data visually allows us to easily see the frequency distribution.
Note the hierarchical relationship between categories. This is different from
the other type of categorical data, nominal data, which lacks any hierarchy.
25
Measures of central tendency: Mode
and/or median
The mode (the value which is most often repeated) and median (the central
value) are two measures of what is known as ‘central tendency.’ There is
also a third measure of central tendency: the mean. However, because
ordinal data is non-numeric, it cannot be used to obtain the mean. That’s
because identifying the mean requires mathematical operations that cannot
be meaningfully carried out using ordinal data.
In this case, we can also identify the median value. The median value is
the one that separates the top half of the dataset from the bottom half. If
you imagined all the respondents’ answers lined up end-to-end, you could
then identify the central value in the dataset. With 165 responses (as in our
grades example) the central value is the 83rd one. This falls under the
grade B.
The range describes the difference between the smallest and largest value.
To calculate this, you first need to use numeric codes to represent each
grade, i.e. A = 1, A- = 2, B = 3, etc. The range would be 5 – 1 = 4. So in this
simple example, the range is 4. This is an easy calculation to carry out. The
26
range is useful because it offers a basic understanding of how spread out
the values in a dataset are.
27
The Mann-Whitney U-test
The Mann-Whitney U test lets you compare whether two samples come
from the same population.
We can use this test to determine whether two samples have been
selected from populations with an equal distribution or if there is a
statistically significant difference.
28
Spearman’s rank correlation coefficient explores possible relationships (or
correlations) between two ordinal variables.
Don’t worry if these models are complex to get your head around. At this
stage, you just need to know that there are a wide range of statistical
methods at your disposal. While this means there is lots to learn, it also
offers the potential for obtaining rich insights from your data.
Explained the difference between ordinal and nominal data: Both are types of
categorical data. However, nominal data lacks hierarchy, whereas ordinal
data ranks categories using discrete values with a clear order.
Shared some examples of nominal data: Likert scales, education level, and
military rankings.
Highlighted the descriptive statistics you can obtain using ordinal data:
Frequency distribution, measures of central tendency (the mode and median),
and variability (the range).
29
What Is Interval Data?
https://careerfoundry.com/en/blog/data-analytics/what-is-interval-data/
What is interval data and how is it used? What’s the best way to
collect and analyze it? Find out in this guide.
30
four data types are not mutually exclusive but rather belong to a hierarchy,
where each level of measurement builds on the previous one.
The simplest levels of measurement are nominal and ordinal data. These
are both types of categorical data that take useful but imprecise measures
of a variable. They are easier to work with but offer less accurate insights.
Building on these are interval data and ratio data, which are both types
of numerical data. While these are more complex, they can offer much
richer insights.
Nominal data is the simplest (and most imprecise) data type. It uses
labels to identify values, without quantifying how those values relate
to one another e.g. employment status, blood type, eye color, or
nationality.
Ordinal dataalso labels data but introduces the concept of ranking. A
dataset of different qualification types is an example of ordinal data
because it contains an explicit, increasing hierarchy, e.g. High School
Diploma, Bachelor’s, Master’s, Ph.D., etc.
Interval data categorizes and ranks data, and introduces precise and
continuous intervals, e.g. temperature measurements in Fahrenheit
and Celsius, or the pH scale. Interval data always lack what’s known
as a ‘true zero.’ In short, this means that interval data can contain
31
negative values and that a measurement of ‘zero’ can represent a
quantifiable measure of something.
Ratio data categorizes and ranks data, and uses continuous
intervals (like interval data). However, it also has a true zero, which
interval data does not. Essentially, this means that when a variable is
equal to zero, there is none of this variable. An example of ratio data
would be temperature measured on the Kelvin scale, for which there
is no measurement below absolute zero (which represents a total
absence of heat).
Why do the different levels of
measurement matter?
Distinguishing between the different levels of measurement helps you decide which
statistical technique to use for analysis. For example, data analysts commonly use
techniques (to infer broader meaning from those data). Understanding what level of
measurement you have will help narrow down the type of analysis you can carry out. That’s
because the level of measurement has implications for the type of calculations that are
possible using those data. When collecting data, then, it’s important to first decide what
types of insights you require. This will determine which level of measurement to use.
32
Interval data is a type of quantitative (numerical) data. It groups variables
into categories and always uses some kind of ordered scale. Furthermore,
interval values are always ordered and separated using an equal measure
of distance. A very good example is the Celsius or Fahrenheit temperature
scales: each notch on the thermometer directly follows the previous one,
and each is the same distance apart. This type of continuous data is useful
because it means you can carry out certain mathematical equations, e.g.
determining the difference between variables using subtraction and
addition. This makes interval data more precise than the levels of measure
that come below it, i.e. nominal or ordinal data, which are both non-
numeric.
Of the four levels of measurement, interval data is the third most complex.
By introducing numerical values, it is eminently more useful for carrying out
statistical analyses than nominal or ordinal data.
33
multiplied or divided do not offer meaningful insights (this has
important implications for the type of analyses you can carry out).
Using interval data, you can calculate the following summary
statistics: frequency distribution; mode, median, and mean; and the
range, standard deviation, and variance of a dataset.
34
SAT scores (900, 950, 1000, 1050, 1100 etc.)
Credit ratings (20, 40, 60, 80, 100)
Dates (1740, 1840, 1940, 2040, 2140, etc.)
Note that the distance between the intervals is always equal. This is the
same as for ratio data. However, what distinguishes interval from ratio data
is that the temperature in Celsius can be negative. This is important
because it means you cannot carry out ratio calculations, i.e. the Celsius
scale goes down to -273.15 degrees, so you cannot say that +20 degrees
have twice the value of +10 degrees.
35
of automated collection is that it allows you to compare past and present
data without needing to measure it directly, which can be impractical.
In reality, because the vast majority of numeric scales have a true zero,
most types of quantitative data are ratio data, not interval data. Interval
data is generally collected and used for very specific use-cases. However,
it is still important to understand the difference.
Frequency distribution
Central tendency: Mode, median, and mean
Variability: Range, standard deviation, and variance
36
Let’s look at each of these now.
Frequency distribution
Frequency distribution looks at how data are distributed. Let’s say you take
temperature measurements in the city you live in every day throughout the
year. Your measurements range from -15 degrees Fahrenheit to +90
degrees Fahrenheit. You might represent this information using a table.
Using this simple example, here’s how this might look:
The important thing to note here is that the relationship between different
categories is both hierarchical and evenly spread, i.e. the number of
37
degrees Fahrenheit measured in the category ‘30 to 45’ is the same as the
number of degrees Fahrenheit measured in the category ‘45 to 60,’ and so
on.
It’s easy to identify the mode by looking at the pie chart or pivot table. As
we can see, throughout the year, the temperature most often falls
somewhere between 60 and 75 degrees Fahrenheit.
We can also identify the median value. This is the value at the center of
your dataset. Since measurements in our temperature dataset were taken
on 365 days of the year, we can determine that the median value is 183rd
value. This is the 45 to 60 degrees Fahrenheit category. The center point of
this category is 52.5 degrees, so this is our median value (or the best
possible estimate, using grouped data).
Finally, we can calculate the mean temperature. For grouped data, this
involves first calculating the midpoint of each group. We can add this to our
table. Next, we must find the product of each midpoint and its
corresponding frequency, which we can also add to our table.
38
By dividing the sum of frequency x product by the sum of frequencies
themselves, we obtain our mean temperature. Doing a quick calculation
(20,437.5 divided by 365) gives us a mean temperature of 56 degrees
Fahrenheit.
39
Inferential statistics for interval data
To analyze quantitative (rather than qualitative) datasets, it is best to use
what are known as parametric tests, i.e. tests that use data with clearly
defined parameters. You can also use non-parametric tests (more
commonly used for qualitative, non-numerical data, i.e. nominal and ordinal
data). However, these provide less meaningful insights.
T-test
Analysis of variance (ANOVA)
Pearson correlation coefficient
Simple linear regression
T-test
The t-test helps to determine if there’s a significant statistical difference
between the mean of two data samples that may be related to one another.
For instance, is there a difference in average credit rating between adults in
the age group 30-40 and the age group 40-50? T-tests are commonly used
for hypothesis testing. To carry out a t-test, all you need to know is the
mean difference between values of each data sample, the standard
deviation of each sample, and the sum of data values in each group.
Analysis of variance
Analysis of variance (ANOVA) compares the mean values across three or
more data samples. For instance, is there a difference in credit rating
between adults in the age groups 30-40, 40-50, and 50-60? In essence,
you can use ANOVA in the same way as a t-test, but for more than two
40
variables. However many variables you have, the t-test will help determine
the relationship between the dependent and independent values.
Using this approach, values will always fall between 1 and -1. A value of 1
indicates a strong positive correlation, while a value of -1 indicates a strong
negative correlation. A value of 0 suggests no strong correlation at all
between variables.
41
For instance, can a person’s income be used to predict their credit rating?
Simple linear regression uses only two variables, but there are variations
on the model. For instance, multiple linear regression measures aim to
predict the dependent output variable based on more two or more
independent input variables.
42
What is Ratio Data? Definition,
Characteristics and Examples
https://careerfoundry.com/en/blog/data-analytics/what-is-ratio-data/
What is ratio data? What’s it used for? And how can we best collect
and analyze it? Find out in this guide.
First up, though, it’s important to understand that the four data types do not
stand alone; they are closely related. We’ll start by summarizing the four.
We’ll then explore the various aspects of ratio data in closer detail. Want to
jump to a particular topic? Use the clickable headings:
43
Broadly speaking, whatever data you are using, you can be certain that it
falls into one or more of four
categories: nominal, ordinal, interval, and ratio. Introduced in 1946 by
the psychologist Stanley Smith Stevens, these four categories are also
known as the levels of measurement. They are now widely used across
the sciences and within data analytics to define the degree of precision to
which a variable has been measured. As a hierarchical scale, each level
builds on the one that comes before it.
The most basic levels of measurement are nominal and ordinal data. These
are types of categorical data that take relatively simplistic measures of a
given variable. Building on these are interval and ratio data—more complex
measures. These are both types of numerical data. They can be harder to
analyze but will, in general, lead to much richer, actionable insights. Let’s
briefly look at what each level measures:
Nominal data is the simplest data type. It classifies (or names) data
without suggesting any implied relationship between those data. For
instance, countries or species of animals are both forms of nominal
data.
Ordinal data also classifies data but it introduces the concept of
ranking. An example might be labeling animals, but this time by using
discrete and imprecise measures of their speed (‘slow’, ‘medium’,
‘fast’).
Interval data both classifies and ranks data (like ordinal data) but
introduces continuous measurements. Examples might be the time of
day or temperature measured on either the Celsius and Fahrenheit
scale. Importantly, it always lacks a ‘true zero.’ A measurement of
zero can be midway through a scale (i.e. you can have minus
temperatures).
44
Ratio data classifies and ranks data, and uses measured,
continuous intervals, just like interval data. However, unlike interval
data, ratio data has a true zero. This basically means that zero is an
absolute, below which there are no meaningful values. Speed, age,
or weight are all excellent examples since none can have a negative
value (you cannot be -10 years old or weigh -160 pounds!)
All statistical techniques fall into two broad categories: descriptive statistics (which
summarize a dataset’s features) and inferential statistics (which help us make
predictions based on those data). Determining if you’re working with nominal,
ordinal, interval, or ratio data helps narrow down which technique to use.
Conversely, determining what kind of analysis you wish to carry out (i.e. what your
goal is) will tell you which type of data measurement you need to take.
45
Ratio data is a form of quantitative (numeric) data. It measures variables on
a continuous scale, with an equal distance between adjacent values. While
it shares these features with interval data (another type of quantitative
data), a distinguishing property of ratio data is that it has a ‘true zero.’ In
other words, a measure of zero on a ratio scale is absolute: ratio data can
never have a negative value. This is important because it allows us to apply
all the possible mathematical operations (addition, subtraction,
multiplication, and division) when carrying out statistical analyses.
It’s worth noting that while ratio data must have a true zero, it does not
necessarily require an endpoint. A ratio scale can have potentially infinite
values or a finite endpoint. The only important distinguisher over interval
data is the existence of a true zero.
46
Because ratio data lack negative values, they can be added,
subtracted, multiplied, and divided (unlike the other three types of
data).
Ratio data can be used to calculate measures including frequency
distribution; mode, median, and mean; range, standard deviation,
variance, and coefficient of variation.
47
Height (5ft. 8in., 5ft. 9in., 5ft. 10in., 5ft. 11in., 6ft. 0in. etc.)
Price of goods ($0, $5, $10, $15, $20, $30, etc.)
Age in years (from zero to 100+)
Distance (from zero miles/km upwards)
Time intervals (might include race times or the number of hours spent
watching Netflix!)
As you can see, ratio data is all about measuring continuous variables on
equidistant scales.
It’s important to note that while values in a ratio dataset must be capable of
reaching true zero, this is not the same thing as actually having values that
go down to zero. To illustrate, if you’re measuring the heights of a group of
adults, you probably won’t obtain many measurements below 5 feet. The
existence of true zero simply means that the measurement scale you are
using has a definitive starting point of zero, i.e. you could reach zero in
theory, even if not in practice.
48
How many hours do you use your phone per day?
0-3 hours
3-6 hours
6-9 hours
More than 9 hours
20-25 kg
26-30 kg
31-35 kg
36-40 kg
Next, let’s see how ratio data is typically collected and used in everyday
life.
49
social media. Plus, if your scale lacks equal distance between measures,
you are not collecting ratio data, but ordinal data.
Like interval data, ratio data are sometimes collected through direct
observation, too. For instance, a zoologist might measure the heights of
various elephants. To drive the point home, note once again that height
measurements have a true zero, i.e. an elephant with a height of zero is an
absence of an elephant.
50
In all cases, ratio data is the best type of data to work with. This is because
it allows you to apply the entire arsenal of different statistical techniques.
Even in the case of summary statistics—the most fundamental type of
measurement—it allows you to scrutinize data at a deeper level than is
possible for nominal, ordinal, and interval data.
The two main types of statistical analysis are descriptive and inferential
statistics. Descriptive statistics summarize a dataset’s characteristics.
Inferential statistics allow you to test hypotheses or make predictions. Let’s
look at each more closely, in relation to ratio data.
Frequency distribution
Central tendency: Mode, median, and mean
Variability: Range, standard deviation, variance, and coefficient of
variation
Almost all of these statistics can also be measured using interval data. The
only exception is the coefficient of variation. For more detail on how you
might obtain each of these measures, check out section five of our post on
interval data, which uses more explicit examples.
Frequency distribution
As the name suggests, frequency distribution explores how a dataset’s
values are distributed. The most common way to measure frequency
distribution is to represent your data using a pivot table or some kind of
graph. For example, the bar graph here shows the distribution of weight in
a sample of marlins.
51
A bar plot showing the estimated weight of marlins. The x-axis shows
weight, the y-axis shows frequency. Source: John C. Holdsworth /
ResearchGate
Remember: while you can measure frequency distribution for many types
of data, ratio data it must have a true zero, as does a measure of weight
(on any scale).
The mode (the value that’s repeated most often throughout the data)
The median (the central value in the dataset)
The mean (the dataset’s average value)
The measures of central tendency are useful summary statistics for judging
the relative positions and importance of different values within a dataset.
For example, we can use these measures to determine whether a value
falls below or above the mean, how far from the mean it sits, what this
52
implies, and so on. This is all beneficial when you are first dealing with a
new set of data since it helps determine the best way to analyze it in more
depth.
53
That’s because parametric techniques are uniquely suited to quantitative
data (which has clearly defined parameters). Parametric tests offer a
deeper level of insight than non-parametric tests. While you can still apply
these to ratio data, they will not make the most of a ratio dataset’s full
range of characteristics.
Here are some statistical tests you can use on ratio data:
T-test
T-tests help you identify whether or not a significant statistical variation
exists between the mean value of two separate data samples. For instance,
is there a difference in average heights between adults who weigh less
than 180 pounds and those who weigh more than 180 pounds? If you want
to test your hypothesis, the t-test is very useful. While there’s a range of
different versions, in general, all you need is the average difference
between values, the standard deviation, and the total of values from each
sample.
54
weight and the amount they spend on weekly groceries? By plotting
quantitative variables on a graph, you can determine the direction and
strength of correlation between the different variables. When calculating
Pearson’s r, values always fall between 1 and -1. 1 indicates a strong
positive correlation, while -1 indicates a strong negative correlation. A value
of 0 shows no correlation between variables.
This is just a small sample of the parametric tests you can use on ratio
data. The full selection is wide and includes variations on those already
described, from alternative regression tests (like logistic regression) to other
comparative tests (such as the paired t-test or multiple analysis of variance,
or MANOVA). While these methods require some getting used to, by now
you hopefully have an idea of the kinds of analyses you can carry out.
56
A Guide to Data Types in
Statistics
What types of data are used in statistics? Here's a comprehensive
guide.
https://builtin.com/data-science/data-types-statistics
57
Introduction to Data Types
You also need to know which data type you are dealing with to
choose the right visualization method. Think of data types as a
way to categorize different types of variables. We will discuss the
main types of variables and look at an example for each. We will
sometimes refer to them as measurement scales.
58
NOMINAL DATA
ORDINAL DATA
59
Note that the difference between Elementary and High School is
different from the difference between High School and College.
This is the main limitation of ordinal data, the differences
between the values is not really known. Because of that, ordinal
scales are usually used to measure non-numeric features like
happiness, customer satisfaction and so on.
Numerical or Quantitative
Data Types
DISCRETE DATA
You can check by asking the following two questions whether you
are dealing with discrete data or not: Can you count it and can it
be divided up into smaller and smaller parts?
CONTINUOUS DATA
60
Continuous data represents measurements and therefore their
values can’t be counted but they can be measured. An example
would be the height of a person, which you can describe by using
intervals on the real number line.
Interval Data
The problem with interval values data is that they don’t have a
“true zero.” That means in regards to our example, that there is
no such thing as no temperature. With interval data, we can add
and subtract, but we cannot multiply, divide or calculate ratios.
Because there is no true zero, a lot of descriptive and inferential
statistics can’t be applied.
Ratio Data
61
Ratio values are also ordered units that have the same
difference. Ratio values are the same as interval values, with the
difference that they do have an absolute zero. Good examples are
height, weight, length, etc.
We will now go over every data type again but this time in
regards to what statistical methods can be applied. To
understand properly what we will now discuss, you have to
understand the basics of descriptive statistics. If you don’t know
them, you can read my blog post (9min read) about
it: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9.
When you are dealing with nominal data, you collect information through:
62
Frequencies: The frequency is the rate at which something occurs over a
Proportion: You can easily calculate the proportion by dividing the frequency
by the total number of events. (e.g how often something happened divided by
Visualization Methods: To visualize nominal data you can use a pie chart or a
bar chart.
When you are dealing with ordinal data, you can use the same
methods as with nominal data, but you also have access to some
additional tools. Therefore you can summarize your ordinal data
with frequencies, proportions, percentages. And you can
visualize it with pie and bar charts. Additionally, you can use
percentiles, median, mode and the interquartile range to
summarize your data.
63
In data science, you can use one label encoding, to transform
ordinal data into a numeric feature.
When you are dealing with continuous data, you can use the
most methods to describe your data. You can summarize your
data using percentiles, median, interquartile range, mean, mode,
standard deviation, and range.
Visualization Methods:
Summary
In this post, you discovered the different data types that are used
throughout statistics. You learned the difference between
discrete & continuous data and learned what nominal, ordinal,
interval and ratio measurement scales are. Furthermore, you
64
now know what statistical measurements you can use at which
data Etype and which are the right visualization methods. You
also learned, with which methods categorical variables can be
transformed into numeric variables. This enables you to create a
big part of an exploratory analysis on a given data set.
65