Module 3_Types of Data_Part I
Module 3_Types of Data_Part I
Types of Data
17/05/2025 1
Contents
• Types of Data: Structured and Unstructured Data, Quantitative and
Qualitative Data.
• Four Levels of data (Nominal, Ordinal, Interval, Ratio Level).
1. Structured vs Unstructured
Structured (Organized) Data: Data stored into a
row/column structure.
• Every row represents a single observation and column
represent the characteristics of that observation.
• Unstructured (Unorganized) Data: Type of data that is
in the free form and does not follow any standard
format/hierarchy.
• Eg: Text or raw audio signals that must be parsed
further to become organized.
Pros of Structured Data
Structured data is generally thought of as being much
easier to work with and analyze.
Most statistical and machine learning models were
built with structured data in mind and cannot work on
the loose interpretation of unstructured data.
The natural row and column structure is easy to digest
for human and machine eyes.
Example of Data Pre-processing
for Text Data
• Text data is generally unstructured and hence there is
need to transform data into structured form.
• Few characteristics that describe the data to assist
transformation are:
Word/phrase count
The existence of certain special characters
The relative length of text
Picking out topics
Example: A Tweet
• This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn
skies.
Topic
Astronomy
2. Qualitative/Quantitative
1. Quantitative data: Data that can be described using
numbers, and basic mathematical procedures,
including addition, subtraction etc can be performed.
• mean = 30.73
• median= 31.0
Finding Measure of Centre
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30,
31, 26
• The mean and median are quite close to each other and both are
around 31 degrees.
• The question, on average, how cold is the fridge?
• About 31 degrees.
• However the vaccine comes with a warning:
• Do not keep this vaccine at a temperature under 29 degrees.
Finding Measure of Centre
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30,
31, 26
• We observe values 28 and 26, indicating that dip has happened below
29 at least twice.
• But we have not paid attention to it while calculating mean and
median.
• Hence we need measure of variation to understand how bad the
fridge condition is.
Measure of Variation
• It is measure of “How spread out the data is”.
• Standard deviation is the most common measure of variation.
• In layman terminology, standard deviation can be thought of as the
"average distance a data point is at from the mean".
• Thus, measure of variation (standard deviation) is a number that attempts
to describe how spread out the data is.
Explanation of formula of standard
deviation
1. Find the mean of the data.
2. For each number in the dataset, subtract it from the mean and then
square it.
3. Find the average of each square difference.
4. Take the square root of the number obtained in step three. This is the
standard deviation.
Measure of variation
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26
• Eg: While Fahrenheit and Celsius are stuck in the interval level, the
Kelvin scale boasts a natural zero.
• A measurement of zero Kelvin literally means the absence of heat. It
is a non-arbitrary starting zero.
• We can actually scientifically say that 200 Kelvin is twice as much heat
as 100 Kelvin.
• Money in the bank is at the ratio level. You can have "no money in the
bank" and it also makes sense that $200,000 is "twice as much as"
$100,000.
Measures of center
• The arithmetic mean still holds meaning at this level, as does a new
type of mean called the geometric mean.
• Geometric mean is the nth root of the product of all the values.
• For refrigerator example, geometric mean is
15th root of (31*32*32*31*28*29*31*38*32*31*30* 29*30*31*26) =
30.634
In this case, geometric mean value is comparable to mean and median.
Problem with ratio data:
• The biggest drawback with ratio data is that most of the negative
values do not make sense with ratio data.
• Example: We allowed debt of 50,000 to occur in our money in the
bank. If we had a balance of $50,000, the ratio of 50000/(-50000) i.e.
-1 would not make sense.
• For this reason alone, many data scientists prefer the interval level to
the ratio level.