0% found this document useful (0 votes)
2 views

Module 3_Types of Data_Part I

The document discusses data preprocessing, focusing on the types of data: structured vs unstructured, and qualitative vs quantitative. It explains the four levels of data (nominal, ordinal, interval, ratio) and their characteristics, including examples and mathematical operations applicable to each level. Additionally, it highlights the importance of transforming unstructured data into a structured format for analysis and the significance of measures of center and variation in understanding data distribution.

Uploaded by

vivekgowda2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module 3_Types of Data_Part I

The document discusses data preprocessing, focusing on the types of data: structured vs unstructured, and qualitative vs quantitative. It explains the four levels of data (nominal, ordinal, interval, ratio) and their characteristics, including examples and mathematical operations applicable to each level. Additionally, it highlights the importance of transforming unstructured data into a structured format for analysis and the significance of measures of center and variation in understanding data distribution.

Uploaded by

vivekgowda2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Module 3: Data Preprocessing

Types of Data

17/05/2025 1
Contents
• Types of Data: Structured and Unstructured Data, Quantitative and
Qualitative Data.
• Four Levels of data (Nominal, Ordinal, Interval, Ratio Level).
1. Structured vs Unstructured
 Structured (Organized) Data: Data stored into a
row/column structure.
• Every row represents a single observation and column
represent the characteristics of that observation.
• Unstructured (Unorganized) Data: Type of data that is
in the free form and does not follow any standard
format/hierarchy.
• Eg: Text or raw audio signals that must be parsed
further to become organized.
Pros of Structured Data
 Structured data is generally thought of as being much
easier to work with and analyze.
 Most statistical and machine learning models were
built with structured data in mind and cannot work on
the loose interpretation of unstructured data.
 The natural row and column structure is easy to digest
for human and machine eyes.
Example of Data Pre-processing
for Text Data
• Text data is generally unstructured and hence there is
need to transform data into structured form.
• Few characteristics that describe the data to assist
transformation are:
Word/phrase count
The existence of certain special characters
The relative length of text
Picking out topics
Example: A Tweet
• This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn
skies.

• Pre-processing is necessary for this tweet because a vast


majority of learning algorithms require numerical data.
• Pre-processing allows us to explore features that have been
created from the existing features.
• For example, we can extract features such as word count and
special characters from the mentioned tweet.
Example: This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.
1. Word/phrase counts:-
• We may break down a tweet into its word/phrase count.
• The word ‘this’ appears in the tweet once, as does every other
word.
• We can represent this tweet in a structured format, converting
the unstructured set of words into a row/column format:

2. Presence of certain special characters


Example: This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.
• 3. Relative length of text
• This tweet is 121 characters long.
• The average tweet, as discovered by analysts, is about 30
characters in length.
• So, we calculate a new characteristic, called relative length, (which
is the length of the tweet divided by the average length), i.e.
121/30 telling us the length of this tweet as compared to average
tweet.
• This tweet is actually 4.03 times longer than the average tweet.
Example: This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.
• 4. Picking out topics
• This tweet is about astronomy, so we can add that information as a
column.
• Thus, we can convert a piece of text into structured/organized
data, ready for use in our models and exploratory analysis.

Topic

Astronomy
2. Qualitative/Quantitative
1. Quantitative data: Data that can be described using
numbers, and basic mathematical procedures,
including addition, subtraction etc can be performed.

2. Qualitative data: This data cannot be described using


numbers and basic Mathematics cannot be performed,
and described using "natural" categories and natural
language.
Examples
Example of Qualitative/Quantitative
Coffee Shop Data
Observations of coffee shops in a major city was made.
Following characteristics were recorded.
1. Name of coffee shop
2. Revenue (in thousands of dollars)
3. Zip code
4. Average monthly customers
5. Country of coffee origin
Let us try to classify each characteristic as Qualitative OR
Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
1. Name of coffee shop
• Qualitative
• The name of a coffee shop is not expressed as a number
and we cannot perform math on the name of the shop.
2. Revenue
• Revenue – Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
3. Zipcode
• This one is tricky!!!
• Zip code – Qualitative
• A zip code is always represented using numbers, but what
makes it qualitative is that it does not fit the second part of the
definition of quantitative—we cannot perform basic mathematical
operations on a zip code.
• If we add together two zip codes, it is a nonsensical measurement.
4. Average monthly customers
• Average monthly customers – Quantitative
5. Country of coffee origin
• Country of coffee origin – Qualitative
Example 2: World alcohol
consumption data

• Classification of attributes as Quantitative OR Qualitative


• country: Qualitative
• beer_servings: Quantitative
• spirit_servings: Quantitative
• wine_servings: Quantitative
• total_litres_of_pure_alcohol: Quantitative
• continent: Qualitative
Quantitative data can be broken down, one step
further, into discrete and continuous
quantities.
Continuous Discrete
It can take any value in It can only have specific
a interval value. No decimal
values
[1 to 10] 1, 2, 3, 4, 5….
Values can be: 1, 1.3,
2.46, 5.378…
Measured Counted
Example: Temperature: Example: Rolling die
22.6 C, 83.46 F 1, 2, 3, 4, 5, 6
Examples: The speed of a car – Continuous
The number of cats in a house – Discrete
Your weight – Continuous
The number of students in a class – Discrete
The number of books in a shelf – Discrete
The height of a person – Continuous
Exact age - Continuous
Four Levels of Data
• It is generally understood that a specific characteristic
(feature/column) of structured data can be broken
down into one of four levels of data. The levels are:
 The nominal level
 The ordinal level
 The interval level
 The ratio level
The nominal level
• The first level of data, the nominal level, consists of
data that is described purely by name or category
with no rank order.
• Basic examples include gender, nationality, species,
or name of a student, color of hair etc...
• No rank order means: Cannot tell which color of hair
is more important than other.
• They are not described by numbers and are therefore
qualitative.
Mathematical operations
allowed
• We cannot perform mathematics on the nominal level
of data except the basic equality and set membership
functions, as shown in the following two examples:
Being a tech entrepreneur is the same as being in the
tech industry, but not vice versa.
Measures of center
• A measure of center is a number that describes what the data tends
to.
• It is sometimes referred to as the balance point of the data.
• Common examples include the mean, median, and mode.
• In order to find the center of nominal data, we generally turn to the
mode (the most common element) of the dataset.
2. The Ordinal level
• Categorical in nature but inherent with order or rank
where each options has a different values.
Examples:
Income Levels: ( Low, Medium, High )
Levels of agreement ( disagree, neutral, agree )
Levels of satisfaction ( Poor, average, good, excellent )
All these options are still categorical but they have
different values ( Ranking difference ).
Quick summary
Measures of center
• In order to find the center of ordinal data, we generally turn to the
median of the dataset.

• Mean isn’t chosen as we need to perform division operation, which


isn’t allowed.
3. The Interval Level
• What type of data is Top 5 Olympic Medalists?
• Ordinal Data, since we can order and rank the Medalists.
• One drawback is that ranking scale does not help us determine how far or
close apart are they in terms of victory.
• To help us measure the difference between two quantities, we make use of
the Interval Level
Example of Interval level:
Temperature
• If it is 100 degrees Fahrenheit in Texas and 80 degrees Fahrenheit in Istanbul,
Turkey, then Texas is 20 degrees warmer than Istanbul.
• Thus, Data at the interval level allows meaningful subtraction between data
points.
Mathematical operations
allowed
• We can use all the operations allowed with nominal and ordinal(ordering,
comparisons, and so on), along with two other notable operations:
• Addition
• Subtraction
Measures of center
• Measures of mean, median and mode to describe this data.
• Usually the most accurate description of the center of data would be the
arithmetic mean, more commonly referred to as, simply, "the mean".
• At the previous levels, addition was meaningless; therefore, the mean would
have lost extreme value.
• It is only at the interval level and above that the arithmetic mean makes
sense.
Example: Temperature of Fridge
• Suppose we look at the temperature of a fridge containing a
pharmaceutical company's new vaccine. We measure the temperate
every hour with the following data points (in Fahrenheit):
• 31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26

• mean = 30.73
• median= 31.0
Finding Measure of Centre
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30,
31, 26
• The mean and median are quite close to each other and both are
around 31 degrees.
• The question, on average, how cold is the fridge?
• About 31 degrees.
• However the vaccine comes with a warning:
• Do not keep this vaccine at a temperature under 29 degrees.
Finding Measure of Centre
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30,
31, 26
• We observe values 28 and 26, indicating that dip has happened below
29 at least twice.
• But we have not paid attention to it while calculating mean and
median.
• Hence we need measure of variation to understand how bad the
fridge condition is.
Measure of Variation
• It is measure of “How spread out the data is”.
• Standard deviation is the most common measure of variation.
• In layman terminology, standard deviation can be thought of as the
"average distance a data point is at from the mean".
• Thus, measure of variation (standard deviation) is a number that attempts
to describe how spread out the data is.
Explanation of formula of standard
deviation
1. Find the mean of the data.
2. For each number in the dataset, subtract it from the mean and then
square it.
3. Find the average of each square difference.
4. Take the square root of the number obtained in step three. This is the
standard deviation.
Measure of variation
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26

• On computation, standard deviation of the dataset is around 2.5.


• Meaning: “On an average", a data point is 2.5 degrees off from the
average temperature of around 31 degrees.
• Takeaway: The temperature could likely dip
below 29 degrees again in the near future.
Note:
• The reason we want the "square difference" between each point and
the mean and not the "actual difference" is because:
• Squaring the value actually puts emphasis on outliers—data points
that are abnormally far away.
Summary
• Measures of variation give us a very clear picture of how spread out
or dispersed our data is.
• This is especially important when we are concerned with ranges of
data and how data can fluctuate (think percent return on stocks).
• Drawback with Interval Data:
• Data at the interval level does not have a "natural starting point or a
natural zero".
• However, being at zero degrees Celsius does not mean that you have
"no temperature".
The ratio level
• After moving through three different levels with differing levels of
allowed mathematical operations, the ratio level proves to be the
strongest of the four.
• Not only can we define order and difference, the ratio level allows us
to multiply and divide as well.
• This might seem like not much to make a fuss over but it changes
almost everything about the way we view data at this level.
Examples of the ratio level

• Eg: While Fahrenheit and Celsius are stuck in the interval level, the
Kelvin scale boasts a natural zero.
• A measurement of zero Kelvin literally means the absence of heat. It
is a non-arbitrary starting zero.
• We can actually scientifically say that 200 Kelvin is twice as much heat
as 100 Kelvin.
• Money in the bank is at the ratio level. You can have "no money in the
bank" and it also makes sense that $200,000 is "twice as much as"
$100,000.
Measures of center
• The arithmetic mean still holds meaning at this level, as does a new
type of mean called the geometric mean.
• Geometric mean is the nth root of the product of all the values.
• For refrigerator example, geometric mean is
15th root of (31*32*32*31*28*29*31*38*32*31*30* 29*30*31*26) =
30.634
In this case, geometric mean value is comparable to mean and median.
Problem with ratio data:
• The biggest drawback with ratio data is that most of the negative
values do not make sense with ratio data.
• Example: We allowed debt of 50,000 to occur in our money in the
bank. If we had a balance of $50,000, the ratio of 50000/(-50000) i.e.
-1 would not make sense.
• For this reason alone, many data scientists prefer the interval level to
the ratio level.

You might also like