Module 3 - Types of Data - Part I

The document discusses data preprocessing, focusing on the types of data: structured vs unstructured, and qualitative vs quantitative. It explains the four levels of data (nominal, ordinal, interval, ratio) and their characteristics, including examples and mathematical operations applicable to each level. Additionally, it highlights the importance of transforming unstructured data into a structured format for analysis and the significance of measures of center and variation in understanding data distribution.

Uploaded by

vivekgowda2006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views41 pages

Module 3 - Types of Data - Part I

Uploaded by

vivekgowda2006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Module 3: Data Preprocessing

Types of Data

17/05/2025 1
Contents
• Types of Data: Structured and Unstructured Data, Quantitative and
Qualitative Data.
• Four Levels of data (Nominal, Ordinal, Interval, Ratio Level).
1. Structured vs Unstructured
 Structured (Organized) Data: Data stored into a
row/column structure.
• Every row represents a single observation and column
represent the characteristics of that observation.
• Unstructured (Unorganized) Data: Type of data that is
in the free form and does not follow any standard
format/hierarchy.
• Eg: Text or raw audio signals that must be parsed
further to become organized.
Pros of Structured Data
 Structured data is generally thought of as being much
easier to work with and analyze.
 Most statistical and machine learning models were
built with structured data in mind and cannot work on
the loose interpretation of unstructured data.
 The natural row and column structure is easy to digest
for human and machine eyes.
Example of Data Pre-processing
for Text Data
• Text data is generally unstructured and hence there is
need to transform data into structured form.
• Few characteristics that describe the data to assist
transformation are:
Word/phrase count
The existence of certain special characters
The relative length of text
Picking out topics
Example: A Tweet
• This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn
skies.

• Pre-processing is necessary for this tweet because a vast

majority of learning algorithms require numerical data.
• Pre-processing allows us to explore features that have been
created from the existing features.
• For example, we can extract features such as word count and
special characters from the mentioned tweet.
Example: This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.
1. Word/phrase counts:-
• We may break down a tweet into its word/phrase count.
• The word ‘this’ appears in the tweet once, as does every other
word.
• We can represent this tweet in a structured format, converting
the unstructured set of words into a row/column format:

2. Presence of certain special characters

Example: This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.
• 3. Relative length of text
• This tweet is 121 characters long.
• The average tweet, as discovered by analysts, is about 30
characters in length.
• So, we calculate a new characteristic, called relative length, (which
is the length of the tweet divided by the average length), i.e.
121/30 telling us the length of this tweet as compared to average
tweet.
• This tweet is actually 4.03 times longer than the average tweet.
Example: This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.
• 4. Picking out topics
• This tweet is about astronomy, so we can add that information as a
column.
• Thus, we can convert a piece of text into structured/organized
data, ready for use in our models and exploratory analysis.

Topic

Astronomy
2. Qualitative/Quantitative
1. Quantitative data: Data that can be described using
numbers, and basic mathematical procedures,
including addition, subtraction etc can be performed.

2. Qualitative data: This data cannot be described using

numbers and basic Mathematics cannot be performed,
and described using "natural" categories and natural
language.
Examples
Example of Qualitative/Quantitative
Coffee Shop Data
Observations of coffee shops in a major city was made.
Following characteristics were recorded.
1. Name of coffee shop
2. Revenue (in thousands of dollars)
3. Zip code
4. Average monthly customers
5. Country of coffee origin
Let us try to classify each characteristic as Qualitative OR
Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
1. Name of coffee shop
• Qualitative
• The name of a coffee shop is not expressed as a number
and we cannot perform math on the name of the shop.
2. Revenue
• Revenue – Quantitative
Example of Qualitative/Quantitative
Coffee Shop Data
3. Zipcode
• This one is tricky!!!
• Zip code – Qualitative
• A zip code is always represented using numbers, but what
makes it qualitative is that it does not fit the second part of the
definition of quantitative—we cannot perform basic mathematical
operations on a zip code.
• If we add together two zip codes, it is a nonsensical measurement.
4. Average monthly customers
• Average monthly customers – Quantitative
5. Country of coffee origin
• Country of coffee origin – Qualitative
Example 2: World alcohol
consumption data

• Classification of attributes as Quantitative OR Qualitative

• country: Qualitative
• beer_servings: Quantitative
• spirit_servings: Quantitative
• wine_servings: Quantitative
• total_litres_of_pure_alcohol: Quantitative
• continent: Qualitative
Quantitative data can be broken down, one step
further, into discrete and continuous
quantities.
Continuous Discrete
It can take any value in It can only have specific
a interval value. No decimal
values
[1 to 10] 1, 2, 3, 4, 5….
Values can be: 1, 1.3,
2.46, 5.378…
Measured Counted
Example: Temperature: Example: Rolling die
22.6 C, 83.46 F 1, 2, 3, 4, 5, 6
Examples: The speed of a car – Continuous
The number of cats in a house – Discrete
Your weight – Continuous
The number of students in a class – Discrete
The number of books in a shelf – Discrete
The height of a person – Continuous
Exact age - Continuous
Four Levels of Data
• It is generally understood that a specific characteristic
(feature/column) of structured data can be broken
down into one of four levels of data. The levels are:
 The nominal level
 The ordinal level
 The interval level
 The ratio level
The nominal level
• The first level of data, the nominal level, consists of
data that is described purely by name or category
with no rank order.
• Basic examples include gender, nationality, species,
or name of a student, color of hair etc...
• No rank order means: Cannot tell which color of hair
is more important than other.
• They are not described by numbers and are therefore
qualitative.
Mathematical operations
allowed
• We cannot perform mathematics on the nominal level
of data except the basic equality and set membership
functions, as shown in the following two examples:
Being a tech entrepreneur is the same as being in the
tech industry, but not vice versa.
Measures of center
• A measure of center is a number that describes what the data tends
to.
• It is sometimes referred to as the balance point of the data.
• Common examples include the mean, median, and mode.
• In order to find the center of nominal data, we generally turn to the
mode (the most common element) of the dataset.
2. The Ordinal level
• Categorical in nature but inherent with order or rank
where each options has a different values.
Examples:
Income Levels: ( Low, Medium, High )
Levels of agreement ( disagree, neutral, agree )
Levels of satisfaction ( Poor, average, good, excellent )
All these options are still categorical but they have
different values ( Ranking difference ).
Quick summary
Measures of center
• In order to find the center of ordinal data, we generally turn to the
median of the dataset.

• Mean isn’t chosen as we need to perform division operation, which

isn’t allowed.
3. The Interval Level
• What type of data is Top 5 Olympic Medalists?
• Ordinal Data, since we can order and rank the Medalists.
• One drawback is that ranking scale does not help us determine how far or
close apart are they in terms of victory.
• To help us measure the difference between two quantities, we make use of
the Interval Level
Example of Interval level:
Temperature
• If it is 100 degrees Fahrenheit in Texas and 80 degrees Fahrenheit in Istanbul,
Turkey, then Texas is 20 degrees warmer than Istanbul.
• Thus, Data at the interval level allows meaningful subtraction between data
points.
Mathematical operations
allowed
• We can use all the operations allowed with nominal and ordinal(ordering,
comparisons, and so on), along with two other notable operations:
• Addition
• Subtraction
Measures of center
• Measures of mean, median and mode to describe this data.
• Usually the most accurate description of the center of data would be the
arithmetic mean, more commonly referred to as, simply, "the mean".
• At the previous levels, addition was meaningless; therefore, the mean would
have lost extreme value.
• It is only at the interval level and above that the arithmetic mean makes
sense.
Example: Temperature of Fridge
• Suppose we look at the temperature of a fridge containing a
pharmaceutical company's new vaccine. We measure the temperate
every hour with the following data points (in Fahrenheit):
• 31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26

• mean = 30.73
• median= 31.0
Finding Measure of Centre
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30,
31, 26
• The mean and median are quite close to each other and both are
around 31 degrees.
• The question, on average, how cold is the fridge?
• About 31 degrees.
• However the vaccine comes with a warning:
• Do not keep this vaccine at a temperature under 29 degrees.
Finding Measure of Centre
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30,
31, 26
• We observe values 28 and 26, indicating that dip has happened below
29 at least twice.
• But we have not paid attention to it while calculating mean and
median.
• Hence we need measure of variation to understand how bad the
fridge condition is.
Measure of Variation
• It is measure of “How spread out the data is”.
• Standard deviation is the most common measure of variation.
• In layman terminology, standard deviation can be thought of as the
"average distance a data point is at from the mean".
• Thus, measure of variation (standard deviation) is a number that attempts
to describe how spread out the data is.
Explanation of formula of standard
deviation
1. Find the mean of the data.
2. For each number in the dataset, subtract it from the mean and then
square it.
3. Find the average of each square difference.
4. Take the square root of the number obtained in step three. This is the
standard deviation.
Measure of variation
31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26

• On computation, standard deviation of the dataset is around 2.5.

• Meaning: “On an average", a data point is 2.5 degrees off from the
average temperature of around 31 degrees.
• Takeaway: The temperature could likely dip
below 29 degrees again in the near future.
Note:
• The reason we want the "square difference" between each point and
the mean and not the "actual difference" is because:
• Squaring the value actually puts emphasis on outliers—data points
that are abnormally far away.
Summary
• Measures of variation give us a very clear picture of how spread out
or dispersed our data is.
• This is especially important when we are concerned with ranges of
data and how data can fluctuate (think percent return on stocks).
• Drawback with Interval Data:
• Data at the interval level does not have a "natural starting point or a
natural zero".
• However, being at zero degrees Celsius does not mean that you have
"no temperature".
The ratio level
• After moving through three different levels with differing levels of
allowed mathematical operations, the ratio level proves to be the
strongest of the four.
• Not only can we define order and difference, the ratio level allows us
to multiply and divide as well.
• This might seem like not much to make a fuss over but it changes
almost everything about the way we view data at this level.
Examples of the ratio level

• Eg: While Fahrenheit and Celsius are stuck in the interval level, the
Kelvin scale boasts a natural zero.
• A measurement of zero Kelvin literally means the absence of heat. It
is a non-arbitrary starting zero.
• We can actually scientifically say that 200 Kelvin is twice as much heat
as 100 Kelvin.
• Money in the bank is at the ratio level. You can have "no money in the
bank" and it also makes sense that $200,000 is "twice as much as"
$100,000.
Measures of center
• The arithmetic mean still holds meaning at this level, as does a new
type of mean called the geometric mean.
• Geometric mean is the nth root of the product of all the values.
• For refrigerator example, geometric mean is
15th root of (31*32*32*31*28*29*31*38*32*31*30* 29*30*31*26) =
30.634
In this case, geometric mean value is comparable to mean and median.
Problem with ratio data:
• The biggest drawback with ratio data is that most of the negative
values do not make sense with ratio data.
• Example: We allowed debt of 50,000 to occur in our money in the
bank. If we had a balance of $50,000, the ratio of 50000/(-50000) i.e.
-1 would not make sense.
• For this reason alone, many data scientists prefer the interval level to
the ratio level.

Chakras Book PDF
100% (17)
Chakras Book PDF
89 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
4/5 (2)
Network Communication Types: by Ahmed El Hefny
100% (1)
Network Communication Types: by Ahmed El Hefny
15 pages
Module 3
No ratings yet
Module 3
66 pages
Module 3
No ratings yet
Module 3
66 pages
Lecture 5 1 Flavours of Data
No ratings yet
Lecture 5 1 Flavours of Data
30 pages
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
No ratings yet
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
50 pages
Module 1 - Interval - Ratio
No ratings yet
Module 1 - Interval - Ratio
28 pages
DAT100 Int Data Ana Lec3 Types of Data
No ratings yet
DAT100 Int Data Ana Lec3 Types of Data
35 pages
Know - Your - Data and Rescaling
No ratings yet
Know - Your - Data and Rescaling
72 pages
Unit 3
No ratings yet
Unit 3
16 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Know - Your - Data and Rescaling-1
No ratings yet
Know - Your - Data and Rescaling-1
78 pages
Module 1 - Lecture 3 - Types of Data - 16.5.2022
No ratings yet
Module 1 - Lecture 3 - Types of Data - 16.5.2022
38 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
CHAPTER 4 Data Management
No ratings yet
CHAPTER 4 Data Management
16 pages
Data Science Using R
No ratings yet
Data Science Using R
74 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
01 Data
No ratings yet
01 Data
100 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Data Types For Analyst
No ratings yet
Data Types For Analyst
8 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Dav Theory
No ratings yet
Dav Theory
111 pages
Module1 Understanding Data1
No ratings yet
Module1 Understanding Data1
56 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Data Types
No ratings yet
Data Types
38 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
4 Types of Data in Statistics
No ratings yet
4 Types of Data in Statistics
10 pages
Unit 2 1
No ratings yet
Unit 2 1
48 pages
Data Types
No ratings yet
Data Types
7 pages
Crisp DM - Crisp MLQ
No ratings yet
Crisp DM - Crisp MLQ
12 pages
Crisp DM - Crisp MLQ
No ratings yet
Crisp DM - Crisp MLQ
9 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Ch01 ICS422 04
No ratings yet
Ch01 ICS422 04
84 pages
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
25 pages
W1L1,2,3 Lecture Script
No ratings yet
W1L1,2,3 Lecture Script
17 pages
Overview and Nature of Data
No ratings yet
Overview and Nature of Data
41 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
Statistics Batch4 Lecture
No ratings yet
Statistics Batch4 Lecture
82 pages
Statistics For-Computing 1
No ratings yet
Statistics For-Computing 1
36 pages
CH 2
No ratings yet
CH 2
35 pages
Data Management
No ratings yet
Data Management
36 pages
Data Preparation-Part 1-231018-220411
No ratings yet
Data Preparation-Part 1-231018-220411
74 pages
CS3353
No ratings yet
CS3353
13 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
Lesson 2 Notes
No ratings yet
Lesson 2 Notes
11 pages
MMW Stat 24 25
No ratings yet
MMW Stat 24 25
42 pages
Final UNIT II-DESCRIPTIVE ANALYTICS
No ratings yet
Final UNIT II-DESCRIPTIVE ANALYTICS
128 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Lectures and Notes MATH 212 (Part 1)
No ratings yet
Lectures and Notes MATH 212 (Part 1)
8 pages
Data Integration
No ratings yet
Data Integration
21 pages
02 Data
No ratings yet
02 Data
35 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Quantitative Vs Qualitative Data - What's The Difference?
No ratings yet
Quantitative Vs Qualitative Data - What's The Difference?
4 pages
Sir Syed's Influence On Language and Literature
No ratings yet
Sir Syed's Influence On Language and Literature
10 pages
Durapac - Pumps - LR
No ratings yet
Durapac - Pumps - LR
29 pages
Agadu Du Du
No ratings yet
Agadu Du Du
15 pages
Pipe Supports
100% (1)
Pipe Supports
147 pages
B.A (Hons) XIX (B) Literary Theory (I) Sem-V (1293)
No ratings yet
B.A (Hons) XIX (B) Literary Theory (I) Sem-V (1293)
4 pages
ĐỀ SỐ 7- ĐỀ LƯƠNG THẾ VINH HÀ NỘI KHÓA 8+-CÔ PHẠM LIỄU
No ratings yet
ĐỀ SỐ 7- ĐỀ LƯƠNG THẾ VINH HÀ NỘI KHÓA 8+-CÔ PHẠM LIỄU
6 pages
Column Slides - Chapter 9
No ratings yet
Column Slides - Chapter 9
17 pages
Cedric Tutt Dentistry CPD - Caroline Edit
No ratings yet
Cedric Tutt Dentistry CPD - Caroline Edit
100 pages
Brandon Jones 2017
No ratings yet
Brandon Jones 2017
6 pages
WMC MCQs 3
No ratings yet
WMC MCQs 3
64 pages
Hamm 2015
No ratings yet
Hamm 2015
8 pages
2024 Exercise Allomorph Der Inf
No ratings yet
2024 Exercise Allomorph Der Inf
5 pages
NxOpen Programming MasterCourse CADVertex
No ratings yet
NxOpen Programming MasterCourse CADVertex
10 pages
RPP Akuntansi Dasar Dalam Bahasa Inggris
No ratings yet
RPP Akuntansi Dasar Dalam Bahasa Inggris
18 pages
Service Manual: Conf Idential
No ratings yet
Service Manual: Conf Idential
32 pages
Loraine Boettner Marriage
No ratings yet
Loraine Boettner Marriage
17 pages
WD801
No ratings yet
WD801
2 pages
Zia Ul Islam
No ratings yet
Zia Ul Islam
2 pages
Java Fundamentals PDF
No ratings yet
Java Fundamentals PDF
106 pages
Estimation of Saturated Data Using The Tobit Kalman Filter
No ratings yet
Estimation of Saturated Data Using The Tobit Kalman Filter
6 pages
Parker SSD Drives 590PR Manual en
No ratings yet
Parker SSD Drives 590PR Manual en
466 pages
CEE 209 Course Objective and Outcome Form - October 2018
No ratings yet
CEE 209 Course Objective and Outcome Form - October 2018
2 pages
Knowledge Booster 4 Unit-6
No ratings yet
Knowledge Booster 4 Unit-6
7 pages
LMHC
No ratings yet
LMHC
1 page
Booklet Chapter 3
No ratings yet
Booklet Chapter 3
22 pages
Name ... : Grade 5: Al Andalus International School
No ratings yet
Name ... : Grade 5: Al Andalus International School
26 pages
1229 Sketching Curves c1
No ratings yet
1229 Sketching Curves c1
13 pages

Module 3 - Types of Data - Part I

Uploaded by

Module 3 - Types of Data - Part I

Uploaded by

Module 3: Data Preprocessing

• Pre-processing is necessary for this tweet because a vast

2. Presence of certain special characters

2. Qualitative data: This data cannot be described using

• Classification of attributes as Quantitative OR Qualitative

• Mean isn’t chosen as we need to perform division operation, which

• On computation, standard deviation of the dataset is around 2.5.

You might also like