0% found this document useful (0 votes)

3 views26 pages

2 - Data Scale

Intro to ML Slide 2

Uploaded by

Harsh Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views26 pages

2 - Data Scale

Intro to ML Slide 2

Uploaded by

Harsh Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Scales and representation

Prof. Asim Tewari

IIT Bombay

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Mining
• Data mining is a process of discovering patterns
in data sets to achieve some specific objective.
This involving methods at the intersection of
machine learning, statistics, and database
systems.

• In the 1960s, statisticians and economists used

terms like data fishing or data dredging to refer
to what they considered the bad practice of
analyzing data without an a-priori hypothesis.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Mining Skill Set

• Statistics
• Programming Languages Pre-processing
• Data Extraction & processing
• Data wrangling and exploration
Business acumen
• Machine Learning models
• Data Visualization
Post-processing

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Mining Tasks

• Gathering Business objectives

• Data acquisition Pre-processing
• Data processing
• Data exploration
• Data Modeling
• Data Visualization
• Model deployment
Post-processing
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Mining job profiles
Designation Role
Data analyst manager Manage the data mining group
Data Scientist Design, develop and deploy data
Data analyst models

Data Architecture Provide secure and efficient access to

Data Engineer data.

Database administrator
Business analyst Provide business objectives
Statistician Provide statistical insights

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Type
• Discrete data:
– Discrete non-ordered numbers
– Random collection of words
– Unrelated audio sounds
– Random music notes
• Sequential (temporal) data: Sequential
– Stochastic process Spatio-temporal
– Sequence of words in a sentence data
– Audio speech data
– Music
Other classifications include
• Spatial data: • Categorical vs numerical
– Image data • Qualitative vs Quantitative
– Geo-spatial data

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
• Same numerical data may have different semantic meanings

• Depending on the semantic meaning different types of

mathematical operations are appropriate

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
• Based on semantic meanings there are four different scales

• For each scale level the operations and statistics of the lower
scale levels are also valid

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Nominal scaled data
– Only tests for equality or non-equality are valid.
– Data of a nominal feature can be represented by the mode (value
that occurs most frequently.)

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Ordinal scaled data
– The operations “greater than” and “less than” are valid
– inequality, and the combinations “greater than or equal” (≥)and “less than or equal” (≤).
– The relation “less than or equal” (≤) defines a total order, such that for any x; y; z we have
• Antisymmetry

• Transitivity

• Totality

– Represented by the median (the value for which (almost) as many smaller as larger values exist)

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Interval scaled data
– addition and subtraction are valid
– have arbitrary zero points
– represented by the (arithmetic) mean

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Ratio scaled data
– multiplication and division are valid
– represented by the generalized mean

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Type, Data Scale, Data value
Date Type, Data Scale and Data values are three different concepts
• Data Type:
– Discrete Type
• Order of collection does not matter
– Sequential Type
• One directional order of collection These can be of any Data Scale
– Spatio-temporal Type
• Multidimensional order of collection

• Data Scale
– Ratio ->Can be only numerical (also called quantitative)
– Interval -> Can be only numerical (also called quantitative)
– Ordinal -> Can be categorical or Qualitative
– Nominal -> Can be only categorical

• Data value
– Discrete (numerical or non-numerical)
– Continuous (numerical also called quantitative)

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
1985 Auto Imports Database

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Abalone (sea snails) data

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Census bureau database
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate,
5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-
clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South,
China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos,
Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,
Holand-Netherlands.

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Variables in ML
• The inputs go by different names, such as
predictors, independent variables, features, or
sometimes just variables and is typically
denoted using the symbol X
• The output variable is often called the
response or dependent variable, and is
typically denoted using the symbol Y

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Supervised Machine Learning
• Our goal in supervised machine learning is to
extract a relationship from data (ordered pairs of
(y,x) )

The real relation is

= +

is noise with zero mean.

What we get from learning from data is

= ℎ( )

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Regression vs Classification

= +

• If y is in interval or ratio scale, then it is

regression
• If y is in Nominal or ordinal (?) scale, then it is
classification

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Regression vs Classification

= +

• The task of classification differs from

regression in that we assign a discrete number
of classes (nominal scale or ordinal scale),
instead of assigning it a continuous value
(interval or ratio scale).

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Set vs Matrix Representations
We can denote numerical feature data as a set
X={x1,x2, ..,xn} ϵ Rpxn
• with n elements, where
• each element is a p-dimensional real-valued
feature vector, where n and p are positive
integers. For p = 1 we call X a scalar data set.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Set and Matrix Representations
• As an alternative to the set representation, numerical feature data
are also often represented as a matrix

• Each row of the data matrix corresponds to an element of the data

set. It is called feature vector or data point xk, k = 1,…., n.

• Each column of the data matrix corresponds to one component of

all elements of the data set. It is called ith feature or ith component
x(i), i =1,…., p.

• A single matrix element is a component of an element of the data

set. It is called datum or value xk(i) , k =1,...., n; i = 1,...., p.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Set and Matrix Representations
• Matrix representation of a data set

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Relations
• Consider a set of (abstract categorical) elements,
with no feature vector representation for the
objects.

• So conventional feature-based data analysis

methods are not applicable. Instead, the relation
of all pairs of objects can often be quantified and
written as a square matrix

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Relations
• Each relation value rij, i; j = 1,…., n, may refer to a
degree of similarity, dissimilarity, compatibility,
incompatibility, proximity or distance between
the pair of objects oi and oj.
• R may be symmetric, so rij = rji for all i, j =1,….,n.
• R may be manually defined or computed from
features. If numerical features X are available,
then R may be computed from X using an
appropriate function f : Rn x Rn →R.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining

Data Mining Unit-1 Notes
No ratings yet
Data Mining Unit-1 Notes
18 pages
Data Mining and Data Warehousing Principles and Practical Techniques 1108727743 9781108727747 Compress
No ratings yet
Data Mining and Data Warehousing Principles and Practical Techniques 1108727743 9781108727747 Compress
513 pages
Data Science 1
100% (4)
Data Science 1
133 pages
Data Science 5
100% (4)
Data Science 5
216 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Unit #2 - Data Warehouse and Data Mining
No ratings yet
Unit #2 - Data Warehouse and Data Mining
51 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Data Science Course in Hyderabad
100% (1)
Data Science Course in Hyderabad
29 pages
CS-30013 (DMDW) - CS End Nov 2024
No ratings yet
CS-30013 (DMDW) - CS End Nov 2024
21 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Ethics Prelim Module
No ratings yet
Ethics Prelim Module
19 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
792 pages
Module 1
No ratings yet
Module 1
140 pages
Biostatistics - Data and Its Types
No ratings yet
Biostatistics - Data and Its Types
11 pages
MachineLearning Presentation
No ratings yet
MachineLearning Presentation
71 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Ch01 ICS422 04
No ratings yet
Ch01 ICS422 04
84 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Big Data and Its Importance
No ratings yet
Big Data and Its Importance
49 pages
Unit 1
No ratings yet
Unit 1
78 pages
Module 2
No ratings yet
Module 2
83 pages
ML Module2 - Chapter2
No ratings yet
ML Module2 - Chapter2
96 pages
Sghapter 02
No ratings yet
Sghapter 02
96 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Presentation Session 1 - Practical Data Science Final
No ratings yet
Presentation Session 1 - Practical Data Science Final
78 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
Module-1 C1-C2
No ratings yet
Module-1 C1-C2
39 pages
CH 1 PDF
No ratings yet
CH 1 PDF
19 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Ccps521 Win2023 Week01 Intro
No ratings yet
Ccps521 Win2023 Week01 Intro
44 pages
Bda U-3
No ratings yet
Bda U-3
30 pages
02 Data
No ratings yet
02 Data
24 pages
ML - 1
No ratings yet
ML - 1
37 pages
Da&ml PPT-1
No ratings yet
Da&ml PPT-1
35 pages
Lec 1
No ratings yet
Lec 1
48 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
02 DataCategorization
No ratings yet
02 DataCategorization
44 pages
L - 2 - Data Scale
No ratings yet
L - 2 - Data Scale
20 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
No ratings yet
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
27 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
Data ch2
No ratings yet
Data ch2
16 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
27 pages
Annual Review of CyberTherapy and Telemedicine, Volume 7, Summer 2009
100% (3)
Annual Review of CyberTherapy and Telemedicine, Volume 7, Summer 2009
296 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
3 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
ML 2
No ratings yet
ML 2
8 pages
Data Mining at UVA: New Horizons in Teaching and Learning Conference
No ratings yet
Data Mining at UVA: New Horizons in Teaching and Learning Conference
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
4 pages
Discovering Your Spiritual Pathway
No ratings yet
Discovering Your Spiritual Pathway
28 pages
Outline of A Research Report
No ratings yet
Outline of A Research Report
4 pages
Inventory Control Techniques in Medical Stores
No ratings yet
Inventory Control Techniques in Medical Stores
9 pages
Lecture 3 - Fundamentals in Metal Casting
No ratings yet
Lecture 3 - Fundamentals in Metal Casting
28 pages
Types of Essays
No ratings yet
Types of Essays
9 pages
Lecture 10a-Forming
No ratings yet
Lecture 10a-Forming
35 pages
Lecture 1-Introduction
No ratings yet
Lecture 1-Introduction
14 pages
MA 214 Lecture 12
No ratings yet
MA 214 Lecture 12
72 pages
Sample Letter of Interest For Promotion
0% (1)
Sample Letter of Interest For Promotion
3 pages
MA 214 Lecture 10
No ratings yet
MA 214 Lecture 10
98 pages
MA 214 Lecture 7
No ratings yet
MA 214 Lecture 7
96 pages
Laser Optics Slides
No ratings yet
Laser Optics Slides
105 pages
Lesson Plan BEED 7
No ratings yet
Lesson Plan BEED 7
11 pages
Lecture5 - Gas Liquefaction and Refrigeration Systems-I
No ratings yet
Lecture5 - Gas Liquefaction and Refrigeration Systems-I
33 pages
Keeping Quiet Notes
No ratings yet
Keeping Quiet Notes
2 pages
Aridra Nakshatra
No ratings yet
Aridra Nakshatra
2 pages
International Journal of Information Management: Francesca Greco, Alessandro Polli
No ratings yet
International Journal of Information Management: Francesca Greco, Alessandro Polli
8 pages
Ee126 Project 1
No ratings yet
Ee126 Project 1
5 pages
Department of Personnel Administration: Memorandum
No ratings yet
Department of Personnel Administration: Memorandum
8 pages
A.S.Jute Product PVT LMT (Final Project)
0% (1)
A.S.Jute Product PVT LMT (Final Project)
48 pages
Language Function
No ratings yet
Language Function
23 pages
Scientists Reveal Three Keys To Happiness-Teacher-12
No ratings yet
Scientists Reveal Three Keys To Happiness-Teacher-12
6 pages
Guma Mba Dissertation 2011 Final
No ratings yet
Guma Mba Dissertation 2011 Final
187 pages
Chapter 7 - Longterm Memory - Encoding & Retrieval
No ratings yet
Chapter 7 - Longterm Memory - Encoding & Retrieval
24 pages
Writing Essays B1
No ratings yet
Writing Essays B1
6 pages
Teaching and Learning For A Sustainable Future
No ratings yet
Teaching and Learning For A Sustainable Future
18 pages
Cesario, Plaks, & Higgins 2006 Automatic Social Behavior As Motivated Preparation To Interact
No ratings yet
Cesario, Plaks, & Higgins 2006 Automatic Social Behavior As Motivated Preparation To Interact
18 pages
Ma214 S23 Part03
No ratings yet
Ma214 S23 Part03
30 pages
Friendship Skills Checklist
No ratings yet
Friendship Skills Checklist
3 pages
Finish Line & Beyond
No ratings yet
Finish Line & Beyond
4 pages
Fundamentals of Project Planning and Management: Yael Grushka-Cockayne
No ratings yet
Fundamentals of Project Planning and Management: Yael Grushka-Cockayne
16 pages
Nonverbal Communication and Discourse Analysis
No ratings yet
Nonverbal Communication and Discourse Analysis
14 pages
Cutting Edge Light Years Ahead A Taste of Things To Come Ahead of One's Time in The Near Future Time Will Tell
No ratings yet
Cutting Edge Light Years Ahead A Taste of Things To Come Ahead of One's Time in The Near Future Time Will Tell
2 pages
Reading Comprehension
No ratings yet
Reading Comprehension
2 pages
Pearl Cris 1
No ratings yet
Pearl Cris 1
2 pages
LETTER
No ratings yet
LETTER
6 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Zero To Mastery In Cybersecurity- Become Zero To Hero In Cybersecurity, This Cybersecurity Book Covers A-Z Cybersecurity Concepts, 2022 Latest Edition
From Everand
Zero To Mastery In Cybersecurity- Become Zero To Hero In Cybersecurity, This Cybersecurity Book Covers A-Z Cybersecurity Concepts, 2022 Latest Edition
RAJIV JAIN
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet