0% found this document useful (0 votes)
77 views

Data Scales and Representation: Prof. Asim Tewari IIT Bombay

This document discusses key concepts in data mining including data types, scales, and representation. It provides examples of different data types such as discrete, sequential, and spatial data. It also outlines the four main data scales: nominal, ordinal, interval, and ratio. For each scale, it describes the valid mathematical operations and appropriate statistical measures. Finally, it gives examples of datasets including an auto imports database and abalone data to demonstrate these concepts.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Data Scales and Representation: Prof. Asim Tewari IIT Bombay

This document discusses key concepts in data mining including data types, scales, and representation. It provides examples of different data types such as discrete, sequential, and spatial data. It also outlines the four main data scales: nominal, ordinal, interval, and ratio. For each scale, it describes the valid mathematical operations and appropriate statistical measures. Finally, it gives examples of datasets including an auto imports database and abalone data to demonstrate these concepts.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Scales and representation

Prof. Asim Tewari


IIT Bombay

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Mining
• Data mining is a process of discovering patterns
in data sets to achieve some specific objective.
This involving methods at the intersection of
machine learning, statistics, and database
systems.

• In the 1960s, statisticians and economists used


terms like data fishing or data dredging to refer
to what they considered the bad practice of
analyzing data without an a-priori hypothesis.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Mining Skill Set

• Statistics
• Programming Languages Pre-processing
• Data Extraction & processing
• Data wrangling and exploration
Business acumen
• Machine Learning models
• Data Visualization
Post-processing

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Mining Tasks

• Gathering Business objectives


• Data acquisition Pre-processing
• Data processing
• Data exploration
• Data Modeling
• Data Visualization
• Model deployment
Post-processing
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Mining job profiles
Designation Role
Data analyst manager Manage the data mining group
Data Scientist Design, develop and deploy data
Data analyst models

Data Architecture Provide secure and efficient access to


Data Engineer data.

Database administrator
Business analyst Provide business objectives
Statistician Provide statistical insights

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Input variables
• Input variables are typically denoted by the symbol X
• A subscript is used to distinguish among different input
variables (X1, X2, .. Xp)
• The input variables go by different names, such as
– Predictors
– Independent variables
– Features
– or just variables
– Sometimes it is also called attributes (although it has a
more general meaning in the context of describing
characteristics of some thing or a person)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Output variable
• The output variable is often called the
– response or
– dependent variable
– It is typically denoted using the symbol Y

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Type
• Discrete data:
– Discrete non-ordered numbers
– Random collection of words
– Unrelated audio sounds
– Random music notes
• Sequential (temporal) data: Sequential
– Stochastic process Spatio-temporal
– Sequence of words in a sentence data
– Audio speech data
– Music
Other classifications include
• Spatial data: • Categorical vs numerical
– Image data • Qualitative vs Quantitative
– Geo-spatial data

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Scales
• Same numerical data may have different semantic meanings

• Depending on the semantic meaning different types of


mathematical operations are appropriate

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Scales
• Based on semantic meanings there are four different scales

• For each scale level the operations and statistics of the lower
scale levels are also valid

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Nominal scaled data
– Only tests for equality or non-equality are valid.
– Data of a nominal feature can be represented by the mode (value
that occurs most frequently.)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Ordinal scaled data
– The operations “greater than” and “less than” are valid
– inequality, and the combinations “greater than or equal” (≥)and “less than or equal” (≤).
– The relation “less than or equal” (≤) defines a total order, such that for any x; y; z we have
• Antisymmetry

• Transitivity

• Totality

– Represented by the median (the value for which (almost) as many smaller as larger values exist)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Interval scaled data
– addition and subtraction are valid
– have arbitrary zero points
– represented by the (arithmetic) mean

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Scales

For each scale level the operations and statistics of the lower scale levels are also valid
• Ratio scaled data
– multiplication and division are valid
– represented by the generalized mean

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Type, Data Scale, Data value
Date Type, Data Scale and Data values are three different concepts
• Data Type:
– Discrete Type
• Order of collection does not matter
– Sequential Type
• One directional order of collection These can be of any Data Scale
– Spatio-temporal Type
• Multidimensional order of collection

• Data Scale
– Ratio ->Can be only numerical (also called quantitative)
– Interval -> Can be only numerical (also called quantitative)
– Ordinal -> Can be categorical or Qualitative
– Nominal -> Can be only categorical

• Data value
– Discrete (numerical or non-numerical)
– Continuous (numerical also called quantitative)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Type, Data Scale, Data value
Date Type, Data Scale and Data values are three different concepts
• Data Type:
– Discrete Type
• Order of collection does not matter
– Sequential Type
• One directional order of collection These can be of any Data Scale
– Spatio-temporal Type
• Multidimensional order of collection

• Data Scale
– Ratio ->Can be only numerical (also called quantitative)
– Interval -> Can be only numerical (also called quantitative)
– Ordinal -> Can be categorical or Qualitative
– Nominal -> Can be only categorical

• Data value
– Discrete (numerical or non-numerical)
– Continuous (numerical also called quantitative)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
1985 Auto Imports Database

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Abalone (sea snails) data

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Census bureau database
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate,
5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-
clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South,
China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos,
Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,
Holand-Netherlands.

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Variables in ML
• The inputs go by different names, such as
predictors, independent variables, features, or
sometimes just variables and is typically
denoted using the symbol X
• The output variable is often called the
response or dependent variable, and is
typically denoted using the symbol Y

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Supervised Machine Learning
• Our goal in supervised machine learning is to
extract a relationship from data (ordered pairs of
(y,x) )

The real relation is


𝑦 =𝑓 𝑥 +𝜖

𝜖 is noise with zero mean.

What we get from learning from data is


𝑦ො = ℎ(𝑥)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Regression vs Classification

𝑦 =𝑓 𝑥 +𝜖

• The task of classification differs from regression in that


we assign a discrete number of classes (nominal scale
or ordinal scale), instead of assigning it a continuous
value (interval or ratio scale).

• If y is in interval or ratio scale, then it is regression


• If y is in Nominal or ordinal (?) scale, then it is
classification

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Set and Matrix Representations
• We denote numerical feature data as the set
X={x1,x2, ..,xn} ϵ Rpxn
• with n elements, where
• each element is a p-dimensional real-valued
feature vector, where n and p are positive
integers. For p = 1 we call X a scalar data set.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Set and Matrix Representations
• As an alternative to the set representation, numerical feature data
are also often represented as a matrix

• Each row of the data matrix corresponds to an element of the data


set. It is called feature vector or data point xk, k = 1,…., n.

• Each column of the data matrix corresponds to one component of


all elements of the data set. It is called ith feature or ith component
x(i), i =1,…., p.

• A single matrix element is a component of an element of the data


set. It is called datum or value xk(i) , k =1,...., n; i = 1,...., p.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Set and Matrix Representations
• Matrix representation of a data set

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Relations
• Consider a set of (abstract categorical) elements,
with no feature vector representation for the
objects.

• So conventional feature-based data analysis


methods are not applicable. Instead, the relation
of all pairs of objects can often be quantified and
written as a square matrix

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Data Relations
• Each relation value rij, i; j = 1,…., n, may refer to a
degree of similarity, dissimilarity, compatibility,
incompatibility, proximity or distance between
the pair of objects oi and oj.
• R may be symmetric, so rij = rji for all i, j =1,….,n.
• R may be manually defined or computed from
features. If numerical features X are available,
then R may be computed from X using an
appropriate function f : Rp x Rp →R.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications

You might also like