03 ML Data Intro
03 ML Data Intro
1
Types of Data Sets
Record
Relational records
Data matrix
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document data: text documents
Transaction data
Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
World Wide Web
Document 2 0 7 0 2 1 0 0 3 0 0
Social or information networks
Molecular Structures Document 3 0 1 0 0 1 2 2 0 3 0
Ordered
Video data: sequence of images
Temporal data: time-series TID Items
Sequential Data: transaction sequences 1 Bread, Coke, Milk
Genetic sequence data 2 Beer, Bread
Spatial, image and multimedia: 3 Beer, Coke, Diaper, Milk
Spatial data: maps
4 Beer, Bread, Diaper, Milk
Image data:
5 Coke, Diaper, Milk
Video data:
2
Important Characteristics of Structured Data
Dimensionality
Attributes/Characteristics/Features
Sparsity
Only presence counts
Resolution
Patterns depend on the scale/Volume of
data (Big Data)
Distribution
Centrality and dispersion
3
Data Objects
Binary
Numeric:
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: categories, states, or “names of things”
Enum
Why Enumerations?
Examples
Hair_color = {black, brown, grey, red, white}
Universities departments, Engg. programs, occupation, zip codes
More examples
?
Can we represent values as numbers?
Why? Why Not?
Order is significant?
6
Attribute Types
Binary
Why do we use Binary variables?
Nominal attribute with only 2 states (0 and 1)
Examples:
?
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
7
Attribute Types
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Examples
Size = {small, medium, large},
CGPA or grades,
designation rankings
Other examples
?
8
Numeric Attribute Types
NUMERIC / Quantity (integer or real-valued)
Interval: All normal values
Measured on a scale of equal-sized units
Distance b/w values is equal
100 marks and 90 marks are same distance
values as 50 and 40 are
Values have order
E.g., temperature in C˚or F˚, calendar dates
Zero is significant and Statistical formula apply
Examples
About all our normal numeric values
9
Numeric Attribute Types
NUMERIC / Quantity (integer or real-valued)
Ratio
Count based values: Number of ?
Frequency based or Normalized
Comparison based values
Pak Rupees vs Dollars
Inherent Zero-point (Special Definition of ZERO POINT)
EXAMPLES:
Weight, Height, HB-level, etc.
Zero mean not existence of a value.
PH value…
TEMP: ( Not a Ratio value??? )
If in F, C, IT IS NOT A Ratio.
TEMP in K is a Ratio Value.
10
Discrete vs. Continuous Attributes
Discrete Attribute
finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables