2 1 Data
2 1 Data
— Chapter 2 —
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix
Document data: text documents:
Term-frequency vector
Transaction data
Graph and network
World Wide Web
Social or information networks
Molecular Structures
Ordered
Video data: sequence of images TID Items
Temporal data: time-series 1 Bread, Coke, Milk
Sequential Data: transaction sequences 2 Butter, Bread
Genetic sequence data 3 Butter, Coke, Cookies, Milk
Spatial, image and multimedia: 4 Butter, Bread, Cookies, Milk
Spatial data: maps
5 Coke, Cookies, Milk
Image data
Video data
3
Important Characteristics of Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Distribution
Centrality and dispersion
4
Data Objects
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
6
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
Continuous Attribute
Has real numbers as attribute values
10
Chapter 2: Getting to Know Your Data
Summary
11
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
Weighted arithmetic mean:
x
1
n
xi x
n n i 1 N
wx i i
x i 1
n
w
i 1
i
Median:
Middle value if odd number of values, or average of the middle two values
otherwise
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula: mean mode 3 (mean median)
13
Symmetric vs. Skewed Data
N
i 1
( x i
2
)
N
xi 2
i 1
2
15
Sample Variance – Unbiased estimator
Variance calculated with the n-1 correlation tends to approach the true variance
at large sample size, meaning it is unbiased.
Khanacademy.org
Quartiles
A plot of the data distribution for some attribute X. The
quantiles plotted are quartiles. The three quartiles divide
the distribution into four equal-size consecutive subsets.
17
Boxplot Analysis
18
Boxplots
Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.
19
Exercise
Find the median, quartiles, and interquartile range for the
following 19 samples:
10, 5, 7, 23, 24, 15, 24, 19, 21, 25, 21, 22, 22, 23, 24, 23, 24, 23, 23
20
Properties of Normal Distribution Curve
21
Outlier Detection: idea from a paper
22