Concepts and Techniques: - Chapter 2
Concepts and Techniques: - Chapter 2
— Chapter 2 —
Summary
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
Important Characteristics of Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
Data Objects
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
Attribute Types
Nominal:
Nominal means “relating to names.” The values of a nominal
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
Chapter 2: Getting to Know Your Data
Summary
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
Basic Statistical Descriptions of Data
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x
i i
Trimmed mean: chopping extreme values x i 1
n
Median: w
i 1
i
Middle value if odd number of values, or average of
the middle two values otherwise
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
N
i 1
( xi
2
)
N
xi 2
i 1
2
A histogram
Histograms Often Tell More than Boxplots