Class 2 Introduction to Data
Class 2 Introduction to Data
What is Data?
• Collection of data objects and Attributes
their attributes
B
7 2
8 3
10 4
15 5
Types of Attributes
• There are different types of attributes
– Nominal:Examples: ID numbers, eye color, zip
codes
– Ordinal: Examples: rankings (e.g., taste of potato
chips on a scale from 1 10), grades, height in {tall,
medium, short}
– Interval: Examples: calendar dates, temperatures
in Celsius or Fahrenheit.
– Ratio: Examples: temperature in Kelvin, length,
time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following properties it
possesses:
– Distinctness: =
– Order: < >
– Addition: +
– Multiplication: */
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color,
attributes provide only enough sex: {male, female}
contingency
information to distinguish one object correlation, 2
from another. (=, ) test
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits.
– Continuous attributes are typically represented as floating point
variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Important Characteristics of Structured Data
– Dimensionality
• Curse of Dimensionality
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
y
n
wi
pla
ball
lost
team
score
game
coach
season
timeout
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
2 Graph Partitioning </a>
<li>
1 <a href="papers/papers.html#aaaa">
5 Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Chemical Data
• Benzene Molecule: C6H6
Ordered Data
• Sequences of transactions
Items/Events
An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
Average Monthly
Temperature of
land and ocean
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More stable data
• Aggregated data tends to have less variability
Aggregation
Variation of Precipitation
• Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition
Sample Size
• Techniques
– Principal Component Analysis
– Singular Value Decomposition
– Others: supervised and non linear techniques
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
Dimensionality Reduction: PCA
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space
x2
x1
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– duplicate much or all of the information contained
in one or more other attributes
– Example: purchase price of a product and the
amount of sales tax paid
• Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the
Feature Subset Selection
• Techniques:
– Brute force approch:
• Try all possible feature subsets as input to data mining
algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data
mining algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find best
subset of attributes
Feature Creation
• Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes