Lecture 3 Variables and Data Preprocessing
Lecture 3 Variables and Data Preprocessing
1
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images 1 Bread, Coke, Milk
Temporal data: time-series 2 Beer, Bread
Sequential Data: transaction sequences 3 Beer, Coke, Diaper, Milk
Genetic sequence data
4 Beer, Bread, Diaper, Milk
Spatial, image and multimedia:
Spatial data: maps
5 Coke, Diaper, Milk
Image data:
Video data:
3
Data Objects
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
7
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
Continuous Attribute
Has real numbers as attribute values
point variables
8
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
9
The data analysis pipeline
Mining is not the only step in the analysis process
Pre-processing: real data is noisy, incomplete and
selection.
A dirty work, but it is often the most important step for the
analysis.
Post-Processing: Make the data actionable and useful to the
user
Statistical analysis of importance
Visualization.
10
Data Quality
Examples of data quality problems:
Noise and outliers
Missing values
Duplicate data
11
Sampling
Sampling is the main technique employed for data selection.
•It is often used for both the preliminary investigation of the data and the final data analysis.
•Statisticians sample because obtaining the entire set of data of interest is too expensive or
time consuming.
•Example: What is the average height of a person in Ioannina?
•We cannot measure the height of everybody
•Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
•Example: We have 1M documents. What fraction has at least 100 words in common?
•Computing number of common words for all pairs requires 1012 comparisons
•Example: What fraction of tweets in a year contain the word “Greece”?
•300M tweets per day, if 100 characters on average, 86.5TB to store all tweets
12
Sampling
13
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
sample.
In sampling with replacement, the same object can be picked up more
0.49. If I pick two persons what is the probability P(W,W) that both are
women?
•Sampling with replacement: P(W,W) = 0.512
14
Types of Sampling
Stratified sampling
Split the data into several groups; then draw random samples from each group.
average more words in common than those that are not? I have 1M pages, and 1M links,
what happens if I select 10K pairs of pages at random?
Most likely I will not get any links. Solution: sample 10K random pairs, and 10K links
15
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
Many types of data sets, e.g., numerical, text, graph, Web,
image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives