Datamining-Lect1 2
Datamining-Lect1 2
LECTURE 1
Introduction
2
Behavioral data
• Mobile phones today record a large amount of information
about the user behavior
• GPS records position
• Camera produces images
• Communication via phone and SMS
• Text via facebook updates
• Amazon collects all the items that you browsed, placed into
your basket, read reviews about, purchased.
Types of Attributes
• There are different types of attributes
• Categorical
• Examples: eye color, id number, rankings (e.g, good, fair,
bad), height in {tall, medium, short}
• Nominal (no order) vs Ordinal (order)
• Nominal and Ordinal are collectively referred to as Categorical
or qualitative attributes
• Numeric(quantitative)
• Examples: dates, temperature, time, length, value.
• Interval
• ratio
11
Types of Attributes
Record Data
• Much data mining work assumes that the data set is a
collection of records(data objects), each of which consists
of a fixed set of data fields (attributes).
14
Categorical Data
• Data that consists of a collection of records, each
of which consists of a fixed set of categorical
attributes
Tid Refund Marital Taxable
Status Income Cheat
Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
• Bag-of-words representation – no ordering
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
16
Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
18
Ordered Data
• Time series
• Sequence of ordered (over “time”) numeric values.
19
Graph Data
• Examples: Web graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Attributes
So, what is Data?
Tid Refund Marital Taxable
• Collection of data objects and Status Income Cheat
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.
Clustering: Application
• Document Clustering:
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
Classification Example
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10
Set Classifier
Classification: Application 1
• Ad Click Prediction
• Goal: Predict if a user that visits a web page will click
on a displayed ad. Use it to target users with high
click probability.
• Approach:
• Collect data for users over a period of time and record who
clicks and who does not. The {click, no click} information
forms the class attribute.
• Use the history of the user (web pages browsed, queries
issued) as the features.
• Learn a classifier model and test on new users.
30
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Disciplines
32
7
38
Data Result
Preprocessing Data Mining Post-processing
Sampling
• Sampling is the main technique employed for data
selection.
• It is often used for both the preliminary investigation of the data and
the final data analysis.
Sampling …
• The key principle for effective sampling is the
following:
• using a sample will work almost as well as using the
entire data sets, if the sample is representative
Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item
• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
44
Sample Size