Datamining 1class
Datamining 1class
Introduction
1
Course Outline:
• Introduction: KDD Process
• Data Preprocessing
• Classification
• Clustering
2
Data Mining
UNIT 1
DATA-Types of Data, Data mining functionalities-Interestingness
patterns-classification of data Mining systems –Data mining task
primitives –integration of data mining system with a Data warehouse-
major issues in Data mining- Data Preprocessing
3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
5
6
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Technology
Machine
Learning Data Mining Visualization
Pattern
Recognition Other
Algorithm Disciplines
7
Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
8
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
9
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet
regions
• Frequent patterns, association, correlation vs. causality
– Tea Sugar [0.5%, 75%] (Correlation or causality?)
– Classification and prediction
– Construct models (functions) that describe and distinguish classes or concepts
for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
– Predict some unknown or missing numerical values
10
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
11
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
12
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowledge
Data Mining Engine -Base
13
KDD Process: Summary
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant representation
• Choosing functions of data mining
– summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
14
End of Introduction
15
What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or Tid Refund Marital Taxable
Cheat
Status Income
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person, 2 No Married 100K No
temperature, etc. 3 No Single 70K No
instance
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following properties it
possesses:
– Distinctness: =
– Order: < >
– Addition: + -
– Multiplication: */
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes Tid Refund Marital Taxable
Status Income Cheat
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Facebook graph and HTML Links
2
5 1
2
5
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the
final data analysis.
• Stratified sampling
– Split the data into several partitions; then draw random samples from each partition
Curse of Dimensionality
• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Discretization
Scatter plots
showing the
similarity from –1
to 1.
End of Data Preprocessing
Data Mining
Association Rules
50
Association Rule Mining
• Given a set of transactions, find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread} {Eggs,Coke},
3 Milk, Diaper, Beer, Coke {Beer, Bread} {Milk},
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke
not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
supersets
ABCDE
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4
Beer 3
Itemset Count Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4 {Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer} 2
{Milk,Diaper} 3 candidates involving Coke
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
{Bread,Milk,Diaper}
Count
3
6
C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are
frequent
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
• Size of database
– Apriori makes multiple passes, run time of algorithm increase with number of
transactions
• Average transaction width
– This may increase max length of frequent itemsets and traversals of hash tree
(number of subsets in a transaction increases with its width)
Rule Generation
• How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
Pruned
Rules D=>ABC C=>ABD B=>ACD A=>BCD
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Preprocessed
Data
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
uct
uct
uct
uct
uct
uct
uct
uct
uct
uct
Featur
Featur
e
Featur
e
Mining
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e
Selected
Data
Data Preprocessing
Selection
Computing Interestingness Measure
• Given a rule X Y, information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table for supports X Y
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
• Subjective measure:
– Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)
+ - Expected Patterns
- + Unexpected Patterns
• Need to combine expectation of users with evidence from data (i.e., extracted patterns)
End of Association Rule