0% found this document useful (0 votes)

30 views76 pages

Datamining 1class

The document provides an overview of data mining, including its processes, functionalities, and the types of data involved. It discusses the necessity of data mining due to the explosive growth of data and outlines various methodologies and issues related to mining different kinds of knowledge. The document also covers data preprocessing techniques and the importance of data quality in the mining process.

Uploaded by

sowjanya.gumidelli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views76 pages

Datamining 1class

Uploaded by

sowjanya.gumidelli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 76

Data Mining

Introduction

1
Course Outline:
• Introduction: KDD Process
• Data Preprocessing
• Classification
• Clustering

2
Data Mining

UNIT 1
DATA-Types of Data, Data mining functionalities-Interestingness
patterns-classification of data Mining systems –Data mining task
primitives –integration of data mining system with a Data warehouse-
major issues in Data mining- Data Preprocessing

3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
5
6
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

Machine
Learning Data Mining Visualization

Pattern
Recognition Other
Algorithm Disciplines

7
Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
8
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
9
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet
regions
• Frequent patterns, association, correlation vs. causality
– Tea  Sugar [0.5%, 75%] (Correlation or causality?)
– Classification and prediction
– Construct models (functions) that describe and distinguish classes or concepts
for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
– Predict some unknown or missing numerical values
10
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera  large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses

11
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
12
Architecture: Typical Data Mining System
Graphical User Interface

Pattern Evaluation
Knowledge
Data Mining Engine -Base

Database or Data Warehouse

Server
data cleaning, integration, and selection

Database Data World-Wide Other Info

Warehouse Web Repositories

13
KDD Process: Summary
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant representation
• Choosing functions of data mining
– summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge

14
End of Introduction

15
What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or Tid Refund Marital Taxable
Cheat
Status Income
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person, 2 No Married 100K No
temperature, etc. 3 No Single 70K No

– Attribute is also known as variable, 4 Yes Married 120K No

field, characteristic, or feature 5 No Divorced 95K Yes

Objects 6 No Married 60K No
• A collection of attributes describe 7 Yes Divorced 220K No
an object 8 No Single 85K Yes

– Object is also known as record, 9 No Married 75K No

point, case, sample, entity, or 10

10 No Single 90K Yes

instance
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following properties it
possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Attribute Description Examples Operations
Type
The values of a nominal attribute are just zip codes, employee mode, entropy,
Nominal different names, i.e., nominal attributes ID numbers, eye color, contingency
provide only enough information to
sex: {male, female} correlation, 2 test
distinguish one object from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

provide enough information to order {good, better, best}, rank correlation,
objects (< >). grades, street numbers run tests, sign tests

For interval attributes, the differences calendar dates, mean, standard

Interval between values are meaningful, i.e., a unit temperature in Celsius deviation, Pearson's
of measurement exists.
or Fahrenheit correlation, t and F
(+, - )
tests
For ratio variables, both differences temperature in Kelvin, geometric mean,
Ratio monetary quantities,
and ratios are meaningful. (*, /) harmonic mean,
counts, age, mass, length,
percent variation
electrical current
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes

• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data

• Graph
– World Wide Web
– Molecular Structures

• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there

are m rows, one for each object, and n columns, one for each attribute
Projection Projection Distance Load T hickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Text Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Facebook graph and HTML Links
2
5 1
2
5
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in
the data set
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)
Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
– Major issue when merging data from heterogeneous sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a single
attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the
final data analysis.

• Statisticians sample because obtaining the entire set of data of

interest is too expensive or time consuming.

• Sampling is used in data mining because processing the entire set

of data of interest is too expensive or time consuming.
Sample Size

8000 points 2000 Points 500 Points

Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item

• Sampling without replacement

– As each item is selected, it is removed from the population

• Sampling with replacement

– Objects are not removed from the population as they are selected for the sample.
• In sampling with replacement, the same object can be picked up more than once

• Stratified sampling
– Split the data into several partitions; then draw random samples from each partition
Curse of Dimensionality

• When dimensionality increases, data becomes increasingly sparse

in the space that it occupies

• Definitions of density and distance between points, which is

critical for clustering and outlier detection, become less
meaningful
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Discretization

Data Equal interval width

Equal frequency K-means

Attribute Transformation
• A function that maps the entire set of values of
a given attribute to a new set of replacement
values such that each old value can be
identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Euclidean Distance
• Euclidean Distance
n 2
dist   ( pk  qk )
k 1

Where n is the number of dimensions (attributes) and pk and qk are,

respectively, the kth attributes (components) or data objects p and q.

• Standardization is necessary, if scales differ.

Mahalanobis Distance
mahalanobis ( p, q ) ( p  q )   1 ( p  q )T

 is the covariance matrix of the

input data X
n
1
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q, and then
take their dot product

pk ( pk  mean ( p )) / std ( p )

qk ( qk  mean ( q )) / std ( q )

correlatio n( p, q )  p  q
Visually Evaluating Correlation

Scatter plots
showing the
similarity from –1
to 1.
End of Data Preprocessing
Data Mining
Association Rules

50
Association Rule Mining
• Given a set of transactions, find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk {Diaper}  {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread}  {Eggs,Coke},
3 Milk, Diaper, Beer, Coke {Beer, Bread}  {Milk},
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke
not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items

• An itemset that contains k items 1 Bread, Milk

• Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
– E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or equal to a
minsup threshold
TID Items
1 Bread, Milk
 Association Rule
2 Bread, Diaper, Beer, Eggs
– An implication expression of the form X  Y, where X
3 Milk, Diaper, Beer, Coke
and Y are itemsets
4 Bread, Milk, Diaper, Beer
– Example:
5 Bread, Milk, Diaper, Coke
{Milk, Diaper}  {Beer}
 Rule Evaluation Metrics
– Support (s) Example:
 Fraction of transactions that contain both X and {Milk , Diaper }  Beer
Y
– Confidence (c)  ( Milk , Diaper, Beer ) 2
s  0.4
 Measures how often items in Y |T| 5
appear in transactions that  ( Milk, Diaper, Beer ) 2
contain X c  0.67
 ( Milk , Diaper ) 3
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to
find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Given d items, there are

ABCD ABCE ABDE ACDE BCDE
2d possible candidate
itemsets
ABCDE
Frequent Itemset Generation

• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate

– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

• Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be
frequent

• Apriori principle holds due to the following property of the

support measure:
X , Y : ( X  Y )  s ( X ) s (Y )

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Pruned ABCD ABCE ABDE ACDE BCDE

supersets
ABCDE
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4
Beer 3
Itemset Count Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4 {Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer} 2
{Milk,Diaper} 3 candidates involving Coke
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
{Bread,Milk,Diaper}
Count
3
6
C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are
frequent
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
• Size of database
– Apriori makes multiple passes, run time of algorithm increase with number of
transactions
• Average transaction width
– This may increase max length of frequent itemsets and traversals of hash tree
(number of subsets in a transaction increases with its width)
Rule Generation
• How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)

– But confidence of rules generated from the same itemset has an

anti-monotone property
– e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Rule Generation for Apriori Algorithm
Lattice of rules ABCD=>{ }
Low
Confidence
Rule BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

Pruned
Rules D=>ABC C=>ABD B=>ACD A=>BCD
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

• Prune rule D=>ABC if its

D=>ABC
subset AD=>BC does not have
high confidence
Pattern Evaluation
• Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence

• Interestingness measures can be used to prune/rank the derived

patterns

• In the original formulation of association rules, support & confidence

are the only measures used
Application of Interestingness Measure Knowledge
Interestingness
Measures Patterns
Postprocessing

Preprocessed
Data

Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
Prod
uct
uct
uct
uct
uct
uct
uct
uct
uct
uct
Featur
Featur
e
Featur
e

Mining
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e

Selected
Data

Data Preprocessing

Selection
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table for supports X  Y
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|

Used to define various measures

 support, confidence, lift, Gini,
J-measure, etc.
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)

– P(SB) = 420/1000 = 0.42

– P(S) P(B) = 0.6 0.7 = 0.42

– P(SB) = P(S) P(B) => Statistical independence

– P(SB) > P(S) P(B) => Positively correlated
– P(SB) < P(S) P(B) => Negatively correlated
Statistical-based Measures
• take into account statistical dependence
P (Y | X )
Lift 
P (Y )
P( X , Y )
Interest 
P ( X ) P (Y )
PS  P ( X , Y )  P ( X ) P (Y )
P ( X , Y )  P ( X ) P (Y )
  coefficien t 
P ( X )[1  P ( X )] P (Y )[1  P (Y )]
Example: Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Þ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
There are lots of
measures proposed in
the literature

Some measures are

good for certain
applications, but not for
others

What criteria should we

use to determine
whether a measure is
good or bad?

What about Apriori-

style support based
pruning? How does it
Subjective Interestingness Measure
• Objective measure:
– Rank patterns based on statistics computed from data
– e.g., 21 measures of association (support, confidence, Laplace, Gini,
mutual information, Jaccard, etc).

• Subjective measure:
– Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent

Pattern found to be frequent
Pattern found to be infrequent

+ - Expected Patterns

- + Unexpected Patterns

• Need to combine expectation of users with evidence from data (i.e., extracted patterns)
End of Association Rule

IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
The Knowledge Graph Cookbook
No ratings yet
The Knowledge Graph Cookbook
228 pages
300+ Artificial Intelligence MCQ Questions & Answers - Letsfindcourse
100% (2)
300+ Artificial Intelligence MCQ Questions & Answers - Letsfindcourse
10 pages
Class-Xii-It-Database Concepts
No ratings yet
Class-Xii-It-Database Concepts
240 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Full
No ratings yet
Full
367 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
Data Mining
100% (4)
Data Mining
9 pages
Mining
No ratings yet
Mining
129 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
DWDM Reference Notes
No ratings yet
DWDM Reference Notes
126 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
TTDS Lecture 1
No ratings yet
TTDS Lecture 1
22 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Week 01 Lecture Material PDF
No ratings yet
Week 01 Lecture Material PDF
79 pages
3 DM
No ratings yet
3 DM
36 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Lec 1
No ratings yet
Lec 1
33 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Ragb Alllnkg Kyoulltherrdz: in Structor
No ratings yet
Ragb Alllnkg Kyoulltherrdz: in Structor
31 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Unit 1
No ratings yet
Unit 1
28 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Data Mining
No ratings yet
Data Mining
15 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
IDS Unit 2 Additional Topics
No ratings yet
IDS Unit 2 Additional Topics
15 pages
Mca (Management) 2020 Pattern
No ratings yet
Mca (Management) 2020 Pattern
74 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Week 02.0 Chapt02
No ratings yet
Week 02.0 Chapt02
9 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
No ratings yet
The Data Explosion: Modern Computer Systems Are Accumulating Data at An Almost Unimaginable Rate and From A
14 pages
02data Part1
No ratings yet
02data Part1
19 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
DSA PPT
No ratings yet
DSA PPT
15 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Fix List For IBM WebSphere Application Server V8
No ratings yet
Fix List For IBM WebSphere Application Server V8
8 pages
Data Structure and Algorithms With Python
100% (14)
Data Structure and Algorithms With Python
369 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (21)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
93% (43)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
Introduction To Artificial Intelligence
93% (41)
Introduction To Artificial Intelligence
316 pages
Hall Ticket Number:: 14CS IT 504
No ratings yet
Hall Ticket Number:: 14CS IT 504
19 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
MOOC Audit Course 4101079
No ratings yet
MOOC Audit Course 4101079
24 pages
Bits Pilani Ms Software Systems Dissertation
100% (2)
Bits Pilani Ms Software Systems Dissertation
7 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
Unit 4
No ratings yet
Unit 4
36 pages
Fundamental of Database Project Work
No ratings yet
Fundamental of Database Project Work
20 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Gulshan Mohiddin Shaik - 3years
No ratings yet
Gulshan Mohiddin Shaik - 3years
2 pages
TOO4TO Module 7 / Artificial Intelligence and Sustainability: Part 1
No ratings yet
TOO4TO Module 7 / Artificial Intelligence and Sustainability: Part 1
17 pages
Coding With Sagar: +91 9685832822, Nagpur, Maharashtra WWW - Codingwithsagar.in
No ratings yet
Coding With Sagar: +91 9685832822, Nagpur, Maharashtra WWW - Codingwithsagar.in
7 pages
Knowledge Representation: Facts: Representations of Facts in Some Chosen Formalism
No ratings yet
Knowledge Representation: Facts: Representations of Facts in Some Chosen Formalism
12 pages
File Management 1
No ratings yet
File Management 1
11 pages
Apoorva Machale
No ratings yet
Apoorva Machale
1 page
DW Lab Manual 1-9
No ratings yet
DW Lab Manual 1-9
27 pages
Evolution of Semantic Similarity - A Survey
No ratings yet
Evolution of Semantic Similarity - A Survey
35 pages
Google Scholar Sci-Hub and LibGen Could They Be Our New Partner
No ratings yet
Google Scholar Sci-Hub and LibGen Could They Be Our New Partner
13 pages
8.CNS Assignment
No ratings yet
8.CNS Assignment
2 pages
Stacks and Queues Mastering Linear Data Structures
No ratings yet
Stacks and Queues Mastering Linear Data Structures
9 pages
461 Spring 2011 P1 Concepts SOLUTIONS
No ratings yet
461 Spring 2011 P1 Concepts SOLUTIONS
6 pages
Preservation of The Privacy of Data On Cloud by Using Key Map Based Data Anonamysa
No ratings yet
Preservation of The Privacy of Data On Cloud by Using Key Map Based Data Anonamysa
12 pages
Capstone Project-Naan Mudlvan
No ratings yet
Capstone Project-Naan Mudlvan
2 pages
Optimization of DeepFake Video Detection Using Image Preprocessing
No ratings yet
Optimization of DeepFake Video Detection Using Image Preprocessing
5 pages
Zaid Ubay Siregar PDF
No ratings yet
Zaid Ubay Siregar PDF
4 pages
Personal Information: Gallano, Arlene Boysillo, Jeremae Limbaña, Joeann Taclajan, Rosemarie Catonina, Elena Gubac, Anelyn
No ratings yet
Personal Information: Gallano, Arlene Boysillo, Jeremae Limbaña, Joeann Taclajan, Rosemarie Catonina, Elena Gubac, Anelyn
3 pages
Mayuri Samanta Resume
No ratings yet
Mayuri Samanta Resume
1 page
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Datamining 1class

Uploaded by

Datamining 1class

Uploaded by

Data Mining

Database or Data Warehouse

Database Data World-Wide Other Info

– Attribute is also known as variable, 4 Yes Married 120K No

field, characteristic, or feature 5 No Divorced 95K Yes

– Object is also known as record, 9 No Married 75K No

point, case, sample, entity, or 10

– Nominal attribute: distinctness

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

For interval attributes, the differences calendar dates, mean, standard

1 Yes Single 125K No

• Such data set can be represented by an m by n matrix, where there

10.23 5.27 15.22 2.7 1.2

Two Sine Waves Two Sine Waves + Noise

• Handling missing values

• Statisticians sample because obtaining the entire set of data of

• Sampling is used in data mining because processing the entire set

8000 points 2000 Points 500 Points

• Sampling without replacement

• Sampling with replacement

• When dimensionality increases, data becomes increasingly sparse

• Definitions of density and distance between points, which is

Data Equal interval width

Equal frequency K-means

Where n is the number of dimensions (attributes) and pk and qk are,

• Standardization is necessary, if scales differ.

 is the covariance matrix of the

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

pk ( pk  mean ( p )) / std ( p )

qk ( qk  mean ( q )) / std ( q )

• An itemset that contains k items 1 Bread, Milk

Given d items, there are

– Match each transaction against every candidate

• Reduce the number of comparisons (NM)

• Apriori principle holds due to the following property of the

– Support of an itemset never exceeds the support of its subsets

Pruned ABCD ABCE ABDE ACDE BCDE

– But confidence of rules generated from the same itemset has an

c(ABC  D)  c(AB  CD)  c(A  BCD)

• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

• Prune rule D=>ABC if its

• Interestingness measures can be used to prune/rank the derived

• In the original formulation of association rules, support & confidence

Used to define various measures

– P(SB) = 420/1000 = 0.42

– P(SB) = P(S) P(B) => Statistical independence

Association Rule: Tea  Coffee

Some measures are

What criteria should we

What about Apriori-

+ Pattern expected to be frequent

- Pattern expected to be infrequent

You might also like