0% found this document useful (0 votes)
39 views61 pages

Down 2

This document provides an introduction to data mining and association mining. It discusses key topics like data mining functionalities, issues, efficient frequent itemset mining methods, mining association rules, and constraint-based association mining. It also describes the major steps in data mining including data preprocessing techniques like data cleaning, integration, transformation, discretization, and reduction which prepare raw data for analysis. The goal of these techniques is to extract useful knowledge and insights from large amounts of data.

Uploaded by

pavunkumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views61 pages

Down 2

This document provides an introduction to data mining and association mining. It discusses key topics like data mining functionalities, issues, efficient frequent itemset mining methods, mining association rules, and constraint-based association mining. It also describes the major steps in data mining including data preprocessing techniques like data cleaning, integration, transformation, discretization, and reduction which prepare raw data for analysis. The goal of these techniques is to extract useful knowledge and insights from large amounts of data.

Uploaded by

pavunkumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

INTRODUCTION TO DATA MINING

AND ASSOCIATION MINING


Unit-2
Topics:
Data mining
functionalities
Major issues
Efficient and scalable frequent item set mining
methods
Mining various kinds of association rules
Association mining to correlation analysis
Constraint based association mining.
Data Mining
• To extract valuable information from huge sets
of data.
• Data mining is also called Knowledge Discovery
in Database (KDD).
• Data mining is one of the most useful techniques
that help entrepreneurs, researchers, and
individuals to extract valuable information from
huge sets of data.
Data mining can be performed on the
following types of data:
• Relational Database
• Data warehouses
• Data Repositories
• Object-Relational Database
• Transactional Database
Data Mining Applications
Example:
Data mining in Education
• An organization can use data mining to make precise
decisions and also to predict the results of the student.
With the results, the institution can concentrate on
what to teach and how to teach.
Data Mining Financial Banking
• The data mining technique can help bankers by solving
business-related problems in banking and finance by
identifying trends, casualties, and correlations in
business information and market costs that are not
instantly evident to managers or executives because
the data volume is too large or are produced too
rapidly on the screen by experts.
Major issues (or)Challenges of
Implementation in Data mining
• Incomplete and noisy data
• Data Distribution
• Complex Data
• Performance
• Data Privacy and Security
• Data Visualization
Data Mining Architecture
Database or Data Warehouse Server:
• The database or data warehouse server consists
of the original data that is ready to be processed.
Data Mining Engine:
• It contains several modules for operating data
mining tasks, including
association,
characterization,
classification,
clustering,
prediction,
time-series analysis, etc.
Pattern Evaluation Module:
• The Pattern evaluation module is primarily responsible for the
measure of investigation of the pattern by using a threshold value.
• It collaborates with the data mining engine to focus the search on
exciting patterns.
Graphical User Interface:
• The graphical user interface (GUI) module communicates between
the data mining system and the user.
• This module helps the user to easily and efficiently use the system
without knowing the complexity of the process.
Knowledge Base:
• It might be helpful to guide the search or evaluate the stake of the
result patterns.
• The knowledge base may even contain user views and data from
user experiences that might be helpful in the data mining process.
• The data mining engine may receive inputs from the knowledge
base to make the result more accurate and reliable.
Data Preprocessing

• Data preprocessing is the process of


transforming raw data into an understandable
format.
• It is also an important step in data mining as
we cannot work with raw data.
• The data is acquired from Excel files,
databases, text file data, and unorganized
data such as audio clips, images, GPRS,
and video clips.
Major Tasks in Data Preprocessing:
• Data cleaning
• Data integration
• Data reduction
• Data transformation
• Discretization and concept hierarchy
generation
Data cleaning:
• Data cleaning is the process to remove incorrect data,
incomplete data and inaccurate data from the datasets, and
it also replaces the missing values.
There are some techniques in data cleaning:
Missing data
There are a number of ways to correct for missing data, but
the two most common are:
• Ignore the tuples: A tuple is an ordered list or sequence of
numbers or entities. If multiple values are missing within
tuples, you may simply discard the tuples with that missing
information. This is only recommended for large data sets,
when a few ignored tuples won’t harm further analysis.
• Manually fill in missing data: This can be tedious, but is
definitely necessary when working with smaller data sets.
Noisy:
• Unnecessary data points, irrelevant data, and data that’s more difficult to
group together.
Binning:
This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of
bins. There are three methods for smoothing data in the bin.
i) Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin;
ii)Smoothing by bin median: In this method, the values in the bin are
replaced by the median value;
iii)Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced
by the closest boundary value.
Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to
decide the variable which is suitable for our analysis.
Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
If you’re working with text data, for example,
some things you should consider when
cleaning your data are:
• Remove URLs, symbols, emojis, etc., that
aren’t relevant to your analysis
• Translate all text into the language you’ll be
working in
• Remove HTML tags
• Remove unnecessary blank text between
words
• Remove duplicate data
Example:

• IF you need to predict whether a person can


drive ,information about their hair color,
height or weight will be irrelevant
Data Integration
• Data Integration is one of the data preprocessing steps
that are used to merge the data present in multiple
sources into a single larger data store like a data
warehouse.
We might run into some issues while adopting Data
Integration as one of the Data Preprocessing steps:
• Schema integration and object matching: The data
can be present in different formats, and attributes
that might cause difficulty in data integration.
• Removing redundant attributes from all data
sources.
• Detection and resolution of data value conflicts.
Data Transformation
• Once data clearing has been done, we need to
consolidate the quality data into alternate
forms by changing the value, structure, or
format of data
• Data Transformation strategies.
Generalization
Normalization
Attribute Selection
Aggregation
Generalization
• The low-level or granular data that we have converted to
high-level information by using concept hierarchies.
• We can transform the primitive data in the address like the city to
higher-level information like the country.
For example:
• age data can be in the form of (20, 30) in a dataset.
• It is transformed into a higher conceptual level into a categorical
value (young, old).
Normalization:
• It is done to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
• Normalization can be done in multiple ways, which are highlighted
here:
Min-max normalization
Z-Score normalization
Decimal scaling normalization
Min-max normalization:
• The min-max normalization would map Vi to the V'i in
a new smaller range [new_minA, new_maxA].
• The formula for min-max normalization is given below:

For example,
• we have $1200 and $9800 as the minimum, and
maximum value for the attribute income, and [0.0,
1.0] is the range in which we have to map a value of
73,600.
• The value $73,600 would be transformed using
min-max normalization as follows:
Attribute Selection
• New properties of data are created from
existing attributes to help in the data mining
process.
• For example, date of birth, data attribute can
be transformed to another property like
is_senior_citizen for each tuple, which will
directly influence predicting diseases or
chances of survival, etc.
Aggregation
• It is a method of storing and presenting data
in a summary format.
• For example:
• we have a data set of sales reports of an
enterprise that has quarterly sales of each
year. We can aggregate the data to get the
enterprise's annual sales report.
Discretization and concept hierarchy
generation
• Data discretization refers to a method of
converting a huge number of data values into
smaller ones so that the evaluation and
management of data become easy
Example:
• Suppose we have an attribute of Age with the
given values
Table before Discretization
Concept hierarchy generation:
• Hierarchy concept refers to a sequence of
mappings with a set of more general concepts
to complex concepts.
• It means mapping is done from low-level
concepts to high-level concepts.
• There are two types of hierarchy: top-down
mapping and the second one is bottom-up
mapping.
Top-down mapping
• Top-down mapping generally starts with the
top with some general information and ends
with the bottom to the specialized
information.
Bottom-up mapping
• Bottom-up mapping generally starts with the
bottom with some specialized information and
ends with the top to the generalized
information.
• Let's understand this concept hierarchy for
the dimension location with the help of an
example.
• A particular city can map with the belonging
country. For example, New Delhi can be
mapped to India, and India can be mapped to
Asia.
Data Reduction:
• Data reduction is a process that reduced the
volume of original data.
• Data reduction techniques ensure the
integrity of data while reducing the data.
• collect data from different data warehouses
for analysis, it results in a huge amount of
data. It is difficult for a data analyst to deal
with this large volume of data.
• Data reduction increases the efficiency of data
mining.
Data Reduction Techniques
• Dimensionality reduction
• Numerosity reduction
• Data compression.
Dimensionality reduction:
• Dimensionality reduction eliminates the
attributes from the data set.
Three methods of dimensionality reduction.
a. Wavelet Transform
b. Principal Component Analysis
c. Attribute Subset Selection
a. Wavelet Transform
• In the wavelet transform, a data vector X is transformed into a
numerically different data vector X’ such that both X and X’
vectors are of the same length.
• The data obtained from the wavelet transform can be truncated.
• The compressed data is obtained by retaining the smallest
fragment of the strongest of wavelet coefficients.
b. Principal Component Analysis
• Let us consider we have a data set to be analyzed that has tuples
with n attributes, then the principal component analysis identifies
k independent tuples with n attributes that can represent the data
set.
c. Attribute Subset Selection
• The large data set has many attributes some of which are
irrelevant to data mining or some are redundant.
• The attribute subset selection reduces the volume of data by
eliminating the redundant and irrelevant attribute.
Numerosity reduction :
• The numerosity reduction reduces the volume of
the original data and represents it in a much
smaller form.
This technique includes two types
• Parametric
For parametric methods a model is used to
estimate the data, so that typically only the data
parameters need to be stored, instead of the
actual data.
• Non-parametric
Non parametric methods for storing data include
histograms, clustering and sampling.
Histograms:
• A histogram is a ‘graph’ that represents frequency
distribution which describes how often a value
appears in the data.
• Histogram uses the binning method and to
represent data distribution of an attribute.
• It uses disjoint subset which we call as bin or
buckets.
We have data for AllElectronics data set, which
contains prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
Clustering
• Clustering techniques groups the similar objects
from the data in such a way that the objects in a
cluster are similar to each other but they are
dissimilar to objects in another cluster.
• How much similar are the objects inside a cluster
can be calculated by using a distance function.
• The quality of cluster depends on the diameter of
the cluster i.e. the at max distance between any
two objects in the cluster.
• This technique is more effective if the present
data can be classified into a distinct clustered.
Sampling:
• Sampling can be used as a data reduction
approach because it enables a huge data set
to be defined by a much smaller random
sample or a subset of the information.
3. Data Compression
• Data compression is a technique where the data
transformation technique is applied to the original
data in order to obtain compressed data.
• If the compressed data can again be
reconstructed to form the original data without
losing any information then it is a ‘lossless’ data
reduction.
• If you are unable to reconstruct the original data
from the compressed one then your data
reduction is ‘lossy’. Dimensionality and
numerosity reduction method are also used for
data compression.
Efficient and scalable frequent item set mining
methods
MINING FREQUENT PATTERNS
• Frequent patterns are patterns (e.g.,itemsets,
subsequences) that appear frequently in a data
set.
• For example, a set of items, such as milk and
bread that appear frequently together in a
transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a
digital camera, and then a memory card, if it
occurs frequently in a shopping history database,
is a (frequent) sequential pattern.
Applications
• Market Basket Analysis
• Telecommunication
• Credit Cards/ Banking Services
• Medical Treatments
• Basketball-Game Analysis
Market Basket Analysis
• Frequent item set mining leads to discovery of
associations and correlations among items in large
transactional or relational data sets.
• For example, if a customer buys bread, he most
likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly
nearby.
• It can help in many business decision making
processes.
• This process analyzes customer buying habits
• The discovery of such associations can help
retailers develop marketing strategies by gaining
insight into which items are frequently purchased
together by customers
Association rule mining
• finds interesting associations and relationships among
large sets of data items.
• This rule shows how frequently a itemset occurs in a
transaction. A typical example is a Market Based
Analysis.
• Association rules are "if-then" statements, that help to
show the probability of relationships between data
items,
• An association rule has two parts: an antecedent (if)
and a consequent (then).
• An antecedent is an item found within the data.
• A consequent is an item found in combination with the
antecedent.
• Association rules are created by searching data for
frequent if-then patterns and using the
criteria support and confidence to identify the most
important relationships.
• Support is an indication of how frequently the items
appear in the data.

• Confidence indicates the number of times the if-then


statements are found true
• Product X => Product Y. This means that you obtain a
rule that tells you that if you buy product X, you are
also likely to buy product Y.

• lift can be used to compare confidence with expected


confidence, or how many times an if-then statement is
expected to be found true.
It has three possible values:
• If Lift= 1: The probability of occurrence of
antecedent and consequent is independent of
each other.
• Lift>1: It determines the degree to which the
two itemsets are dependent to each other.
• Lift<1: It tells us that one item is a substitute
for other items, which means one item has a
negative effect on another.
Efficient and scalable frequent item set mining
methods
1.Apriori Algorithm
2. F-P Growth Algorithm
Apriori algorithm:
• Apriori algorithm was the first algorithm that
was proposed for frequent itemset mining
• This algorithm uses two steps “join” and
“prune” to reduce the search space.
• It is an iterative approach to discover the most
frequent itemsets.
The steps followed in the Apriori Algorithm of
data mining are:
• Join Step: This step generates (K+1) itemset
from K-itemsets by joining each item with
itself.
• Prune Step: This step scans the count of each
item in the database. If the candidate item
does not meet minimum support, then it is
regarded as infrequent and thus it is removed.
This step is performed to reduce the size of
the candidate itemsets.
The Apriori Algorithm: Pseudo Code
• C: Candidate item set of size k
• L: Frequent itemset of size k
Steps In Apriori
1) In the first iteration of the algorithm, each item is
taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.
2) Let there be some minimum support, min_sup ( eg
2).
• The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined.
• Only those candidates which count more than or
equal to min_sup, are taken ahead for the next
iteration and the others are pruned.
3) Next, 2-itemset frequent items with min_sup are
discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining
items with itself.
4) The 2-itemset candidates are pruned using min-sup
threshold value. Now the table will have 2 –itemsets
with min-sup only.
5) The next iteration will form 3–itemsets using join
and prune step. This iteration will follow
antimonotone property where the subsets of
3-itemsets, that is the 2 –itemset subsets of each
group fall in min_sup. If all 2-itemset subsets are
frequent then the superset will be frequent
otherwise it is pruned.
6) Next step will follow making 4-itemset by joining
3-itemset with itself and pruning if its subset does
not meet the min_sup criteria. The algorithm is
stopped when the most frequent itemset is
achieved.
H.W
FP Growth Algorithm
• An efficient and scalable method to find frequent
patterns. It allows frequent itemset discovery
without candidate itemset generation.
Following are the steps for FP Growth Algorithm
• Scan DB once, find frequent 1-itemset (single item
pattern)
• Sort frequent items in frequency descending
order, f-list
• Scan DB again, construct FP-tree
• Construct the conditional FP tree in the sequence
of reverse order of F - List - generate frequent
item set
Illustration:
• Consider the below tansaction in which B =
Bread, J = Jelly, P = Peanut Butter, M = Milk
and E = Eggs. Given that minimum threshold
support = 40% and minimum threshold
confidence = 80% [13].
Step-1: Scan DB once, find frequent 1-itemset
(single item in itemset)

• Step-2: As minimum threshold support = 40%,


So in this step we will remove all the items
that are bought less than 40% of support or
support less than 2.
Step-3: Create a F -list in which frequent items
are sorted in the descending order based on
the support.

Step-4: Sort frequent items in transactions


based on F-list. It is also known as FPDP.
Step-5: Construct the FP tree
• Read transaction 1: {B,P} -> Create 2 nodes B and P.
Set the path as null -> B -> P and the count of B and
P as 1 as shown below :

• Read transaction 2: {B,P} -> The path will be null ->


B -> P. As transaction 1 and 2 share the same path.
Set counts of B and P to 2.
• Read transaction 3: {B,P,M} -> The path will be
null -> B -> P -> M. As transaction 2 and 3 share
the same path till node P. Therefore, set count
of B and P as 3 and create node M having count
1.
Continue until all the transactions are mapped to a
path in FP-tree;
Step-6: Construct the conditional FP tree in the sequence of reverse order of
F - List {E,M,P,B} and generate frequent item set.

• The above table has two items {B , P} that are bought together frequently.
• As for items E and M, nodes in the conditional FP tree has a
count(support) of 1 (less than minimum threshold support 2).
• Therefore frequent itemset are nil. In case of item P, node B in the
conditional FP tree has a count(support) of 3 (satisfying minimum
threshold support).
• Hence frequent itemset is generated by adding the item P to the B.

You might also like