Down 2
Down 2
For example,
• we have $1200 and $9800 as the minimum, and
maximum value for the attribute income, and [0.0,
1.0] is the range in which we have to map a value of
73,600.
• The value $73,600 would be transformed using
min-max normalization as follows:
Attribute Selection
• New properties of data are created from
existing attributes to help in the data mining
process.
• For example, date of birth, data attribute can
be transformed to another property like
is_senior_citizen for each tuple, which will
directly influence predicting diseases or
chances of survival, etc.
Aggregation
• It is a method of storing and presenting data
in a summary format.
• For example:
• we have a data set of sales reports of an
enterprise that has quarterly sales of each
year. We can aggregate the data to get the
enterprise's annual sales report.
Discretization and concept hierarchy
generation
• Data discretization refers to a method of
converting a huge number of data values into
smaller ones so that the evaluation and
management of data become easy
Example:
• Suppose we have an attribute of Age with the
given values
Table before Discretization
Concept hierarchy generation:
• Hierarchy concept refers to a sequence of
mappings with a set of more general concepts
to complex concepts.
• It means mapping is done from low-level
concepts to high-level concepts.
• There are two types of hierarchy: top-down
mapping and the second one is bottom-up
mapping.
Top-down mapping
• Top-down mapping generally starts with the
top with some general information and ends
with the bottom to the specialized
information.
Bottom-up mapping
• Bottom-up mapping generally starts with the
bottom with some specialized information and
ends with the top to the generalized
information.
• Let's understand this concept hierarchy for
the dimension location with the help of an
example.
• A particular city can map with the belonging
country. For example, New Delhi can be
mapped to India, and India can be mapped to
Asia.
Data Reduction:
• Data reduction is a process that reduced the
volume of original data.
• Data reduction techniques ensure the
integrity of data while reducing the data.
• collect data from different data warehouses
for analysis, it results in a huge amount of
data. It is difficult for a data analyst to deal
with this large volume of data.
• Data reduction increases the efficiency of data
mining.
Data Reduction Techniques
• Dimensionality reduction
• Numerosity reduction
• Data compression.
Dimensionality reduction:
• Dimensionality reduction eliminates the
attributes from the data set.
Three methods of dimensionality reduction.
a. Wavelet Transform
b. Principal Component Analysis
c. Attribute Subset Selection
a. Wavelet Transform
• In the wavelet transform, a data vector X is transformed into a
numerically different data vector X’ such that both X and X’
vectors are of the same length.
• The data obtained from the wavelet transform can be truncated.
• The compressed data is obtained by retaining the smallest
fragment of the strongest of wavelet coefficients.
b. Principal Component Analysis
• Let us consider we have a data set to be analyzed that has tuples
with n attributes, then the principal component analysis identifies
k independent tuples with n attributes that can represent the data
set.
c. Attribute Subset Selection
• The large data set has many attributes some of which are
irrelevant to data mining or some are redundant.
• The attribute subset selection reduces the volume of data by
eliminating the redundant and irrelevant attribute.
Numerosity reduction :
• The numerosity reduction reduces the volume of
the original data and represents it in a much
smaller form.
This technique includes two types
• Parametric
For parametric methods a model is used to
estimate the data, so that typically only the data
parameters need to be stored, instead of the
actual data.
• Non-parametric
Non parametric methods for storing data include
histograms, clustering and sampling.
Histograms:
• A histogram is a ‘graph’ that represents frequency
distribution which describes how often a value
appears in the data.
• Histogram uses the binning method and to
represent data distribution of an attribute.
• It uses disjoint subset which we call as bin or
buckets.
We have data for AllElectronics data set, which
contains prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
Clustering
• Clustering techniques groups the similar objects
from the data in such a way that the objects in a
cluster are similar to each other but they are
dissimilar to objects in another cluster.
• How much similar are the objects inside a cluster
can be calculated by using a distance function.
• The quality of cluster depends on the diameter of
the cluster i.e. the at max distance between any
two objects in the cluster.
• This technique is more effective if the present
data can be classified into a distinct clustered.
Sampling:
• Sampling can be used as a data reduction
approach because it enables a huge data set
to be defined by a much smaller random
sample or a subset of the information.
3. Data Compression
• Data compression is a technique where the data
transformation technique is applied to the original
data in order to obtain compressed data.
• If the compressed data can again be
reconstructed to form the original data without
losing any information then it is a ‘lossless’ data
reduction.
• If you are unable to reconstruct the original data
from the compressed one then your data
reduction is ‘lossy’. Dimensionality and
numerosity reduction method are also used for
data compression.
Efficient and scalable frequent item set mining
methods
MINING FREQUENT PATTERNS
• Frequent patterns are patterns (e.g.,itemsets,
subsequences) that appear frequently in a data
set.
• For example, a set of items, such as milk and
bread that appear frequently together in a
transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a
digital camera, and then a memory card, if it
occurs frequently in a shopping history database,
is a (frequent) sequential pattern.
Applications
• Market Basket Analysis
• Telecommunication
• Credit Cards/ Banking Services
• Medical Treatments
• Basketball-Game Analysis
Market Basket Analysis
• Frequent item set mining leads to discovery of
associations and correlations among items in large
transactional or relational data sets.
• For example, if a customer buys bread, he most
likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly
nearby.
• It can help in many business decision making
processes.
• This process analyzes customer buying habits
• The discovery of such associations can help
retailers develop marketing strategies by gaining
insight into which items are frequently purchased
together by customers
Association rule mining
• finds interesting associations and relationships among
large sets of data items.
• This rule shows how frequently a itemset occurs in a
transaction. A typical example is a Market Based
Analysis.
• Association rules are "if-then" statements, that help to
show the probability of relationships between data
items,
• An association rule has two parts: an antecedent (if)
and a consequent (then).
• An antecedent is an item found within the data.
• A consequent is an item found in combination with the
antecedent.
• Association rules are created by searching data for
frequent if-then patterns and using the
criteria support and confidence to identify the most
important relationships.
• Support is an indication of how frequently the items
appear in the data.
• The above table has two items {B , P} that are bought together frequently.
• As for items E and M, nodes in the conditional FP tree has a
count(support) of 1 (less than minimum threshold support 2).
• Therefore frequent itemset are nil. In case of item P, node B in the
conditional FP tree has a count(support) of 3 (satisfying minimum
threshold support).
• Hence frequent itemset is generated by adding the item P to the B.