Data Mining Association Rules Mining:: Large
Data Mining Association Rules Mining:: Large
Association Rules Mining: The task of association rule mining is to find certain association relationships among a
set of objects (called items) in a database. The association, relationships are described in association rules. Each rule has
two measurements, support and confidence. Confidence is a measure of the rule’s strength, while - support corresponds to
statistical significance.
The task of discovering association rules was first introduced in 1993 [AIS93]. Originally, association rule mining is
focused on market “basket data” which stores items purchased on a per-transaction basis. A typical example of an
association rule on market “basket data” is that 70% of customers who purchase bread also purchase butter.
Finding association rules is valuable for crossing-marketing and attached mailing applications. Other applications include
catalog design, add-on sales, store layout, and customer segmentation based on buying patterns. Besides application on
business area, association rule mining can also be applied to other areas, such as medical diagnosis, remotely sensed
imagery.
Let I = { } be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such
that T c I associated with each transaction is a unique identifier, called its TID. An association rule is an implication of the
form X => Y, where X a I, Y c l, and X n Y = 0 X is called
Antecedent while Y is called consequence of the rule.
There are two measurements for each rule, support and confidence.
Initially used for Market Basket Analysis to find how items purchased by customers are related.
Algorithms:
AIS Algorithm
In AIS algorithms [AIS 93], candidate itemsets are generated and counted on-the-fly as the database is scanned. After
reading a trans, it is determined which of the itemsets that were found to be large in the previous pass contained in this
trans. New candidate itemsets are generated by extending these large itemsets with other items in the trans.
SETM Algorithm
This algorithm was motivated by the desire to use SQL to compute large itemsets. Like AIS, the SETM algorithm also
generates candidates on-the-fly based on trans read from the database. To use the standard SQL join operation for
candidate generation, SETM separates candidate generation from counting.
Apriori Algorithm
The disadvantage of AIS and SETM algorithm is the fact of unnecessarily generating and counting too many candidate
itemsets that turn out to be small. To improve the performance, Apriori algorithm was proposed [AS 94]. Apriori
algorithm generate the candidate itemsets to be counted in the pass by using only the itemsets found large in the previous
pass - without considering the transactions in the database. Apriori beats AIS and SETM by more than an order of
magnitude for large datasets. The key idea of Apriori algorithm lies in the “downward-closed” property of support which
means if an itemset has minimum support, then all its subsets also have minimum support.
DHP (Direct Hashing and Pruning) Algorithm: In frequent itemset generation, the heuristic to construct the
candidate set of large itemsets is crucial to performance. The larger the candidate set, the more processing cost required to
discover the frequent itemsets. The processing in the initial iterations in fact dominates the total execution cost. It shows
the initial candidate set generation, especially for the large 2 -itemsets, is the key issue to improve the performance.
Based on the above concern, DHP is proposed [PCY 95]. DHP is a hash-based algorithm and is especially effective for
the generation of candidate set of large 2 - itemsets. DHP has two major features, one is efficient generation for large
itemsets, the other is effective reduction on trans database sizes Instead of including all k-itemsets from Lk-i * Lk-i into in
Apriori, DHP adds a k-itemset into Ck only if that k-itemset passes the hash filtering, i.e., that k-itemset is hashed into a
hash entry whose value is larger than or equal to the min support. Such hash filtering can drastically reduce the size of Q.
DHP progressively trims the transaction database sizes in two ways, one is to reduce the size of some transactions, the
other is to remove some transactions. The execution time of the first pass of DHP is slightly larger than that of Apriori due
to the extra overhead required for generating hash table. However, DHP incurs significantly smaller execution times than
Apriori in later passes. The reason is that Apriori scans the full database for every pass, whereas DHP only scans the full
database for the first 2 passes and then scans the reduced database thereafter.
1: -j48 (C4.5): J48 is an implementation of C4.5 [8] that builds decision trees from a set of training data in the same way
as ID3, using the concept of Information Entropy. The training data is a set S = s1, s2... of already classified samples.
Each sample si = x1, x2... is a vector where x1, x2… represent attributes or features of the sample. Decision tree are
efficient to use and display good accuracy for large amount of data. At each node of the tree, C4.5 chooses one attribute of
the data that most effectively splits its set of samples into subsets enriched in one class or the other
.
2: -Naive Bayes: -a naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the
presence or absence of any other feature, given the class variable. Bayesian belief networks are graphical models, which
unlikely naive Bayesian classifier; allow the representation of dependencies among subsets of attributes[10]. Bayesian
belief networks can also be used for classification. A simplified assumption: attributes are conditionally independent:
3: - k-nearest neighborhood:- The k-NN algorithm for continuous-valued target functions Calculate the mean values of
the k nearest neighbors Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors
according to their distance to the query point xqg giving greater weight to closer neighbors Similarly, for real-valued
target functions. Robust to noisy data by averaging k-nearest neighbors.
4: -Neural Network:- Neural networks have emerged as an important tool for classification. The recent vast research
activities in neural classification have established that neural networks are a promising alternative to various conventional
classification methods. The advantage of neural networks lies in the following theoretical aspects. First, neural networks
are data driven self-adaptive methods in that they can adjust themselves to the data without any explicit specification of
functional or distributional form for the underlying model.
5: -Support Vector Machine: - A new classification method for both linear and non linear data.It uses a nonlinear
mapping to transform the original training data into a higher dimension. With the new dimension, it searches for the linear
optimal separating hyper plane (i.e., “decision boundary”). With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a hyper plane SVM finds this hyper plane using support
vectors (“essential” training tuples) and margins (defined by the support vectors).
---------------------------------------------------------------------------------
Review of Basic Data Analytic Methods Using R
Introduction to R: R is a programming language and software framework for statistical analysis and graphics.
Available for use under the GNU General Public License, R software and installation instructions can be obtained via the
Comprehensive R Archive and Network. Functions such as summary() can help analysts easily get an idea of the
magnitude and range of the data, but other aspects such as linear relationships and distributions are more difficult to see
from descriptive statistics
R Graphical User Interfaces: R software uses a command-line interface (CLI) that is similar to the BASH shell in Linux
or the interactive versions of scripting languages such as Python. UNIX and Linux users can enter command R at the
terminal prompt to use the CLI. For Windows installations, R comes with RGui.exe, which provides a basic graphical user
interface (GUI). However, to improve the ease of writing, executing, and debugging R code, several additional GUIs have
been written for R. Popular GUIs include the R commander
Exploratory Data Analysis: Exploratory data analysis [9] is a data analysis approach to reveal the important
characteristics of a dataset, mainly through visualization A useful way to detect patterns and anomalies in the data is
through the exploratory data analysis with visualization. Visualization gives a succinct, holistic view of the data that may
be difficult to grasp from the numbers and summaries alone. Variables x and y of the data frame data can instead be
visualized in a scatter plot , which easily depicts the relationship between two variables. An important facet of the initial
data exploration, visualization assesses data cleanliness and suggests potentially important relationships in the data prior
to the model planning and building phases.
Visualization Before Analysis, Dirty Data, Visualizing a Single Variable { Dotchart and Barplot, Histogram and
Density Plot}
Statistical Methods for Evaluation: Visualization is useful for data exploration and presentation, but statistics is
crucial because it may exist throughout the entire Data Analytics Lifecycle. Statistical techniques are used during the
initial data exploration and data preparation, model building, evaluation of the final models, and assessment of how the
new models improve the situation when deployed in the field. In particular, statistics can help answer the following
questions for data analytics:
● Model Building and Planning
● What are the best input variables for the model?
● Can the model predict the outcome given the input?
Data Cube Computation: Data cube computation is an essential task in data warehouse implementation. The
precomputation of all or part of a data cube can greatly reduce the response time and enhance the performance of online
analytical processing. However, such computation is challenging because it may require substantial computational time
and storage space.
METHODS: 1. Full Cube: I. Full materialization II. Materializing all the cells of all of the cuboids for a given data cube
III. Issues in time and space
2. Iceberg cube: I. Partial materialization II. Materializing the cells of only interesting cuboids III. Materializing only the
cells in a cuboid whose measure value is above the minimum threshold
3. Closed cube: Materializing only closed cells
#Multi-Way Array Aggregation: i. Array-based “bottom-up” approach, (ii) Uses multi-dimensional chunks (iii) No
direct tuple comparisons (iv) Simultaneous aggregation on multiple dimensions (v) Intermediate aggregate values
are re-used for computing ancestor cuboids (vi) Full materialization
Aggregation Strategy: (i) Partitions array into chunks (ii) Data addressing (III) Multi-way Aggregation
#Bottom-Up Computation: (I) “Top-down” approach (II) Partial materialization (iceberg cube computation) (III)
Divides dimensions into partitions and facilitates iceberg pruning (IV) No simultaneous aggregation
Iceberg Pruning Process: (I) Partitioning: (i) Sorts data values (ii) Partitions into blocks that fit in memory
(II) Apriori Pruning: For each block
If it does not satisfy min_sup, its descendants are prune
• If it satisfies min_sup, materialization and a recursive call including the next dimension
#Shell Fragment Cube Computation (I) Reduces a high dimensional cube into a set of lower dimensional cubes
(II) Lossless reduction (III) Online re-construction of high-dimensional data cube
Fragmentation Strategy: (i) Observation (ii) Fragmentation (iii) Semi-Online Computation
---------------------------------------------------------------------------------------------------------------------------------------------------
Mining Frequent Patterns without Candidate Generation: Frequent pattern mining plays an essential role in mining
associations correlations, sequential patterns, episodes, multi-dimensional patterns , max-patterns, partial periodicity
,emerging patterns, and many other important data mining tasks.
First, we design a novel data structure, called frequent pattern tree, or FP-tree for short, which is an extended prefix-tree
structure storing crucial, quantitative information about frequent patterns. To ensure that the tree structure is compact and
informative, only frequent length-1 items will have nodes in the tree. The tree nodes are arranged in such a way that more
frequently occurring nodes will have better chances of sharing nodes than less frequently occurring ones. Our experiments
show that such a tree is highly compact, usually orders of magnitude smaller than the original database. This offers an FP-
tree-based mining method a much smaller data set to work on.
Second, we develop an FP-tree-based pattern fragment growth mining method, which starts from a frequent length-1
pattern (as an initial suffix pattern), examines only its conditional pattern base (a \sub-database" which consists of the set
of frequent items co-occurring with the suffix pattern), constructs its (conditional) FP-tree, and performs mining
recursively with such a tree. The pattern growth is achieved via concatenation of the suffix pattern with the new ones
generated from a conditional FP-tree. Since the frequent itemset in any transaction is always encoded in the corresponding
path of the frequent pattern trees, pattern growth ensures the completeness of the result. In this context, our method is not
Apriori-like restricted generation-and-test but restricted test only. The major operations of mining are count accumulation
and prefix path count adjustment, which are usually much less costly than candidate generation and pattern matching
operations performed in most Apriori-like algorithms.
Third, the search technique employed in mining is a partitioning-based, divide-and-conquer method rather than Apriori-
like bottom-up generation of frequent itemsets combinations. This dramatically reduces the size of conditional pattern
base generated at the subsequent level of search as well as the size of its corresponding conditional FP-tree. Moreover, it
transforms the problem of finding long frequent patterns to looking for shorter ones and then concatenating the suffix. It
employs the least frequent items as suffix, which offers good selectivity. All these techniques contribute to the substantial
reduction of search costs
Algorithm (FP-growth : Mining frequent patterns with FP-tree and by pattern fragment growth)
Input: FP-tree constructed based on Algorithm 1, using DB and a minimum support threshold _.
Output: The complete set of frequent patterns.
Method: Call FP-growth (FP-tree ; null), which is implemented as follows.
Classification: There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels. Following are the examples of cases where the data analysis task is
Classification –
A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky
or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who will buy a
new computer.
In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of
classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered
acceptable.
Decision Tree: A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class
label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy computer that indicates whether a customer at a company is
likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a
class.
Decision Tree Induction Algorithm: A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and
C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.