0% found this document useful (0 votes)
19 views7 pages

Vertical Data Format For Frequent Pattern Mining

Frequent Pattern Mining (FPM) is a data mining technique that identifies recurring patterns within datasets, crucial for applications like market basket analysis and fraud detection. The vertical data format enhances FPM by representing transactional data in a way that allows efficient pattern mining through TID-lists, reducing computational overhead. While it offers scalability and efficiency, challenges include high memory usage and inefficiencies with sparse datasets, suggesting a need for hybrid approaches and optimizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

Vertical Data Format For Frequent Pattern Mining

Frequent Pattern Mining (FPM) is a data mining technique that identifies recurring patterns within datasets, crucial for applications like market basket analysis and fraud detection. The vertical data format enhances FPM by representing transactional data in a way that allows efficient pattern mining through TID-lists, reducing computational overhead. While it offers scalability and efficiency, challenges include high memory usage and inefficiencies with sparse datasets, suggesting a need for hybrid approaches and optimizations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1 Introduction

1.1 What is Frequent Pattern Mining?

Frequent Pattern Mining (FPM) is a technique within data mining, which is the process
of extracting meaningful patterns, trends, and insights from large datasets. Data mining
uses methods from statistics, machine learning, and database systems to uncover hidden
relationships and support decision-making across various domains.

FPM identifies recurring patterns, associations, or structures within a dataset. These


patterns often represent relationships among items, events, or attributes that appear to-
gether frequently. FPM is fundamental for discovering useful insights in transactional
databases, time-series data, and other structured datasets. It serves as the basis for ad-
vanced mining techniques like association rule mining, sequential pattern mining, and
graph mining.

The significance of FPM lies in its ability to extract actionable knowledge from large
datasets. It is widely used in areas such as market basket analysis, recommendation
systems, fraud detection, and bioinformatics. For example, it can identify frequently
purchased product combinations in retail, which businesses can leverage for cross-selling
strategies (Fournier-Viger et al., 2022).

1.2 What is vertical data format?

The vertical data format is a representation of transactional data where each item is
associated with a list of transaction IDs (TIDs) in which it appears. This format differs
from the horizontal format, where transactions are listed as collections of items. The ver-
tical format is especially beneficial for algorithms that rely on intersections of TID lists to
identify frequent itemsets, as it reduces database scans and computational overhead. The
vertical data format transforms transactional data into a structure that enables efficient
pattern mining through set intersections, offering significant performance gains in dense
datasets (Leung et al., 2018).

The vertical data format is widely used in frequent pattern mining due to its efficiency

1
and scalability. It represents each item as a Transaction ID List (TID-list), simplifying
support counting by performing intersections of TID-lists instead of repeatedly scanning
the database. This format reduces computational overhead and is particularly beneficial
for dense datasets, where frequent items are numerous. Algorithms like ECLAT leverage
the vertical format to discover frequent itemsets efficiently, making it suitable for applica-
tions such as market basket analysis, web usage mining, and bioinformatics. Additionally,
the vertical format supports parallelization and incremental mining, enabling scalability
for large or dynamic datasets. Its compact representation and compatibility with hybrid
approaches further enhance its utility in frequent pattern mining tasks (Dwivedi & Satti,
2015; Meenakshi, 2015)..

2 How Vertical Data Format works

The vertical data format algorithm for frequent pattern mining involves the following
steps:

2.1 Convert Dataset to Vertical Format

The horizontal transaction dataset is transformed into a vertical format. Each item is
associated with a Transaction ID List (TID-list), which contains all transactions where
the item appears. For example:

• Item A: {1, 2, 3}

• Item B: {1, 3, 4}

• Item C: {2, 3, 4}

2.2 Identify Frequent Single Items

1. Calculate the support for each item by counting the number of transactions in its
TID-list.

2. Retain items with support greater than or equal to the minimum support threshold
(min-sup).

2
2.2.1 How to choose minimum support threshold

When choosing the minimum support threshold, balancing the trade-off between the
following two is needed:

• A lower minimum support threshold may yield too many patterns, many of which
might be irrelevant.

• A higher minimum support threshold reduces computational complexity but risks


missing important patterns.

2.3 Generate Candidate Itemsets

1. Combine frequent k-itemsets to generate candidate (k + 1)-itemsets.

2. Compute the TID-list for each candidate by intersecting the TID-lists of its subsets.

2.4 Prune Non-Frequent Itemsets

For each candidate (k + 1)-itemset, calculate its support by determining the size of its
TID-list. Retain only those candidates that meet the minimum support threshold.

2.5 Repeat Until No More Candidates Are Generated

The process continues iteratively, generating larger itemsets and pruning non-frequent
ones, until no further frequent itemsets can be identified.

3 Demonstration of how Vertical Data Format works

Let’s take an online retail store (like Amazon) tracking what customers add to their
shopping carts. The store wants to identify frequently purchased product combinations
to improve recommendations and promotions.

Dataset (Horizontal Format)

Here’s a dataset showing what customers purchased in each transaction:

3
Transaction ID (TID) Items
1 Laptop, Mouse, Headset
2 Laptop, Mouse
3 Laptop, Headset
4 Mouse, Headset
5 Laptop, Mouse, Headset

Table 1: Example dataset in horizontal format

Step 1: Convert to Vertical Format

Transform the dataset into a vertical format where each product is associated with the
transactions it appears in.

Item TID-List
Laptop {1, 2, 3, 5}
Mouse {1, 2, 4, 5}
Headset {1, 3, 4, 5}

Table 2: Dataset converted to vertical format

Step 2: Identify Frequent Single Items

Calculate the support (number of transactions) for each item. Assume min-sup = 2.

Item Support Frequent?


Laptop 4 Yes
Mouse 4 Yes
Headset 4 Yes

Table 3: Frequent single items

Step 3: Generate Frequent 2-Itemsets

Combine frequent single items into pairs and calculate support by intersecting their TID-
lists.

4
Itemset TID-List (Intersection) Support Frequent?
{Laptop, Mouse} {1, 2, 5} 3 Yes
{Laptop, Headset} {1, 3, 5} 3 Yes
{Mouse, Headset} {1, 4, 5} 3 Yes

Table 4: Frequent 2-itemsets

Step 4: Generate Frequent 3-Itemsets

Combine frequent 2-itemsets into 3-itemsets and calculate support by intersecting their
TID-lists.

Itemset TID-List (Intersection) Support Frequent?


{Laptop, Mouse, Headset} {1, 5} 2 Yes

Table 5: Frequent 3-itemsets

Step 5: Final Frequent Patterns

The frequent patterns (with support) are:

• Single items: {Laptop: 4}, {Mouse: 4}, {Headset: 4}

• 2-itemsets: {Laptop, Mouse: 3}, {Laptop, Headset: 3}, {Mouse, Headset: 3}

• 3-itemset: {Laptop, Mouse, Headset: 2}

Step 6: How this helps

The store can now use these patterns for actionable insights:

• Product Recommendations: If a customer buys a Laptop and Mouse, recom-


mend a Headset.

• Bundling Discounts: Offer a discount for buying Laptop, Mouse, and Headset
together, as they frequently co-occur.

• Stock Optimization: Ensure the Mouse and Headset are available alongside Lap-
tops to meet customer demand.

5
4 Conclusion

The vertical data format in frequent pattern mining offers several strengths. It enables
efficient support counting by using TID-lists and set intersections, reducing the need for
multiple database scans and improving computational efficiency, particularly for dense
datasets. It also provides a compact representation, minimizing redundancy by associ-
ating each item only with the transactions in which it appears. This format is scalable,
well-suited for parallel processing, and effective for dense datasets, where it avoids unnec-
essary computations on infrequent items. Additionally, it supports incremental mining,
adapting easily to changes in data.

However, it has some weaknesses. Storing TID-lists for every item can lead to high mem-
ory usage, particularly for large datasets. In sparse datasets, TID-list intersections can
become inefficient, as they involve many short or empty lists, which increases computa-
tional costs. Moreover, for long frequent itemsets, the number of candidate combinations
and TID-list intersections grows exponentially, resulting in high computational overhead.

The way forward for the vertical data format involves adopting hybrid approaches that
combine it with other formats to balance memory and efficiency, using optimization
techniques like compressed TID-lists, and using parallel computing frameworks for large
datasets. Additionally, adapting algorithms for sparse data and expanding the format to
support advanced mining tasks could enhance its applicability and performance.

5 References

1. Leung, C. K., Zhang, H., Souza, J., Lee, W. (2018). Scalable vertical mining for big
data analytics of frequent itemsets. In Database and Expert Systems Applications:
29th International Conference, DEXA 2018, Regensburg, Germany, September 3–6,
2018, Proceedings, Part I 29 (pp. 3-17). Springer International Publishing.

2. Dwivedi, N., Satti, S. R. (2015). Vertical-format Based Frequent Pattern Mining-A


Hybrid Approach. Journal of Intelligent Computing, 6(4), 119.

3. Mohsin, M., Ahmed, M. R., Ahmed, T. (2016). Closed frequent pattern mining

6
using vertical data format: depth first approach. IJSSET, 2(3), 230-238.

4. Meenakshi, A. (2015). Survey of frequent pattern mining algorithms in horizontal


and vertical data layouts. Int J Adv Comput Sci Technol, 4(4).

5. Fournier-Viger, P., Gan, W., Wu, Y., Nouioua, M., Song, W., Truong, T., Duong,
H. (2022, April). Pattern mining: Current challenges and opportunities. In Inter-
national Conference on Database Systems for Advanced Applications (pp. 34-49).
Cham: Springer International Publishing.

You might also like