Vertical Data Format For Frequent Pattern Mining
Vertical Data Format For Frequent Pattern Mining
Frequent Pattern Mining (FPM) is a technique within data mining, which is the process
of extracting meaningful patterns, trends, and insights from large datasets. Data mining
uses methods from statistics, machine learning, and database systems to uncover hidden
relationships and support decision-making across various domains.
The significance of FPM lies in its ability to extract actionable knowledge from large
datasets. It is widely used in areas such as market basket analysis, recommendation
systems, fraud detection, and bioinformatics. For example, it can identify frequently
purchased product combinations in retail, which businesses can leverage for cross-selling
strategies (Fournier-Viger et al., 2022).
The vertical data format is a representation of transactional data where each item is
associated with a list of transaction IDs (TIDs) in which it appears. This format differs
from the horizontal format, where transactions are listed as collections of items. The ver-
tical format is especially beneficial for algorithms that rely on intersections of TID lists to
identify frequent itemsets, as it reduces database scans and computational overhead. The
vertical data format transforms transactional data into a structure that enables efficient
pattern mining through set intersections, offering significant performance gains in dense
datasets (Leung et al., 2018).
The vertical data format is widely used in frequent pattern mining due to its efficiency
1
and scalability. It represents each item as a Transaction ID List (TID-list), simplifying
support counting by performing intersections of TID-lists instead of repeatedly scanning
the database. This format reduces computational overhead and is particularly beneficial
for dense datasets, where frequent items are numerous. Algorithms like ECLAT leverage
the vertical format to discover frequent itemsets efficiently, making it suitable for applica-
tions such as market basket analysis, web usage mining, and bioinformatics. Additionally,
the vertical format supports parallelization and incremental mining, enabling scalability
for large or dynamic datasets. Its compact representation and compatibility with hybrid
approaches further enhance its utility in frequent pattern mining tasks (Dwivedi & Satti,
2015; Meenakshi, 2015)..
The vertical data format algorithm for frequent pattern mining involves the following
steps:
The horizontal transaction dataset is transformed into a vertical format. Each item is
associated with a Transaction ID List (TID-list), which contains all transactions where
the item appears. For example:
• Item A: {1, 2, 3}
• Item B: {1, 3, 4}
• Item C: {2, 3, 4}
1. Calculate the support for each item by counting the number of transactions in its
TID-list.
2. Retain items with support greater than or equal to the minimum support threshold
(min-sup).
2
2.2.1 How to choose minimum support threshold
When choosing the minimum support threshold, balancing the trade-off between the
following two is needed:
• A lower minimum support threshold may yield too many patterns, many of which
might be irrelevant.
2. Compute the TID-list for each candidate by intersecting the TID-lists of its subsets.
For each candidate (k + 1)-itemset, calculate its support by determining the size of its
TID-list. Retain only those candidates that meet the minimum support threshold.
The process continues iteratively, generating larger itemsets and pruning non-frequent
ones, until no further frequent itemsets can be identified.
Let’s take an online retail store (like Amazon) tracking what customers add to their
shopping carts. The store wants to identify frequently purchased product combinations
to improve recommendations and promotions.
3
Transaction ID (TID) Items
1 Laptop, Mouse, Headset
2 Laptop, Mouse
3 Laptop, Headset
4 Mouse, Headset
5 Laptop, Mouse, Headset
Transform the dataset into a vertical format where each product is associated with the
transactions it appears in.
Item TID-List
Laptop {1, 2, 3, 5}
Mouse {1, 2, 4, 5}
Headset {1, 3, 4, 5}
Calculate the support (number of transactions) for each item. Assume min-sup = 2.
Combine frequent single items into pairs and calculate support by intersecting their TID-
lists.
4
Itemset TID-List (Intersection) Support Frequent?
{Laptop, Mouse} {1, 2, 5} 3 Yes
{Laptop, Headset} {1, 3, 5} 3 Yes
{Mouse, Headset} {1, 4, 5} 3 Yes
Combine frequent 2-itemsets into 3-itemsets and calculate support by intersecting their
TID-lists.
The store can now use these patterns for actionable insights:
• Bundling Discounts: Offer a discount for buying Laptop, Mouse, and Headset
together, as they frequently co-occur.
• Stock Optimization: Ensure the Mouse and Headset are available alongside Lap-
tops to meet customer demand.
5
4 Conclusion
The vertical data format in frequent pattern mining offers several strengths. It enables
efficient support counting by using TID-lists and set intersections, reducing the need for
multiple database scans and improving computational efficiency, particularly for dense
datasets. It also provides a compact representation, minimizing redundancy by associ-
ating each item only with the transactions in which it appears. This format is scalable,
well-suited for parallel processing, and effective for dense datasets, where it avoids unnec-
essary computations on infrequent items. Additionally, it supports incremental mining,
adapting easily to changes in data.
However, it has some weaknesses. Storing TID-lists for every item can lead to high mem-
ory usage, particularly for large datasets. In sparse datasets, TID-list intersections can
become inefficient, as they involve many short or empty lists, which increases computa-
tional costs. Moreover, for long frequent itemsets, the number of candidate combinations
and TID-list intersections grows exponentially, resulting in high computational overhead.
The way forward for the vertical data format involves adopting hybrid approaches that
combine it with other formats to balance memory and efficiency, using optimization
techniques like compressed TID-lists, and using parallel computing frameworks for large
datasets. Additionally, adapting algorithms for sparse data and expanding the format to
support advanced mining tasks could enhance its applicability and performance.
5 References
1. Leung, C. K., Zhang, H., Souza, J., Lee, W. (2018). Scalable vertical mining for big
data analytics of frequent itemsets. In Database and Expert Systems Applications:
29th International Conference, DEXA 2018, Regensburg, Germany, September 3–6,
2018, Proceedings, Part I 29 (pp. 3-17). Springer International Publishing.
3. Mohsin, M., Ahmed, M. R., Ahmed, T. (2016). Closed frequent pattern mining
6
using vertical data format: depth first approach. IJSSET, 2(3), 230-238.
5. Fournier-Viger, P., Gan, W., Wu, Y., Nouioua, M., Song, W., Truong, T., Duong,
H. (2022, April). Pattern mining: Current challenges and opportunities. In Inter-
national Conference on Database Systems for Advanced Applications (pp. 34-49).
Cham: Springer International Publishing.