0% found this document useful (0 votes)
2 views

Apriori algorithm

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Apriori algorithm

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Market Basket Analysis

• Market basket analysis is the process of analyzing a customer’s buying


habits by finding associations between the different items in their
"shopping basket."

• The goal is to identify associations between items frequently bought


together. This is typically achieved by mining frequent itemsets and
deriving association rules from them.
Applications of Market Basket Analysis
1.Product Placement: Retailers can place frequently bought together
items, like bread and milk, closer to each other to encourage more
sales.
2.Cross-selling: E-commerce platforms can recommend items based on
frequent itemsets, like suggesting butter when a customer buys
bread.
3.Discount Bundling: Stores can offer discounts on frequently
purchased item combinations to boost sales.
Apriori algorithm
Apriori algorithm
• The Apriori algorithm is a fundamental algorithm in data mining,
specifically used for association rule mining. It identifies frequent
itemsets in a transactional dataset and generates association rules
that help discover patterns, relationships, or associations between
items in large datasets.
• The Apriori algorithm operates on the principle that:
• If an itemset is frequent, then all of its subsets must also be frequent.
• Conversely, if an itemset is infrequent, all of its supersets are also infrequent.
• This principle is known as the "Apriori property" and significantly
reduces the search space for finding frequent itemsets.
Apriori Algorithm Steps
1.Generate candidate itemsets: Start by identifying individual items and
then combining them to form itemsets of increasing size (from 1-itemsets
to k-itemsets).
2.Prune the candidate itemsets: Use the support threshold to eliminate
infrequent itemsets (those whose support is below the minimum
threshold).
3.Repeat: Continue generating and pruning itemsets until no more frequent
itemsets can be found.
4.Generate association rules: Once frequent itemsets are identified,
generate rules using the confidence metric and prune them using a
minimum confidence threshold.
Example
• Consider a dataset with 5 transactions:
Step 1: Find all 1-itemsets (single items) and
calculate their support.
Step 2: Generate 2-itemsets and calculate
their support.
Step 3: Generate 3-itemsets and calculate
their support.
Step 4: Generate association rules from frequent
itemsets.
Final Results
• The following association rules have been identified as strong based
on the support and confidence thresholds:
Handling large datasets in main memory

• Handling large datasets in main memory is a critical challenge in data


analytics, especially when the dataset exceeds the available memory
of the system.

• Efficient techniques and strategies are needed to overcome memory


constraints while maintaining performance. Below are common
approaches for managing large datasets in main memory:
Data Partitioning (Chunking)
• Partitioning large datasets into smaller chunks allows for processing one
chunk at a time, preventing memory overload.

• These chunks can be stored on disk, and only a manageable portion is


loaded into memory when needed, processed, and then discarded.

• Example: In Python, libraries like pandas allow processing CSVs in chunks


using the chunksize parameter. Similarly, big data frameworks such as
Apache Spark divide datasets into partitions for distributed processing.
Batch Processing
• Batch processing splits data into batches that can be processed
independently. Each batch is loaded into memory, processed, and saved
before moving to the next batch.
• This is useful for streaming or real-time processing systems where
continuous data is generated and only a certain amount of data can be
processed at a time.
• Advantages:
• Minimizes memory requirements by processing smaller data portions.
• Suitable for distributed computing.
Compression Techniques
• Data compression reduces the size of the data stored in memory.
Compression algorithms (such as Gzip or Snappy) can be applied to
datasets before loading them into memory.

• Some databases and file formats, like Apache Parquet or ORC, offer built-in
compression that reduces memory overhead.

• Advantages:
• Significant reduction in memory usage, allowing larger datasets to fit in memory.
• Minimizes I/O and memory footprint.
Distributed Computing Frameworks
• Distributed systems such as Apache Spark, Apache Hadoop, and Dask
allow processing large datasets across clusters of machines, effectively
expanding the available memory.
• These frameworks split the data into partitions and distribute them across
multiple nodes, allowing parallel processing.
• Advantages:
• Scalable and can handle very large datasets.
• Supports fault-tolerant and parallel processing.
• Disadvantages:
• Requires setting up and maintaining a distributed computing environment.
• Higher latency compared to single-machine solutions.
In-Memory Databases
• In-memory databases like Redis, Memcached, and Apache Ignite store
data directly in the system's RAM for faster access compared to traditional
disk-based databases.
• These databases use efficient data structures to manage large volumes of
data while keeping the memory overhead low.
• Advantages:
• High-performance with low-latency access to large datasets.
• Suitable for real-time data analytics.
• Disadvantages:
• Limited by available memory.
• Data persistence can be a challenge (although solutions like Redis offer persistence
options).
Sampling and Approximation Techniques
• Sampling involves working with a smaller, representative subset of the
large dataset instead of the entire dataset.
• Approximation algorithms such as sketching (HyperLogLog, Bloom filters,
etc.) or random sampling allow for approximate computations when exact
results are not necessary.
• Advantages:
• Reduces memory usage while still providing insights.
• Faster processing times for approximate solutions.
• Disadvantages:
• Accuracy might be compromised depending on the sample size or approximation
method.
• May not be suitable when exact analysis is required.
Efficient Data Structures
• Using memory-efficient data structures like arrays, dictionaries
(hash maps), or tries can significantly reduce memory overhead.
• Libraries such as NumPy (for numerical data) or sparse matrices (for
data with many zeros) provide ways to efficiently store and process
data in memory.
• Advantages:
• Reduces memory consumption by storing data more compactly.
• Faster data access and processing due to better memory locality.
Parallel Processing and Multithreading
• Multithreading or multiprocessing allows splitting the dataset and
processing it in parallel across multiple CPU cores, which can effectively
utilize the available memory and improve performance.
• Data can be divided across threads or processes, and each one works on its
portion of the dataset independently.
• Advantages:
• Faster processing by utilizing multiple CPU cores.
• Helps in taking full advantage of system resources.
• Disadvantages:
• Can introduce complexity in managing thread safety and synchronization.
• Overhead in creating and managing threads or processes.
Memory Mapping (mmap)
• Memory-mapped files allow large files to be accessed as though they are
in memory, without loading the entire file at once. Only the required
portions of the file are loaded into memory as needed.
• In Python, the mmap module or libraries like h5py can be used to handle
memory-mapped files efficiently.
• Advantages:
• Efficient for large datasets stored on disk.
• Only loads small, required portions into memory.
• Disadvantages:
• Performance can be impacted by disk I/O.
• Complex file structures may require additional handling.
Data Streaming
• Data streaming is useful for processing data on-the-fly as it is
ingested. The dataset is not stored entirely in memory; instead, the
system processes each incoming data record immediately.
• Streaming frameworks such as Apache Kafka, Apache Flink, and
Spark Streaming are designed for real-time analytics on large-scale
data streams.
• Advantages:
• Eliminates the need to store large datasets in memory.
• Real-time processing capabilities.

You might also like