Apriori algorithm

Uploaded by

shreyanshtiwari12062003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views26 pages

Apriori algorithm

Uploaded by

shreyanshtiwari12062003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Market Basket Analysis

• Market basket analysis is the process of analyzing a customer’s buying

habits by finding associations between the different items in their
"shopping basket."

• The goal is to identify associations between items frequently bought

together. This is typically achieved by mining frequent itemsets and
deriving association rules from them.
Applications of Market Basket Analysis
1.Product Placement: Retailers can place frequently bought together
items, like bread and milk, closer to each other to encourage more
sales.
2.Cross-selling: E-commerce platforms can recommend items based on
frequent itemsets, like suggesting butter when a customer buys
bread.
3.Discount Bundling: Stores can offer discounts on frequently
purchased item combinations to boost sales.
Apriori algorithm
Apriori algorithm
• The Apriori algorithm is a fundamental algorithm in data mining,
specifically used for association rule mining. It identifies frequent
itemsets in a transactional dataset and generates association rules
that help discover patterns, relationships, or associations between
items in large datasets.
• The Apriori algorithm operates on the principle that:
• If an itemset is frequent, then all of its subsets must also be frequent.
• Conversely, if an itemset is infrequent, all of its supersets are also infrequent.
• This principle is known as the "Apriori property" and significantly
reduces the search space for finding frequent itemsets.
Apriori Algorithm Steps
1.Generate candidate itemsets: Start by identifying individual items and
then combining them to form itemsets of increasing size (from 1-itemsets
to k-itemsets).
2.Prune the candidate itemsets: Use the support threshold to eliminate
infrequent itemsets (those whose support is below the minimum
threshold).
3.Repeat: Continue generating and pruning itemsets until no more frequent
itemsets can be found.
4.Generate association rules: Once frequent itemsets are identified,
generate rules using the confidence metric and prune them using a
minimum confidence threshold.
Example
• Consider a dataset with 5 transactions:
Step 1: Find all 1-itemsets (single items) and
calculate their support.
Step 2: Generate 2-itemsets and calculate
their support.
Step 3: Generate 3-itemsets and calculate
their support.
Step 4: Generate association rules from frequent
itemsets.
Final Results
• The following association rules have been identified as strong based
on the support and confidence thresholds:
Handling large datasets in main memory

• Handling large datasets in main memory is a critical challenge in data

analytics, especially when the dataset exceeds the available memory
of the system.

• Efficient techniques and strategies are needed to overcome memory

constraints while maintaining performance. Below are common
approaches for managing large datasets in main memory:
Data Partitioning (Chunking)
• Partitioning large datasets into smaller chunks allows for processing one
chunk at a time, preventing memory overload.

• These chunks can be stored on disk, and only a manageable portion is

loaded into memory when needed, processed, and then discarded.

• Example: In Python, libraries like pandas allow processing CSVs in chunks

using the chunksize parameter. Similarly, big data frameworks such as
Apache Spark divide datasets into partitions for distributed processing.
Batch Processing
• Batch processing splits data into batches that can be processed
independently. Each batch is loaded into memory, processed, and saved
before moving to the next batch.
• This is useful for streaming or real-time processing systems where
continuous data is generated and only a certain amount of data can be
processed at a time.
• Advantages:
• Minimizes memory requirements by processing smaller data portions.
• Suitable for distributed computing.
Compression Techniques
• Data compression reduces the size of the data stored in memory.
Compression algorithms (such as Gzip or Snappy) can be applied to
datasets before loading them into memory.

• Some databases and file formats, like Apache Parquet or ORC, offer built-in
compression that reduces memory overhead.

• Advantages:
• Significant reduction in memory usage, allowing larger datasets to fit in memory.
• Minimizes I/O and memory footprint.
Distributed Computing Frameworks
• Distributed systems such as Apache Spark, Apache Hadoop, and Dask
allow processing large datasets across clusters of machines, effectively
expanding the available memory.
• These frameworks split the data into partitions and distribute them across
multiple nodes, allowing parallel processing.
• Advantages:
• Scalable and can handle very large datasets.
• Supports fault-tolerant and parallel processing.
• Disadvantages:
• Requires setting up and maintaining a distributed computing environment.
• Higher latency compared to single-machine solutions.
In-Memory Databases
• In-memory databases like Redis, Memcached, and Apache Ignite store
data directly in the system's RAM for faster access compared to traditional
disk-based databases.
• These databases use efficient data structures to manage large volumes of
data while keeping the memory overhead low.
• Advantages:
• High-performance with low-latency access to large datasets.
• Suitable for real-time data analytics.
• Disadvantages:
• Limited by available memory.
• Data persistence can be a challenge (although solutions like Redis offer persistence
options).
Sampling and Approximation Techniques
• Sampling involves working with a smaller, representative subset of the
large dataset instead of the entire dataset.
• Approximation algorithms such as sketching (HyperLogLog, Bloom filters,
etc.) or random sampling allow for approximate computations when exact
results are not necessary.
• Advantages:
• Reduces memory usage while still providing insights.
• Faster processing times for approximate solutions.
• Disadvantages:
• Accuracy might be compromised depending on the sample size or approximation
method.
• May not be suitable when exact analysis is required.
Efficient Data Structures
• Using memory-efficient data structures like arrays, dictionaries
(hash maps), or tries can significantly reduce memory overhead.
• Libraries such as NumPy (for numerical data) or sparse matrices (for
data with many zeros) provide ways to efficiently store and process
data in memory.
• Advantages:
• Reduces memory consumption by storing data more compactly.
• Faster data access and processing due to better memory locality.
Parallel Processing and Multithreading
• Multithreading or multiprocessing allows splitting the dataset and
processing it in parallel across multiple CPU cores, which can effectively
utilize the available memory and improve performance.
• Data can be divided across threads or processes, and each one works on its
portion of the dataset independently.
• Advantages:
• Faster processing by utilizing multiple CPU cores.
• Helps in taking full advantage of system resources.
• Disadvantages:
• Can introduce complexity in managing thread safety and synchronization.
• Overhead in creating and managing threads or processes.
Memory Mapping (mmap)
• Memory-mapped files allow large files to be accessed as though they are
in memory, without loading the entire file at once. Only the required
portions of the file are loaded into memory as needed.
• In Python, the mmap module or libraries like h5py can be used to handle
memory-mapped files efficiently.
• Advantages:
• Efficient for large datasets stored on disk.
• Only loads small, required portions into memory.
• Disadvantages:
• Performance can be impacted by disk I/O.
• Complex file structures may require additional handling.
Data Streaming
• Data streaming is useful for processing data on-the-fly as it is
ingested. The dataset is not stored entirely in memory; instead, the
system processes each incoming data record immediately.
• Streaming frameworks such as Apache Kafka, Apache Flink, and
Spark Streaming are designed for real-time analytics on large-scale
data streams.
• Advantages:
• Eliminates the need to store large datasets in memory.
• Real-time processing capabilities.

Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
CSC649 Group Project and Presentation
No ratings yet
CSC649 Group Project and Presentation
4 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
45 pages
Data Analytics Unit-4
No ratings yet
Data Analytics Unit-4
47 pages
Big Data
No ratings yet
Big Data
957 pages
BDA _ Introduction to Big Data Analytics Part 02
No ratings yet
BDA _ Introduction to Big Data Analytics Part 02
13 pages
DAA.ppt
No ratings yet
DAA.ppt
24 pages
Kuliah M1 - TEKREK - Komputasi Big Data
No ratings yet
Kuliah M1 - TEKREK - Komputasi Big Data
55 pages
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
No ratings yet
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
34 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Bda 4
No ratings yet
Bda 4
18 pages
Extra - Data Science Unit II
No ratings yet
Extra - Data Science Unit II
41 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
DS Unit-2 PDF
No ratings yet
DS Unit-2 PDF
54 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Unit 5
No ratings yet
Unit 5
9 pages
unit II (3)
No ratings yet
unit II (3)
32 pages
Data Analytics Quantum
No ratings yet
Data Analytics Quantum
144 pages
Data Modeling
No ratings yet
Data Modeling
12 pages
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
No ratings yet
Big Data Analytics For R-2017 by ArunPrasath S., Sriram Kumar K., Krishna Sankar P.
7 pages
Unit 1
No ratings yet
Unit 1
14 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
RAILWAY SYSTEM PROJECT
No ratings yet
RAILWAY SYSTEM PROJECT
21 pages
Unit 1
No ratings yet
Unit 1
36 pages
Unit-1 Introduction to Data Analytics.pptx
No ratings yet
Unit-1 Introduction to Data Analytics.pptx
35 pages
Intro to Big Data Analytics
No ratings yet
Intro to Big Data Analytics
14 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Unit 4 Data Analytics
No ratings yet
Unit 4 Data Analytics
11 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
3
No ratings yet
3
12 pages
Big_Data_Analytics_-_Chapter_4
No ratings yet
Big_Data_Analytics_-_Chapter_4
22 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
yasir f29 ass1 bigdata
No ratings yet
yasir f29 ass1 bigdata
7 pages
Fas626gb Logoscreen Es
No ratings yet
Fas626gb Logoscreen Es
32 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
CS8091-Big-Data-Analytics
No ratings yet
CS8091-Big-Data-Analytics
28 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Big data Handling Techniques
No ratings yet
Big data Handling Techniques
21 pages
Advanced Database Concepts
No ratings yet
Advanced Database Concepts
7 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
ucPDF (14)
No ratings yet
ucPDF (14)
10 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Recent Trends in IT
No ratings yet
Recent Trends in IT
7 pages
Class 12 Computer Science Practical Report Files 2022 2023.PDF 20240519 222127 0000
No ratings yet
Class 12 Computer Science Practical Report Files 2022 2023.PDF 20240519 222127 0000
44 pages
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
No ratings yet
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
5 pages
Internship Report
No ratings yet
Internship Report
43 pages
Chapter 5 ITM100
No ratings yet
Chapter 5 ITM100
5 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
BD unit 1
No ratings yet
BD unit 1
5 pages
HPE ProLiant DL380 Gen10 Server Data Sheet-PSN1010026818USEN
No ratings yet
HPE ProLiant DL380 Gen10 Server Data Sheet-PSN1010026818USEN
6 pages
lastException_63853554167
No ratings yet
lastException_63853554167
29 pages
G31MV Series Manual en v1.0
No ratings yet
G31MV Series Manual en v1.0
77 pages
Unit 4 - DS - 1st year
No ratings yet
Unit 4 - DS - 1st year
6 pages
MCA_301_Data_Mining_Notes
No ratings yet
MCA_301_Data_Mining_Notes
6 pages
Design For Health Chatbot
No ratings yet
Design For Health Chatbot
20 pages
3 Assignment
No ratings yet
3 Assignment
5 pages
1z0-1046-22_ExamKiller
No ratings yet
1z0-1046-22_ExamKiller
23 pages
Vizio Gv47l Fhdtv10a Service Manual
67% (3)
Vizio Gv47l Fhdtv10a Service Manual
164 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
GlitchedOnEarth Slides
No ratings yet
GlitchedOnEarth Slides
46 pages
Fda 1
No ratings yet
Fda 1
5 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Laptop To Do Homework On
100% (2)
Laptop To Do Homework On
5 pages
Industrial Control Systems: Concepts, Components, and Architectures
100% (1)
Industrial Control Systems: Concepts, Components, and Architectures
12 pages
LPF Brochure 20 21
No ratings yet
LPF Brochure 20 21
10 pages
20-10-2023_Deepshikha_Call_Performance_Report (1)
No ratings yet
20-10-2023_Deepshikha_Call_Performance_Report (1)
9 pages
Architectural Design: Software Engineering: A Practitioner's Approach, 7/e
No ratings yet
Architectural Design: Software Engineering: A Practitioner's Approach, 7/e
33 pages
Control Structure C
No ratings yet
Control Structure C
12 pages
SAP FICO Practice1
No ratings yet
SAP FICO Practice1
22 pages
Hibernate FAQ: 1. What Is ORM? A 2. What Is Hibernate? A
No ratings yet
Hibernate FAQ: 1. What Is ORM? A 2. What Is Hibernate? A
7 pages
Coding Form Surat Kelahiran
No ratings yet
Coding Form Surat Kelahiran
3 pages
829374902839asdfasdfasdf
No ratings yet
829374902839asdfasdfasdf
2 pages
CNS_Question_bank
No ratings yet
CNS_Question_bank
2 pages
Y10 05 CT27 Lesson Plan
No ratings yet
Y10 05 CT27 Lesson Plan
2 pages
Youtube Demo
No ratings yet
Youtube Demo
2 pages
CV Ar 2023
No ratings yet
CV Ar 2023
2 pages
Delta Ups CL6000-10000
No ratings yet
Delta Ups CL6000-10000
2 pages
Push Button Catalogue
No ratings yet
Push Button Catalogue
1 page
Types of PC Expansion Cards - Google Search
No ratings yet
Types of PC Expansion Cards - Google Search
1 page
Review of Related Literature Management Information System
No ratings yet
Review of Related Literature Management Information System
3 pages
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet

Apriori algorithm

Uploaded by

Apriori algorithm

Uploaded by

Market Basket Analysis

• Market basket analysis is the process of analyzing a customer’s buying

• The goal is to identify associations between items frequently bought

• Handling large datasets in main memory is a critical challenge in data

• Efficient techniques and strategies are needed to overcome memory

• These chunks can be stored on disk, and only a manageable portion is

• Example: In Python, libraries like pandas allow processing CSVs in chunks

You might also like