Project Report
Project Report
Data mining is the science of extracting valuable patterns from large and complex datasets,
revealing significant insights that can help in decision-making processes. Often referred to as
Knowledge Discovery in Databases (KDD), data mining focuses on discovering useful and
comprehensible information from vast data repositories. Among the various data mining
techniques, Market Basket Analysis (MBA) is widely used to understand customer purchasing
behavior, enabling businesses to optimize product placement, cross-selling strategies, and
personalized recommendations.
Market Basket Analysis (MBA) is a powerful data mining technique used to uncover
relationships between items in transactional datasets. It helps businesses understand purchasing
behavior by identifying associations between products frequently bought together. By
analyzing these patterns, companies can optimize sales strategies, improve product placement,
and enhance customer experience. Market Basket Analysis is widely used in retail, e-
commerce, and inventory management to drive data-driven decision-making.
The foundation of Market Basket Analysis is based on the concept of association rule
mining, which extracts hidden patterns in large datasets. The Apriori algorithm is one of the
most widely used techniques for discovering frequent itemsets and generating association rules.
It follows a bottom-up approach by iteratively finding the most commonly occurring itemsets
and using them to form association rules. The generated rules help retailers understand product
affinities and improve marketing efforts, such as recommending complementary products or
bundling items for promotions.
In this project, we implement a Market Basket Analysis system using the Apriori algorithm.
The system processes transaction data, identifies frequent itemsets, generates meaningful
association rules, and provides insightful visualizations. The objective is to enable businesses
to make informed decisions based on customer purchasing patterns. The system is integrated
into a web application for ease of access, where users can upload transaction datasets, analyze
frequent itemsets, and receive recommendations for frequently bought-together products.
The Apriori algorithm is a widely used data mining technique for discovering frequent
itemsets and association rules in large transactional datasets. It is based on the principle that
if an itemset is frequent, then all of its subsets must also be frequent. This algorithm is
1
commonly used in Market Basket Analysis to identify relationships between products
frequently purchased together.
Association Rule Mining (ARM) is a data mining technique used to identify hidden
patterns, relationships, and correlations in large datasets. It helps uncover associations
between different items in a dataset, particularly in transactional and market basket analysis.
The primary goal of ARM is to find rules that predict the occurrence of an item based on
the presence of other items in a transaction.
With the rise of e-commerce and digital transactions, Market Basket Analysis has become
increasingly relevant. It allows businesses to personalize recommendations, enhance cross-
selling opportunities, and optimize inventory management. By leveraging data mining
techniques, this system aims to provide valuable insights into consumer behavior, helping
businesses improve their strategic planning and revenue generation.
One of the most popular and effective algorithms for Market Basket Analysis is the Apriori
algorithm, introduced by Agrawal et al. . Apriori plays a crucial role in association rule mining,
which aims to uncover frequent patterns and correlations between item sets in transactional
datasets. Association rules help discover valuable insights by identifying item sets that often
appear together and deriving actionable rules. For example, if many customers who buy milk
also purchase bread, the association rule {Milk} → {Bread} enables businesses to make
informed decisions regarding promotions, inventory management, and product placement.
The Apriori algorithm operates in two main stages: first, it identifies frequent item sets that
meet a minimum support threshold; second, it generates association rules with high confidence
and lift to establish strong relationships between items. While the algorithm is highly effective,
its performance is often challenged by the increasing complexity and size of modern datasets.
2
Consequently, robust, efficient, and user-friendly tools are required to handle large-scale data
more effectively.
This work bridges the gap between theoretical studies on algorithms and practical
implementation, offering an optimized, scalable solution for Market Basket Analysis in real-
world retail scenarios.
The remaining of this project is to provide a comprehensive survey of the Apriori algorithm,
highlighting its applications across various domains, such as retail, healthcare, and web usage
mining. This paper also examines the datasets commonly used for Market Basket Analysis and
offers a comparative analysis of the algorithm’s application in different industries. By
reviewing existing research, this study aims to demonstrate the Apriori algorithm’s significance
in extracting actionable insights and explore potential enhancements for handling large datasets
with improved efficiency and scalability.
3
1.1 PROJECT DESCRIPTION
Market Basket Analysis (MBA) is a widely used data mining technique in the retail and
e-commerce industries to identify patterns in customer purchasing behavior. The goal of
this project is to implement an intelligent system that analyzes large-scale transactional data
and uncovers associations between frequently bought-together products. Using the Apriori
algorithm, the system extracts frequent itemsets and generates association rules, enabling
businesses to optimize product placement, improve cross-selling strategies, and personalize
recommendations.
This project is designed as a Flask-based web application, where users can upload
transaction datasets and analyze product relationships through interactive visualizations.
The system first pre-processes the dataset by transforming it into a one-hot encoded format,
which is then used to mine frequent itemsets with a specified minimum support threshold.
The Apriori algorithm identifies item combinations that appear together frequently, and
association rules are extracted based on confidence and lift values. The results are
displayed through bar charts, scatter plots, and network graphs, helping users easily
interpret product relationships.
4
1.2 PROBLEM STATEMENT
5
1.3 EXISTING SYSTEM
In the current retail and e-commerce landscape, businesses primarily rely on manual
analysis and traditional sales reports to understand customer purchasing behavior. These
conventional methods involve reviewing historical sales data and making assumptions
based on predefined product categories. However, these approaches are often inefficient,
time-consuming, and lack real-time insights, making it difficult to identify hidden
relationships between frequently purchased items. As a result, businesses struggle to
optimize product placements, design effective marketing campaigns, and improve cross-
selling strategies.
Another major drawback of the existing system is its inability to process large
transactional datasets efficiently. Many traditional data processing tools are not
optimized for handling high-volume transaction records, leading to slow computations
and inaccurate results. Without a dedicated system for frequent itemset mining,
businesses miss the opportunity to identify valuable product associations that could be
used for better inventory management and promotional strategies. Furthermore, there is no
effective visualization mechanism in place to help businesses interpret these relationships
easily, making data analysis a complex and overwhelming process.
Due to these limitations, there is a need for an automated, scalable, and interactive
system that can efficiently analyze transaction data, discover frequent itemsets, and
generate meaningful association rules. The absence of such a system leaves businesses at a
disadvantage, as they are unable to make data-driven decisions regarding product
bundling, targeted promotions, and personalized recommendations. Addressing these
challenges requires a robust Market Basket Analysis system that applies advanced data
mining techniques to uncover valuable insights and enhance business intelligence.
6
1.4 PROPOSED SYSTEM
The system processes transaction data by first converting it into a one-hot encoded
format, which allows the Apriori algorithm to efficiently analyze item relationships. The
algorithm then identifies frequent itemsets based on a specified minimum support
threshold and derives association rules using confidence and lift values. These rules help
businesses understand which products are commonly purchased together, allowing
them to create targeted recommendations and promotions. The system is further enhanced
with interactive visualizations, including bar charts, scatter plots, and network graphs, to
simplify the interpretation of results.
7
1.5 LITERATURE REVIEW
S.H. Liao, P.H. Chu and P.Y. Hsiao reviewed published papers from 2000 to 2011 about
data mining techniques and their applications. These techniques have tend to become more
expertise oriented and their application is more problem-centered, leading to development
of advanced algorithms and their application in different discipline areas, like, computer
science, engineering, medicine, mathematics, earth and planetary sciences, biochemistry,
genetics and molecular biology, business, management and accounting, marketing
decisions, social science, decision sciences, multidisciplinary, environmental science,
energy, agricultural and biological sciences, nursing, material science, neurology, chemical
engineering, etc.
Market basket Analysis is rather a practical subject than academical, therefore most of
the studies on the matter have been practiced in actual retail stores. MBA as an old field in
data mining and is also one of the best examples of mining association rules (Gancheva,
Market basket analysis of beauty products, 2013). Rakesh Agrawal and Usama Fayyad as
pioneers in data mining, Association Rule Mining (ARM) and Clustering have developed
different algorithms to help users achieve their objectives.
Most of the data mining algorithms have existed since decades, but in last decade there
have been sudden increase in the data and realization of the importance of data in every
field. S. Linoff, Michael J.A. Berry in his book suggested for companies in the service
sector, data/ information confers competitive advantage. That is why hotel chains records
your preference for a nonsmoking room, a car rental companies record your preferred type
of car. Similarly, credit card companies, airlines, retailers, etc. compete more on services
than on price. Many companies find that the information on their customers is not only
valuable for themselves but also for others. Like, the information about the customers of a
credit card company is also useful for an airline company – they would like to know who
is buying a lot of airline tickets. Similarly, google knows what people are looking for on
web and it takes advantage of the knowledge by selling this information.
In 2009 E. Ngai, L.Xiu and D. Chau presented how data mining in customer relationship
management is an emerging trend, which helps in identification, attraction, retention and
development of a customer. Customer retention and development are important to maintain
8
a long term and pleasant relationship with the customers which is very much useful in
maximizing the organization’s profit. Data mining provides a lot of opportunities in the
market sector. For customer identification the methods mostly used are classification,
clustering and regression. For customer development the methods usually used are
classification, regression, association discovery, pattern discovery, forecasting, etc.
Aiman Mushtaq in 2015 highlighted how data mining in marketing helps to increase return
on investment or net profit, improve customer relationship management, market analysis,
building better marketing strategies, reduces unnecessary expense, etc. With the growing
volume of data everyday various techniques are being used to mine the data in the field of
marketing and helping in fulfilling the organizational goals.
Market basket analysis is a data mining technique originated in the field of marketing to
understand purchase patterns of customers. It has made a lot of advancements since it was
first introduced in 1993 by Rakesh Agrawal. He proposed the first associative algorithm called
apriori algorithm, which have been used as part of technique in association, classification and
associative classification algorithms. It was mostly used in stock management and placement
of items. Now, market basket analysis is being used to build predictive models and to get
interesting insights which are helpful in decision making. Its application is in several fields.
Cascio and Aguinis in 2008 noted, “there is a serious disconnect between the knowledge that
academics are producing and the knowledge the practitioners are consuming.” The use of
MBA can produce knowledge that is relevant and actionable and market basket analysis can
help bridge the science and practical divide.
S. Kamley, S. Jaloree, R.S. Thakur in 2014 have developed an association rule mining
model for finding the interesting patterns in stock market dataset. This model is helpful in
predicting the price of share which will be helpful for stock brokers and investors to invest in
the right direction by understanding market conditions. In June 2015, S.S. Umbarkar and S.S.
Nandgaonkar used association rule mining for prediction of stock market from financial
news. Prediction depends on technical trading indicators and closing prices of the stock.
One of the most important use of market basket analysis is product placement in the super
market. In 2012 by A. A. Raorne market basket analysis was used in understanding the
behavior of the customer. The researchers did experimental analysis by employing association
rules using MBA, which improved the strategy in placement of product on the shelf leading
to fetch more profit to the seller. This research was effective in fetching more profit but what
9
it was lacking was that changing of consumer behavior. To sustain in a competitive market
the organizations must understand consumer behaviors and consumer behavior changes with
the change in time.
In 2015, G. Kapadia did a study that analyses the pattern of consumer behavior of products
of lifestyle store. It gives valuable insights relating to the formation of the basket. This study
helped in product assortments, managing the stocks for the likely items sold, making
promotions on the likely items sold, give discounts to the loyal customers and cross selling.
The limitation of this study was that its scope was limited to one store in a specific region.
Solnet et al. in 2016 studied the potential of market basket analysis to grow revenue of
hotels. The researchers explored and derived the most attractive services and products which
could attract and satisfy hotel guests and encourage them to repeat their purchase. This
approach can increase revenue without increasing customer counts.
In February 2017, Roshan Gangurde did a study on building predictive model using market
basket analysis which stated that in a retail business if we are using market basket analysis to
come up with product bundles then you are basing past purchase behavior of customers to
predict future purchase behavior, which is a predictive model. He also concluded that, with
MBA, the leading retailers can attract more customers, increase the value of market basket,
drive more profitable advertising and promotion and much more. The study also suggested to
design and develop intelligent prediction models to generate the association rules that can be
adopted on the recommendation system to make the functionally more operational. Later by
the end of 2017, they designed optimized technique for MBA with goal of predicting and
analyzing the consumers buying behaviors. In this study, they introduced novel algorithms
based on data cleaning, which is one of the most important challenge in every field of data
analysis. To overcome this challenge, they combined two data mining algorithms i.e. apriori
algorithm and neural networks. They also highlighted that one of the biggest challenges is that
demands of customers are continuously changing with respect to seasons and time. Also output
of MBA is totally dependent on time and seasons and so we need to perform it over and over.
Analyzing the trends is very useful method to understand any businesses performance.
Debaditya and Nimalya in 2013 attempted to develop a method using association rule mining
to find out the most preferable and popular genres which can be represented as movie
business’s trend. The study can predict the possible movie trend based on the genres. As
viewer’s taste in movies can change time to time, it was important to get the insights in the
10
changing trends. This study is very useful to the production houses to drive movie business
towards profitability.
Kaur and Kang (2016) did MBA to identify the changing trends of market data using
association rule mining. This study proposed a different approach of periodic mining which
will enhance the power of data mining techniques. This study was helpful in finding out
interesting patterns from large database, predicting future association rules as well as gives us
right methodology to find out outliers. This study shows advancement by not just mining static
data but also provides a new way to consider changes happening in data.
S. Tan and J. Lau (2014) tried a different approach by summarizing a real-world sales
transaction data set into time series format. Rather than applying association rule mining (i.e.
often used in market basket analysis), they used time series clustering to discover commonly
purchased items that are useful for pricing or formulating cross-selling strategies. This
approach uses a data set that is substantially smaller than the data to be used for association
analysis which shows that certain market basket analysis can be analyzed more easily using
time series clustering instead of association analysis.
G.N.V.G. Sirisha, M. Shashi & G.V. Padma Raju in 2013 presented a paper overviewing
distinct types of periodic patterns and their applications along with a discussion of the
algorithms that are used to mine these patterns. Periodic pattern mining is very much useful in
constructing classification/ prediction and recommender systems.
Data keeps on changing with time and interestingness of data differs from person to person.
Time to time and task to task. So, attempts are being made to mine the necessary information
from a large amount of transactional data on a seasonal basis. Frequent item sets based on
calendric pattern will be mined to generate association rules.
Similar study was first made by Ramaswamy and Siberschatz in 1998 to discover the
association rules that repeat in every cycle of a fixed time span. This information about
variations in different time periods allowed marketers to better identify the trends in association
rules and help in better predictions.
11
CHAPTER 2. SYSTEM DESIGN
System design is a critical phase in developing the Market Basket Analysis system, as it
defines the overall architecture, components, and workflow. The system follows a structured
approach that ensures efficiency, scalability, and user-friendliness. This chapter covers the
architecture diagram, functional diagram, database design, form design, and module
explanation, detailing how the system processes transactional data, applies the Apriori
algorithm, and presents the results through an interactive web-based interface.
The system is built using a three-layer architecture, consisting of the input layer,
processing layer, and output layer. The input layer allows users to upload transaction
datasets in CSV format. The processing layer performs data preprocessing, frequent itemset
mining using the Apriori algorithm, and association rule generation. The output layer
presents the results through various visualization techniques, including bar charts, scatter
plots, and network graphs, which help users easily interpret product relationships.
The functional diagram illustrates the flow of data within the system, starting from dataset
input, data pre-processing, and Apriori algorithm execution to the final presentation of
association rules. Each component plays a vital role in ensuring the accuracy and efficiency of
market basket analysis. The database design ensures proper structuring and storage of
transaction data, supporting efficient data retrieval and processing. The form design enhances
user experience by providing an intuitive web interface for uploading datasets, viewing analysis
results, and predicting frequently bought-together items.
The system is divided into multiple modules, including data pre-processing, frequent
itemset mining, association rule generation, visualization, and user input-based
predictions. Each module is designed to handle specific functionalities, ensuring modularity
and ease of maintenance. By integrating these components, the system provides a powerful
and automated Market Basket Analysis solution, allowing businesses to make data-driven
decisions for improved product recommendations, targeted promotions, and optimized
inventory management.
12
13
CHAPTER 3: SYSTEM IMPLEMENTATION
System implementation involves transforming the theoretical design into a fully functional
system. The Market Basket Analysis system using the Apriori algorithm consists of multiple
modules, each responsible for different aspects of data processing, rule mining, visualization,
and user interaction. A structured approach is followed to ensure seamless integration between
these components, enabling efficient transaction analysis and recommendation generation.
14
with their expected frequency under independent conditions. Only rules that meet the
confidence and lift thresholds are retained to ensure meaningful recommendations. These rules
help businesses identify product affinities and improve sales strategies.
15
Step 7: System Deployment
After successful testing, the system is deployed for real-world use. The deployment process
includes hosting the Flask-based web application, configuring the database to support real-time
queries, and optimizing server performance through caching mechanisms. Once deployed, the
system enables businesses to explore purchasing patterns, generate strategic insights, and
improve customer experience by offering personalized product recommendations.
This part will explain how the algorithm that will be running behind the python libraries for
Market Basket Analysis. This will help the companies to understand their clients more and
analyze their data more closely and attentively. Rakesh Agrawal proposed the Apriori
algorithm which was the first associative algorithm proposed and future developments in
association, classification, associative classification algorithms have used it as a part of the
technique.
1. Frequent Itemset Generation: Find all frequent item-sets with support >= pre-
determined minimum support count. In frequent mining usually the interesting
associations and correlations between item sets in transactional and relational databases
are found. In short, Frequent Mining shows which items appear together in a transaction
or relation. The discovery of frequent item sets is accomplished in several iterations.
Counting new candidate item-sets from existing item sets requires scanning the entire
training data. In short it involves only two important steps:
a. Pruning
b. Joining
2. Rule Generation: List all association rules from frequent item-sets. Calculate Support
and Confidence for all the rules. Prune rules which fail minimum support and minimum
confidence thresholds.
16
Frequent Itemset Generation scan the whole database and find the frequent itemset with a
threshold on support. Since it scans the whole database, it is the most computationally
expensive step. In the real-world, transaction data for retail, can exceed to Gigabytes and
Terabytes of data for which an optimized algorithm is needed to exclude item-sets that will not
help in later steps.
For this Apriori algorithm is used.
Apriori algorithm sates “Any subset of a frequent itemset must also be frequent. In other words,
no superset of an infrequent itemset must be generated or tested.”
In the image below, which is a graphical representation of the Apriori algorithm principle. It
consists of k-item-set node and relation of subsets of the k-item-set. You can notice in the
figure that in the bottom is all the items in the transaction data and then you start moving up
creating subsets till it reaches to the null set.
This shows that it will be difficult to generate frequent item-set by finding support for each
combination. Therefore, in the figure below we can notice that Apriori algorithm helps to
reduce the number of sets to be generated.
17
figure 4: if an item set is infrequent, we do not consider its super sets
If an item-set {a, b} is infrequent then we do not need to consider all its super sets.
We can also look at it in the form of a transactional data-set. In the following example you can
notice why Apriori algorithm is much more effective and it generates stronger association rules
step by step. Step: 1
• Create a table containing support count of each item present in the dataset.
• Compare support count with minimum support count (in this case we have minimum
support count = 2 and if support count is less than the minimum support count then
remove those items), this gives us a new set of items.
18
• Check if the subsets of an itemset are frequent or not and if not remove that itemset.
For example, in the case below we can see that the subset of {I1, I2} are {I1}, {I2} and
are frequent. We must check for the each itemset in the same way.
• Now find the support count of these item-sets by searching in the dataset.
• Since we have already specified a threshold of minimum support count of 2. We
compare the minimum support count and if the support count is less than the minimum
support count, then remove those items. Gives us another itemset as we can see below.
Step: 3
• After getting another dataset we follow the same step (I.e. join step). We cross join
each itemset with each other. So, the itemset generated after this step will be:
{I1, I2, I3}
{I1, I2, I4}
{I1, I2, I5}
{I1, I3, I5}
{I2, I3, I4}
{I2, I4, I5}
{I2, I3, I5}
• Check all subsets of these item sets are frequent or not, if not then remove that item
sets. For example, in this case the subset of {I1, I2, I3} are {I1, I2}, {I1, I3}, {I2, I3}
which are frequent. But for {I2, I3, I4} one of the subsets is {I3, I4}, which is not
frequent. So, we remove this. We do the same for every itemset.
• Once we have removed all the non-frequent item sets, find support count of the
remaining itemset by searching in the dataset.
19
• Compare the minimum support count and if the support count is less than the minimum
support count, then remove those items. It gives us another itemset as we can see below.
figure 7: pruning and joining again until there are no more frequent items
left.
Step: 4
• We follow the same procedure again. First, we do the join step and we cross join each
itemset with one another. In out example the first two elements of the item set should
match.
• After, check all subsets of these item sets are frequent or not. In our example the itemset
formed after join step is {I1, I2, I3, I5}. So, one of the subsets of this itemset is {I1, I3,
I5} which is not frequent. Therefore, there is no itemset left anymore.
• We stop here because no more frequent itemset are found anymore.
The next step will be to list all frequent item-sets and generate how strong are the association
rules. For that we calculate the confidence of each rule. To calculate confidence, we use the
following formula:
figure
8: support, confidence and lift calculation
By taking an example of any frequent Item (we took {I1, I2, I3}) we will show how rule
generation is done:
20
figure 9: calculation of confidence
So, in this case if the minimum confidence is 50% then the first 3 rules can be considered strong
association rules. For example, take {I1, I2} => {I3} having confidence equal to 50% tells that
50% of people who bought I1 and I2 also bought I3.
21
Dataset Description
The dataset used in this project consists of transactional records representing customer
purchases in a retail environment. Each transaction contains a list of items purchased together,
reflecting real-world shopping behavior. The dataset is structured in a single-column format
where each row represents a unique transaction, containing multiple items separated by
commas.
22
Dataset Characteristics
• Total Transactions: The dataset includes multiple transactions capturing diverse
shopping patterns.
• Items per Transaction: Each transaction contains a variable number of items, ranging
from essential groceries to household and frozen goods.
• Product Categories:
• Fresh Produce (e.g., Bananas, Apples, Tomatoes, Potatoes)
• Dairy Products (e.g., Milk, Cheese, Yogurt, Butter)
• Packaged and Processed Foods (e.g., Bread, Cereal, Pasta)
• Frozen Foods (e.g., Frozen Vegetables, Ice Cream, Frozen Pizzas)
• Beverages and Snacks (e.g., Soda, Chips, Cookies)
• Household Essentials (e.g., Toilet Paper, Paper Towels, Laundry Detergent
• Protein Sources (e.g., Chicken, Ground Beef, Fish)
This dataset is crucial for uncovering patterns in purchasing behavior and driving data-driven
decision-making in retail.
23
METHODOLOGY
Methodology are the guidelines or path on how to proceed in validating knowledge on your
subject matter. Different areas of science have developed very different bodies of methodology
based on which to conduct their research (Little, 2012).
PROJECT PURPOSE
The ultimate purpose of every business is to find better ways to improve the profit for a long
run. But for this research the aim would be to encountering actual case of dependencies among
products chosen by customer.
Though several different products could be bought in a single visit to a mega store like,
groceries, pillowcase, furniture, an electric toaster, etc. However, we believe that there are no
coincidences for these choices. These decisions from several categories results in forming
customer’s shopping basket. Which with-holds the collection of categories that customer
purchased on a specific shopping trip. (Manchanda, Ansari, & Gupta, 1999).
24
Fig.2. Methodology of Market basket analysis
By filtering association rules with high support, confidence, and lift values while discarding
redundant or non-actionable patterns, the junk-based methodology optimizes resource usage,
improves scalability, and enhances the overall accuracy and efficiency of frequent itemset
mining in large datasets.
25
STRATEGY FOR MARKET BASKET ANALYSIS
In this section we describe the entire research process. Before getting into the steps of the
analysis.
First, we clear some of the concepts that we will be coming across in our analysis.
26
KEY TERMS AND CONCEPTS
Association rules:
Association analysis is also known as affinity analysis or association rule mining, a method
commonly used for market basket analysis. ARM is currently the most suitable method for
analysis of big market basket data but when there is a large volume of sales transaction with
high number of products, the data matrix to be used for association rule mining usually ends
up large and sparse, resulting in longer time to process data. Association rules provide
information of this type in the form of “IF-THEN” statements. There are three indexes which
are commonly used to understand the presence, nature and strength of an association rule.
(Berry & Linoff, 2004; Larose, 2005; Zhang & Zhang, 2002)
Lift is obtained first because it provides information on whether an association exist or not or
if the association is positive or negative. If the value for lift suggests that there is an existence
of association rule, then we obtain the value for support.
Support of an item or itemset is the fraction of transactions in our dataset that contain that
item or itemset. It is an important measure because a rule that have low support may occur
simply by chance. A low support rule may also be uninteresting from a business perspective
because it may not be profitable to promote items that are seldom bought together. For these
reasons, support is often used to eliminate uninteresting rules.
Confidence is defined as the conditional probability that shows that the transaction
containing the LHS will also contain RHS. Association analysis results should be interpreted
with caution. The inference made by an association rules does not necessarily imply causality.
Instead, it suggests a strong co-occurrence relationship between the items in the antecedent and
consequent of the rule.
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆
27
Confidence and support measure the strength of an association rule. Since the transactional
database is quite large, there is a higher risk of getting too many unimportant and rules which
may not be of our interest. To avoid these kinds of errors we commonly define a threshold of
support and confidence prior to the analysis, so that only useful and interesting rules are
generated in our result.
If lift is greater than 1, it suggests that the presence of the items on the LHS has increased
the probability that the items on the RHS will occur on this transaction. If the lift is below 1, it
suggests that the presence of the items on the LHS make the probability that the items on the
RHS will be part of the transaction lower. If the lift is 1, it suggests that the presence of items
on the LHS and RHS are independent: knowing that the items on the LHS are present makes
no difference to the probability that items will occur on the RHS.
While performing market basket analysis, we look for rules with a lift of more than one. It
is also preferable to have rules which have high support as this will be applicable to a large
number of transactions and rules with higher confidence are ones where the probability of an
item appearing on the RHS is high, given the presence of items on the LHS.
28
CHAPTER 4: SYSTEM TESTING
System testing is a crucial phase in the development of the Market Basket Analysis system,
ensuring that all components function correctly and efficiently. The testing process is designed
to validate data pre-processing, algorithm performance, association rule generation,
visualization accuracy, and the overall functionality of the web application. Various testing
methodologies, including unit testing, integration testing, performance testing, and user
acceptance testing, are employed to identify and rectify potential issues.
Unit testing is conducted to verify the correctness of individual modules, such as data
loading, pre-processing, and rule generation. Each function is tested with different datasets to
ensure expected outputs. Errors in data handling, incorrect frequent itemset mining, and rule
generation inconsistencies are identified and resolved in this phase. Python’s unittest
framework is utilized to automate unit testing, ensuring consistent validation of core
functionalities.
Integration testing focuses on the interaction between different modules within the system.
This includes testing the seamless flow of data from pre-processing to frequent itemset mining
and then to association rule generation and visualization. The database's ability to store and
retrieve association rules efficiently is also validated. Integration tests ensure that the Flask-
based web application correctly communicates with the backend processing logic and displays
results without errors.
Performance testing evaluates the system’s efficiency in handling large datasets. The
execution time of the Apriori algorithm is measured under different dataset sizes to ensure
scalability. The response time of the web application is also assessed to optimize user
experience. Memory usage and computational efficiency are analyzed to identify bottlenecks,
and optimizations are applied where necessary.
User acceptance testing (UAT) is conducted by allowing end-users, such as retail analysts
and business professionals, to interact with the system. Their feedback is gathered to improve
usability, interface design, and overall system functionality. Real-world transaction data is used
to validate the accuracy and relevance of generated association rules. Any inconsistencies
reported by users are addressed before the final deployment.
29
Security testing is also performed to ensure that the system is protected against potential
threats. Input validation is implemented to prevent SQL injection and cross-site scripting (XSS)
attacks. Secure authentication mechanisms are applied to restrict unauthorized access to
sensitive data. These security measures safeguard the integrity and confidentiality of
transaction data.
The system undergoes multiple rounds of debugging and testing to achieve stability before
deployment. Test cases and results are documented for future reference, ensuring
maintainability and future enhancements. By following a structured testing approach, the
Market Basket Analysis system delivers accurate, reliable, and scalable results, providing
valuable insights into consumer purchasing patterns.
To evaluate the system's performance, different datasets are used to test the execution time
and accuracy of the Apriori algorithm. The following table summarizes the test results:
Frequent Association
Execution
Dataset Size Transactions Itemsets Rules
Time (s)
Found Generated
Unit testing is conducted to verify the correctness of individual modules, such as data
loading, pre-processing, and rule generation. Each function is tested with different datasets to
ensure expected outputs. Errors in data handling, incorrect frequent itemset mining, and rule
generation inconsistencies are identified and resolved in this phase. Python’s unittest
framework is utilized to automate unit testing, ensuring consistent validation of core
functionalities.
Integration testing focuses on the interaction between different modules within the system.
This includes testing the seamless flow of data from pre-processing to frequent itemset mining
30
and then to association rule generation and visualization. The database's ability to store and
retrieve association rules efficiently is also validated. Integration tests ensure that the Flask-
based web application correctly communicates with the backend processing logic and displays
results without errors.
User acceptance testing (UAT) is conducted by allowing end-users, such as retail analysts
and business professionals, to interact with the system. Their feedback is gathered to improve
usability, interface design, and overall system functionality. Real-world transaction data is used
to validate the accuracy and relevance of generated association rules. Any inconsistencies
reported by users are addressed before the final deployment.
The system undergoes multiple rounds of debugging and testing to achieve stability before
deployment. Test cases and results are documented for future reference, ensuring
maintainability and future enhancements. By following a structured testing approach, the
Market Basket Analysis system delivers accurate, reliable, and scalable results, providing
valuable insights into consumer purchasing patterns.
31
Summary of Apriori Algorithm-Based Research Studies, Their Datasets, and Findings
This Research Large Retail Retail & - Developed a Market Basket Analysis system
using Apriori.
Study-Market Transaction Consumer
- Implemented chunk-based processing for
Basket Analysis Dataset Behavior handling large datasets efficiently.
- Visualized top-selling products using bar
Using Apriori (CSV Analysis charts and association rules using scatter plots.
Algorithm Format) - Measured support, confidence, and lift for
rule evaluation.
- Compared Apriori with FP-Growth,
highlighting performance trade-offs.
Dubey et al.
(2021) – - Identified milk and bread as the most
Groceries
Comparative frequently bought-together items.
Dataset (UCI
Analysis of Retail & E- - Apriori generated strong association rules,
Machine
Market Basket Commerce but its scalability was limited.
Learning
Analysis through - FP-Growth outperformed Apriori in
Repository)
Data Mining execution speed but lacked interpretability.
Techniques
Hossain et al. Instacart - Found fruits, vegetables, and dairy products
(2019) – Market Online as frequently purchased together.
Retail &
Basket Analysis Grocery - Confidence and lift metrics improved rule
Consumer
Using Apriori and Shopping filtering.
Behavior
FP-Growth Dataset - Apriori provided clearer rules, while FP-
Algorithm (Kaggle) Growth was faster.
Maske et al. - Found that low support thresholds generated
Retail
(2018) – Survey excessive rules, making interpretation
Market Retail &
on Frequent Item- difficult.
Basket Supply
Set Mining - Apriori is more suitable for small datasets,
Dataset Chain
Approaches in while Eclat performs better for dense data.
(IBM Quest Management
Market Basket - Recommended hybrid models combining
Data)
Analysis Apriori with FP-Growth.
Nengsih (2020) –
- Discovered frequent itemsets such as rice,
A Comparative
sugar, and cooking oil.
Study on Market Supermarket Retail &
- Manual rule generation was less effective
Basket Analysis Transaction Sales
than Apriori-based mining.
and Apriori Dataset Optimization
- Suggested combining Apriori with advanced
Association
filtering techniques.
Technique
J. Han & M. Breast Healthcare - Used Apriori to identify risk factors for
Kamber (2006) – Cancer & Disease breast cancer.
Data Mining: Wisconsin Prediction - Found strong associations between tumor
32
Concepts and Dataset (UCI size, patient age, and malignancy probability.
Techniques Machine - Demonstrated the value of association rule
Learning mining in medical diagnostics.
Repository)
M. J. Kargar & F. National - Applied negative association rules to find
Hajilou (2019) – Health and Healthcare risk factors linked to chronic diseases.
An Analysis on Nutrition & Medical - Showed that certain medications were
Characteristics of Examination Data negatively associated with specific diseases.
Negative Survey Analysis - Suggested applying Apriori in drug
Association Rules (NHANES) interaction studies.
P. Parthiban & S.
Web Usage - Used Apriori for web traffic pattern analysis.
Selvakumar (2016) NASA
Mining & - Identified frequent browsing sequences to
– Big Data HTTP Web
User enhance website navigation.
Architecture for Server Log
Behavior - Suggested using Apriori for real-time web
Analyzing Web Dataset
Analysis analytics.
Server Logs
Philip K. Chan et - Applied Apriori for detecting fraudulent
Credit Card
al. (1999) – transactions.
Fraud Fraud
Distributed Data - Found strong correlations between unusual
Detection Detection &
Mining in Credit spending behaviors and fraud cases.
Dataset Security
Card Fraud - Suggested combining Apriori with anomaly
(Kaggle)
Detection detection models for better fraud detection.
33
CHAPTER 5: FUTURE ENHANCEMENT
Future enhancements for the Market Basket Analysis system focus on improving the
efficiency, scalability, and accuracy of the recommendation engine. One of the primary
enhancements includes integrating real-time transaction processing, which allows businesses
to analyze customer purchasing behavior instantly. By incorporating a streaming data pipeline,
the system can continuously update association rules and provide immediate product
recommendations, improving the overall shopping experience for customers.
Another key enhancement is the adoption of machine learning models to complement the
Apriori algorithm. Traditional association rule mining techniques rely on predefined thresholds
for support and confidence, which may not always yield optimal results. By integrating deep
learning models such as recurrent neural networks (RNNs) or transformer-based models, the
system can dynamically adjust recommendation strategies based on evolving customer
preferences and seasonal trends. This will enhance the accuracy of predictions and provide
more personalized recommendations.
Finally, improving database performance and scalability will ensure the system remains
efficient even with large datasets. Implementing distributed databases, optimizing query
performance, and leveraging cloud-based storage solutions will enhance the system's capability
to handle extensive transaction data. These enhancements will make the Market Basket
Analysis system more robust, adaptable, and useful for businesses of all sizes, driving data-
driven decision-making and improving customer engagement.
34
CHAPTER 6: Results and Discussion
The Apriori algorithm to identify frequent itemsets and generate association rules. The
application allows users to upload transaction datasets, process them in real-time, and
visualize relationships between products. Key functionalities include:
Performance Evaluation
To evaluate the system's performance, several datasets were tested with different
transaction sizes. The following observations were made:
35
Insights on Dataset Impact
Transaction Length:
Short transactions (2-5 items) generate fewer association rules, often below the minimum
support threshold. Longer transactions (8+ items) produce stronger and more meaningful
rules.
Item Frequency:
Commonly purchased products (e.g., bread, milk, eggs) form high-confidence rules.Rare
items may not appear frequently enough to generate rules unless support is lowered.
Higher support values restrict rule discovery to only the most frequent products. Lower
support/confidence thresholds allow more rules but may include less significant
relationships.
36
Data Analysis and Visualization
Figure 1 presents a bar chart visualization of the top 10 products with the highest sales derived
from the dataset analyzed in this study. The visualization reveals that Chili Powder holds the
leading position with the highest sales count, followed by Garlic and Jeera, underscoring their
widespread demand among customers. Other products such as Olive Oil, Spinach, and
Tomatoes also rank prominently, highlighting their essential role in consumer purchasing
patterns. The results emphasize the significance of commonly used cooking ingredients, which
form a substantial portion of customer transactions.
37
The scatter plot provided in this study visualizes the relationship between support and
confidence for all association rules generated during the Market Basket Analysis. Each point
represents a specific rule, while the color gradient denotes the corresponding lift value. The
graph reveals that many rules exhibit high confidence values (above 0.8), indicating strong
associations between items. However, these rules often have low support values, which is
typical in retail datasets where highly specific combinations of products occur less frequently.
This pattern highlights the potential for identifying niche product combinations that can inform
tailored marketing strategies or promotions.
The addition of lift as a colour gradient enriches the analysis by emphasizing rules with
significant correlations. Rules with higher lift values, particularly those exceeding 100, stand
out as exceptionally strong associations, suggesting opportunities for cross-selling or bundling
strategies. The visualization underscores the efficacy of the Apriori algorithm in uncovering
meaningful patterns in large datasets. By combining the metrics of support, confidence, and lift
in an intuitive graphical format, the study enhances interpretability and provides actionable
insights that align with the goals of Market Basket Analysis.
38
Optimized Data Processing for Large Datasets
Previous research generated static association rules. Our system allows users to input
specific products and dynamically retrieve frequently bought-together recommendations.
Common Challenges Identified in Past Research and Proposed Solutions
Several studies have highlighted key challenges in applying the Apriori algorithm for Market
Basket Analysis, particularly in terms of scalability, rule redundancy, interpretability, and
accessibility. While the algorithm is widely used for association rule mining, its practical
implementation often encounters computational and usability constraints. This section
discusses these challenges as identified in previous research and presents the solutions
incorporated into our system to address these limitations.
39
CHAPTER.6 CONCLUSION
The Market Basket Analysis system, implemented using the Apriori algorithm, successfully
identifies frequent itemsets and generates meaningful association rules to improve decision-
making in retail and e-commerce. By leveraging data pre-processing, efficient rule mining,
visualization techniques, and web deployment, the system provides valuable insights into
customer purchasing behavior. This implementation demonstrates how association rule mining
can be effectively used to analyze large transactional datasets and discover patterns that drive
business strategies.
Through the development process, various challenges were addressed, including handling
large datasets efficiently, optimizing the performance of the Apriori algorithm, and ensuring
that the system provides relevant and actionable recommendations. The integration of data
visualization tools further enhances the user experience, allowing stakeholders to interpret
results intuitively. The web-based interface ensures accessibility, making the system a practical
solution for businesses looking to optimize their product offerings and marketing strategies.
System testing confirmed that the implementation is both efficient and scalable, with the
ability to process different dataset sizes while maintaining accuracy and performance. By using
structured testing methodologies, including unit testing, integration testing, and performance
evaluation, the system’s reliability was ensured. The results indicate that the system is capable
of providing useful recommendations for various business scenarios, supporting data-driven
decision-making.
In the future, the system can be enhanced by integrating machine learning techniques for
more accurate recommendations, real-time transaction processing for dynamic rule updates,
and advanced visualization tools for better data interpretation. These improvements will further
expand the capabilities of the Market Basket Analysis system, making it a more powerful tool
for retailers and businesses seeking to enhance customer experience and maximize revenue
opportunities.
40
CHAPTER.7 REFERENCES
[1] Agrawal, R., & Imielinski, T. (1993). Mining association rules between sets of items
in large databases. Proceedings of the ACM SIGMOD International Conference on
Management of Data, pp. 207-216.
[2] Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules.
Proceedings of the 20th VLDB Conference, Santiago, Chile.
[3] Andrew, J. H., & Zhang, Y. (2003). General test result checking with log file analysis.
IEEE Transactions on Software Engineering, 29, pp. 634-648.
[4] Baralis, E., Caglicro, I., & Cerquitclli, T. (2014). Mining network data through cloud-
based data mining techniques. IEEE/ACM 7th International Conference on Utility and
Cloud Computing, IEEE, vol 201, pp. 503-504.
[5] Begum, S. H. (2013). Data mining tools and trends – an overview. International Journal
of Emerging Research in Management & Technology, 6-12.
[6] Bhalodiya, D., Patel, K. M., & Patel, C. (2013). An efficient way to find frequent
pattern with dynamic programming approach. Nirma University International
Conference on Engineering (NUiCONE), Ahmedabad, pp. 1-5.
[7] Bhalodiya, D., Patel, K. M., & Patel, C. (2013). An Efficient way to Find Frequent
Pattern with Dynamic Programming Approach. University International Conference
On Engineering.
[8] Buche, P., Dibie-Barthelemy, J., Ibanescu, L., & Soler, L. (2013). Fuzzy web data
tables integration guided by an ontological and terminological resource. IEEE
Transactions on Knowledge and Data Engineering, 25(4), pp. 805-814.
[9] Chan, K., & Wai-Ho. (1997). An effective algorithm for mining interesting quantitative
association rules. Proceedings of the ACM Symposium on Applied Computing, pp. 88-
90.
[10] Chan, P. K., Fan, W., Prodromidis, A. L., & Stolfo, S. J. (1999). Distributed data
mining in credit card fraud detection. IEEE, pp. 67-73.
[11] Cheng, A., Su, S., Xu, S., & Li, Z. (2015). DP-Apriori: A differentially private frequent
itemset mining algorithm.
41
[12] Cooley, B., Mobasher, H., & Srivastava, J. (1999). Data preparation for mining World
Wide Web browsing patterns. Journal of Knowledge and Information Systems, 1(1),
pp. 5-
32.
[13] Eremic, Z., Radosav, D., & Markoski, B. (2010). Mining user logs to optimize
42
[25] Mayil, V. V. (2012). Web navigation path pattern prediction using first-order Markov
model and depth-first evaluation. International Journal of Computer Applications,
45(16).
[26] Midhunchakkaravarthy, J., & Brunda, S. S. (2012). An enhanced web mining approach
for product usability evaluation in feature fatigue analysis using LDA model
association rule mining with the Fruit Fly algorithm. Indian Journal of Science &
Technology, 9(8).
[27] Mobasher, B., Cooley, R., & Srivastava, J. (1999). Creating adaptive websites through
usage-based clustering of URLs. Proceedings of Knowledge and Data Engineering
Exchange, 1(1), pp. 19-25.
[28] Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Discovery and evaluation of
aggregate usage profiles for web personalization. Proceedings of Data Mining and
Knowledge Discovery, pp. 61-82.
[29] Parthiban, P., & Selvakumar, S. (2016). Big data architecture for capturing, storing,
analyzing, and visualizing web server logs. Indian Journal of Science and Technology,
9(4).
[30] Park, J. S., Chen, M.-S., & Yu, P. S. (1997). Using a hash-based method with
transaction trimming for mining association rules. IEEE Transactions on Knowledge
and Data Engineering, pp. 813-825.
[31] Patel, K. B., & Patel, A. R. (2012). Process of web usage mining to find interesting
patterns from web usage data. International Journal of Computers & Technology, 3(1).
[32] Saurkar, A. V., Bhujade, V., Bhagat, P., & Khaparde, A. (2014). A review paper on
various data mining techniques. International Journal of Advanced Research in
Computer Science and Software Engineering, 4(4), pp. 98-101.
[33] Srivastava, J., Cooley, R., Despande, M., & Tan, P. N. (2000). Web usage mining:
Discovery and applications of usage patterns from web data. SIKD Explorations, 1(2),
pp.
12-33.
[34] Suneetha, K. R., & Krishnamoorthi, R. (2009). Identifying user behavior by analyzing
web server access log files. International Journal of Computer Science and Network
Security, 9(4).
[35] Suneetha, K. R., & Krishnamoorthi, R. (2010). Advanced version of the Apriori
algorithm.
43
SOURCE CODE
APP.py
app = Flask(__name__)
if error:
return render_template("index.html", error=error)
return render_template("index.html")
if __name__ == "__main__":
app.run(debug=True, port=5001)
MARKET BASKET ANALYSIS.py
import pandas as pd
44
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from mlxtend.frequent_patterns import apriori, association_rules
from scipy.sparse import csr_matrix
from collections import defaultdict
DATA_PATH = "data/store_data.csv"
# Load dataset
def load_dataset(file_path=DATA_PATH):
try:
transactions = []
for chunk in pd.read_csv(file_path, chunksize=5000):
chunk = chunk.dropna(subset=['Transaction'])
transactions.extend(chunk['Transaction'].astype(str).str.split(',').tolist())
return transactions, None
except Exception as e:
return None, f"Error loading dataset: {e}"
# Preprocess transactions
def preprocess_transactions(transactions):
unique_items = sorted(set(item.strip() for transaction in transactions for item in transaction))
item_to_index = {item: idx for idx, item in enumerate(unique_items)}
45
sparse_matrix = csr_matrix((data, (rows, cols)), shape=(len(transactions),
len(unique_items)))
return pd.DataFrame.sparse.from_spmatrix(sparse_matrix, columns=unique_items),
unique_items
# Run Apriori
def run_apriori(input_products, min_support=0.005):
transactions, error = load_dataset()
if error:
return None, None, error
46
).head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x="Count", y="Product", data=product_counts_df, palette="viridis")
plt.title("Top 10 Products with Highest Sales")
plt.xlabel("Sales Count")
plt.ylabel("Products")
plt.savefig("static/top_products.png") # Save image for display in HTML
plt.close()
graph = nx.Graph()
for rule in rules:
for ant in rule['antecedents']:
for con in rule['consequents']:
graph.add_edge(ant, con, weight=rule['confidence'])
plt.figure(figsize=(10, 6))
pos = nx.spring_layout(graph)
nx.draw(graph, pos, with_labels=True, node_color="lightblue", edge_color="gray",
font_size=10, node_size=2000)
plt.title("Association Rules Network Graph")
plt.savefig("static/association_rules.png")
plt.close()
INDEX .html
47
<!DOCTYPE html>
<html lang="en">
<head>
<title>Market Basket Analysis</title>
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css">
<style>
body {
background-color: #f8f9fa;
}
.hero-section {
background: linear-gradient(45deg, #007bff, #6610f2);
color: white;
padding: 80px 0;
text-align: center;
}
.container-box {
background: white;
padding: 30px;
border-radius: 10px;
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
margin-top: -50px;
}
.btn-custom {
width: 100%;
background: #007bff;
color: white;
font-size: 18px;
}
.btn-custom:hover {
background: #0056b3;
}
.footer {
48
text-align: center;
padding: 10px;
background: #343a40;
color: white;
margin-top: 50px;
}
</style>
</head>
<body>
<div class="hero-section">
<h1>Market Basket Analysis</h1>
<p>Discover frequently bought-together products</p>
</div>
<div class="container">
<div class="container-box">
<h3 class="text-center">Enter Products to Analyze</h3>
<form method="POST">
<div class="mb-3">
<input type="text" name="products" class="form-control" placeholder="e.g.,
Milk, Bread" required>
</div>
<button type="submit" class="btn btn-custom">Analyze</button>
</form>
{% if error %}
<div class="alert alert-danger mt-3 text-center">
{{ error }}
</div>
{% endif %}
</div>
</div>
49
<div class="footer">
<p>© 2025 Market Basket Analysis | Powered by Flask & Apriori Algorithm</p>
</div>
</body>
</html>
RESULTS .html
<!DOCTYPE html>
<html lang="en">
<head>
<title>Analysis Results</title>
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css">
<style>
body {
background-color: #f8f9fa;
}
.container-box {
background: white;
padding: 30px;
border-radius: 10px;
box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
margin-top: 30px;
}
.section-header {
text-align: center;
margin-bottom: 20px;
}
.footer {
text-align: center;
padding: 10px;
background: #343a40;
50
color: white;
margin-top: 50px;
}
.btn-back {
width: 100%;
background: #007bff;
color: white;
font-size: 18px;
}
.btn-back:hover {
background: #0056b3;
}
img {
width: 100%;
border-radius: 5px;
margin-top: 10px;
}
</style>
</head>
<body>
<div class="container">
<div class="container-box">
<h2 class="section-header">Market Basket Analysis Results</h2>
51
</div>
52
</div>
<div class="card-body">
<img src="{{ url_for('static', filename='association_rules.png') }}"
alt="Association Rules">
</div>
</div>
<div class="footer">
<p>© 2025 Market Basket Analysis | Powered by Flask & Apriori Algorithm</p>
</div>
</body>
</html>
53
OUTPUT
54
55
Relationship between Support and Confidence.
56
Hardware and Software Requirements for Market Basket Analysis System
Hardware Requirements
Software Requirements
57