Association Rules Ans
Association Rules Ans
Instructions:
Please share your answers filled in-line in the word document. Submit code separately wherever
applicable.
Grading Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are
submitted along with the documentation explaining the method and results. Failing to submit either of
those will be considered an invalid submission and will not be considered for evaluation.
2. Assignments submitted after the deadline will affect your grades.
Grading:
Ans Date Ans Date
Correct On time A 100
80% & above On time B 85 Correct Late
50% & above On time C 75 80% & above Late
50% & below On time D 65 50% & above Late
E 55 50% & below
Copied/No Submission F 45
● Grade A: (>= 90): When all assignments are submitted on or before the given deadline.
● Grade B: (>= 80 and < 90):
o When assignments are submitted on time but less than 80% of problems are completed.
(OR)
o All assignments are submitted after the deadline.
1. Business Problem
1.1. What is the business objective?
1.2. Are there any constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the
below image:
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Model Building
4.1 Application of Apriori Algorithm
4.2 Build most frequent item sets and plot the rules
4.3 Work on Codes
5.Deployment
5.1 Deploy solutions
6. Write about the benefits/impact of the solution - in what way does the business
(client) benefit from the solution provided?
Problem Statement: -
Kitabi Duniya, a famous book store in India, which was established before Independence, the growth of the
company was incremental year by year, but due to online selling of books and wide spread Internet access its
annual growth started to collapse, seeing sharp downfalls, you as a Data Scientist help this heritage book
store gain its popularity back and increase footfall of customers and provide ways the business can improve
exponentially, apply Association RuleAlgorithm, explain the rules, and visualize the graphs for clear
understanding of solution.
1.) Books.csv
#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data
book_series.columns = [“trans”]
#creating a dummy column for each item in each transactions… using column
names as item name
X = book_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)
plt.bar(x=list(range(0,11),height=frequent_itemsets.support[0:11],color=’r
gmyk’)
plt.xticks(list(range(0,11)), frequent_itemsets.itemsets[0:11])
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()
rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(10)
rules.sort_values(‘lift’, ascending=False).head(10)
Insights:
· From the books dataset, 164 rules were created using the apriori algorithm.
· There are 3 different variables plotted in the scatterplot; confidence, support and lift.
· From the above visualization we can say that the lift ratio for rules ranges from 0 to 14.5
approximately.
· The rules are all over scattered and more than 50 percent of rules lie within the support
range of 0.03
· The rules with the highest lift ratio lie within the constraint of support value of 0.025 to 0.03
and confidence value from 0.6 to 0.7
· The highest lift ratio value is for the rules with the items CookBks, DoltYBks, ArtBks, ItalCook.
· The lowest lift ratio value is for the items ItalArts, ArtBks, CookBks, DoltYBks
· We have 11 rules with the highest lift value and the items include ItalCook, ArtBks and 4 other
items.
· There are 6 rules which have the highest support and the items include DoltYBks, GeogBks and
4 other items.
© 2013 - 2021 360DigiTMG. All Rights Reserved.
Problem Statement: -
The Departmental Store, has gathered the data of the products it sells on a Daily basis.
Using Association Rules concepts, provide the insights on the rules and the plots.
2.) Groceries.csv
Python code:
#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data
#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data
groceries_series.columns = [“trans”]
#creating a dummy column for each item in each transactions… using column
names as item name
X = groceries_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep =
‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)
plt.bar(x=list(range(0,11),height=frequent_itemsets.support[0:11],color=’r
gmyk’)
plt.xticks(list(range(0,11)), frequent_itemsets.itemsets[0:11])
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()
rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(10)
rules.sort_values(‘lift’, ascending=False).head(10)
Exploratory data Analysis:
· The above is the two key plot of 39 rules from the groceries dataset which are plotted on the basis
of order of the rules. Also we have 3 different variables plotted.
· From the visualization we can infer that the rules with 4 itemsets are more in number as
compared to 3 and 5 itemsets.
· Majority of the rules with 4 itemsets lie within the constraint of support value as 0.002 to 0.0032
and confidence as 0.75 to 0.78.
· There is a rule with 5 itemsets which has highest confidence and support value approximately as
0.0032.
2. Graph chart
· Amongst the 4 rules there is a rue with highest lift ratio and items include whipped/sour cream,
onions, whole milk and other vegetables..
· There is a rule which has highest support value and the items include tropical fruit, frozen
vegetables, root vegetables and whole milk. Whole milk is again common for 3 different rules
3. Grouped chart
· There is one rules which has maximum lift ratio value and the items include citrus fruit, whole
milk and 2 other items.
· There is one rule which has highest support and good lift value and the items include citrus fruit,
tropical fruit and 1 other item.
Problem Statement: -
Python code:
# import pandas library
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data
#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data
movie_series.columns = [“trans”]
#creating a dummy column for each item in each transactions… using column
names as item name
X = movie_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)
plt.bar(x=list(range(0,11),height=frequent_itemsets.support[0:11],color=’r
gmyk’)
plt.xticks(list(range(0,11)), frequent_itemsets.itemsets[0:11])
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()
rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(10)
rules.sort_values(‘lift’, ascending=False).head(10)
Insights:
· From the above chart we can infer that there is a rule which has higher lift among others but has
low support and the movies include LOTR, Green Mile, Gladiator.
· The rule with movies Gladiator and Patriot has the highest support value among others.
· The movie Gladiator and Sixth Sense are a part of other rules as well.
2. Grouped chart
· There is a rule which has the highest lift ratio value and has the movies Gladiator and Green Mile.
· The rule which has higher support as compared to other rules includes the movies Gladiator
and Patriot.
Python code:
#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data
data_series.columns = [“trans”]
#creating a dummy column for each item in each transactions… using column
names as item name
X = data_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)
plt.bar(x=list(range(0,5),height=frequent_itemsets.support[0:5],color=’rgm
yk’)
plt.xticks(list(range(0,5)), frequent_itemsets.itemsets[0:5], rotation=15)
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()
rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(5)
rules.sort_values(‘lift’, ascending=False).head(5)
Insights:
· Amongst the 12 rules we have a rule with highest lift ratio and the items(color of the
phones) green, white, red.
· There is a rule with highest support value and the items include only white color.
· From the graph above we can infer that white and red color are a part of most of the rules
hence we can go ahead with both of those colors.
2. Grouped chart
© 2013 - 2021 360DigiTMG. All Rights Reserved.
Insights:
· From the chart we can infer that there are 6 rules which have the highest support and the colors
of the rules are blue, red and white.
· We have a rule with the highest lift ratio amongst all the rules and it has color white and green
but the support value is less as compared to other rules.
Problem Statement: -
A retail store in India, has its transaction data, and it would like to know the buying pattern of the
consumers in its locality, you have been assigned this task to provide the manager with rules
on how the placement of products needs to be there in shelves so that it can improve the buying
Python code:
#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data
tr_series.columns = [“trans”]
#creating a dummy column for each item in each transactions… using column
names as item name
X = tr_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)
plt.bar(x=list(range(0,5),height=frequent_itemsets.support[0:5],color=’rgm
yk’)
plt.xticks(list(range(0,5)), frequent_itemsets.itemsets[0:5], rotation=15)
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()
rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(20)
rules.sort_values(‘lift’, ascending=False).head(10)
Insights:
· There are other rules which have approximately similar lift ratio value and the items include
paper, Christmas, kit for one rule and for another rule we have items as wooden, frame, white.
2. Grouped plot:
Insights:
© 2013 - 2021 360DigiTMG. All Rights Reserved.
· Above is the grouped chart for 204 rules.
· There is a rule which has the highest lift ratio value amongst other and the
items include kneeling and pad.
· We have a rule which has higher support value as compared to other rules and
the items include Jam and set.
Data Dictionaries:
1. Books dataset
2. Groceries dataset
It is a transaction dataset and does not have specific columns. For our analysis we convert the
items in the transactions as the columns in the dataset.
It is dataset in which columns are in encoded format apart from the mentioned below.
4. My phone data
It is dataset in which columns are in encoded format apart from the mentioned below.
It is a transaction dataset and does not have specific columns. For our analysis we convert the
items in the transactions as the columns in the dataset.