0% found this document useful (0 votes)
278 views28 pages

Association Rules Ans

This document discusses association rule mining on grocery store transaction data. It loads and preprocesses the data, generates frequent itemsets and association rules using the Apriori algorithm, and provides visualizations of the results including bar plots of top items and support/lift for rules.

Uploaded by

Priya kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
278 views28 pages

Association Rules Ans

This document discusses association rule mining on grocery store transaction data. It loads and preprocesses the data, generates frequent itemsets and association rules using the Apriori algorithm, and provides visualizations of the results including bar plots of top items and support/lift for rules.

Uploaded by

Priya kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Association Rules

Instructions:
Please share your answers filled in-line in the word document. Submit code separately wherever
applicable.

Please ensure you update all the details:


Name: Vinutha N Batch ID: DSWDE24062021
Topic: Association Rules

Grading Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are
submitted along with the documentation explaining the method and results. Failing to submit either of
those will be considered an invalid submission and will not be considered for evaluation.
2. Assignments submitted after the deadline will affect your grades.

Grading:
Ans Date Ans Date
Correct On time A 100
80% & above On time B 85 Correct Late
50% & above On time C 75 80% & above Late
50% & below On time D 65 50% & above Late
E 55 50% & below
Copied/No Submission F 45

● Grade A: (>= 90): When all assignments are submitted on or before the given deadline.
● Grade B: (>= 80 and < 90):
o When assignments are submitted on time but less than 80% of problems are completed.
(OR)
o All assignments are submitted after the deadline.

● Grade C: (>= 70 and < 80):


o When assignments are submitted on time but less than 50% of the problems are completed.
(OR)
o Less than 80% of problems in the assignments are submitted after the deadline.

● Grade D: (>= 60 and < 70):


o Assignments submitted after the deadline and with 50% or less problems.

● Grade E: (>= 50 and < 60):


o Less than 30% of problems in the assignments are submitted after the deadline.
(OR)
o Less than 30% of problems in the assignments are submitted before the deadline.

● Grade F: (< 50): No submission (or) malpractice.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Hints:

1. Business Problem
1.1. What is the business objective?
1.2. Are there any constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the
below image:

3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Model Building
4.1 Application of Apriori Algorithm
4.2 Build most frequent item sets and plot the rules
4.3 Work on Codes
5.Deployment
5.1 Deploy solutions
6. Write about the benefits/impact of the solution - in what way does the business
(client) benefit from the solution provided?

Problem Statement: -
Kitabi Duniya, a famous book store in India, which was established before Independence, the growth of the
company was incremental year by year, but due to online selling of books and wide spread Internet access its
annual growth started to collapse, seeing sharp downfalls, you as a Data Scientist help this heritage book
store gain its popularity back and increase footfall of customers and provide ways the business can improve
exponentially, apply Association RuleAlgorithm, explain the rules, and visualize the graphs for clear
understanding of solution.
1.) Books.csv

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Python code:

# import pandas library


import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

#loading the data set


book = pd.read_csv(“C:/Users/Vinutha/Downloads/Datasets(6)/book.csv”)

#splitting the data


book = book.split(“\n”)
print(book.split())
book_list = []
for i in book:
book_list.append(i.split(“,”))
“There is a link that will split the data whenever we see the (comma(,))
values”.
all_book_list = [i for item in book_list for i in item]
we are going in I and in I we are searching all the values i i

from collections import Counter #, OrderedDict


#we are importing counter from collections
item_frequencies = Counter(all_book_list)
countering all the books list and feeding in the item_frequencies

#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data

#sorting frequencies and items in separate variables


frequencies = list(reversed([i[1] for i in item_frequencies]))
items = list(reversed([i[0] for i in item_frequencies]))

© 2013 - 2021 360DigiTMG. All Rights Reserved.


#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data

plt.bar(height = frequencies[0:11], x = list(range(0,11)), color =


‘rgbkymc’)
plt.xticks(list(range(0,11), ), items[0:11])
plt.xlabel(“items”)
plt.ylabel(“Count”)
plt.show()

#creating data frame for the transactions data


book_series = pd.DataFrame(pd.Series(book_list))
book_series= book_series.iloc[:2000, :]

book_series.columns = [“trans”]

#creating a dummy column for each item in each transactions… using column
names as item name
X = book_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)

#most frequent item sets based on support


frequent_itemsets.sort_values(‘support’, ascending = False, inplace= True)

plt.bar(x=list(range(0,11),height=frequent_itemsets.support[0:11],color=’r
gmyk’)
plt.xticks(list(range(0,11)), frequent_itemsets.itemsets[0:11])
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()

rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(10)
rules.sort_values(‘lift’, ascending=False).head(10)

Exploratory Data Analysis:

© 2013 - 2021 360DigiTMG. All Rights Reserved.


1. Scatterplot of the rules with three different variables on the plot.

Insights:

· From the books dataset, 164 rules were created using the apriori algorithm.

· There are 3 different variables plotted in the scatterplot; confidence, support and lift.

· From the above visualization we can say that the lift ratio for rules ranges from 0 to 14.5
approximately.

· The rules are all over scattered and more than 50 percent of rules lie within the support
range of 0.03

· The rules with the highest lift ratio lie within the constraint of support value of 0.025 to 0.03
and confidence value from 0.6 to 0.7

2. Graph plot for random rules

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Insights:

· We have a graph plot of 3 rules amongst all the rules generated.

· The highest lift ratio value is for the rules with the items CookBks, DoltYBks, ArtBks, ItalCook.

· The lowest lift ratio value is for the items ItalArts, ArtBks, CookBks, DoltYBks

3. Grouped chart for all 164 rules

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Insights:

· Above is a group chart of 164 rules.

· We have 11 rules with the highest lift value and the items include ItalCook, ArtBks and 4 other
items.

· There are 6 rules which have the highest support and the items include DoltYBks, GeogBks and
4 other items.
© 2013 - 2021 360DigiTMG. All Rights Reserved.
Problem Statement: -
The Departmental Store, has gathered the data of the products it sells on a Daily basis.
Using Association Rules concepts, provide the insights on the rules and the plots.
2.) Groceries.csv

Python code:

# import pandas library


import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

#loading the data set


groceries = pd.read_csv
(“C:/Users/Vinutha/Downloads/Datasets(6)/groceries.csv”)

#splitting the data


groceries = groceries.split(“\n”)
print(groceries.split())
groceries_list = []
for i in groceries:
groceries_list.append(i.split(“,”))
“There is a link that will split the data whenever we see the (comma(,))
values”.
all_groceries_list = [i for item in mgroceries_list for i in item]
we are going in I and in I we are searching all the values i i

from collections import Counter #, OrderedDict


#we are importing counter from collections
item_frequencies = Counter(all_groceries_list)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


countering all the items list and feeding in the item_frequencies

#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data

#sorting frequencies and items in separate variables


frequencies = list(reversed([i[1] for i in item_frequencies]))
items = list(reversed([i[0] for i in item_frequencies]))

#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data

plt.bar(height = frequencies[0:11], x = list(range(0,11)), color =


‘rgbkymc’)
plt.xticks(list(range(0,11), ), items[0:11])
plt.xlabel(“items”)
plt.ylabel(“Count”)
plt.show()

#creating data frame for the transactions data


groceries_series = pd.DataFrame(pd.Series(groceries_list))
groceries_series= groceries_series.iloc[:2000, :]

groceries_series.columns = [“trans”]

#creating a dummy column for each item in each transactions… using column
names as item name
X = groceries_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep =
‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)

#most frequent item sets based on support


frequent_itemsets.sort_values(‘support’, ascending = False, inplace= True)

plt.bar(x=list(range(0,11),height=frequent_itemsets.support[0:11],color=’r
gmyk’)
plt.xticks(list(range(0,11)), frequent_itemsets.itemsets[0:11])
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()

rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(10)
rules.sort_values(‘lift’, ascending=False).head(10)
Exploratory data Analysis:

1. Two-key plot for 39 rules

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Insights:

· The above is the two key plot of 39 rules from the groceries dataset which are plotted on the basis
of order of the rules. Also we have 3 different variables plotted.

· From the visualization we can infer that the rules with 4 itemsets are more in number as
compared to 3 and 5 itemsets.

· Majority of the rules with 4 itemsets lie within the constraint of support value as 0.002 to 0.0032
and confidence as 0.75 to 0.78.

· There is a rule with 5 itemsets which has highest confidence and support value approximately as
0.0032.

2. Graph chart

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Insights:

· Above is the graph chart of 4rules.

· Amongst the 4 rules there is a rue with highest lift ratio and items include whipped/sour cream,
onions, whole milk and other vegetables..

· The item other vegetables is again common for 3 different rules.

· There is a rule which has highest support value and the items include tropical fruit, frozen
vegetables, root vegetables and whole milk. Whole milk is again common for 3 different rules

3. Grouped chart

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Insights:
© 2013 - 2021 360DigiTMG. All Rights Reserved.
· Above is the group chart for 39 rules.

· There is one rules which has maximum lift ratio value and the items include citrus fruit, whole
milk and 2 other items.

· There is one rule which has highest support and good lift value and the items include citrus fruit,
tropical fruit and 1 other item.

Problem Statement: -

© 2013 - 2021 360DigiTMG. All Rights Reserved.


A film distribution company wants to target audience based on their likes and dislikes, you as a Chief Data
Scientist Analyze the data and come up with different rules of movie list so that the business objective is
achieved.
3.) my_movies.csv

Python code:
# import pandas library
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

#loading the data set


movie = pd.read_csv
(“C:/Users/Vinutha/Downloads/Datasets(6)/my_movies.csv”)

#splitting the data


movie = movie.split(“\n”)
print(movie.split())
movie_list = []
for i in movie:
movie_list.append(i.split(“,”))
“There is a link that will split the data whenever we see the (comma(,))
values”.
all_movie_list = [i for item in movie_list for i in item]
we are going in I and in I we are searching all the values i i

from collections import Counter #, OrderedDict


#we are importing counter from collections
item_frequencies = Counter(all_movie_list)
countering all the books list and feeding in the item_frequencies

#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data

#sorting frequencies and items in separate variables


© 2013 - 2021 360DigiTMG. All Rights Reserved.
frequencies = list(reversed([i[1] for i in item_frequencies]))
items = list(reversed([i[0] for i in item_frequencies]))

#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data

plt.bar(height = frequencies[0:11], x = list(range(0,11)), color =


‘rgbkymc’)
plt.xticks(list(range(0,11), ), items[0:11])
plt.xlabel(“items”)
plt.ylabel(“Count”)
plt.show()

#creating data frame for the transactions data


movie_series = pd.DataFrame(pd.Series(movie_list))
movie_series= movie_series.iloc[:2000, :]

movie_series.columns = [“trans”]

#creating a dummy column for each item in each transactions… using column
names as item name
X = movie_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)

#most frequent item sets based on support


frequent_itemsets.sort_values(‘support’, ascending = False, inplace= True)

plt.bar(x=list(range(0,11),height=frequent_itemsets.support[0:11],color=’r
gmyk’)
plt.xticks(list(range(0,11)), frequent_itemsets.itemsets[0:11])
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()

rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(10)
rules.sort_values(‘lift’, ascending=False).head(10)

Exploratory Data Analysis:

© 2013 - 2021 360DigiTMG. All Rights Reserved.


1. Graph chart for 6 rules

Insights:

· Above is the graph chart for 6 rules.

· From the above chart we can infer that there is a rule which has higher lift among others but has
low support and the movies include LOTR, Green Mile, Gladiator.

· The rule with movies Gladiator and Patriot has the highest support value among others.

· The movie Gladiator and Sixth Sense are a part of other rules as well.

2. Grouped chart

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Insights:

· Above is the grouped chart 30 rules.

· There is a rule which has the highest lift ratio value and has the movies Gladiator and Green Mile.

· The rule which has higher support as compared to other rules includes the movies Gladiator
and Patriot.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Problem Statement: -
A Mobile Phone manufacturing company wants to launch its three brand new phone into the market, but
before going with its traditional marketing approach this time it want to analyze the data of its previous
model sales in different regions and you have been hired as an Data Scientist to help them out, use the
Association rules concept and provide your insights to the company’s marketing team to improve its sales.
4.) myphonedata.csv

Python code:

# import pandas library


import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from collections import Counter
import matplotlib.pyplot as plt
#loading the data set
data = pd.read_csv
(“C:/Users/Vinutha/Downloads/Datasets(6)/myphonedata.csv”)

#splitting the data


data_list = []
for i in data:
data_list.append(i.split(“,”))
“There is a link that will split the data whenever we see the (comma(,))
values”.
all_data_list = [i for item in data_list for i in item]
we are going in I and in I we are searching all the values i i

from collections import Counter #, OrderedDict


#we are importing counter from collections
item_frequencies = Counter(all_data_list)
countering all the books list and feeding in the item_frequencies

© 2013 - 2021 360DigiTMG. All Rights Reserved.


#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])
sorting the data

#sorting frequencies and items in separate variables


frequencies = list(reversed([i[1] for i in item_frequencies]))
items = list(reversed([i[0] for i in item_frequencies]))

#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data

plt.bar(height = frequencies[0:11], x = list(range(0,11)), color =


‘rgbkymc’)
plt.xticks(list(range(0,11), ), items[0:11, rotation=30)
plt.xlabel(“items”)
plt.ylabel(“Count”)
plt.show()

#creating data frame for the transactions data


data_series = pd.DataFrame(pd.Series(data_list))

data_series.columns = [“trans”]

#creating a dummy column for each item in each transactions… using column
names as item name
X = data_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)

#most frequent item sets based on support


frequent_itemsets.sort_values(‘support’, ascending = False, inplace= True)

plt.bar(x=list(range(0,5),height=frequent_itemsets.support[0:5],color=’rgm
yk’)
plt.xticks(list(range(0,5)), frequent_itemsets.itemsets[0:5], rotation=15)
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()

rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(5)
rules.sort_values(‘lift’, ascending=False).head(5)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Exploratory Data Analysis:

1. Graph chart for the 12 rules generated

Insights:

· Above is the graph chart for 12 rules.

· Amongst the 12 rules we have a rule with highest lift ratio and the items(color of the
phones) green, white, red.

· There is a rule with highest support value and the items include only white color.

· From the graph above we can infer that white and red color are a part of most of the rules
hence we can go ahead with both of those colors.

2. Grouped chart
© 2013 - 2021 360DigiTMG. All Rights Reserved.
Insights:

· Above is a grouped chart of 12 rules.

· From the chart we can infer that there are 6 rules which have the highest support and the colors
of the rules are blue, red and white.

· We have a rule with the highest lift ratio amongst all the rules and it has color white and green
but the support value is less as compared to other rules.

Problem Statement: -
A retail store in India, has its transaction data, and it would like to know the buying pattern of the
consumers in its locality, you have been assigned this task to provide the manager with rules
on how the placement of products needs to be there in shelves so that it can improve the buying

© 2013 - 2021 360DigiTMG. All Rights Reserved.


patterns of consumes and increase customer footfall.
5.) transaction_retail.csv

Python code:

# import pandas library


import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

#loading the data set


tr = pd.read_csv
(“C:/Users/Vinutha/Downloads/Datasets(6)/transaction_retail1.csv”)

#splitting the data


tr_list = []
for i in tr:
tr_list.append(i.split(“,”))
“There is a link that will split the data whenever we see the (comma(,))
values”.
all_tr_list = [i for item in tr_list for i in item]
we are going in I and in I we are searching all the values i i

from collections import Counter #, OrderedDict


#we are importing counter from collections
item_frequencies = Counter(all_tr_list)
countering all the books list and feeding in the item_frequencies

#after sorting
item_frequencies = sorted(item_frequencies.items(), key = lambda x:x[1])

© 2013 - 2021 360DigiTMG. All Rights Reserved.


sorting the data

#sorting frequencies and items in separate variables


frequencies = list(reversed([i[1] for i in item_frequencies]))
items = list(reversed([i[0] for i in item_frequencies]))

#barplot of top 10
import matplotlib.pyplot as plt
=> importing matplotlib to visualize the plot of the data

plt.bar(height = frequencies[0:5], x = list(range(0,5)), color =


‘rgbkymc’)
plt.xticks(list(range(0,5), ), items[0:5],rotation=30)
plt.xlabel(“items”)
plt.ylabel(“Count”)
plt.show()

#creating data frame for the transactions data


tr_series = pd.DataFrame(pd.Series(tr_list))
tr_series= tr_series.iloc[:2000, :]

tr_series.columns = [“trans”]

#creating a dummy column for each item in each transactions… using column
names as item name
X = tr_series[‘trans’].str.join(sep = ‘*’).str.get_dummies(sep = ‘*’)
frequent_itemsets = apriori(X, min_support = 0.0075, max_len=4,
use_colnames= True)

#most frequent item sets based on support


frequent_itemsets.sort_values(‘support’, ascending = False, inplace= True)

plt.bar(x=list(range(0,5),height=frequent_itemsets.support[0:5],color=’rgm
yk’)
plt.xticks(list(range(0,5)), frequent_itemsets.itemsets[0:5], rotation=15)
plt.xlabel(‘item-sets’)
plt.ylabel(‘support’)
plt.show()

rules=association_rules(frequent_itemsets, metric=”lift”,min_threshold=1)
rules.head(20)
rules.sort_values(‘lift’, ascending=False).head(10)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Exploratory Data Analysis

1. Graph plot for 6 rules

Insights:

· Above we have a graph plot of 6 rules.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


· From the chart we can infer that there is a rule which has higher support value as compared to
other rules and the items include wicker, heart, of. Wicker and of are again a part of a
different rule.

· There are other rules which have approximately similar lift ratio value and the items include
paper, Christmas, kit for one rule and for another rule we have items as wooden, frame, white.

2. Grouped plot:

Insights:
© 2013 - 2021 360DigiTMG. All Rights Reserved.
· Above is the grouped chart for 204 rules.

· There is a rule which has the highest lift ratio value amongst other and the
items include kneeling and pad.

· We have a rule which has higher support value as compared to other rules and
the items include Jam and set.

Data Dictionaries:

1. Books dataset

Name of Description Type Relevance


Feature

ChildBks Category of book is Nominal Relevant


chidrens

YouthBks Category of book is Nominal Relevant


Youth

CookBks Category of book is Nominal Relevant


cooking recipes

DoItYBks Category of book is dolt Nominal Relevant

RefBks Category of book is Nominal Relevant


reference books

ArtBks Category of book is arts Nominal Relevant

GeogBks Category of book is Nominal Relevant


geography

ItalCook Category of book is Nominal Relevant


Italian cooking recipes

© 2013 - 2021 360DigiTMG. All Rights Reserved.


ItalAtlas Category of book is Nominal Relevant
Italian Atlas

ItalArt Category of book is Nominal Relevant


Italian arts

Florence Category of book is Nominal Relevant


Florence

2. Groceries dataset

It is a transaction dataset and does not have specific columns. For our analysis we convert the
items in the transactions as the columns in the dataset.

3. My movies data set

It is dataset in which columns are in encoded format apart from the mentioned below.

Name of Description Type Relevance


Feature

V1 Consists of movie names Nominal Relevant

V2 Consists of movie names Nominal Relevant

V3 Consists of movie names Nominal Relevant

V4 Consists of movie names Nominal Relevant

V5 Consists of movie names Nominal Relevant

4. My phone data

It is dataset in which columns are in encoded format apart from the mentioned below.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Name of Description Type Relevance
Feature

V1 Consists of color of Nominal, Qualitative Relevant


phones

V2 Consists of color of Nominal, Qualitative Relevant


phones

V3 Consists of color of Nominal, Qualitative Relevant


phones

5. Transaction retail 1 dataset

It is a transaction dataset and does not have specific columns. For our analysis we convert the
items in the transactions as the columns in the dataset.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

You might also like