CSC 177 Assignment 1 Chetan Nagarkar
2) a) Define the term “BIG DATA”
Explain why it is very important in the near future with two concrete innovative
applications from the BIG DATA
Describe the role of data mining research in the two innovations
Related presentation: http://www.youtube.com/watch?v=_HI5pLCFbu0
Ans- Big data refers to the collection of extremely large and complex datasets, which are
being created, transformed and updated at high speeds. They also need to be analysed very
fast. Traditional hands-on database tools cannot be used for such data as they are also of a
wide variety and structure. Moreover, a majority of these datasets consist of unstructured
data(pictures, videos or random multimedia or other files).
Why important?
Big data has had and will have percussions in all spheres of activity.
1) Big data has helped thousands of businesses evolve and grow drastically. Social
networking giants like Facebook, Twitter and LinkedIn crunch millions of TeraBytes o data
everyday – to give their users exactly what they are looking for. Yelp can suggest you places
to visit based on your location and user reviews.
2) Visualizations created using analysis of big data obtained from census all over the world
has helped health scientists and doctors to analyse data about diseases better and come up
with the most effective treatments and medicines. Patient records worldwide is a very huge
dataset.
To managing such magnanimous amounts of data, some companies have come up with a new
kind of database known as NoSQL. It may consist of unstructured data stored in the form of
documents, or in some proprietary form created by the company.
Role of data mining:
Data mining plays a key role, as retrieving useful information from these big data sets is a
very difficult task.
CSC 177 Assignment 1 Chetan Nagarkar
e.g. Facebook keeps a track of all the posts one ‘likes’, and makes suggestions for friends or
pages accordingly. Twitter has become a major playground for social data mining. One may
search all the tweets with a particular hashtag, and understand the public temper about a
particular event. LinkedIn can suggest you jobs with exactly the skill you are looking for,
from the lakhs of openings currently available.
In healthcare, UNESCO has collaborated with health organizations worldwide to collect and
visualize data about HIV/AIDS. This has helped them to take appropriate steps and
implement solutions at the right time in the more affected areas.
b) Give a concise description on each of the following terms using a real-world data
preprocessing case: data cleaning, data integration, data reduction, data
transformation, data discretization
Data cleaning is the process of removing noise and inconsistent data.
Data integration is the process of combining data from disparate sources into a combined
view to represent some useful information. Most of the major database companies(IBM,
Oracle, Informatica) provide facilities for data integration.
Data transformation is the process where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data discretization is a form of data mining, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0-10, 11-20, etc.) or conceptual labels (e.g.,
youth, adult, senior).The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for numeric attribute.
Data reduction is the process of minimizing the amount of data that needs to be stored in a
data storage environment. Data reduction can increase storage efficiency and reduce costs.
c) Describe various methods for handling missing values for some attributes in real
world data.
d) Use following methods to normalize the following group of data:
200, 300, 400, 600, 1000
CSC 177 Assignment 1 Chetan Nagarkar
Normalization is done to avoid the dependence on the choice of measurement units. In
general an attribute with smaller units will have a larger ‘weight’. Hence we normalize or
standardize the data to fall within a smaller and common range usually [-1, 1] or [0.0, 1.0].
There are two methods to normalize the data,
(1) min-max normalization by setting new.min=0 and new.max=1
In this method we find the normalized data by the formula,
V’= (V-min)/ (max-min) * (new.max –new.min) + new.min
(new.max –new.min) + new.min value for all the data values will be the same because it is
based on the new.max and new.min that we set. So the value of that expression is,
(1-0)+0 = 1
So, V’ = ((V-200)/ (1000-200)) * 1 = (V-200)/800
V V’ Final Value of V’
200 0 0
300 100/800 =>0.125 0.125
400 200/800 =>0.25 0.25
600 400/800 =>0.5 0.5
1000 800/800 =>1 1
(2) z-score normalization
This data transformation technique uses the mean and standard deviation of the given set of
values. A value, Vi is normalized to Vi’ based on the formula,
Vi’ = Vi – Mean/ Standard Deviation
Given dataset is {200, 300, 400, 600, 1000}
Mean = (200+300+400+600+1000)/5 = 500
Standard deviation,
((200-500) ^2 + (300-500) ^2 + (400-500) ^2 + (600-500) ^2+ (1000-500) ^2) / 5
CSC 177 Assignment 1 Chetan Nagarkar
(90000+40000+10000+10000+25000)/5
(4*10^5)/5
8*10^4
Applying the formula stated above for the Vi’, we have,
Value of Vi (Vi-Mean) / Standard Deviation Final Value
200 (200-500) /8*10^4 -3/800= -0.375
300 (300-500) /8*10^4 -2/800 = -1/400= 0.00125
400 (400-500) /8*10^4 -1/800 = -1/800= -0.00125
600 (600-500) /8*10^4 1/800 = 1/800= 0.00125
1000 (1000-500) /8*10^4 5/800 = 1/160=0.00625
3) Find all association rules with s = 20% and a = 40% using the Apriori algorithm from the
following grocery store data set. Trace the results level by level and be sure to show the
candidates and large itemsets for each pass of database scan. Also indicate the association
rules that will be generated.
Ans- The Apriori algorithm consists of creation of frequent itemsets and large itemsets
iteratively, selecting the ones fulfilling minimum support requirements.
Level 1:
Candidate Sets Support
Bread 4
Jelly 1
CSC 177 Assignment 1 Chetan Nagarkar
Peanut Butter 3
Milk 2
Beer 2
This is a large itemset as all candidate sets support minimum support s = 20%
Level 2:
Candidate Sets Support
{Bread, Jelly} 1
{Bread, Peanut Butter} 3
{Bread, Milk} 1
{Bread, Beer} 1
{Jelly, Peanut Butter} 1
{Milk, Beer} 1
{Peanut Butter, Milk} 1
{Jelly, Beer} 0
{Peanut Butter, Beer} 0
{Jelly, Milk} 0
L2 Large itemset:
Dropping the candidate sets with support < 20% (minsup)
{Bread, Jelly}
{Bread, Peanut Butter}
{Bread, Milk}
{Bread, Beer}
{Jelly, Peanut Butter}
{Peanut Butter, Milk}
{Milk, Beer}
Level 3:
{Bread, Jelly, Peanut Butter} 1
CSC 177 Assignment 1 Chetan Nagarkar
{Bread, Peanut Butter, Milk} 1
{Bread, Milk, Beer} 0
L3 large itemset :
{Bread, Jelly, Peanut Butter}
{Bread, Peanut Butter, Milk}
Level 4:
{Bread, Peanut Butter, Jelly, Milk} 0
L4 candidate set does not fulfill minimum support. Hence, it is dropped.
Association Rules:
An association rule is of the form: X => Y
X => Y: if someone buys X, he also buys Y
The confidence is the conditional probability that, given X present in a transition , Y
will also be present.
Confidence measure, by definition:
Confidence , α (X=>Y) = support(X,Y) / support(X)
{Bread, Peanut Butter, Milk, Beer, Jelly} => {Br, PB, M, B, J}
a) Check for Br => PB , α = Support of (Br U PB) / Support of (Br)
= 60/80 = 75% > αT
b) Check for PB => Br , α = Support of (Br U PB) / Support of (PB)
= 60/40 = 60% > αT
c) Check for Br => M , α = Support of (Br U M) / Support of (Br)
CSC 177 Assignment 1 Chetan Nagarkar
= 40/80 = 50% > αT
d) Check for M => Br , α = Support of (Br U M) / Support of (M)
= 40/40 = 100% > αT
e) Check for Br => B , α = Support of (Br U B) / Support of (Br)
= 20/80 = 25% < αT
f) Check for B => Br , α = Support of (Br U B) / Support of (B)
= 20/40 = 50% > αT
g) Check for M => B , α = Support of (M U B) / Support of (M)
= 20/40 = 50% > αT
h) Check for B => M , α = Support of (M U B) / Support of (B)
= 20/40 = 50% > αT
i) Check for Br => J , α = Support of (Br U J) / Support of (Br)
= 20/80 = 25% < αT
j) Check for J => Br , α = Support of (Br U J) / Support of (J)
= 20/20 = 100% > αT
k) Check for PB => J , α = Support of (PB U J) / Support of (PB)
= 20/60 = 33.3% < αT
l) Check for J => PB , α = Support of (PB U J) / Support of (J)
= 20/20 = 100% > αT
m) Check for {Br,PB} => {J} , α = Support of (BrU PB U J) / Support of {Br,PB}
= 20/60 = 33.33% > αT
n) Check for {J}=> {Br,PB} , α = Support of (BrU PB U J) / Support of {J}
= 20/20 = 100% > αT
o) Check for {Br,J} => {PB} , α = Support of (BrU PB U J) / Support of {Br,J}
= 20/20 = 100% > αT
p) Check for {PB} => {Br,J} , α = Support of (BrU PB U J) / Support of {PB}
= 20/60 = 33.33% > αT
q) Check for {PB,J} => {Br} , α = Support of (BrU PB U J) / Support of {PB,J}
= 20/20 = 100% > αT
r) Check for {Br} => {PB,J}, α = Support of (BrU PB U J) / Support of {Br}
= 20/80 = 25% > αT
CSC 177 Assignment 1 Chetan Nagarkar
s) Check for {Br,PB} => {M} , α = Support of (BrU PB U M) / Support of {Br,PB}
= 20/60 = 33.33% < αT
t) Check for {M}=> {Br,PB} , α = Support of (BrU PB U M) / Support of {M}
= 20/40 = 50% > αT
u) Check for {Br,M} => {PB} , α = Support of (BrU PB U M) / Support of {Br,M}
= 20/40 = 50% > αT
v) Check for {PB} => {Br,M} , α = Support of (BrU PB U M) / Support of {PB}
= 20/60 = 33.33% > αT
w) Check for {PB,M} => {Br} , α = Support of (BrU PB U M) / Support of {PB,M}
= 20/20 = 100% > αT
x) Check for {Br} => {PB,M}, α = Support of (BrU PB U M) / Support of {Br}
= 20/80 = 25% > αT
Final Association Rules:
Jelly Bread
Bread Peanut Butter
Peanut Butter Bread
Milk Bread
Beer Bread
Jelly Peanut Butter
Milk Peanut Butter
Milk Beer
Beer Milk
Bread, Jelly Peanut Butter
Jelly, Peanut Butter Bread
Jelly Bread, Peanut Butter
CSC 177 Assignment 1 Chetan Nagarkar
Bread, Milk Peanut Butter
Milk, Peanut Butter Bread
Milk Bread, Peanut Butter
4) Consider the market basket transactions shown in the above table. Use the data set to
answer the questions listed below.
a) How many possible association rules can be extracted from this data (including rules that
have zero support)?
Left of rule Right of rule Combinations Total
1 1 5c1*4c1=5*4=20 20
1 2 5c1*4c2=5*6=30 30
1 3 5c1*4c3=5*4=20 20
1 4 5c1*4c4=5*1=5 5
2 1 5c2*3c1=10*3=30 30
2 2 5c2*3c2=10*3=30 30
2 3 5c2*3c3=10*1=10 10
3 1 5c3*2c1=10*2=20 20
3 2 5c3*2c2=10*1=10 10
4 1 5c4*1c1=5*1=5 5
180
b) Given the transactions shown above, what is the largest size of an item set we can
extract?
The largest size of an itemset supporting minimum confidence is 3.
c) What is the maximum number of size-3 itemsets that can be derived from this data
set?
The number of items possible is 5.
Hence number of size-3 itemsets possible
= 5C3 = 5!/(5-3)!3! = 5!/3!2! = 10 itemsets of size 3
d) Which item set (of size 2 or larger) has the largest support?
The itemset {Bread, Peanut Butter} has the largest support, s = 60%
CSC 177 Assignment 1 Chetan Nagarkar
e) From this data set, can we find a pair of association rules, A => B and B => A that have
the same confidence?
Yes, there is one pair of association rules that have the same confidence value in both
the directions.
The itemset is {Beer, Milk}
Association Rules are,
Milk Beer: 1/2 = 50%
Beer Milk: 1/2 = 50%
5) A database has five transactions. Let min_sup D 60% and min_conf D 80%.
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
X belongs to transaction, buys (X, item1) ^ buys (X, item2) buys (X item3) [s, c]
TID Items
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 { M, A, K, E}
T400 { M, U, C, K, Y}
T500 { C, O, O, K, I, E}
Ans-
Apriori :
Apriori algorithm consists of creation of frequent itemsets, satisfying minimum confidence
and support.
Min. support = 60%
Min. confidence = 80%
Level 1:
Candidate Sets Support
CSC 177 Assignment 1 Chetan Nagarkar
K 5
E 4
M 3
O 3
Y 3
N 2
C 2
A 1
D 1
U 1
I 1
L1 Large Itemsets : K, E, M, O, Y
Level 2:
{K,E} 4
{K,M} 3
{K,O} 3
{K,Y} 3
{E,M} 2
{E,O} 3
{E.Y} 2
{M,O} 1
{M,Y} 2
{O.Y} 2
L2 Large Itemsets = {K,E},{K,M},{K,O},{K,Y},{E,O}
Level 3
{K, E, O} 3
L3 Large Itemsets = {K,E,O}
CSC 177 Assignment 1 Chetan Nagarkar
{K,E,O} is the only possible candidate set at Level 3, as other subsets at Level 2 have been
proved to have less support in the previous database scan.
FP Growth:
F-List Support FPDB FP-tree
K 5 T100: {K, E, M, O, Y}
E 4 T200: {K, E, O, Y}
M 3 T300: {K, E, M}
O 3 T400: {K, M, Y}
Y 3 T500: {K, E, O}
CSC 177 Assignment 1 Chetan Nagarkar
Item Conditional FP-tree Frequent Itemsets
Database
Y {K, E, M, O: 1} {K,Y :3}
{K, E, O: 1} Root
{K, M: 1}
K: 3
E: 2 M: 1
M: 1 O: 1
O: 1
O {K, E, M: 1} {K,E,O:3}
{K, E: 1} Root {E,O:3}
{K, E: 1} {K,O:3}
K: 3
E: 3
M: 1
M {K, E: 1} Root {K, M: 3}
{K, E: 1} K: 3
{K: 1}
CSC 177 Assignment 1 Chetan Nagarkar
E: 2
E {K:4} Root {K, E: 4}
K: 4
The itemsets generated from FP Growth are same as Apriori.
The difference is in the following:
Data scan: Apriori needs to scan database repeatedly to accumulate a k-item support and
check frequency. FP growth just needs 2 scans- one to identify frequent item set and and
second to build FP-tree initially. The later stages are just an extension of the list and tree thus
formed.
Candidate generation: Apriori algorithm generates exponential number of candidate set and
the self-join process of candidate generation itself is expensive. FP-growth algorithm does
not generate any candidate set.
(b) List all the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and item i denotes
variables representing items (e.g., “A,” “B,”):
For this part, a strong rule of the form given below is required.
Item 1 ^ item 2 item 3
To generate this kind of rules we need a itemset of size 3 or greater. Also we can have rules
of type item 1 item 2 with such itemsets. But we are only interested in the one mentioned
in the problem statement. Since there is only one itemset of size 3 in the example above, we
have
As before, Large Itemset = {K,E,O}
Association Rules Confidence
K,EO 3/4 = 75%
K,OE 3/3 = 100%
E,OK 3/3 = 100%
CSC 177 Assignment 1 Chetan Nagarkar
Association Rules :
K, O E
E, O K
So the association rules satisfying the given metarule are,
Buys(X, K) ^ Buys(X, O) Buys(X, E) [60%, 100%]
Buys(X, E) ^ Buys(X, O) Buys(X, K) [60%, 100%]