0% found this document useful (0 votes)

92 views15 pages

CSC 177 Assignment 1 Chetan Nagarkar: Presentation

Big data refers to extremely large and complex datasets that are created, transformed, and updated at high speeds. Traditional database tools cannot handle big data due to its size, variety, and often unstructured nature. Big data is important for businesses and health organizations to gain insights. Data mining plays a key role in extracting useful information from big data. The Apriori algorithm is used to find association rules in transactional data by identifying frequent itemsets that occur together more than the minimum support threshold.

Uploaded by

عمار طعمة

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views15 pages

CSC 177 Assignment 1 Chetan Nagarkar: Presentation

Uploaded by

عمار طعمة

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CSC 177 Assignment 1 Chetan Nagarkar

2) a) Define the term “BIG DATA”

 Explain why it is very important in the near future with two concrete innovative
applications from the BIG DATA

 Describe the role of data mining research in the two innovations

Role of data mining:

Data mining plays a key role, as retrieving useful information from these big data sets is a
very difficult task.
CSC 177 Assignment 1 Chetan Nagarkar

e.g. Facebook keeps a track of all the posts one ‘likes’, and makes suggestions for friends or
pages accordingly. Twitter has become a major playground for social data mining. One may
search all the tweets with a particular hashtag, and understand the public temper about a
particular event. LinkedIn can suggest you jobs with exactly the skill you are looking for,
from the lakhs of openings currently available.

In healthcare, UNESCO has collaborated with health organizations worldwide to collect and
visualize data about HIV/AIDS. This has helped them to take appropriate steps and
implement solutions at the right time in the more affected areas.

b) Give a concise description on each of the following terms using a real-world data
preprocessing case: data cleaning, data integration, data reduction, data
transformation, data discretization

Data cleaning is the process of removing noise and inconsistent data.

Data integration is the process of combining data from disparate sources into a combined
view to represent some useful information. Most of the major database companies(IBM,
Oracle, Informatica) provide facilities for data integration.

Data transformation is the process where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations.

Data discretization is a form of data mining, where the raw values of a numeric attribute
(e.g., age) are replaced by interval labels (e.g., 0-10, 11-20, etc.) or conceptual labels (e.g.,
youth, adult, senior).The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for numeric attribute.

Data reduction is the process of minimizing the amount of data that needs to be stored in a
data storage environment. Data reduction can increase storage efficiency and reduce costs.

c) Describe various methods for handling missing values for some attributes in real
world data.

d) Use following methods to normalize the following group of data:

200, 300, 400, 600, 1000

CSC 177 Assignment 1 Chetan Nagarkar

Normalization is done to avoid the dependence on the choice of measurement units. In

general an attribute with smaller units will have a larger ‘weight’. Hence we normalize or
standardize the data to fall within a smaller and common range usually [-1, 1] or [0.0, 1.0].
There are two methods to normalize the data,

(1) min-max normalization by setting new.min=0 and new.max=1

In this method we find the normalized data by the formula,

V’= (V-min)/ (max-min) * (new.max –new.min) + new.min

(new.max –new.min) + new.min value for all the data values will be the same because it is
based on the new.max and new.min that we set. So the value of that expression is,

(1-0)+0 = 1
So, V’ = ((V-200)/ (1000-200)) * 1 = (V-200)/800

V V’ Final Value of V’
200 0 0
300 100/800 =>0.125 0.125
400 200/800 =>0.25 0.25
600 400/800 =>0.5 0.5
1000 800/800 =>1 1

(2) z-score normalization

This data transformation technique uses the mean and standard deviation of the given set of
values. A value, Vi is normalized to Vi’ based on the formula,

Vi’ = Vi – Mean/ Standard Deviation

Given dataset is {200, 300, 400, 600, 1000}

Mean = (200+300+400+600+1000)/5 = 500

Standard deviation,

((200-500) ^2 + (300-500) ^2 + (400-500) ^2 + (600-500) ^2+ (1000-500) ^2) / 5

CSC 177 Assignment 1 Chetan Nagarkar

 (90000+40000+10000+10000+25000)/5
 (4*10^5)/5
 8*10^4

Applying the formula stated above for the Vi’, we have,

Value of Vi (Vi-Mean) / Standard Deviation Final Value

200 (200-500) /8*10^4 -3/800= -0.375
300 (300-500) /8*10^4 -2/800 = -1/400= 0.00125
400 (400-500) /8*10^4 -1/800 = -1/800= -0.00125
600 (600-500) /8*10^4 1/800 = 1/800= 0.00125
1000 (1000-500) /8*10^4 5/800 = 1/160=0.00625

3) Find all association rules with s = 20% and a = 40% using the Apriori algorithm from the
following grocery store data set. Trace the results level by level and be sure to show the
candidates and large itemsets for each pass of database scan. Also indicate the association
rules that will be generated.

Ans- The Apriori algorithm consists of creation of frequent itemsets and large itemsets
iteratively, selecting the ones fulfilling minimum support requirements.

Level 1:

Candidate Sets Support

Bread 4
Jelly 1
CSC 177 Assignment 1 Chetan Nagarkar

Peanut Butter 3
Milk 2
Beer 2

This is a large itemset as all candidate sets support minimum support s = 20%

Level 2:

Candidate Sets Support

{Bread, Jelly} 1
{Bread, Peanut Butter} 3
{Bread, Milk} 1
{Bread, Beer} 1
{Jelly, Peanut Butter} 1
{Milk, Beer} 1
{Peanut Butter, Milk} 1
{Jelly, Beer} 0
{Peanut Butter, Beer} 0
{Jelly, Milk} 0

L2 Large itemset:

Dropping the candidate sets with support < 20% (minsup)

{Bread, Jelly}
{Bread, Peanut Butter}
{Bread, Milk}
{Bread, Beer}
{Jelly, Peanut Butter}
{Peanut Butter, Milk}
{Milk, Beer}

Level 3:

{Bread, Jelly, Peanut Butter} 1

CSC 177 Assignment 1 Chetan Nagarkar

{Bread, Peanut Butter, Milk} 1

{Bread, Milk, Beer} 0

L3 large itemset :

{Bread, Jelly, Peanut Butter}

{Bread, Peanut Butter, Milk}

Level 4:

{Bread, Peanut Butter, Jelly, Milk} 0

L4 candidate set does not fulfill minimum support. Hence, it is dropped.

Association Rules:

An association rule is of the form: X => Y

X => Y: if someone buys X, he also buys Y

The confidence is the conditional probability that, given X present in a transition , Y

will also be present.

Confidence measure, by definition:

Confidence , α (X=>Y) = support(X,Y) / support(X)

{Bread, Peanut Butter, Milk, Beer, Jelly} => {Br, PB, M, B, J}

a) Check for Br => PB , α = Support of (Br U PB) / Support of (Br)

= 60/80 = 75% > αT

b) Check for PB => Br , α = Support of (Br U PB) / Support of (PB)

= 60/40 = 60% > αT

c) Check for Br => M , α = Support of (Br U M) / Support of (Br)

CSC 177 Assignment 1 Chetan Nagarkar

= 40/80 = 50% > αT

d) Check for M => Br , α = Support of (Br U M) / Support of (M)

= 40/40 = 100% > αT

e) Check for Br => B , α = Support of (Br U B) / Support of (Br)

= 20/80 = 25% < αT

f) Check for B => Br , α = Support of (Br U B) / Support of (B)

= 20/40 = 50% > αT

g) Check for M => B , α = Support of (M U B) / Support of (M)

= 20/40 = 50% > αT

h) Check for B => M , α = Support of (M U B) / Support of (B)

= 20/40 = 50% > αT

i) Check for Br => J , α = Support of (Br U J) / Support of (Br)

= 20/80 = 25% < αT

j) Check for J => Br , α = Support of (Br U J) / Support of (J)

= 20/20 = 100% > αT

k) Check for PB => J , α = Support of (PB U J) / Support of (PB)

= 20/60 = 33.3% < αT

l) Check for J => PB , α = Support of (PB U J) / Support of (J)

= 20/20 = 100% > αT

m) Check for {Br,PB} => {J} , α = Support of (BrU PB U J) / Support of {Br,PB}

= 20/60 = 33.33% > αT

n) Check for {J}=> {Br,PB} , α = Support of (BrU PB U J) / Support of {J}

= 20/20 = 100% > αT

o) Check for {Br,J} => {PB} , α = Support of (BrU PB U J) / Support of {Br,J}

= 20/20 = 100% > αT

p) Check for {PB} => {Br,J} , α = Support of (BrU PB U J) / Support of {PB}

= 20/60 = 33.33% > αT

q) Check for {PB,J} => {Br} , α = Support of (BrU PB U J) / Support of {PB,J}

= 20/20 = 100% > αT

r) Check for {Br} => {PB,J}, α = Support of (BrU PB U J) / Support of {Br}

= 20/80 = 25% > αT
CSC 177 Assignment 1 Chetan Nagarkar

s) Check for {Br,PB} => {M} , α = Support of (BrU PB U M) / Support of {Br,PB}

= 20/60 = 33.33% < αT

t) Check for {M}=> {Br,PB} , α = Support of (BrU PB U M) / Support of {M}

= 20/40 = 50% > αT

u) Check for {Br,M} => {PB} , α = Support of (BrU PB U M) / Support of {Br,M}

= 20/40 = 50% > αT

v) Check for {PB} => {Br,M} , α = Support of (BrU PB U M) / Support of {PB}

= 20/60 = 33.33% > αT
w) Check for {PB,M} => {Br} , α = Support of (BrU PB U M) / Support of {PB,M}
= 20/20 = 100% > αT

x) Check for {Br} => {PB,M}, α = Support of (BrU PB U M) / Support of {Br}

= 20/80 = 25% > αT

Final Association Rules:

Jelly  Bread

Bread  Peanut Butter

Peanut Butter  Bread

Milk  Bread

Beer  Bread

Jelly  Peanut Butter

Milk  Peanut Butter

Milk  Beer

Beer  Milk

Bread, Jelly  Peanut Butter

Jelly, Peanut Butter  Bread

Jelly  Bread, Peanut Butter

CSC 177 Assignment 1 Chetan Nagarkar

Bread, Milk  Peanut Butter

Milk, Peanut Butter  Bread

Milk  Bread, Peanut Butter

4) Consider the market basket transactions shown in the above table. Use the data set to
answer the questions listed below.

a) How many possible association rules can be extracted from this data (including rules that
have zero support)?
Left of rule Right of rule Combinations Total
1 1 5c1*4c1=5*4=20 20
1 2 5c1*4c2=5*6=30 30
1 3 5c1*4c3=5*4=20 20
1 4 5c1*4c4=5*1=5 5
2 1 5c2*3c1=10*3=30 30
2 2 5c2*3c2=10*3=30 30
2 3 5c2*3c3=10*1=10 10
3 1 5c3*2c1=10*2=20 20
3 2 5c3*2c2=10*1=10 10
4 1 5c4*1c1=5*1=5 5
180

b) Given the transactions shown above, what is the largest size of an item set we can
extract?
The largest size of an itemset supporting minimum confidence is 3.
c) What is the maximum number of size-3 itemsets that can be derived from this data
set?
The number of items possible is 5.
Hence number of size-3 itemsets possible
= 5C3 = 5!/(5-3)!3! = 5!/3!2! = 10 itemsets of size 3
d) Which item set (of size 2 or larger) has the largest support?

The itemset {Bread, Peanut Butter} has the largest support, s = 60%
CSC 177 Assignment 1 Chetan Nagarkar

e) From this data set, can we find a pair of association rules, A => B and B => A that have
the same confidence?
Yes, there is one pair of association rules that have the same confidence value in both
the directions.
The itemset is {Beer, Milk}
Association Rules are,
Milk  Beer: 1/2 = 50%
Beer  Milk: 1/2 = 50%

5) A database has five transactions. Let min_sup D 60% and min_conf D 80%.

(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.

X belongs to transaction, buys (X, item1) ^ buys (X, item2)  buys (X item3) [s, c]

TID Items
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 { M, A, K, E}
T400 { M, U, C, K, Y}
T500 { C, O, O, K, I, E}
Ans-

Apriori :

Apriori algorithm consists of creation of frequent itemsets, satisfying minimum confidence

and support.

Min. support = 60%

Min. confidence = 80%

Level 1:

Candidate Sets Support

CSC 177 Assignment 1 Chetan Nagarkar

K 5
E 4
M 3
O 3
Y 3
N 2
C 2
A 1
D 1
U 1
I 1

L1 Large Itemsets : K, E, M, O, Y

Level 2:

{K,E} 4
{K,M} 3
{K,O} 3
{K,Y} 3
{E,M} 2
{E,O} 3
{E.Y} 2
{M,O} 1
{M,Y} 2
{O.Y} 2

L2 Large Itemsets = {K,E},{K,M},{K,O},{K,Y},{E,O}

Level 3

{K, E, O} 3

L3 Large Itemsets = {K,E,O}

CSC 177 Assignment 1 Chetan Nagarkar

{K,E,O} is the only possible candidate set at Level 3, as other subsets at Level 2 have been
proved to have less support in the previous database scan.

FP Growth:

F-List Support FPDB FP-tree

K 5 T100: {K, E, M, O, Y}

E 4 T200: {K, E, O, Y}

M 3 T300: {K, E, M}

O 3 T400: {K, M, Y}

Y 3 T500: {K, E, O}
CSC 177 Assignment 1 Chetan Nagarkar

Item Conditional FP-tree Frequent Itemsets

Database
Y {K, E, M, O: 1} {K,Y :3}
{K, E, O: 1} Root

{K, M: 1}
K: 3

E: 2 M: 1

M: 1 O: 1

O: 1

O {K, E, M: 1} {K,E,O:3}
{K, E: 1} Root {E,O:3}
{K, E: 1} {K,O:3}
K: 3

E: 3

M: 1

M {K, E: 1} Root {K, M: 3}

{K, E: 1} K: 3
{K: 1}
CSC 177 Assignment 1 Chetan Nagarkar

E: 2

E {K:4} Root {K, E: 4}

K: 4

The itemsets generated from FP Growth are same as Apriori.

The difference is in the following:

Data scan: Apriori needs to scan database repeatedly to accumulate a k-item support and
check frequency. FP growth just needs 2 scans- one to identify frequent item set and and
second to build FP-tree initially. The later stages are just an extension of the list and tree thus
formed.

Candidate generation: Apriori algorithm generates exponential number of candidate set and
the self-join process of candidate generation itself is expensive. FP-growth algorithm does
not generate any candidate set.

(b) List all the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and item i denotes
variables representing items (e.g., “A,” “B,”):

For this part, a strong rule of the form given below is required.

Item 1 ^ item 2  item 3

To generate this kind of rules we need a itemset of size 3 or greater. Also we can have rules
of type item 1  item 2 with such itemsets. But we are only interested in the one mentioned
in the problem statement. Since there is only one itemset of size 3 in the example above, we
have

As before, Large Itemset = {K,E,O}

Association Rules Confidence

K,EO 3/4 = 75%
K,OE 3/3 = 100%
E,OK 3/3 = 100%
CSC 177 Assignment 1 Chetan Nagarkar

Association Rules :

K, O  E

E, O K

So the association rules satisfying the given metarule are,

Buys(X, K) ^ Buys(X, O)  Buys(X, E) [60%, 100%]

Buys(X, E) ^ Buys(X, O)  Buys(X, K) [60%, 100%]

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Apriori Principle Example Question and Answer
100% (11)
Apriori Principle Example Question and Answer
11 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
300+ TOP DATA MINING Multiple Choice Questions and Answers
No ratings yet
300+ TOP DATA MINING Multiple Choice Questions and Answers
10 pages
Emt Standards Ed HW
No ratings yet
Emt Standards Ed HW
28 pages
Advanced Databases Jan 2024
No ratings yet
Advanced Databases Jan 2024
2 pages
Unit 4
No ratings yet
Unit 4
113 pages
CS-30013 (DMDW) - CS Mid Sept 2024
No ratings yet
CS-30013 (DMDW) - CS Mid Sept 2024
12 pages
Assignment 2 Slot8 TTS3208 Summer
No ratings yet
Assignment 2 Slot8 TTS3208 Summer
11 pages
Unit 3 Data Science
No ratings yet
Unit 3 Data Science
15 pages
26CS157F Stefaan Yetimyan Cs175A
No ratings yet
26CS157F Stefaan Yetimyan Cs175A
18 pages
Foundation of Data Science Imp
No ratings yet
Foundation of Data Science Imp
6 pages
QB FDS
No ratings yet
QB FDS
5 pages
Final Compre - Solutions - Updated FoDS
No ratings yet
Final Compre - Solutions - Updated FoDS
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
FDS - 2 Solved
No ratings yet
FDS - 2 Solved
14 pages
CST466
No ratings yet
CST466
5 pages
Compre FoDS
No ratings yet
Compre FoDS
2 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
No ratings yet
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
6 pages
IS328 Data Mining-Tutorial Lab Session 2 - Solution - Updated
No ratings yet
IS328 Data Mining-Tutorial Lab Session 2 - Solution - Updated
15 pages
Association Rule Mining 2023 (Compatibility Mode)
No ratings yet
Association Rule Mining 2023 (Compatibility Mode)
44 pages
DataMining Workbook Answers
No ratings yet
DataMining Workbook Answers
18 pages
Suppose A Student Collected The Price and Weight of 20 Products in A Shop With The Following Result
No ratings yet
Suppose A Student Collected The Price and Weight of 20 Products in A Shop With The Following Result
4 pages
Chapter 1 SAIDS
No ratings yet
Chapter 1 SAIDS
38 pages
Exp 9
No ratings yet
Exp 9
9 pages
R - Practical
No ratings yet
R - Practical
50 pages
Exam dm1 121017 Ans
No ratings yet
Exam dm1 121017 Ans
8 pages
Experiment: 3: Aim: Theory
No ratings yet
Experiment: 3: Aim: Theory
16 pages
4 Preprocessing
No ratings yet
4 Preprocessing
72 pages
DA CIA 3 Answers
No ratings yet
DA CIA 3 Answers
20 pages
DM PYQ Merged
No ratings yet
DM PYQ Merged
26 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
No ratings yet
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
26 pages
Data Mining
No ratings yet
Data Mining
24 pages
Question Bank FDS
No ratings yet
Question Bank FDS
4 pages
Association Rule Mining
No ratings yet
Association Rule Mining
20 pages
HW 2
No ratings yet
HW 2
7 pages
3) 65 (Apriori Algorithm) : Frequent Item Set in Data Set (Association Rule Mining
No ratings yet
3) 65 (Apriori Algorithm) : Frequent Item Set in Data Set (Association Rule Mining
4 pages
Important Questions
No ratings yet
Important Questions
4 pages
AIML CIA II Question Paper ECE Remedial Anskey
No ratings yet
AIML CIA II Question Paper ECE Remedial Anskey
33 pages
DAA Assignment 1
No ratings yet
DAA Assignment 1
32 pages
Uid22202 Urza Urwb Urxa Urvb
No ratings yet
Uid22202 Urza Urwb Urxa Urvb
12 pages
BCA (AIDS) - 3rd Sem - TBD303 - Statistical Methods For Data Science-JBK
No ratings yet
BCA (AIDS) - 3rd Sem - TBD303 - Statistical Methods For Data Science-JBK
2 pages
DAA Assignment 1
No ratings yet
DAA Assignment 1
32 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Adobe Scan Mar 15, 2025
No ratings yet
Adobe Scan Mar 15, 2025
1 page
DS Unit 2
No ratings yet
DS Unit 2
50 pages
Assignment 1 (Concept) : Solutions: Note, Throughout Exercises 1 To 4, N Denotes The Input Size of A Problem. (10%)
No ratings yet
Assignment 1 (Concept) : Solutions: Note, Throughout Exercises 1 To 4, N Denotes The Input Size of A Problem. (10%)
6 pages
Data Mining BITS-PILANI Mid Semester Sample
No ratings yet
Data Mining BITS-PILANI Mid Semester Sample
10 pages
Mis 331 PS05 23-24 F
No ratings yet
Mis 331 PS05 23-24 F
40 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Data Warehousing and Mining - Exam Solutions
No ratings yet
Data Warehousing and Mining - Exam Solutions
6 pages
PHDCS
No ratings yet
PHDCS
12 pages
Previous Year Paper - Sem 7
No ratings yet
Previous Year Paper - Sem 7
12 pages
May 2021 - Urxa PDF
No ratings yet
May 2021 - Urxa PDF
7 pages
DMT 2020
No ratings yet
DMT 2020
7 pages
2CSOE03 IR February 2022
No ratings yet
2CSOE03 IR February 2022
2 pages
Data Mining Merged
No ratings yet
Data Mining Merged
10 pages
Visual Financial Accounting for You: Greatly Modified Chess Positions as Financial and Accounting Concepts
From Everand
Visual Financial Accounting for You: Greatly Modified Chess Positions as Financial and Accounting Concepts
Anthony Brticevic
No ratings yet
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
From Everand
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
Stuart A. Klugman
4/5 (1)
SPSS Statistics Workbook For Dummies
From Everand
SPSS Statistics Workbook For Dummies
Jesus Salcedo
No ratings yet
DSP 1st Exam Solutions
No ratings yet
DSP 1st Exam Solutions
5 pages
ECE 301: Signals and Systems Homework Solution #1: Professor: Aly El Gamal TA: Xianglun Mao
No ratings yet
ECE 301: Signals and Systems Homework Solution #1: Professor: Aly El Gamal TA: Xianglun Mao
6 pages
Hybrid Algorithms For Summarization of Video Surveillance Systems
No ratings yet
Hybrid Algorithms For Summarization of Video Surveillance Systems
35 pages
Relation Between Olap: Data Warehouse and
No ratings yet
Relation Between Olap: Data Warehouse and
7 pages
Relation Between Big: Data and Business
No ratings yet
Relation Between Big: Data and Business
4 pages
Digital ImageProcessing 5
No ratings yet
Digital ImageProcessing 5
6 pages
Structural Equation Modeling: A Multidisciplinary Journal
No ratings yet
Structural Equation Modeling: A Multidisciplinary Journal
25 pages
محاضره 1 - ماجستير
No ratings yet
محاضره 1 - ماجستير
13 pages
علي عبد حسين - ماستر عام - OCR
No ratings yet
علي عبد حسين - ماستر عام - OCR
12 pages
CSC 177 Assignment 1 Chetan Nagarkar: Presentation
No ratings yet
CSC 177 Assignment 1 Chetan Nagarkar: Presentation
15 pages
Ocr & Cbir
No ratings yet
Ocr & Cbir
13 pages
Model-Based Design (MBD) Is A Mathematical and Visual Method of Addressing Problems
No ratings yet
Model-Based Design (MBD) Is A Mathematical and Visual Method of Addressing Problems
1 page
Top Down Vs Bottom Up
100% (1)
Top Down Vs Bottom Up
2 pages
Introduction To Mobile Computing by Prof. Madhuri Nitin Badgujar
No ratings yet
Introduction To Mobile Computing by Prof. Madhuri Nitin Badgujar
38 pages
Modelling Tourism Demand: A Comparative Study Between Artificial Neural Networks and The Box-Jenkins Methodology
No ratings yet
Modelling Tourism Demand: A Comparative Study Between Artificial Neural Networks and The Box-Jenkins Methodology
22 pages
A Real Time Approach For ECG Signal Denoising and Smoothing Using Adaptive Window Technique
No ratings yet
A Real Time Approach For ECG Signal Denoising and Smoothing Using Adaptive Window Technique
6 pages
Bregni, 2004
No ratings yet
Bregni, 2004
5 pages
Uninformative Parameters and Model Selection Using Akaike's Information Criterion
No ratings yet
Uninformative Parameters and Model Selection Using Akaike's Information Criterion
4 pages
Time Series Analysis Matlab Tutorial: Joachim Gross
No ratings yet
Time Series Analysis Matlab Tutorial: Joachim Gross
39 pages
Vedhalalovighanabeejalu MOHANPUBLICATIONS
100% (1)
Vedhalalovighanabeejalu MOHANPUBLICATIONS
73 pages
09-10 - Six-DOF Model
No ratings yet
09-10 - Six-DOF Model
71 pages
Dana Koehler Resume
No ratings yet
Dana Koehler Resume
2 pages
Circular Function
No ratings yet
Circular Function
161 pages
Presentation On Accounting Software
No ratings yet
Presentation On Accounting Software
18 pages
Circular Motion, WPE (Question Paper)
No ratings yet
Circular Motion, WPE (Question Paper)
10 pages
Practice Questions - Statistical Process Control
No ratings yet
Practice Questions - Statistical Process Control
3 pages
Research Methodology
No ratings yet
Research Methodology
3 pages
Unit 10
No ratings yet
Unit 10
64 pages
January 2020 MS
No ratings yet
January 2020 MS
26 pages
200 Questions of Quantitative Aptitude
No ratings yet
200 Questions of Quantitative Aptitude
28 pages
Media Charge - Linear Wear - Ball Mills
No ratings yet
Media Charge - Linear Wear - Ball Mills
5 pages
MERO: A Statistical Approach For Hardware Trojan Detection
No ratings yet
MERO: A Statistical Approach For Hardware Trojan Detection
18 pages
EDRW001
No ratings yet
EDRW001
2 pages
Tribology+and+machine+elements PDF Jsessionid
No ratings yet
Tribology+and+machine+elements PDF Jsessionid
63 pages
Percentiles and T-Distribution: Grade 11 Statistics and Probability
No ratings yet
Percentiles and T-Distribution: Grade 11 Statistics and Probability
8 pages
Ratios6 Oneisnotenough Se 2019 2020
No ratings yet
Ratios6 Oneisnotenough Se 2019 2020
14 pages
Research Designs: Dr. Vidya Naik Session 7 1
No ratings yet
Research Designs: Dr. Vidya Naik Session 7 1
17 pages
Math - 4 7 - Preamble - 2010 D 441 en 5
No ratings yet
Math - 4 7 - Preamble - 2010 D 441 en 5
23 pages
Divine Word College of San Jose
No ratings yet
Divine Word College of San Jose
3 pages
Lab 3 Reynolds Number
0% (1)
Lab 3 Reynolds Number
9 pages
Do You Really Know What You Believe, and Do You Really Know Why You Believe It? 2016
100% (4)
Do You Really Know What You Believe, and Do You Really Know Why You Believe It? 2016
482 pages
Mathematical Language and Symbols
No ratings yet
Mathematical Language and Symbols
2 pages
Final 2 - States of Matter Lab Assessment Packet
No ratings yet
Final 2 - States of Matter Lab Assessment Packet
22 pages
Introduction Screw Theo
No ratings yet
Introduction Screw Theo
20 pages
Instant Ebooks Textbook Transformer Engineering Design and Practice Power Engineering Willis 1st Edition S.V. Kulkarni Download All Chapters
100% (1)
Instant Ebooks Textbook Transformer Engineering Design and Practice Power Engineering Willis 1st Edition S.V. Kulkarni Download All Chapters
51 pages
An EDP-Model of Open Pit Short Term Production Scheduling Optimization For Stratiform Orebodies
No ratings yet
An EDP-Model of Open Pit Short Term Production Scheduling Optimization For Stratiform Orebodies
10 pages
Metin Arik, Gökhan Ünel Auth., Bruno Gruber, Michael Ramek Eds. Symmetries in Science IX
No ratings yet
Metin Arik, Gökhan Ünel Auth., Bruno Gruber, Michael Ramek Eds. Symmetries in Science IX
351 pages
Score-Based Diffusion Models Via Stochastic Differential Equations - A Technical Tutorial
No ratings yet
Score-Based Diffusion Models Via Stochastic Differential Equations - A Technical Tutorial
29 pages

CSC 177 Assignment 1 Chetan Nagarkar: Presentation

Uploaded by

CSC 177 Assignment 1 Chetan Nagarkar: Presentation

Uploaded by

CSC 177 Assignment 1 Chetan Nagarkar

2) a) Define the term “BIG DATA”

 Describe the role of data mining research in the two innovations

Related presentation: http://www.youtube.com/watch?v=_HI5pLCFbu0

Role of data mining:

Data cleaning is the process of removing noise and inconsistent data.

d) Use following methods to normalize the following group of data:

200, 300, 400, 600, 1000

Normalization is done to avoid the dependence on the choice of measurement units. In

(1) min-max normalization by setting new.min=0 and new.max=1

V’= (V-min)/ (max-min) * (new.max –new.min) + new.min

(2) z-score normalization

Vi’ = Vi – Mean/ Standard Deviation

Given dataset is {200, 300, 400, 600, 1000}

Mean = (200+300+400+600+1000)/5 = 500

((200-500) ^2 + (300-500) ^2 + (400-500) ^2 + (600-500) ^2+ (1000-500) ^2) / 5

Applying the formula stated above for the Vi’, we have,

Value of Vi (Vi-Mean) / Standard Deviation Final Value

Candidate Sets Support

Candidate Sets Support

Dropping the candidate sets with support < 20% (minsup)

{Bread, Jelly, Peanut Butter} 1

{Bread, Peanut Butter, Milk} 1

{Bread, Jelly, Peanut Butter}

{Bread, Peanut Butter, Jelly, Milk} 0

L4 candidate set does not fulfill minimum support. Hence, it is dropped.

An association rule is of the form: X => Y

X => Y: if someone buys X, he also buys Y

The confidence is the conditional probability that, given X present in a transition , Y

Confidence measure, by definition:

Confidence , α (X=>Y) = support(X,Y) / support(X)

{Bread, Peanut Butter, Milk, Beer, Jelly} => {Br, PB, M, B, J}

a) Check for Br => PB , α = Support of (Br U PB) / Support of (Br)

b) Check for PB => Br , α = Support of (Br U PB) / Support of (PB)

c) Check for Br => M , α = Support of (Br U M) / Support of (Br)

= 40/80 = 50% > αT

d) Check for M => Br , α = Support of (Br U M) / Support of (M)

e) Check for Br => B , α = Support of (Br U B) / Support of (Br)

f) Check for B => Br , α = Support of (Br U B) / Support of (B)

g) Check for M => B , α = Support of (M U B) / Support of (M)

h) Check for B => M , α = Support of (M U B) / Support of (B)

i) Check for Br => J , α = Support of (Br U J) / Support of (Br)

j) Check for J => Br , α = Support of (Br U J) / Support of (J)

k) Check for PB => J , α = Support of (PB U J) / Support of (PB)

l) Check for J => PB , α = Support of (PB U J) / Support of (J)

m) Check for {Br,PB} => {J} , α = Support of (BrU PB U J) / Support of {Br,PB}

n) Check for {J}=> {Br,PB} , α = Support of (BrU PB U J) / Support of {J}

o) Check for {Br,J} => {PB} , α = Support of (BrU PB U J) / Support of {Br,J}

p) Check for {PB} => {Br,J} , α = Support of (BrU PB U J) / Support of {PB}

q) Check for {PB,J} => {Br} , α = Support of (BrU PB U J) / Support of {PB,J}

r) Check for {Br} => {PB,J}, α = Support of (BrU PB U J) / Support of {Br}

s) Check for {Br,PB} => {M} , α = Support of (BrU PB U M) / Support of {Br,PB}

t) Check for {M}=> {Br,PB} , α = Support of (BrU PB U M) / Support of {M}

u) Check for {Br,M} => {PB} , α = Support of (BrU PB U M) / Support of {Br,M}

v) Check for {PB} => {Br,M} , α = Support of (BrU PB U M) / Support of {PB}

x) Check for {Br} => {PB,M}, α = Support of (BrU PB U M) / Support of {Br}

Final Association Rules:

Bread  Peanut Butter

Peanut Butter  Bread

Jelly  Peanut Butter

Milk  Peanut Butter

Bread, Jelly  Peanut Butter

Jelly, Peanut Butter  Bread

Jelly  Bread, Peanut Butter

Bread, Milk  Peanut Butter

Milk, Peanut Butter  Bread

Milk  Bread, Peanut Butter

Apriori algorithm consists of creation of frequent itemsets, satisfying minimum confidence

Min. support = 60%

Min. confidence = 80%

Candidate Sets Support

L2 Large Itemsets = {K,E},{K,M},{K,O},{K,Y},{E,O}

L3 Large Itemsets = {K,E,O}

F-List Support FPDB FP-tree

Item Conditional FP-tree Frequent Itemsets

M {K, E: 1} Root {K, M: 3}