0% found this document useful (0 votes)
185 views23 pages

Advanced Database

This document presents a project analyzing an automobile dataset using the Apriori algorithm for association rule mining. A team of 4 students extracted information from the dataset using data mining techniques to understand relationships between attributes like MPG, cylinders, horsepower, etc. They implemented the Apriori algorithm in Python code to generate rules based on minimum support and confidence thresholds. The analysis found relationships with high confidence between attributes like MPG, cylinders and origin.

Uploaded by

ravikumarrk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views23 pages

Advanced Database

This document presents a project analyzing an automobile dataset using the Apriori algorithm for association rule mining. A team of 4 students extracted information from the dataset using data mining techniques to understand relationships between attributes like MPG, cylinders, horsepower, etc. They implemented the Apriori algorithm in Python code to generate rules based on minimum support and confidence thresholds. The analysis found relationships with high confidence between attributes like MPG, cylinders and origin.

Uploaded by

ravikumarrk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

ADVANCED DATABASES

AND DATA MINING


CSCI-527
PROJECT REPORT PRESENTATION

ANALASYS ON
AUTOMOBILE DATASET
TEAM MEMBERS:
Anusha Vadlamudi Narasimha Rao
50134597

Deepthi Chidura 50129270

Namitha Yellokonda 50126906

Shravya Beerakayala 50124534

Abstract

The goal of a data mining process is to


extract information from a dataset and
transform it into a format that could be used
for any purpose of the concerned field.
We have examined the Auto dataset in
which the performance of various cars are
analyzed based on their attributes such as
mpg, cylinders, displacement, horse power,
weight, acceleration, year and origin.
We have analyzed using data mining
technique Apriori algorithm.

INTRODUCTION
This Auto dataset contains the car model,
mpg (miles per gallon), cylinders,
displacement, horse power, weight,
acceleration, origin.

By using Apriori algorithm, confidence and


support allows us to generate important
decisions to find how they worked as
combined factors to find ways to increase the
performance.

The minimum support and


confidence value are set
according to the application and
the sets which satisfy this criterion
are considered to finally find out
which attributes together satisfy
the confidence and support.

DATA OF AUTO :

ATTRIBUTES DESCRIPTION

MPG
CYLINDERS
DISPLACEMENT
HORSEPOWER
WEIGHT
ACCELERATION
YEAR
ORIGIN
NAME

APRIORI ALGORITHM
The

Apriori Algorithm is an influential


algorithm for mining frequent item sets
for Boolean association rules that have
support and confidence greater than
minimum
support
(min-sup)
and
minimum
confidence
(min-conf),
respectively.

The problem of discovering all association


rules can be broken down into two parts
as follows:
Find all sets of items that have support
values greater than the minimum support.
These items are called large item sets.

Use the large item sets to generate the


desired rules.

Two factors that affect significance of


association rules:

Support: The rule X Y has support s in


the transaction set D if s% of the
transactions in D contains X Y.
Confidence: The rule X Y holds in the
transaction set D with confidence c if c%
of the transactions in D that contain X
also contain Y.

PSEUDO CODE
L1 = {large 1-itemsets};
for (k=2; Lk-1 0; k++) do
begin
Ck= apriori-gen(Lk-1); // new candidates
For all transactions t D do
begin
Ct =subset(C, t);
forall candidates c Ct do
c.count++;
end
L k= {c Ck | c.count minsup}
end
answer = k Lk;

DATA CLEANING
Unclean data refers to data that contains
erroneous information.

It may also be used when referring to data


that is in memory and not yet loaded into a
database. There are some missing fields in the
data fields.

UNCLEAN DATA

CODE FOR DATA CLEANING


autoData<- read.csv(file =
"~/Documents//data//Auto.csv", header = TRUE)
horsepwr<- (as.character(autoData$horsepower))
horsepwr<- (ifelse( horsepwr== "?", 0, horsepwr))a

After data cleaning these fields are


removed.

PYTHON CODE FOR APRIORI ALGORITHM


import csv
import os
def apriori_generation_algo(data, min_support=0.3, verbose=False):
can_keys = create_candidate_keys(data)
D_map = map(set, data)
F1, supporting_data = back_prune(D_map, can_keys, min_support, verbose=False)
F = [F1]
key = 2
while (len(F[key - 2]) > 0):
candidate_keys = apriori_generation(F[key-2], key)
F_key, support_K = back_prune(D_map, candidate_keys, min_support)
supporting_data.update(support_K)
F.append(F_key)
key += 1
if verbose:
for kset in F:
for item in kset:

print("" \
+ "{" \
+ "".join(str(i) + ", " for i in iter(item)).rstrip(', ') \
+ "}" \
+ ": supp = " + str(round(supporting_data[item], 3)))
return F, supporting_data
def create_candidate_keys(data, verbose=False):
can_keys = []
for transac in data:
for item in transac:
if not [item] in can_keys:
can_keys.append([item])
can_keys.sort()
return map(frozenset, can_keys)
def back_prune(data, candidates, min_support, verbose=False):
sscount = {}
for tid in data:
for candidate in candidates:
if candidate.issubset(tid):
sscount.setdefault(candidate, 0)
sscount[candidate] += 1
num_items = float(len(data))
ret_list = []
supporting_data = {}

for key in sscount:


support = sscount[key] / num_items
if support >= min_support:
ret_list.insert(0, key)
supporting_data[key] = support
if verbose:
for kset in ret_list:
for item in kset:
print("{" + str(item) + "}")
print("")
for key in sscount:
print("" \
+ "{" \
+ "".join([str(i) + ", " for i in iter(key)]).rstrip(', ') \
+ "}" \
+ ": supp = " + str(supporting_data[key]))
return ret_list, supporting_data
def apriori_generation(frequency_sets, key):
returnList = []
lenLk = len(frequency_sets)
for i in range(lenLk):
for j in range(i+1, lenLk):
a=list(frequency_sets[i])
b=list(frequency_sets[j])

a.sort()
b.sort()
F1 = a[:key-2]
F2 = b[:key-2]
if F1 == F2:
returnList.append(frequency_sets[i] | frequency_sets[j])
return returnList
def rules_from_conseq(frequency_set, H, supporting_data, rules, min_confidence=0.9,
verbose=False):
m = len(H[0])
if m == 1:
Hmp1 = cal_conf(frequency_set, H, supporting_data, rules, min_confidence, verbose)
if (len(frequency_set) > (m+1)):
Hmp1 = apriori_generation(H, m+1)
Hmp1 = cal_conf(frequency_set, Hmp1, supporting_data, rules, min_confidence,
verbose)
if len(Hmp1) > 1:
rules_from_conseq(frequency_set, Hmp1, supporting_data, rules, min_confidence,
verbose)
def cal_conf(frequency_set, H, supporting_data, rules, min_confidence=0.9, verbose=False):
pruned_H = []
for consequence in H:
confidence = supporting_data[frequency_set] / supporting_data[frequency_set consequence]

if confidence >= min_confidence:


append((frequency_set - consequence, consequence, confidence))
pruned_H.append(consequence)
if verbose:
print("" \
+ "{" \
rules.

+ "".join([str(i) + ", " for i in iter(frequency_setconsequence)]).rstrip(', ') \


+ "}" \
+ " --> " \
+ "{" \
+ "".join([str(i) + ", " for i in iter(consequence)]).rstrip(', ') \
+ "}" \
+ ": conf = " + str(round(confidence, 3)) \
+ ", supp = " + str(round(supporting_data[frequency_set], 3)))
return pruned_H
def gen_rules(F, supporting_data, min_confidence=0.9, verbose=True):
rules = []
for i in range(1, len(F)):
for frequency_set in F[i]:

def import_data():
with open('C:/Users/Anusha/Desktop/Auto_clean_data.csv',"rU") as fin:
data = [row for row in csv.reader(fin.read().splitlines())]
return data
data = import_data()
D_map = map(set, data)
can_keys = create_candidate_keys(data, verbose=True)
F1, supporting_data = back_prune(D_map, can_keys, 0.3, verbose=True)
F, supporting_data = apriori_generation_algo(data, min_support=0.05,
verbose=True)
H = gen_rules(F, supporting_data, min_confidence=0.9, verbose=True)

OBSERVATION
If mpg equals 14, Cylinders equals 8 and origin
equals 1 then confidence = 1.0 and support =
0.063
If mpg equals 13, Cylinders equals 8 and origin
equals 1 then confidence = 0.929 and support =
0.066
If cylinders equals 8, Origin equals 73 and origin
equals 1 then confidence = 1.0 and support =
0.051
If horsepower 150, Cylinders equals 8 and origin
equals 1 then confidence = 1.0 and support =
0.056.

CONCLUSION

In our project we have observed Apriori


algorithm and generated rules by considering
minimum support and confidence. The data
set is cleaned using R-Programming and the
algorithm is implemented using python code.
The python code is run is the java
environment and the results are obtained.

You might also like