0% found this document useful (0 votes)
19 views96 pages

AI-43 Data Mining

Uploaded by

alisher20552020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views96 pages

AI-43 Data Mining

Uploaded by

alisher20552020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

LAB MANUAL

DATA MINING (AI-413)

BS ARTIFICAL INTELLIGENCE

FALL 2023

Prepared By:
Ms. Khushbakht Awar

FACULTY OF ENGINEERING SCIENCES AND TECHNOLOGY

HAMDARD UNIVERSITY, ISLAMABAD CAMPUS


AI-413 Data Mining
LABORATORY MANUAL

Student Name:
Revision History

First Edition: 2023


CLO – PLO Mapping

Taxonomy
S. No. Course Learning Outcomes (CLOs) PLOs
Level
PLO_02
(Knowledge
Apply preprocessing techniques on any given raw C3
CLO_1 for Solving
data. and finding valid patterns (Applying)
Computing
Problems)
PLO_02
Apply data mining techniques for data mining (Knowledge
C3
CLO_2 operations. for Solving
(Applying)
Computing
Problems)
C=Cognitive domain, P=Psychomotor domain, A=Affective domain, BT=Blooms Taxonomy
PLO=Program Learning Outcome
Forward

Efforts were taken to prepare a comprehensive, easy and practical based laboratory manual to fulfill the
laboratory work requirements of the course titled Data Mining. As Hamdard University offers sixteen
laboratory classes per semester, this manual contains variety of fourteen laboratories, whereas during two
remaining laboratory classes; students will be encouraged to repeat the laboratories, and/or work for their
semester project whenever necessary.

…………………………
Ms. Khushbakht Awar,
Department of Computing
It is certified that Mr. Ms.

bearing University CMS ID has completed his/her laboratory work

of the course titled “Data Mining” with course code AI-413 in the Artificial Intelligence

Lab of the Department of Computing.

Opening Date of Lab Work:


Closing Date of Lab Work:
Grade / Marks: `

Faculty Member:
Signature of Faculty Member:
Date:
List of Experiments

Sr.
Title of Experiment CLO PLO Taxonomy Level
No.

1. To Installation of WEKA Tool CLO_1 PLO_2 P3


(Guided Response)
2. To Practice Creating new Arff File CLO_1 PLO_2 P3
(Guided Response)

3. To Practice the Implementation of Pre-Processes Techniques on Data Set CLO_1 PLO_2 P3


(Guided Response)
4. To Demonstrate the Generate Association Rules using the Apriori CLO_1 PLO_2 P4
Algorithm (Mechanism)
5. To Generating association rules using fp growth algorithm CLO_1 PLO_2 P4
(Mechanism)
6. Build a Decision Tree by using J48 algorithm + Open Ended Lab CLO_1 PLO_2 P4
(Mechanism)
7. To Try familiarization with Naïve bayes classification on a given data CLO_1 PLO_2 P3
set (Guided Response)

8. To implementation of Applying k-means clustering on a given data set CLO_2 PLO_2 P3


(Guided Response)

9. Write a procedure for Employee data using Make Density Based CLO_2 PLO_2
Cluster Algorithm P4
(Mechanism)

10. Write a procedure for Clustering Weather data using EM Algorithm CLO_2 PLO_2
P4
(Mechanism)

11. Write a procedure for Clustering Buying data using Cobweb Algorithm. CLO_2 PLO_2
P4
(Mechanism)

12. To Construct Decision Tree for Customer data and classify it. CLO_2 PLO_2 P3
(Guided Response)

13. Write a procedure for Clustering Customer data using Simple KMeans CLO_2 PLO_2 P3
Algorithm. (Guided Response)

14. Write a procedure for Banking data using Farthest First Algorithm CLO_2 PLO_2 P3
(Guided Response)
LAB 1: Installation of WEKA Tool
Aim: A Investigation the Application interfaces of the Weka tool.

Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together with
graphical user interfaces for easy access to these functions. The original non-Java version of
Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms implemented in other
programming languages, plus data preprocessing utilities in C, and Make file-based system for
running machine learning experiments. This original version was primarily designed as a tool for
analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3),
for which development started in 1997, is now used in many different application areas, in
particular for educational purposes and research. Advantages of Weka include:

▪ Free availability under the GNU General Public License.


▪ Portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform
▪ A comprehensive collection of data preprocessing and modeling techniques
▪ Ease of use due to its graphical user interfaces

Description:
Open the program. Once the program has been loaded on the user’s machine it is opened by
navigating to the program’s start option and that will depend on the user’s operating system.
Figure 1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:

Page 1
Fig: 1.1 Weka GUI

1. Explorer - the graphical interface used to conduct experimentation on raw data. After
clicking the Explorer button, the Weka explorer interface appears.

Fig: 1.2 Pre-processor

Page 2
Page 3
Inside the weka explorer window there are six tabs:
1. Preprocess- used to choose the data file to be used by the application.
Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by user
2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation

Fig: 1.3 choosing Zero set from classify


Again there are several options to be selected inside of the classify tab. Test option gives the user
the choice of using four different test mode scenarios on the data set.
1. Use training set
2. Supplied training set
3. Cross validation
4. Split percentage

3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze.

Page 4
4. Association- used to apply different rules to the data file that identify association within the
data. The associate tab opens a window to select the options for associations within the dataset.

Page 5
5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment

6. Visualize- used to see what the various manipulation produced on the data set in a 2D format,
in scatter plot and bar graph output.

2. Experimenter - this option allows users to conduct different experimental variations on data
sets and perform statistical manipulation. The Weka Experiment Environment enables the user to
create, run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes.

Fig: 1.6 Weka experiment

Results destination: ARFF file, CSV file, JDBC database.


Experiment type: Cross-validation (default), Train/Test Percentage Split (data randomized).
Iteration control: Number of repetitions, Data sets first/Algorithms first.
Algorithms: filters

Page 6
3. Knowledge Flow -basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from previous
results
4. Simple CLI - provides users without a graphic interface option the ability to execute
commands from a terminal window.
b. Explore the default datasets in weka tool.

Click the “Open file…” button to open a data set and double click on the “data” directory.
Weka provides a number of small common machine learning datasets that you can use to practiceon.
Select the “iris.arff” file to load the Iris dataset.

Fig: 1.7 Different Data Sets in weka

References:
[1] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learning tools and
techniques. 2nd edition Morgan Kaufmann, San Francisco.
[2] Ross Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
San Mateo, CA.
[3] CVS–http://weka.sourceforge.net/wiki/index.php/CVS
[4] Weka Doc–http://weka.sourceforge.net/wekadoc/

Exercise:
1. Normalize the data using min-max normalization

Page 7
Page 8
Page 9
Page 10
Experiment 2.Creating new ARFF file
Aim: Creating a new ARFF file

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project
at the Department of Computer Science of The University of Waikato for use with the Weka
machine learning software in WEKA, each data entry is an instance of the java class weka.core.
Instance, and each instance consists of a For loading datasets in WEKA, WEKA can
load ARFF files. Attribute Relation File Format has two sections:

1. The Header section defines relation (dataset) name, attribute name, and type.
2. The Data section lists the data instances.

The figure above is from the textbook that shows an ARFF file for the weather data. Lines
beginning with a % sign are comments. And there are three basic keywords:

Page 11
The external representation of an Instances class Consists of:
▪ A header: Describes the attribute types
▪ Data section: Comma separated list of data

ARF

References:
https://www.cs.auckland.ac.nz/courses/compsci367s1c/tutorials/IntroductionToWeka.pdf

Exercise:

1. Creating a sample dataset for supermarket (supermarket.arff)

Page 12
Page 13
Page 14
Page 15
Experiment 3: Pre-Processes Techniques on Data Set
Aim: 3a) Pre-process a given dataset based on Attribute selection

To search through all possible combinations of attributes in the data and find which subset of
attributes works best for prediction, make sure that you set up attribute evaluator to „Cfs Subset
Val‟ and a search method to „Best First‟. The evaluator will determine what method to use to
assign a worth to each subset of attributes. The search method will determine what style of
search to perform. The options that you can set for selection in the „Attribute Selection Mode‟ fig
no: 3.2

1. Use full training set. The worth of the attribute subset is determined using the full set of
training data.

2. Cross-validation. The worth of the attribute subset is determined by a process of cross-


validation. The „Fold‟ and „Seed‟ fields set the number of folds to use and the random seed used
when shuffling the data.

Specify which attribute to treat as the class in the drop-down box below the test options. Once all
the test options are set, you can start the attribute selection process by clicking on „Start‟ button.

Fig: 3.1 Choosing Cross validation

Page 16
When it is finished, the results of selection are shown on the right part of the window and entry is
added to the „Result list‟.

2. Visualizing Results

Fig: 3.2 Data Visualization


WEKA‟s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice; it helps to determine difficulty of the learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations
(Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden”
data points.

Page 17
Fig 3.3: Preprocessing with jitter

Fig: 3.3 Data visualization


Exercise
1. Explain data preprocessing steps for heart disease dataset.

Page 18
Aim: B. Pre-process a given dataset based on Handling Missing Values

Process: Replacing Missing Attribute Values by the Attribute Mean. This method is used for
data sets with numerical attributes. An example of such a data set is presented in fig no: 3.4

Fig: 3.4 Missing values

Fig: 3.5 Choosing a dataset

Page 19
In this method, every missing attribute value for a numerical attribute is replaced by the
arithmetic mean of known attribute values. In Fig, the mean of known attribute values for
Temperature is 99.2, hence all missing attribute values for Temperature should be replaced by
The table with missing attribute values replaced by the mean is presented in fig. For symbolic
attributes Headache and Nausea, missing attribute values were replaced using the most common
value of the Replace Missing Values.

Page 20
Fig: 3.6 Replaced values
Exercise
1. Create your own dataset having missing values included.

Page 21
Record Notes

Page 22
Page 23
Page 24
Experiment 4. Generate Association Rules using the Apriori Algorithm
Description:
The Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. It uses a “bottom-up” approach, where frequent subsets are extended one at a
time (a step known as candidate generation, and groups of candidates are tested against the data).

❖ Problem:

TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

To find frequent item sets for above transaction with a minimum support of 2 having confidence
measure of 70% (i.e, 0.7).

Procedure:
Step 1:
Count the number of transactions in which each item occurs

TID ITEMS
1 2
2 3
3 3
4 1
5 3

Step 2:
Eliminate all those occurrences that have transaction numbers less than the minimum support ( 2
in this case).

Page 25
ITEM NO. OF
TRANSACTIONS

1 2

2 3

3 3

5 3

This is the single items that are bought frequently. Now let‟s say we want to find a pair of items
that are bought frequently. We continue from the above table (Table in step 2).

Step 3:
We start making pairs from the first item like 1,2;1,3;1,5 and then from second item like 2,3;2,5.
We do not perform 2,1 because we already did 1,2 when we were making pairs with 1 and
buying 1 and 2 together is same as buying 2 and 1 together. After making all the pairs we get,

ITEM PAIRS

1,2
1,3
1,5
2,3
2,5
3,5
Step 4:
Now, we count how many times each pair is bought together.

NO.OF
ITEM PAIRS TRANSACTIONS

1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2

Page 26
Step 5:
Again remove all item pairs having number of transactions less than 2.

ITEM PAIRS NO.OF


TRANSACTIONS

1,3 2
2,3 2
2,5 3
3,5 2

These pair of items is bought frequently together. Now, let‟s say we want to find a set of three
items that are bought together. We use above table (of step 5) and make a set of three items.

Step 6:
To make the set of three items we need one more rule (It‟s termed as self-join), it simply means,
from item pairs in above table, we find two pairs with the same first numeric, so, we get (2,3)
and (2,5), which gives (2,3,5). Then we find how many times (2, 3, 5) are bought together in the
original table and we get the following

ITEM NO. OF
SET TRANSACTIONS

(2,3,5) 2

Thus, the set of three items that are bought together from this data are (2, 3, 5).

Confidence:
We can take our frequent item set knowledge even further, by finding association rules using the
frequent item set. In simple words, we know (2, 3, 5) are bought together frequently, but what is
the association between them. To do this, we create a list of all subsets of frequently bought
items (2, 3, 5) in our case we get following subsets:

▪ {2}
▪ {3}
▪ {5}
▪ {2,3}
▪ {3,5}
▪ {2,5}

Page 27
Now, we find association among all the subsets.
{2} => {3,5}: ( If „2‟ is bought , what‟s the probability that „3‟ and „5‟ would be bought in same
transaction)
Confidence = P (3฀5฀2)/ P(2) =2/3 =67%
{3}=>{2,5}= P (3฀5฀2)/P(3)=2/3=67%
{5}=>{2,3}= P (3฀5฀2)/P(5)=2/3=67%
{2,3}=>{5}= P (3฀5฀2)/P(2฀3)=2/2=100%
{3,5}=>{2}= P (3฀5฀2)/P(3฀5)=2/2=100%
{2,5}=>{3}= P (3฀5฀2)/ P(2฀5)=2/3=67%
Also, considering the remaining 2-items sets, we would get the following associations-
{1}=>{3}=P(1฀3)/P(1)=2/2=100%
{3}=>{1}=P(1฀3)/P(3)=2/3=67%
{2}=>{3}=P(3฀2)/P(2)=2/3=67%
{3}=>{2}=P(3฀2)/P(3)=2/3=67%
{2}=>{5}=P(2฀5)/P(2)=3/3=100%
{5}=>{2}=P(2฀5)/P(5)=3/3=100%
{3}=>{5}=P(3฀5)/P(3)=2/3=67%
{5}=>{3}=P(3฀5)/P(5)=2?3=67%
Eliminate all those having confidence less than 70%. Hence, the rules would be –
{2,3}=>{5}, {3,5}=>{2}, {1}=>{3},{2}=>{5}, {5}=>{2}.
➢ Now these manual results should be checked with the rules generated in WEKA.

So first create a csv file for the above problem, the csv file for the above problem will look like
the rows and columns in the above figure. This file is written in excel sheet.

Page 28
Procedure for running the rules in weka:
Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

Step 2:Now select the association tab and then choose apriori algorithm by setting the minimum
support and confidence as shown in the figure

Page 29
Step 3:
Now run the apriori algorithm with the set values of minimum support and the confidence. After
running the weka generates the association rules and the respective confidence with minimum
support as shown in the figure.

Page 30
The above csv file has generated 5 rules as shown in the figure:

Page 31
Conclusion:
As we have seen the total rules generated by us manually and by the weka are matching, hence
the rules generated are 5.

Exercise:

1. Apply the Apriori algorithm on Airport noise monitoring dataset discriminating between
patients with parkinsons and neurological diseases using voice recording dataset.
[https://archive.ics.uci.edu/ml/machine-learning-databases/00000/ refer this link for datasets]

Page 32
Page 33
Page 34
Page 35
Experiment5: Generating Association Rules Using FP Growth Algorithm
(5a) Aim: To generate association rules using FP Growth Algorithm

PROBLEM:
To find all frequent item sets in following dataset using FP-growth algorithm. Minimum
support=2 and confidence =70%
TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

Solution:
Similar to Apriori Algorithm, find the frequency of occurrences of all each item in dataset and
then prioritize the items according to its descending order of its frequency of occurrence.
Eliminating those occurrences with the value less than minimum support and assigning the
priorities, we obtain the following table.

ITEM NO. OF PRIORITY


TRANSACTIONS

1 2 4

2 3 1

3 3 2

5 3 3

Re-arranging the original table, we obtain

TID ITEMS

100 1,3

200 2,3,5

300 2,3,5,1

400 2,5

Page 36
Construction of tree:
Note that all FP trees have „null‟ node as the root node. So, draw the root node first and attach
the items of the row 1 one by one respectively and write their occurrences in front of it. The tree
is further expanded by adding nodes according to the prefixes (count) formed and by further
incrementing the occurrences every time they occur and hence the tree is built.

Prefixes:

▪ 1->3:1 2,3,5:1
▪ 5->2,3:2 2:1
▪ 3->2:2

Frequent item sets:

▪ 1-> 3:2 /*2 and 5 are eliminated because they‟re less than minimum support, and the
occurrence of 3 is obtained by adding the occurrences in both the instances*/
▪ Similarly, 5->2,3:2 ; 2:3;3:2
▪ 3->2 :2

Therefore, the frequent item sets are {3,1}, {2,3,5}, {2,5}, {2,3},{3,5}
The tree is constructed as below:

NUL

3:1
2:3

1:1 3:2
5:1

5:2

1:1

Generating the association rules for the following tree and calculating the
confidence measures we get-
▪ {3}=>{1}=2/3=67%
▪ {1}=>{3}=2/2=100%
▪ {2}=>{3,5}=2/3=67%

Page 37
▪ {2,5}=>{3}=2/3=67%
▪ {3,5}=>{2}=2/2=100%
▪ {2,3}=>{5}=2/2=100%
▪ {3}=>{2,5}=2/3=67%
▪ {5}=>{2,3}=2/3=67%
▪ {2}=>{5}=3/3=100%
▪ {5}=>{2}=3/3=100%
▪ {2}=>{3}=2/3=67%
▪ {3}=>{2}=2/3=67%

Thus eliminating all the sets having confidence less than 70%, we obtain the following
conclusions:
{1}=>{3} , {3,5}=>{2} , {2,3}=>{5} , {2}=>{5}, {5}=>{2}.

As we see there are 5 rules that are being generated manually and these are to be checked against
the results in WEKA. Inorder to check the results in the tool we need to follow the similar
procedure like
Apriori.

So first create a csv file for the above problem, the csv file for the above problem will look like the rows
and columns in the above figure. This file is written in excel sheet.

Page 38
Procedure for running the rules in weka:
Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

Step 2:Now select the association tab and then choose FPgrowth algorithm by setting the
minimum support and confidence as shown in the figure.

Page 39
Step 3:
Now run the FP Growth algorithm with the set values of minimum support and the confidence.
After running the weka generates the association rules and the respective confidence with
minimum support as shown in the figure.
The above csv file has generated 5 rules as shown in the figure:

Page 40
Conclusion:
As we have seen the total rules generated by us manually and by the weka are matching, hence
the rules generated are 5.

Exercise

1. Apply FP-Growth algorithm on Blood Transfusion Service Center data set

Page 41
Record Notes

Page 42
Page 43
Page 44
Experiment 6: Build a Decision Tree by using J48 algorithm (6a)

Aim: Generate a Decision Tree by using J48 algorithm.

DESCRIPTION:
Decision tree learning is one of the most widely used and practical methods for inductive
inference over supervised data. It represents a procedure for classifying categorical database on
their attributes. This representation of acquired knowledge in tree form is intuitive and easy to
assimilate by humans.

ILLUSTRATION:
Build a decision tree for the following data

AGE INC STUD CREDIT_RATIN BUYS_CO


OM ENT G MPUTER
E
Youth High No Fair No

Youth High No Excellent No

Middle aged High No Fair Yes

Senior Medium No Fair Yes

Senior Low Yes Fair Yes

Senior Low Yes Excellent No

Middle aged Medium Yes Excellent Yes

Youth Low No Fair No

Youth Medium Yes Fair Yes

Senior Medium Yes Fair Yes

Youth Medium Yes Excellent Yes

Middle aged Medium No Excellent Yes

Middle aged High Yes Fair Yes

Senior Medium No Excellent No

Page 45
The entropy is a measure of the uncertainty associated with a random variable. As uncertainty
increases, so does entropy, values range from [0-1] to present the entropy of information

Entropy (D) =
Information gain is used as an attribute selection measure; pick the attribute having the highest
information gain, the gain is calculated by:
Gain (D, A) = Entropy (D) -
Where, D: A given data partition A: Attribute
V: Suppose we were partition the tuples in D on some attribute A having v distinct values D is
split into v partition or subsets, (D1, D2….. Dj) , where Dj contains those tuples in D that have
outcome Aj of A.

Class P: buys_computer=”yes”
Class N: buys_computer=”no”

Entropy (D) = -9/14log (9/14)-5/15log (5/14) =0.940


Compute the expected information requirement for each attribute start with the attribute age Gain
(age, D)

= Entropy (D) -
= Entropy ( D ) – 5/14Entropy(Syouth)-4/14Entropy(Smiddle-aged)-5/14Entropy(Ssenior)
= 0.940-0.694
=0.246

Similarly, for other attributes,


Gain (Income, D) =0.029
Gain (Student, D ) = 0.151
Gain (credit_rating, D) = 0.048

Income Student Credit_rating Class

High No Fair No

High No Excellent No

Medium No Fair No

Low Yes Fair Yes

medium Yes excellent yes

Now, calculating information gain for subtable (age<=30)


I The attribute age has the highest information gain and therefore becomes the splitting
* attribute at the root node of the decision tree. Branches are grown for each outcome of
age. These tuples are shown partitioned accordingly.

Page 46
Income=”high” S11=0, S12=2
I=0
Income=”medium” S21=1 S22=1
I (S21, S23) = 1
Income=”low” S31=1 S32=0
I=0
Entropy for income
E( income ) = (2/5)(0) + (2/5)(1) + (1/5)(0) = 0.4
Gain( income ) = 0.971 - 0.4 = 0.571

Similarly, Gain(student)=0.971
Gain(credit)=0.0208
Gain( student) is highest ,

A decision tree for the concept buys_computer, indicating whether a customer at All Electronics
is likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class ( either buys_computer=”yes” or buys_computer=”no”.

first create a csv file for the above problem,the csv file for the above problem will look like the
rows and columns in the above figure. This file is written in excel sheet.

Page 47
Procedure for running the rules in weka:
Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

Page 48
Step2:
Now select the classify tab in the tool and click on start button and then we can see the result of
the problem as below

Step3:

Check the main result which we got manually and the result in weka by right clicking on the
result and visualizing the tree.

The visualized tree in weka is as shown below:

Page 49
Conclusion:
The solution what we got manually and the weka both are same.

Exercise:

1. Apply decision tree algorithm to book a table in a hotel/ book a train ticket/ movie ticket

Page 50
Page 51
Page 52
Page 53
Experiment7: Naïve bayes classification on a given data set

AIM:To apply naïve bayes classifier on a given data set.

Description:
In machine learning, Naïve Bayes classifiers are a family of simple probabilistic classifiers based
on applying Bayes‟ Theorem with strong (naïve) independence assumptions between the features

Example:
.
AGE INCOME STUDENT CREDIT_RATING BUYS_COMPUTER
<30 High No Fair No
<30 High No Excellent No
31-40 High No Fair Yes
>40 Mediu m No Fair Yes
>40 Low Yes Fair Yes
>40 Low Yes Excellent No
31-40 Mediu m Yes Excellent Yes
<=30
Low No Fair No
<=30 Mediu m Yes Fair Yes
>40 Mediu m Yes Fair Yes
<30 Mediu m Yes Excellent Yes
31-40 Mediu m No Excellent Yes
31-40 High Yes Fair Yes
>40 Mediu m No Excellent No

CLASS:
C1:buys_com
puter = ‘yes’
C2:buys_com
puter=’no’
DATA TO
BECLASSIFIED
:
X= (age<=30, income=Medium, Student=Yes, credit_rating=Fair)
P(C1): P(buys_computer=”yes”)= 9/14 =0.643

P (buys_computer=”no”) =5/14=0.357

Page 54
Compute P(X/C1) and p(x/c2) weget:

1. P( age=”<=30” |buys_computer=”yes”)=2/9
2. P(age=”<=30”|buys_computer=”no”)=3/5
3. P(income=”medium”|buys_computer=”yes”)=4/9
4. P(income=”medium”|buys_computer=”no”)=2/5
5. P(student=”yes”|buys_computer=”yes”)=6/9
6. P(student=”yes” |buys_computer=”no”)=1/5=0.2
7. P(credit_rating=”fair ”|buys_computer=”yes”)=6/9
8. P(credit_rating=”fair” |buys_computer=”no”)=2/5

X=(age<=30, income=medium, student=yes,


credit_rating=fair) P(X/C1): P
(X/buys_computer=”yes”)=2/9*4/9*6/9*6/9=
32/1134
P(X/C2):P(X/buys_computer=”no”)=3/5*2/5*1
/5*2/5=12/125

P(C1/X)=P(X/C1)*P(C1)

P(X/buys_computer=”yes”)*P(buys_computer=”yes”)=(32/1134)*(9/14)=0.019

P(C2/X)=p(x/c2)*p(c2)

P (X/buys_computer=”no”)*P(buys_computer=”no”)=(12/125)*(5/14)=0.007

Therefore, conclusion is that the given data belongs to C1 since P(C1/X)>P(C2/X)

Checking the result in the WEKA tool:


In order to check the result in the tool we need to
follow aprocedure. Step 1:
Create a csv file with the above table considered in the example. the arff file
will look as shown below:

Page 55
Step 2:
Now open weka explorer and then select all the attributes in the table.

Page 56
Step 3:
Select the classifier tab in the tool and choose baye‟s folder and then naïve baye‟s classifier to
see the result as shown below.

Exercise

1. Classify data (lung cancer/ diabetes /liver disorder)using Bayesian approach .

Page 57
Record Notes

Page 58
Page 59
Page 60
Experiment 8: Applying k-means clustering on a given data set

DESCRIPTION:

K-means algorithm aims to partition n observations into “k clusters” in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in
partitioning of the data into Voronoi cells.

ILLUSTRATION:

As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of the five variables.

I X1 X2

A 1 1

B 1 0

C 0 2

D 2 4

E 3 5

This data set is to be grouped into two clusters: As a first step in finding a sensible partition, let
the A & C values of the two individuals furthest apart (using the Euclidean distance measure),
define the initial cluster means, giving:

Cluster Individual Mean Vector(Centroid)

Cluster1 A (1,1)

Cluster2 C (0,2)

Page 61
The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:

A C

A 0 1.4

B 1 2.5

C 1.4 0

D 3.2 2.82

E 4.5 4.2

Initial partitions have changed, and the two clusters at this stage having the following
characteristics.

Individual Mean vector( Centroid)

Cluster 1 A,B (1,0.5)

Cluster 2 C,D,E (1.7,3.7)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual‟s distance to its own cluster mean and to that of the opposite cluster.
And, we find:

I A C

A 0.5 2.7

B 0.5 3.7

C 1.8 2.4

D 3.6 0.5

E 4.9 1.9

The individuals C is now relocated to Cluster 1 due to its less mean distance with the centroid
points. Thus, its relocated to cluster 1 resulting in the new partition

Page 62
Individual Mean vector(Centroid)

Cluster 1 A,B,C (0.7,1)

Cluster 2 D,E (2.5,4.5)

The iterative relocation would now continue from this new partition until no more relocation
occurs. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won‟t find a final solution. In this case, it would
be a better idea to consider stopping the algorithm after a pre-chosen maximum number of
iterations.
Checking the solution in weka:
In order to check the result in the tool we need to follow a procedure.
Step 1:
Create a csv file with the above table considered in the example. the csv file will look as shown
below:

Step 2:
Now open weka explorer and then select all the attributes in the table.

Page 63
Step 3:
Select the cluster tab in the tool and choose normal k-means technique to see
the result as shown below.

Page 64
Exercise

1. Implement of K-means clustering using crime dataset.

Page 65
LAB 9

Aim: Write a procedure for Employee data using Make Density Based Cluster Algorithm.

Description:

Cluster analysis or clustering is the task of assigning a set of objects into groups (called
clusters) so that the objects in the same cluster are more similar (in some sense or another) to each
other than to those in other clusters. Clustering is a main task of explorative data mining, and a
common technique for statistical data analysis used in many fields, including machine learning,
pattern recognition, image analysis, information retrieval, and bioinformatics.

Creation of Employee Table:

Procedure:

1) Open Start → Programs → Accessories → Notepad


2) Type the following training data set with the help of Notepad for Employee
Table. @relation employee
@attribute eid numeric
@attribute ename
{raj,ramu,anil,sunil,rajiv,sunitha,kavitha,suresh,ravi,ramana,ram,kavya,navya}
@attribute salary numeric
@attribute exp numeric
@attribute address {pdtr,kdp,nlr,gtr}

@data
101,raj,10000,4,p
dtr
102,ramu,15000,
5,pdtr
103,anil,12000,3,
kdp
104,sunil,13000,3
,kdp
105,rajiv,16000,6
,kdp
106,sunitha,1500
0,5,nlr
107,kavitha,1200
0,3,nlr
108,suresh,11000
,5,gtr

Page 66
109,ravi,12000,3,
gtr
110,ramana,1100
0,5,gtr
111,ram,12000,3,
kdp
112,kavya,13000,
4,kdp
113,navya,14000,
5,kdp
3) After that the file is saved with .arff file format.
4) Minimize the arff file and then open Start → Programs → weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows employee table on weka.

Training Data Set → Employee Table

Page 67
LAB 10

Aim: Write a procedure for Clustering Weather data using EM Algorithm.

Description:

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each other than to those in
other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical
data analysis used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics.

Creation of Weather Table:

Procedure:

1) Open Start → Programs → Accessories → Notepad


2) Type the following training data set with the help of Notepad for Weather Table.
@relation weather
@attribute outlook {sunny, rainy, overcast}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,
FALSE} @attribute play {yes,
no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,ye
s rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,ye
s rainy,71,91,TRUE,no

3) After that the file is saved with .arff file format.


4) Minimize the arff file and then open Start → Programs → weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file Page 68
8) Click on edit button which shows weather table on weka.
Training Data Set → Weather Table

Procedure:
9) Click Start -> Programs -> Weka 3.4
10) Click on Explorer.
11) Click on open file & then select Weather.arff file.
12) Click on Cluster menu. In this there are different algorithms are there.
13) Click on Choose button and then select EM algorithm.
14) Click on Start button and then output will be displayed on the screen.

Page 69
LAB 11

Aim: Write a procedure for Clustering Buying data using Cobweb Algorithm.

Description:

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each other than to those in
other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical
data analysis used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics.

Creation of Buying Table:

Procedure:

1) Open Start → Programs → Accessories → Notepad


2) Type the following training data set with the help of Notepad for Buying Table.
@relation buying
@attribute age {L20,20-40,G40}
@attribute income {high,medium,low}
@attribute stud {yes,no}
@attribute creditrate
{fair,excellent} @attribute
buyscomp {yes,no} @data
L20,high,no,fair,yes
20-
40,low,yes,fair,yes
G40,medium,yes,fair,yes
L20,low,no,fair,no
G40,high,no,excellent,yes
L20,low,yes,fair,yes
20-40,high,yes,excellent,no
G40,low,no,fair,yes
L20,high,yes,excellent,yes
G40,high,no,fair,yes
L20,low,yes,excellent,no
G40,high,yes,excellent,no
20-40,medium,yes,excellent,yes
L20,medium,yes,fair,yes
G40,high,yes,excellent,yes

3) After that the file is saved with .arff file format.


4) Minimize the arff file and then open Start → Programs → weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer. Page 70
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows buying table on weka.
Training Data Set → Buying Table

Procedure:
1) Click Start -> Programs -> Weka 3.4
2) Click on Explorer.
3) Click on open file & then select Buying.arff file.
4) Click on Cluster menu. In this there are different algorithms are there.
5) Click on Choose button and then select cobweb algorithm.
6) Click on Start button and then output will be displayed on the screen.

Page 71
Output:

Result:
The program has been successfully executed.

Page 72
LAB 12

Aim:

To Construct Decision Tree for Customer data and classify it.

Description:

Classification & Prediction:

Classification is the process for finding a model that describes the data values and concepts
for the purpose of Prediction.

Decision Tree:

A decision Tree is a classification scheme to generate a tree consisting of root node, internal
nodes and external nodes.

Root nodes representing the attributes. Internal nodes are also the attributes. External nodes
are the classes and each branch represents the values of the attributes

Decision Tree also contains set of rules for a given data set; there are two subsets in
Decision Tree. One is a Training data set and second one is a Testing data set. Training data set is
previously classified data. Testing data set is newly generated data.

Creation of Customer Table:

Procedure:

1) Open Start → Programs → Accessories → Notepad


2) Type the following training data set with the help of Notepad for Customer Table.
@relation customer
@attribute name {x,y,z,u,v,l,w,q,r,n}
@attribute age {youth,middle,senior}
@attribute income {high,medium,low}
@attribute class {A,B}

@data
x,youth,high,A
y,youth,low,B
z,middle,high,A
u,middle,low,B
v,senior,high,A
l,senior,low,B
w,youth,high,A
q,youth,low,B
r,middle,high,A
n,senior,high,A Page 73

3) After that the file is saved with .arff file format.


4) Minimize the arff file and then open Start → Programs → weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows customer table on weka.

Page 74
Training Data Set → Customer Table

Procedure for Decision Trees:

1) Open Start → Programs → Weka-3-4 → Weka-3-4


2) Open explorer.
3) Click on open file and select customer.arff
4) Select Classifier option on the top of the Menu bar.
5) Select Choose button and click on Tree option.
6) Click on J48.
7) Click on Start button and output will be displayed on the right side of the window.
8) Select the result list and right click on result list and select Visualize Tree option.
9) Then Decision Tree will be displayed on new window.

Output:

Page 75
Decision Tree:

Page 76
LAB 13

Aim: Write a procedure for Clustering Customer data using Simple KMeans Algorithm.

Description:

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each other than to those in
other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical
data analysis used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics.

Creation of Customer Table:

Procedure:

1) Open Start → Programs → Accessories → Notepad


2) Type the following training data set with the help of Notepad for Buying Table.
@relation customer
@attribute name {x,y,z,u,v,l,w,q,r,n}
@attribute age {youth,middle,senior}
@attribute income {high,medium,low}
@attribute class {A,B}

@data
x,youth,high,A
y,youth,low,B
z,middle,high,A
u,middle,low,B
v,senior,high,A
l,senior,low,B
w,youth,high,A
q,youth,low,B
r,middle,high,A
n,senior,high,A

3) After that the file is saved with .arff file format.


4) Minimize the arff file and then open Start → Programs → weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows buying table on weka.

Page 77
Training Data Set → Customer Table

Procedure:
1) Click Start -> Programs -> Weka 3.4
2) Click on Explorer.
3) Click on open file & then select Customer.arff file.
4) Click on Cluster menu. In this there are different algorithms are there.
5) Click on Choose button and then select SimpleKMeans algorithm.
6) Click on Start button and then output will be displayed on the screen.

Page 78
Page 79
LAB 14

Aim: Write a procedure for Banking data using Farthest First Algorithm.

Description:

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so
that the objects in the same cluster are more similar (in some sense or another) to each other than to those in
other clusters. Clustering is a main task of explorative data mining, and a common technique for statistical
data analysis used in many fields, including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics.

Creation of Banking Table:

Procedure:

1) Open Start → Programs → Accessories → Notepad


2) Type the following training data set with the help of Notepad for Banking Table.
@relation bank
@attribute cust {male,female}
@attribute accno
{0101,0102,0103,0104,0105,0106,0107,0108,0109,0110,0111,0112,0113,0114,0115}
@attribute bankname {sbi,hdfc,sbh,ab,rbi}
@attribute location {hyd,jmd,antp,pdtr,kdp}
@attribute deposit {yes,no}
@data
male,0101,sbi,hyd,yes
female,0102,hdfc,jmd,no
male,0103,sbh,antp,yes
male,0104,ab,pdtr,yes
female,0105,sbi,jmd,no
male,0106,ab,hyd,yes
female,0107,rbi,jmd,yes
female,0108,hdfc,kdp,no
male,0109,sbh,kdp,yes
male,0110,ab,jmd,no
female,0111,rbi,kdp,yes
male,0112,sbi,jmd,yes
female,0113,rbi,antp,no
male,0114,hdfc,pdtr,yes
female,0115,sbh,pdtr,no
3) After that the file is saved with .arff file format.
4) Minimize the arff file and then open Start → Programs → weka-3-4.
5) Click on weka-3-4, then Weka dialog box is displayed on the screen.
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
8) Click on edit button which shows banking table on weka. Page 80

Training Data Set → Banking Table


Procedure:
1) Click Start -> Programs -> Weka 3.4
2) Click on Explorer.
3) Click on open file & then select Banking.arff file.
4) Click on Cluster menu. In this there are different algorithms are there.
5) Click on Choose button and then select FarthestFirst algorithm.
Click on Start button and then output will be displayed on the scree

Page 81
Page 82
Page 83
Page 84
Page 85
Page 86
Page 87
Page 88
Page 89

You might also like