DM Lab Internal
DM Lab Internal
5 a) Demonstration of classification rule process on dataset employee.arff using naïve bayes algorithm
b) Write a program of cluster analysis using simple k-means algorithm Python programming language.
1
1.Write a Python program to generate frequent item sets / association rules using Apriori algorithm
Aim: To generate frequent item sets / association rules using Apriori algorithm
Source code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori
store_data = pd.read_csv('D:\\Datasets\\store_data.csv')
store_data.head()
Output:
Output:
records = []
for i in range(0, 7501):
records.append([str(store_data.values[i,j]) for j in range(0, 20)])
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)
for item in association_rules:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
Output:
Rule: light cream -> chicken
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
=====================================
2
Rule: mushroom cream sauce -> escalope
Support: 0.005732568990801126
Confidence: 0.3006993006993007
Lift: 3.790832696715049
=====================================
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
=====================================
Rule: ground beef -> herb & pepper
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
=====================================
2. Write a program to cluster your choice of data using simple k-means algorithm using JDK
Aim : To write a program to cluster your choice of data using simple k-means algorithm using JDK
Source code:
import java.util.*;
class KmeansJ
{
public static void main(String args[])
{
int dataset[][] = {
{2,1},
{5,2},
{2,2},
{4,1},
{4,3},
{7,5},
{3,6},
{5,7},
{1,4},
{4,1}
};
int i,j,k=2;
int part1[][] = new int[10][2];
int part2[][] = new int[10][2];
float mean1[][] = new float[1][2];
float mean2[][] = new float[1][2];
float temp1[][] = new float[1][2], temp2[][] = new float[1][2];
int sum11 = 0, sum12 = 0, sum21 = 0, sum22 = 0;
double dist1, dist2;
int i1 = 0, i2 = 0, itr = 0;
System.out.println("Dataset: ");
for(i=0;i<10;i++)
{
System.out.println(dataset[i][0]+" "+dataset[i][1]);
}
System.out.println("\nNumber of partitions: "+k);
mean1[0][0] = 2;
mean1[0][1] = 2;
mean2[0][0] = 5;
3
mean2[0][1] = 7;
while(!Arrays.deepEquals(mean1, temp1) || !Arrays.deepEquals(mean2, temp2))
{
for(i=0;i<10;i++)
{
part1[i][0] = 0;
part1[i][1] = 0;
part2[i][0] = 0;
part2[i][1] = 0;
}
i1 = 0; i2 = 0;
for(i=0;i<10;i++)
{
dist1 = Math.sqrt(Math.pow(dataset[i][0] - mean1[0][0],2) +
Math.pow(dataset[i][1] - mean1[0][1],2));
dist2 = Math.sqrt(Math.pow(dataset[i][0] - mean2[0][0],2) +
Math.pow(dataset[i][1] - mean2[0][1],2));
if(dist1 < dist2)
{
part1[i1][0] = dataset[i][0];
part1[i1][1] = dataset[i][1];
i1++;
}
else
{
part2[i2][0] = dataset[i][0];
part2[i2][1] = dataset[i][1];
i2++;
}
}
temp1[0][0] = mean1[0][0];
temp1[0][1] = mean1[0][1];
temp2[0][0] = mean2[0][0];
temp2[0][1] = mean2[0][1];
sum11 = 0; sum12 = 0; sum21 = 0; sum22 = 0;
for(i=0;i<i1;i++)
{
sum11 += part1[i][0];
sum12 += part1[i][1];
}
for(i=0;i<i2;i++)
{
sum21 += part2[i][0];
sum22 += part2[i][1];
}
mean1[0][0] = (float)sum11/i1;
mean1[0][1] = (float)sum12/i1;
mean2[0][0] = (float)sum21/i2;
mean2[0][1] = (float)sum22/i2;
itr++;
}
System.out.println("\nFinal Partition: ");
System.out.println("Part1:");
for(i=0;i<i1;i++)
{
4
System.out.println(part1[i][0]+" "+part1[i][1]);
}
System.out.println("\nPart2:");
for(i=0;i<i2;i++)
{
System.out.println(part2[i][0]+" "+part2[i][1]);
}
System.out.println("\nFinal Mean: ");
System.out.println("Mean1 : "+mean1[0][0]+" "+mean1[0][1]);
System.out.println("Mean2 : "+mean2[0][0]+" "+mean2[0][1]);
System.out.println("\nTotal Iteration: "+itr);
}
}
Output:
3. Write a program of cluster analysis using simple k-means algorithm Python programming language
Aim: To Write a program of cluster analysis using simple k-means algorithm Python programming language
Source code:
import matplotlib.pyplot as plt
x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
plt.scatter(x, y)
plt.show()
Output:
5
from sklearn.cluster import KMeans
data = list(zip(x, y))
inertias = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
Output:
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
plt.scatter(x, y, c=kmeans.labels_)
plt.show()
Output:
6
4. Visualize the datasets using matplotlib in python.(Histogram, Box plot, Bar chart, Pie chart etc.,)
Aim: To Visualize the datasets using matplotlib in python.(Histogram, Box plot, Bar chart, Pie chart etc.,)
Source code:
PIE CHART:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('d:\\tips.csv')
cars = ['AUDI', 'BMW', 'FORD','TESLA', 'JAGUAR',]
data = [23, 10, 35, 15, 12]
plt.pie(data, labels=cars)
plt.title("Car data")
plt.show()
BOXPLOT:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(10)
data_1 = np.random.normal(100, 10, 200)
data_2 = np.random.normal(90, 20, 200)
data_3 = np.random.normal(80, 30, 200)
data_4 = np.random.normal(70, 40, 200)
data = [data_1, data_2, data_3, data_4]
fig = plt.figure(figsize =(10, 7))
ax = fig.add_axes([0, 0, 1, 1])
bp = ax.boxplot(data)
plt.show()
HISTGORAMS:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('d:\\tips.csv')
x = data['total_bill']
plt.hist(x, bins=25, color='green', edgecolor='blue',
linestyle='--', alpha=0.5)
plt.title("Tips Dataset")
plt.ylabel('Frequency')
plt.xlabel('Total Bill')
plt.show()
7
Bar Chart:
import matplotlib.pyplot as plt
year = ['2010', '2002', '2004', '2006', '2008']
production = [25, 15, 35, 30, 10]
plt.bar(year, production)
plt.savefig("output.jpg")
plt.savefig("output1", facecolor='y', bbox_inches="tight",
pad_inches=0.3, transparent=True)
plt.show()
8
4. Demonstration of Association rule process on dataset test.arff using apriori algorithm
Aim: This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is test.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.
Dataset test.arff
@relation test
@data
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
%
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
5. Demonstration of classification rule process on dataset student.arff using j48
algorithm
Aim: This experiment illustrates the use of j-48 classifier in weka. The sample data set used
in this experiment is “student” data available at arff format. This document assumes that
appropriate data pre processing has been performed.
Step2: Next we select the “classify” tab and click “choose” button t o select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values. The
default version does perform some pruning but does not perform error pruning.
Step4: Under the “text” options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the Ascii version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can
be done by right clicking the last result set and selecting “visualize tree” from the pop-up
menu.
Step-9: In the main panel under “text” options click the “supplied test set” radio button and
then click the “set” button. This wills pop-up a window which will allow you to open the file
containing test instances.
Dataset student .arff
@relation student
@data
%
The following screenshot shows the classification rules that were generated when j48
algorithm is applied on the given dataset.
6. Demonstration of classification rule process on dataset employee.arff using j48
algorithm
Aim: This experiment illustrates the use of j-48 classifier in weka.the sample data set used in
this experiment is “employee”data available at arff format. This document assumes that
appropriate data pre processing has been performed.
Step 1: We begin the experiment by loading the data (employee.arff) into weka.
Step2: Next we select the “classify” tab and click “choose” button to select the
“j48”classifier.
Step3: Now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values the default
version does perform some pruning but does not perform error pruning.
Step4: Under the “text “options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This can
be done by right clicking the last result set and selecting “visualize tree” from the pop-up
menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This wills pop-up a window which will allow you to open the file
containing test instances.
Data set employee.arff:
@relation employee
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@data
48, 32k,good
%
The following screenshot shows the classification rules that were generated whenj48
algorithm is applied on the given dataset.
7. Demonstration of classification rule process on dataset employee.arff using id3
algorithm
Aim: This experiment illustrates the use of id3 classifier in weka. The sample data set used in
this experiment is “employee”data available at arff format. This document assumes that
appropriate data pre processing has been performed.
Step2: next we select the “classify” tab and click “choose” button to select the “id3”classifier.
Step3: now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values his default
version does perform some pruning but does not perform error pruning.
Step4: under the “text “options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This can be
done by right clicking the last result set and selecting “visualize tree” from the pop-up menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This will show pop-up window which will allow you to open the
file containing test instances.
Data set employee.arff:
@relation employee
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@data
%
The following screenshot shows the classification rules that were generated when id3
algorithm is applied on the given dataset.
8.Demonstration of classification rule process on dataset employee.arff using naïve
bayes algorithm
Aim: This experiment illustrates the use of naïve bayes classifier in weka. The sample data
set used in this experiment is “employee”data available at arff format. This document
assumes that appropriate data pre processing has been performed.
Step2: next we select the “classify” tab and click “choose” button to select the “id3”classifier.
Step3: now we specify the various parameters. These can be specified by clicking in the text
box to the right of the chose button. In this example, we accept the default values his default
version does perform some pruning but does not perform error pruning.
Step4: under the “text “options in the main panel. We select the 10-fold cross validation as
our evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.
Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: note that the classification accuracy of model is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: now weka also lets us a view a graphical version of the classification tree. This can be
done by right clicking the last result set and selecting “visualize tree” from the pop-up menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This will show pop-up window which will allow you to open the
file containing test instances.
Data set employee.arff:
@relation employee
@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}
@data
%
The following screenshot shows the classification rules that were generated when naive bayes
algorithm is applied on the given dataset.
9. Demonstration of clustering rule process on dataset iris.arff using simple k-means
Aim: This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the iris data available in ARFF format.
This document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.
Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster.For eg,
the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characterstics of each cluster through visualization ,we
can do this, try right clicking the result set on the result. List panel and selecting the visualize
cluster assignments.
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
Interpretation of the above visualization
From the above visualization, we can understand the distribution of sepal length and petal
length in each cluster. For instance, for each cluster is dominated by petal length. In this case
by changing the color dimension to other attributes we can see their distribution with in each
of the cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result iris k-mean .The top portion of this file is shown in the following figure.
10. Demonstration of clustering rule process on dataset student.arff using simple k-
means
Aim: This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the student data available in ARFF
format. This document assumes that appropriate preprocessing has been performed. This
istudent dataset includes 14 instances.
Step 1: Run the Weka explorer and load the data file student.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster.
From the above visualization, we can understand the distribution of age and instance number
in each cluster. For instance, for each cluster is dominated by age. In this case by changing
the color dimension to other attributes we can see their distribution with in each of the
cluster.
Step 8: We can assure that resulting dataset which included each instance along with its
assign cluster. To do so we click the save button in the visualization window and save the
result student k-mean .The top portion of this file is shown in the following figure.
Dataset student .arff
@relation student
@data
%
The following screenshot shows the clustering rules that were generated when simple k-
means algorithm is applied on the given dataset.