0% found this document useful (0 votes)
46 views32 pages

DWDM Lab Manual

Uploaded by

bharathkatamneni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views32 pages

DWDM Lab Manual

Uploaded by

bharathkatamneni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Exercise 1: INTRODUCTION

1. Introduction to Python libraries for Data Mining: NumPy, Pandas, Matplotlib etc.
Write a Python program to do the following operations: Library: NumPy
a) Create multi-dimensional arrays and find its shape and dimension
b) Create a matrix full of zeros and ones
c) Reshape and flatten data in the array
d) Append data vertically and horizontally
e) Apply indexing and slicing on array
f) Use statistical functions on array - Min, Max, Mean, Median and Standard Deviation
PROCEDURE:
1. Create: Open a new file in Python shell, write a program and save the program with .py
extension.
2. Execute: Go to Run -> Run module (F5)
a) Create multi-dimensional arrays and find its shape and dimension
import numpy as np
#creation of multi-dimensional array
a=np.array([[1,2,3],[2,3,4],[3,4,5]])
#shape
b=a.shape
print("shape:")
print(a.shape)
#dimension
c=a.ndim
print("dimensions:")
print(a.ndim)
b) Create a matrix full of zeros and ones
import numpy as np
#matrix full of zeros
z=np. zeros ((2,2))

DEPARTMENT OF AI Page 1
print("zeros:")
print(z)
#matrix full of ones
o=np. ones ((2,2))
print("ones:")
print(o)

c) Reshape and flatten data in the


import numpy as np
a=np.array([[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]])
b=a.reshape(4,2,2)
print("reshape:")
print(b)
#matrix flatten
c=a.flatten()
print("flatten:")
print(c)
d) Append data vertically and horizontally
import numpy as np
#Appending data vertically
x=np.array([[10,20],[80,90]])
y=np.array([[30,40],[60,70]])
v=np.vstack((x,y))
print("vertically:")
print(v)
#Appending data horizontally
h=np.hstack((x,y))
print("horizontally:")
print(h)
DEPARTMENT OF AI Page 2
e) Apply indexing and slicing on array #indexing
import numpy as np
a=np.array([[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7]])
temp = a[[0, 1, 2, 3],[1, 1, 1, 1]]
print('indexing:')
print(temp)
#slicing
i=a[:4,::2]
print('slicing:')
print(i)

f) Use statistical functions on array - Min, Max, Mean, Median and Standard Deviation

import numpy as np
#min for finding minimum of an array
a=np.array([[1,3,-1,4],[3,-2,1,4]])
b=a.min()
print("minimum:",b)
#max for finding maximum of an array
c=a.max()
print("maximum:",c)
#mean
a=np.array([1,2,3,4,5])
d=a.mean()
print("mean:",d)
#median
e=np.median(a)
print("median:",e)
#standard deviation
f=a.std()
print("standard deviation:",f)

DEPARTMENT OF AI Page 3
Exercise 2: UNDERSTANDING DATA
Write Python programs to do the following operations:
1. Loading data from CSV file
2. Compute the basic statistics of given data - shape, no. of columns, mean
3. Splitting a data frame on values of categorical variables
4. Visualize data using Scatter plot
Dataset: brain_size.csv
Library: Pandas, matplotlib
a) Loading data from CSV file
b) Compute the basic statistics of given data - shape, no. of columns, mean
c) Splitting a data frame on values of categorical variables
d) Visualize data using Scatter plot
1. Loading data from CSV file
import pandas as pd
a=pd.read_csv("D:/data.csv")
print(a)
2. Compute the basic statistics of given data - shape, no. of columns, mean
import pandas as pd
a=pd.read_csv("D:/data.csv")
print('shape :',a.shape)
#no of columns
cols=len(a.axes[1])
print('no of columns:',cols)
#mean of data

m=a["marks"].mean()

DEPARTMENT OF AI Page 4
print('mean of marks:',m)

3. Splitting a data frame on values of categorical variables

import pandas as pd

a=pd.read_csv("D:\data.csv")

print("Before:")

print(a)

a_split=a['address'].str.split(' ',1)

a['district']=a_split.str.get(0)

a['state']=a_split.str.get(1)

del(a['address'])

print("After:")

print(a)

4. Visualize data using Scatter plot

import pandas as pd

import matplotlib.pyplot as plt

a=pd.read_csv("D:\data.csv")

print("Before",a)

a_split=a['address'].str.split(',',1)

a['district']=a_split.str.get(0)

a['state']=a_split.str.get(1)

del(a['address'])

print("After=",a)

a.plot(kind='scatter',x='marks',y='rollno',c='red')
DEPARTMENT OF AI Page 5
plt.show()

Exercise 3: CORRELATION MATRIX


Write a python programs to load the dataset and understand the input data
1. Load data, describe the given data and identify missing, outlier data items
2. Find correlation among all attributes
3. Visualize correlation matrix
1. Load data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("D:/diabetes.csv")
print(df. describe())
print(df.head(10))
print(df.isnull())
2. Find correlation among all attributes
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats
# Making data frame from the csv file
df = pd.read_csv("nba.csv")
# Printing the first 10 rows of the data frame for visualization
print(df[:10])

DEPARTMENT OF AI Page 6
# To find the correlation among columns # using pearson method
print(df.corr(method ='pearson'))
# using „kendall‟ method.
print(df.corr(method ='kendall'))

3. Visualize correlation matrix


import pandas as pd
df = pd.read_csv("D:/diabetes.csv")
print(df. describe())
print(df.head(10))

DEPARTMENT OF AI Page 7
Exercise 4: DATA PREPROCESSING – HANDLING MISSING VALUES

Write a python program to impute missing values with various techniques on given dataset.
1. Remove rows/ attributes
2. Replace with mean or mode
3. Write a python program to perform transformation of data using Discretization (Binning)
and normalization (MinMaxScaler or MaxAbsScaler) on given dataset.
1. Remove rows/ attributes
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("D:/diabetes.csv")
# filling missing value using fillna()
print(df.fillna(0))
# filling a missing value with previous value
print(df.fillna(method ='pad'))
#Filling null value with the next ones
print(df.fillna(method ='bfill'))
# filling a null values using fillna()

DEPARTMENT OF AI Page 8
print(df["gender"].fillna("No Gender", inplace = True))
# will replace Nan value in dataframe with value -99
print(df.replace(to_replace = np.nan, value = -99))
# using dropna() function to remove rows having one Nan
print(df.dropna())
# using dropna() function to remove rows with all Nan
print(df.dropna(how = 'all'))
# using dropna() function to remove column having one Nan
print(df.dropna(axis = 1))

2. Replace with mean or mode


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("D:/diabetes.csv")
# filling missing value using fillna()
print(df.fillna(0))
# filling a missing value with previous value
print(df.fillna(method ='pad'))
#Filling null value with the next ones
print(df.fillna(method ='bfill'))
# filling a null values using fillna()
print(df["gender"].fillna("No Gender", inplace = True))
# will replace Nan value in dataframe with value -99
print(df.replace(to_replace = np.nan, value = -99))
# using dropna() function to remove rows having one Nan
print(df.dropna())
DEPARTMENT OF AI Page 9
# using dropna() function to remove rows with all Nan
print(df.dropna(how = 'all'))
# using dropna() function to remove column having one Nan
print(df.dropna(axis = 1))
print(df["Age"].fillna(df["Age"].mean()))

3. Perform transformation of data using Discretization (Binning)


Binning can also be used as a discretization technique. Discretization refers to the
process of converting or partitioning continuous attributes, features or variables to
discretized or nominal attributes/ features/ variables/ intervals.
For example, attribute values can be discretized by applying equal-width or equal-
frequency binning, and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively. Then the continuous
values can be converted to a nominal or discretized value which is same as the value of
their corresponding bin.
There are basically two types of binning approaches –
Equal width (or distance) binning: The simplest binning approach is to partition the
range of the variable into k equal-width intervals. The interval width is simply the range
[A, B] of the variable divided by k, w = (B-A) / k
Thus, ith interval range will be [A + (i-1)w, A + iw] where i = 1, 2, 3…..k
Skewed data cannot be handled well by this method.
Equal depth (or frequency) binning : In equal-frequency binning we divide the range
[A, B] of the variable into intervals that contain (approximately) equal number of points;
equal frequency may not be possible due to repeated values.
DEPARTMENT OF AI Page 10
There are three approaches to perform smoothing –
Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
Smoothing by bin median : In this method each bin value is replaced by its bin
median value.
Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Example:
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30

Smoothing by bin means:


import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm

DEPARTMENT OF AI Page 11
import statistics

import math

from collections import OrderedDict

x =[]

print("enter the data:")

x = list(map(float, input().split()))

print("enter the number of bins:")

bi = int(input())

# X_dict will store the data in sorted order

X_dict = OrderedDict()

# x_old will store the original data

x_old ={}

# x_new will store the data after binning

x_new ={}

for i in range(len(x)):

X_dict[i]= x[i]

x_old[i]= x[i]

x_dict = sorted(X_dict.items(), key = lambda x: x[1])

# list of lists(bins)

binn =[]

# a variable to find the mean of each bin

DEPARTMENT OF AI Page 12
avrg = 0

i=0

k=0

num_of_data_in_each_bin = int(math.ceil(len(x)/bi))

# performing binning

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

avrg = avrg + h

i=i+1

elif(i == num_of_data_in_each_bin):

k=k+1

i=0

binn.append(round(avrg / num_of_data_in_each_bin, 3))

avrg = 0

avrg = avrg + h

i=i+1

rem = len(x)% bi

if(rem == 0):

binn.append(round(avrg / num_of_data_in_each_bin, 3))

else:

binn.append(round(avrg / rem, 3))

# store the new value of each data

DEPARTMENT OF AI Page 13
i=0

j=0

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

x_new[g]= binn[j]

i=i+1

else:

i=0

j=j+1

x_new[g]= binn[j]

i=i+1

print("number of data in each bin")

print(math.ceil(len(x)/bi))

for i in range(0, len(x)):

print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))

Smoothing by bin medians:

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm

import statistics

import math

DEPARTMENT OF AI Page 14
from collections import OrderedDict

x =[]

print("enter the data")

x = list(map(float, input().split()))

print("enter the number of bins")

bi = int(input())

# X_dict will store the data in sorted order

X_dict = OrderedDict()

# x_old will store the original data

x_old ={}

# x_new will store the data after binning

x_new ={}

for i in range(len(x)):

X_dict[i]= x[i]

x_old[i]= x[i]

x_dict = sorted(X_dict.items(), key = lambda x: x[1])

# list of lists(bins)

binn =[]

# a variable to find the mean of each bin

avrg =[]

i=0

k=0

DEPARTMENT OF AI Page 15
num_of_data_in_each_bin = int(math.ceil(len(x)/bi))

# performing binning

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

avrg.append(h)

i=i+1

elif(i == num_of_data_in_each_bin):

k=k+1

i=0

binn.append(statistics.median(avrg))

avrg =[]

avrg.append(h)

i=i+1

binn.append(statistics.median(avrg))

# store the new value of each data

i=0

j=0

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

x_new[g]= round(binn[j], 3)

i=i+1

else:

DEPARTMENT OF AI Page 16
i=0

j=j+1

x_new[g]= round(binn[j], 3)

i=i+1

print("number of data in each bin")

print(math.ceil(len(x)/bi))

for i in range(0, len(x)):

print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))

Smoothing by bin boundaries:

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn import linear_model

# import statsmodels.api as sm

import statistics

import math

from collections import OrderedDict

x =[]

print("enter the data")

x = list(map(float, input().split()))

print("enter the number of bins")

bi = int(input())

# X_dict will store the data in sorted order

DEPARTMENT OF AI Page 17
X_dict = OrderedDict()

# x_old will store the original data

x_old ={}

# x_new will store the data after binning

x_new ={}

for i in range(len(x)):

X_dict[i]= x[i]

x_old[i]= x[i]

x_dict = sorted(X_dict.items(), key = lambda x: x[1])

# list of lists(bins)

binn =[]

# a variable to find the mean of each bin

avrg =[]

i=0

k=0

num_of_data_in_each_bin = int(math.ceil(len(x)/bi))

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

avrg.append(h)

i=i+1

elif(i == num_of_data_in_each_bin):

k=k+1

i=0

DEPARTMENT OF AI Page 18
binn.append([min(avrg), max(avrg)])

avrg =[]

avrg.append(h)

i=i+1

binn.append([min(avrg), max(avrg)])

i=0

j=0

for g, h in X_dict.items():

if(i<num_of_data_in_each_bin):

if(abs(h-binn[j][0]) >= abs(h-binn[j][1])):

x_new[g]= binn[j][1]

i=i+1

else:

x_new[g]= binn[j][0]

i=i+1

else:

i=0

j=j+1

if(abs(h-binn[j][0]) >= abs(h-binn[j][1])):

x_new[g]= binn[j][1]

else:

x_new[g]= binn[j][0]

i=i+1

DEPARTMENT OF AI Page 19
print("number of data in each bin")

print(math.ceil(len(x)/bi))

for i in range(0, len(x)):

print('index {2} old value {0} new value {1}'.format(x_old[i], x_new[i], i))

3. Perform transformation of data using normalization (MinMaxScaler or MaxAbsScaler) on given


dataset. In preprocessing, standardization of data is one of the transformation task. Standardization is
scaling features to lie between a given minimum and maximum value, often between zero and one, or
so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using
MinMaxScaler or MaxAbsScaler, respectively.

The motivation to use this scaling include robustness to very small standard deviations of features and
preserving zero entries in sparse data.

# example of a normalization
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
# define data
data = asarray([[100, 0.001],[8, 0.05],[50, 0.005],[88, 0.07],[4, 0.1]])
print(data)

# define min max scaler


scaler = MinMaxScaler()

# transform data
scaled = scaler.fit_transform(data)
print(scaled)

DEPARTMENT OF AI Page 20
Exercise 5: ASSOCIATION RULE MINING- APRIORI

Write a python program to find rules that describe associations by using Apriori algorithm

Steps in Apriori:

1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for
the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with
other items (e.g. confidence).

DEPARTMENT OF AI Page 21
2. Extract all the subsets having higher value of support than minimum threshold.

3. Select all the rules from the subsets with confidence value higher than minimum threshold.

4. Order the rules by descending order of Lift.

Example:

from apyori import apriori

transactions = [ ['beer', 'nuts'], ['beer', 'cheese'], ]

#CASE1:

results = list(apriori(transactions))

association_results = list(results)

print(results[0])

#CASE2:

min support=.5,minconfidence=.8

results = list(apriori(transactions,min_support=0.5, min_confidence=0.8))

association_results = list(results)

print(len(results))

print(association_results)

OUTPUT: 5

RelationRecord(items=frozenset({'beer'}), support=1.0,
ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)])

Case 2:

[RelationRecord(items=frozenset({'beer'}), support=1.0,
ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)]),
DEPARTMENT OF AI Page 22
RelationRecord(items=frozenset({'cheese', 'beer'}), support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'cheese'}), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)]),

RelationRecord(items=frozenset({'nuts', 'beer'}), support=0.5,


ordered_statistics=[OrderedStatistic(items_base=frozenset({'nuts'}), items_add=frozenset({'beer'}),
confidence=1.0, lift=1.0)])]

Three major measures to validate Association Rules:

• Support

• Confidence

• Lift

Suppose a record of 1 thousand customer transactions. Consider two items e.g. burgers and ketchup.
Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150
transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data,
Find the support, confidence, and lift.

Support:

Support(B) = (Transactions containing (B))/(Total Transactions)

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item
Ketchup can be calculated as:

Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)

Support(Ketchup) = 100/1000 = 10%

Confidence:

Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be
calculated by finding the number of transactions where A and B are bought together, divided by total
number of transactions where A is bought.

Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)

DEPARTMENT OF AI Page 23
A total of 50 transactions where Burger and Ketchup were bought together. While in 150 transactions,
burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be
represented as confidence of Burger -> Ketchup and can be mathematically written as:

Confidence (Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions


containing A)

Confidence(Burger→Ketchup) = 50/150 = 33.3%

Lift :

Lift (A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be
calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be
represented as:

Lift (A→B) = (Confidence (A→B))/(Support (B))

In Burger and Ketchup problem, the Lift (Burger -> Ketchup) can be calculated as:

Lift (Burger → Ketchup) = (Confidence (Burger → Ketchup))/(Support (Ketchup))

Lift(Burger → Ketchup) = 33.3/10 = 3.33

a) Display top 5 rows of data


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori
store_data = pd.read_csv("D:/datasets/store_data.csv")
print(store_data.head())
print('Structure of store data\n',str(store_data))
b) Find the rules with min_confidence : .2, min_support= 0.0045, min_lift=3, min_length=2
Let's suppose that we want rules for only those items that are purchased at least 5 times a day,
or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period.
The support for those items can be calculated as 35/7500 = 0.0045.
The minimum confidence for the rules is 20% or 0.2.
DEPARTMENT OF AI Page 24
Similarly, the value for lift as 3 and finally min_length is 2 since at least two products should
exist in every rule.
#Converting data frame to list
records = []
for i in range(0, 7500):
records.append([str(store_data.values[i,j]) for j in range(0, 20)])
#Generating association rules using apriori()
#association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3,
min_length=2)
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3,
min_length=5)
association_results = list(association_rules)
print(len(association_results))
print(association_results[0])
for item in association_rules:
# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
#second index of the inner list
print("Support: " + str(item[1]))
#third index of the list located at 0th
#of the third index of the inner list
print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3])) print("=====================================")

DEPARTMENT OF AI Page 25
\

Exercise 6: CLASSIFICATION - DECISION TREES

Write a python program


1. To build a decision tree classifier to determine the kind of flower by using given dimensions.

DEPARTMENT OF AI Page 26
2. Training with various split measures (Gini index, Entropy and Information Gain)
3. Compare the accuracy

A decision tree is a machine learning algorithm that uses a tree-like model of decisions and their
subsequent consequences to arrive at a particular decision. It is a Supervised Machine Learning model,
where the data is continuously split according to a certain parameter, and finally, a decision is made.

Usually, a decision tree is drawn upside down, with the root node at the top and the leaf nodes at the
bottom. A decision tree usually contains 3 types of nodes.

1. Root node: The very top node that represents the entire population or sample.
2. Decision nodes: Sub-nodes that split from the root node.
3. Leaf nodes: Nodes with no children, also known as terminal nodes.

DEPARTMENT OF AI Page 27
Build a Decision Tree using IRIS dataset in Python:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Reading the Iris.csv file


data = load_iris()

# Extracting Attributes / Features


X = data.data

# Extracting Target / Class Labels


y = data.target

# Import Library for splitting data


from sklearn.model_selection import train_test_split

# Creating Train and Test datasets


X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size = 0.25)

# Creating Decision Tree Classifier


from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)

# Predict Accuracy Score


y_pred = clf.predict(X_test)
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=clf.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))

DEPARTMENT OF AI Page 28
Decision Tree

DEPARTMENT OF AI Page 29
Exercise 7: CLASSIFICATION –BAYESIAN NETWORK

1. Build Bayesian network model using existing default data

2. Visualize Tree Augmented Naïve Bayes model

o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

PROGRAM:-

from sklearn import metrics


from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

DEPARTMENT OF AI Page 30
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset=pd.read_csv("Iris.csv")
X=dataset.iloc[:,:4].values
Y=dataset['Species'].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
print(X_test[0])
y_pred=classifier.predict(X_test)
print(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy:",accuracy)

DEPARTMENT OF AI Page 31
Exercise 8: CLUSTERING – K-MEANS

Write a python program

1. To perform Preprocessing

2. To perform clustering using k-means algorithm to cluster the records

Program:

import matplotlib.pyplot as plt

import numpy as np

from sklearn.cluster import KMeans

X=np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],[3.5,4.5]])

print(X)

plt.scatter(X[:,0],X[:,1])

kmeans=KMeans(n_clusters=2)

kmeans.fit(X)

DEPARTMENT OF AI Page 32

You might also like