Data Mining Using Python
Data Mining Using Python
PROGRAMMEOUTCOMES(POs):
PROGRAMMESPECIFICOUTCOMES (PSOs):
The ability to apply Software Engineering practices and strategies in software
PSO
projectdevelopmentusingopen-sourceprogrammingenvironmentforthesuccessof
1
organization.
PSO Theabilitytodesignanddevelopcomputerprogramsinnetworking,webapplications
2 andIoTasperthesocietyneeds.
PSO
Toinculcateanabilitytoanalyze,designandimplementdatabaseapplications.
3
AIM: Demonstratingthefollowingdatapreprocessingtasksusingpythonlibraries
Description/Theory:
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.Data
Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words,
whenever the data is gathered fromdifferent sources it is collected in raw format which is not feasible
for the analysis.
PROCEDURE/CODE:
a) Loadingthedataset
import numpy as np
import pandas as pd
dataset=pd.read_csv('D:/DM/file1.csv')
print(dataset)
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
x=dataset.iloc[:,:-1].values
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
print(x)
b) Identifyingthedependentandindependentvariables.
dataset=pd.read_csv('D:/DM/file2.csv')
print(dataset)
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
print(x)
print(y)
x=dataset.iloc[1:,1:-1].values
y=dataset.iloc[:,[0,4]].values
z=dataset.iloc[:0].values
print(x)
print(y)
print(z)
c) Dealingwithmissingdata
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(x[:,1:3])x[:,1:3]=imputer.transform(x[:,1:3])
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
print(x)2 Page|2
imputer=imputer.fit(x[:,1:])
x[:,1:]=imputer.transform(x[:,1:])
print(x)
importnumpyasnp
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy='mean') imp.fit([[1,2],[np.nan,3],
[7,6]])
SimpleImputer() x=[[np.nan,2],[6,np.nan],
[7,6]] print(imp.transform(x))
OUTPUT:
a) Roll NameAgeBranchMarks1Marks2Marks3
0 1Ramesh 19 CSE 28 27 25
1 2Suresh 19 IT 24 29 26
2 3Geetha 18 EEE 19 28 27
3 4Seetha 19 ECE 26 19 28
4 5Sheela19 MECH 27 18 29
5 6 Ram 20 CSE 20 27 20
6 7 Ravi 19 IT 21 26 30
7 8 Mani 19 ECE 30 25 19
8 9 Srinu 20 EEE 28 24 18
9 10 Devi 19 CSE 18 23 23
[[1'Ramesh'19'CSE'2827]
[2'Suresh'19'IT'2429]
[3'Geetha'18'EEE'1928]
[4'Seetha'19'ECE'2619]
[5'Sheela'19'MECH'2718]
[6 'Ram'20 'CSE'2027]
[7'Ravi'19'IT'21 26]
[8'Mani'19 'ECE'3025]
[9'Srinu'20'EEE'2824]
[10'Devi'19'CSE'18 23]]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
4 GermanyFemale40.0 NaN Yes P age|3
5 FranceFemale35.058000.0 Yes
6 SpainFemale NaN52000.0 No
c)[['Female'27.048000.0]
['Male'30.0 54000.0]
['Male'38.0 61000.0]
['Female'40.063000.0]
['Female'35.058000.0]
['Female'38.12552000.0]
['Male'48.0 79000.0]
['Male'50.0 83000.0]
['Male'37.069000.0]]
[[4. 2. ]
[6. 3.66666667]
[7. 6. ]]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|4
AIM: Demonstratethefollowingdatapreprocessingtasksusingpythonlibraries
Description/Theory:
Categoricaldataisatypeofdatathatisusedtogroupinformationwithsimilarcharacteristics, while
numerical data is a type of data that expresses information inthe form of numbers.
Exampleofcategoricaldata:gender
Categoricalvariablescanbedividedintotwocategories:
Nominal:noparticularorder
Ordinal:thereissomeorderbetweenvalues
PROCEDURE/CODE:
a) Dealingwithcategoricaldata.
importnumpyasnp
from sklearn.compose import ColumnTransformer
fromsklearn.preprocessingimportOneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthroug
h')
x=np.array(ct.fit_transform(x))
print(x)
fromsklearn.preprocessingimportLabelEncoder le
= LabelEncoder()
y=le.fit_transform(y)
print(y)
b) Scalingthefeatures.
fromsklearn.preprocessingimportStandardScaler
sc=StandardScaler()
x_train[:,6:]=sc.fit_transform(x_train[:,6:])
x_test[:,6:]=sc.transform(x_test[:,6:])
print(x_train)
print(x_test)
c) SplittingdatasetintoTrainingandTestingSets
fromsklearn.model_selectionimporttrain_test_split
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1) Page|5
print(x_train)
print(x_test)
print(y_train)
print(y_test)
OUTPUT
a)[[0.01.00.00.0'Male'44.072000.0]
[1.00.00.01.0'Female'27.048000.0]
[1.00.01.0 0.0 'Male'30.054000.0]
[1.00.00.0 1.0 'Male'38.061000.0]
[1.00.01.00.0'Female'40.0nan]
[0.01.00.00.0'Female'35.058000.0]
[1.00.00.01.0'Female'nan52000.0]
[0.01.00.0 0.0 'Male'48.079000.0]
[1.00.01.0 0.0 'Male'50.083000.0]
[0.01.00.00.0'Male'37.069000.0]]
[0100110101]
b)[[1.00.00.01.0'Female'nan-1.0182239953527132]
[1.00.01.00.0'Female'40.0nan]
[0.01.00.00.0'Male'44.00.5834766714942513]
[1.00.00.01.0'Male'38.0-0.2974586952715791]
[1.00.00.01.0'Female'27.0 -1.3385641287221062]
[0.01.00.00.0'Male'48.01.1440719048906889]
[1.00.01.00.0'Male'50.0 1.4644120382600818]
[0.01.00.00.0'Female'35.0 -0.5377137952986238]]
[[1.00.01.00.0'Male'30.0-0.8580539286680167]
[0.01.00.00.0'Male'37.00.3432215714672067]]
c) [[1.00.00.01.0'Female'nan 52000.0]
[1.00.01.00.0'Female'40.0nan]
[0.01.00.0 0.0 'Male'44.072000.0]
[1.00.00.0 1.0 'Male'38.061000.0]
[1.00.00.01.0'Female'27.048000.0]
[0.01.00.0 0.0 'Male'48.079000.0]
[1.00.01.0 0.0 'Male'50.0 83000.0]
[0.01.00.00.0'Female'35.058000.0]]
[[1.00.01.00.0'Male'30.054000.0]
[0.01.00.00.0'Male'37.069000.0]]
[01001101]
[01]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|6
Cosinesimilarityisametricusedtomeasurethesimilarityoftwovectors.Specifically,it
measures the similarity in the direction or orientation of the vectors ignoring differen
TheJaccardsimilaritymeasuresthesimilaritybetweentwo setsofdatatoseewhichmembersare
sharedanddistinct.TheJaccardsimilarityiscalculatedbydividingthenumberofobservationsin both
sets by the number of observations in either set ces in their magnitude or scale.
Euclidean distance is considered the traditional metric for problems with geometry. It can be
simply explained as the ordinary distance between two points. It is one of the most used
algorithmsintheclusteranalysis.OneofthealgorithmsthatusethisformulawouldbeK-mean.
Mathematically it computes the root of squared differences between the coordinates between
two objects.
Thisdeterminestheabsolutedifferenceamongthepairofthecoordinates.
SupposewehavetwopointsPandQtodeterminethedistancebetweenthese pointswesimply have to
calculate the perpendicular distance of the points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
ManhattandistancebetweenPandQ=|x1 –x2|+|y1–y2|
PROCEDURE/CODE:
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|7
a) Pearson’sCorrelation
#Pearson'sCorrelation
fromscipy.statsimportpearsonr
X=[-2,-1,0,1,2]
Y=[4,1,3,2,0]
corr=pearsonr(X,Y)
print('PearsonCorrelation:',corr)
b)CosineSimilarity
importnumpyasnp
fromnumpy.linalgimportnorm
#definetwolistsorarray
a=np.array([2,1,2,3,2,9])
b=np.array([3,4,2,4,5,5])
print("a : ",a)
print("b:",b)
c)JaccardSimilarity
importnumpyasnp
fromscipy.spatial.distanceimportjaccard
a=np.array([1,0,0,1,1,1])
b=np.array([0,0,1,1,1,1])
d=jaccard(a,b)
print("Distance : ",d)
d)EuclideanDistance
fromsklearn.metrics.pairwiseimporteuclidean_distances X=[[0,1],
[1,1]]
euclidean_distances(X,X)
#Calculatingeuclideandistancebetweenvectors from
scipy.spatial.distance import euclidean
row1=[10,20,15,10,5]
row2=[12,24,18,8,7]
dist=euclidean(row1,row2)
print(dist)
minkowskiDistance
fromscipy.spatialimportminkowski_distance
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
row1=[10,20,15,10,5] Page|8
row2=[12,24,18,8,7]
#Calculate distance (p=1)
dist=minkowski_distance(row1,row2,1)
print(dist)
#Calculate distance (p=2)
dist=minkowski_distance(row1,row2,2)
print(dist)
#pearsonwithoutusingmethods
importmath
n=int(input('Enternumberofrows:')) x,y
= [],[]
foriinrange(n):
print('Enter'+str(i+1)+'rowxvalue:',end="")
x.append(int(input()))
print('Enter'+str(i+1)+'rowyvalue:',end="")
y.append(int(input()))
print("Enteredvalues:")
print(x)
print(y)
x2=[] #List for storing x^
y2=[] #Listforstoringy^2
xy=[] #List for storing xy
for i in range(n):
x2.append(math.pow(x[i],2))
y2.append(math.pow(y[i],2))
xy.append(x[i]*y[i])
print('x square : ',x2)
print('y square : ',y2)
print('xy value : ',xy)
rxy=sum(xy) // n-((sum(x)//n) * (sum(y) // n)) #rhoxy
rx =math.sqrt(sum(x2) // n-math.pow((sum(x)//n),2))ry
=math.sqrt(sum(y2) // n-math.pow((sum(y)//n),2))
result = rxy//(rx*ry)
print("Pearson'scorrelationcoefficient:",result)
OUTPUT:
a) PearsonCorrelation:PearsonRResult(statistic=-0.7000000000000001,
pvalue=0.18812040437418737)
Page|9
b) a :[212329]
b:[342455]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
CosineSimilarity:0.8188504723485274 Page|9
c) Distance: 0.4
d) array([[0.,1.],
[1.,0.]])
6.082762530298219
e) 10
Minkowskidistance
13.0
6.082762530298219
Pearson
Enternumberofrows:5
Enter 1 row x value : -2
Enter 1 row y value : 4
Enter 2 row x value : -1
Enter 2 row y value : 1
Enter 3 row x value : 0
Enter 3 row y value : 3
Enter 4 row x value : 1
Enter 4 row y value : 2
Enter 5 row x value : 2
Enter 5 row y value : 0
Entered values :
[-2,-1,0,1,2]
[4,1,3,2, 0]
xsquare:[4.0,1.0,0.0,1.0,4.0]
ysquare:[16.0,1.0,9.0,4.0,0.0]
xyvalue:[-8,-1,0,2,0]
Pearson'scorrelationcoefficient:-1.0
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|10
AIM:Buildamodelusinglinearregressionalgorithmonanydataset.
Description/Theory:
PROCEDURE/CODE:
Linearregressionisthetypeofregressionthatformsarelationshipbetweenthetargetvariable and one
or more independent variables utilizing a straight line. The given equation represents the
equation of linear regression
Y= a +b*X+ e.
#Buildingalinearregressionmodelfordiabetesdataset
importmatplotlib.pyplotasplt
import numpy as np
fromsklearnimportdatasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score
diabetes_X,diabetes_Y=datasets.load_diabetes(return_X_y=True) diabetes_X
= diabetes_X[:,np.newaxis,2]
#CreateanobjectforLinearRegression
regr=linear_model.LinearRegression()
print(diabetes_y_pred)
Plotting:
plt.scatter(diabetes_X_test,diabetes_y_test,color='black')
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
plt.plot(diabetes_X_test,diabetes_y_pred,color='blue',linewidth=2) Page|11
plt.show()
Grid:
plt.scatter(diabetes_X_test,diabetes_y_test,color='black')
plt.plot(diabetes_X_test,diabetes_y_pred,color='blue',linewidth=2,marker='*',markerfacecolor='r
ed')
plt.grid()
plt.show()
LinearRegression:
importmatplotlib.pyplotasplt
import numpy as np
fromsklearn.linear_modelimportLinearRegression
x=np.array([5,15,45,25,35,45]).reshape((-1,1))
y=np.array([5,21,14,22,32,28])
print(x,y)
x=np.array([5,15,45,25,35,45]).reshape((-1,1))
y=np.array([5,21,14,22,32,28])
print(x,y)
model=LinearRegression()
model.fit(x,y)
result=model.score(x,y)
print("Score :",result)model.fit(x,y)
result=model.score(x,y)print("Score
: ",result)
print("Intercept:",model.intercept_)
print("Slope : ",model.coef_)
y_pred = model.predict(x) #predictedbyyvalue
print("Actual values of y : ",y);
print("Predictedvaluesofy:",y_pred)
plt.show()
plt.plot(x,y_pred,color='blue',linewidth=2,marker='o',markerfacecolor='red')
[<matplotlib.lines.Line2Dat0x23b679e6850>]
plt.plot(x,y_pred,color='blue',linewidth=2,marker='o',markerfacecolor='red')
plt.grid()
plt.show()
MultipleLinearRegression
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
#Importinglibraries Page|12
importnumpyasnp
importmatplotlib.pyplotasplt
import pandas as pd
#EncodingCategoricaldata
from sklearn.compose import ColumnTransformer
fromsklearn.preprocessingimportOneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthroug
h')
X=np.array(ct.fit_transform(X))
print(X)
#SplitingthedatasetintotheTrainingsetandTestset from
sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
#TrainingtheMultipleLinearRegressionmodelontheTrainingset from
sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)
plt.scatter(y_test,y_pred);
plt.xlabel('Actual')
plt.ylabel('Predicted')
pred_df=pd.DataFrame({'ActualValue':y_test,'PredictedValue':y_pred,'Difference':y_test- y_pred})
pred_df
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page| 13
OUTPUT:
[225.9732401115.74763374163.27610621114.73638965120.80385422
158.21988574236.08568105121.8150983299.56772822123.83758651
204.7371141196.53399594154.17490936130.9162951783.3878227
171.36605897137.99500384137.99500384189.5684526884.3990668]
Plotting:
GRID:
[[5]
[15]
[45]
[25]
[35]
[45]][52114 223228]
[[5]
[15]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
[45]
[25] Page|14
[35]
[45]][52114 223228]
Score:0.3114260563380282
Score:0.3114260563380282
Intercept:10.912499999999996
Slope:[0.3325]
Actual valuesofy:[521142232 28]
Predicted values of y :[12.575 15.9 25.87519.22522.5525.875]
plt.scatter(x,y,color='black')
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|15
MultipleLinearRegression:
(50,5)
[[165349.2136897.8471784.1'NewYork']
[162597.7151377.59443898.53'California']
[153441.51101145.55407934.54'Florida']
[144372.41118671.85383199.62'NewYork']
[142107.3491391.77366168.42'Florida']
[131876.999814.71362861.36'NewYork']
[134615.46147198.87127716.82'California']
[130298.13145530.06323876.68'Florida']
[120542.52148718.95311613.29'NewYork']
[123334.88108679.17304981.62'California']
[101913.08110594.11229160.95'Florida']
[100671.9691790.61249744.55'California']
[93863.75127320.38249839.44'Florida']
[91992.39135495.07252664.93'California']
[119943.24156547.42256512.92'Florida']
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|16
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|17
AIM:BuildaclassificationmodelusingDecisionTreealgorithmonirisdataset Description/Theory:
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostlyit is preferred for solving Classification problems. It is atree-
structuredclassifier,whereinternalnodesrepresentthefeaturesofadataset,branchesrepresent the
decision rules and each leaf node represents the outcome.
PROCEDURE/CODE:
#importlibraries
fromsklearn.datasetsimportload_iris
fromsklearn.treeimportDecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
importmatplotlib.pyplotasplt
#Load iris dataset
iris=load_iris()
X=iris.data
y=iris.target
#Split the data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=None)
#Create an instance of the DecisionTreeClassifier class
tree_clf=DecisionTreeClassifier(max_depth=3)
#Fit the model on the training data
tree_clf.fit(X_train,y_train)
DecisionTreeClassifier(max_depth=3)
#predict on the testing data
y_pred=tree_clf.predict(X_test)
#Calculate accuracy of the model
accuracy=accuracy_score(y_test,y_pred)
print('Accuracy : ',accuracy)
#Visualizethedecisiontreeusingtheplot_treefunction
plt.figure(figsize=(15,10))
plot_tree(tree_clf,filled=True,feature_names=iris.feature_names,class_names=iris.target_names
)
plt.show()
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|18
OUTPUT:
Accuracy:0.9555555555555556
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|19
AIM:ApplyNaiveBayesClassificationalgorithmonanydataset. Description/Theory:
NaïveBayesalgorithmisasupervised learningalgorithm,whichisbasedonBayestheorem and used
for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
SomepopularexamplesofNaïveBayesAlgorithmarespamfiltration,Sentimentalanalysis, and
classifying articles.
PROCEDURE/CODE:
fromsklearnimportmetrics
fromsklearn.naive_bayesimportGaussianNB import
pandas as pd
importnumpyasnp
fromsklearn.model_selectionimporttrain_test_split
from sklearn.metrics import confusion_matrix
fromsklearn.metricsimportaccuracy_score
dataset=pd.read_csv('D:\DM\iris.csv')
print(dataset)
X=dataset.iloc[:,:4].values
Y=dataset['Species'].values
print(Y)
print(X)
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page| 20
#buildconfusionmatrix
fromsklearn.metricsimportconfusion_matrix
cm=confusion_matrix(Y_test,y_pred)
from sklearn.metrics import accuracy_score
print("Accuracy: ",accuracy_score(Y_test,y_pred))
cm
df=pd.DataFrame({'RealValues':Y_test,'PredictedValues':y_pred}) print(df)
NaiveBayes:
#predictclasstablefor
fromsklearn.datasetsimportload_iris
fromsklearn.model_selectionimporttrain_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
fromsklearn.preprocessingimportLabelEncoder
import numpy as np
#import iris dataset
iris=load_iris()
X=iris.data
Y=iris.target
le=LabelEncoder()
Y=le.fit_transform(Y)
#Splitthedatasetintotrainingandtestingsets
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
#Train a Naive bayes model on the training data
nb_model = GaussianNB()
nb_model.fit(X_train, Y_train)
#Make predictions on the test data
y_pred=nb_model.predict(X_test)
y_pred = le.inverse_transform(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy: ",accuracy)
new_observation = np.array([[5.8, 3.0, 4.5, 1.5]])
predicted_class = nb_model.predict(new_observation)
predicted_class=le.inverse_transform(predicted_class)
print("Predicted class: ", predicted_class)
OUTPUT:
IdSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm\
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2 Page |21
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
[150rowsx6columns]
['Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa'
[[ 1. 5.1 3.5 1.4]
[ 2. 4.9 3. 1.4]
[ 3. 4.7 3.2 1.3]
[ 4. 4.6 3.1 1.5]
[ 5. 5. 3.6 1.4]
[ 6. 5.4 3.9 1.7]
[ 7. 4.6 3.4 1.4]
[ 8. 5. 3.4 1.5]
[ 9. 4.4 2.9 1.4]
[ 10. 4.9 3.1 1.5]
[11. 5.4 3.7 1.5]
['Iris-setosa''Iris-virginica''Iris-versicolor''Iris-
setosa''Iris-versicolor''Iris-virginica''Iris-virginica''Iris-
setosa''Iris-virginica''Iris-setosa''Iris-virginica''Iris-
virginica''Iris-setosa''Iris-versicolor''Iris-virginica''Iris-
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
virginica'
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Accuracy:1.0 Page|22
Accuracy:1.0
RealValuesPredictedValues
0 Iris-setosa Iris-setosa
1 Iris-virginica Iris-virginica
2 Iris-versicolorIris-versicolor
3 Iris-setosa Iris-setosa
4 Iris-versicolorIris-versicolor
5 Iris-virginica Iris-virginica
Accuracy:0.9777777777777777
Predictedclass: [1]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|23
AIM:GeneratefrequentitemsetsusingAprioriAlgorithminpythonandalsogenerate association
rules for any market basket data.
Description/Theory:
ApriorialgorithmisgivenbyR.AgrawalandR.Srikantin1994forfindingfrequentitemsetsin a dataset
for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge
of frequent itemset properties. We apply an iterative approach or level-wise search where k-
frequent itemsets are used to find k+1 itemsets.
PROCEDURE/CODE:
pipinstallmlxtend
importpandasaspd
frommlxtend.preprocessingimportTransactionEncoder
frommlxtend.frequent_patternsimportapriori,association_rules data
=[['Bread', 'Milk', 'Eggs'],
['Bread','Diapers','Beer','Eggs'],
['Milk','Diapers','Beer','Cola'],
['Bread''Milk','Diapers','Beer','Cola','Eggs']]
te=TransactionEncoder()
te_ary=te.fit(data).transform(data)
df=pd.DataFrame(te_ary,columns=te.columns_) df
frequent_itemsets=apriori(df,min_support=0.75,use_colnames=True)
print(frequent_itemsets)
rules=association_rules(frequent_itemsets,metric='confidence',min_threshold=0.7)
print(rules)
selected_columns=['antecedents','consequents','antecedentsupport','consequentsupport',
'support', 'confidence']
print(rules[selected_columns])
Apriori:
pip install apyori
importnumpyasnp
importpandasaspd
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
importmatplotlib.pyplotasplt Page|24
fromapyoriimportapriori
store_data=pd.read_csv("D:/DM/store_data.csv",header=None)
display(store_data.head())
store_data.shape
records= []
foriinrange(1,7501):
records.append([str(store_data.values[i,j])forjinrange(0,20)])
print(type(records))
association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,min_lift=3,
min_length=2)
association_results=list(association_rules)
print("Thereare{}Relationderived.".format(len(association_results))) for i
in range(0, len(association_results)):
print(association_results[i][0])
foriteminassociation_results:
pair = item[0]
items=[xforxin pair]
print("Rule:"+items[0]+"->"+items[1])
print("Support: " +str(item[1]))
print("Confidence: " +str(item[2][0][2]))
print("Lift: " +str(item[2][0][3]))
print("==========================================")
OUTPUT:
support itemsets
0 0.75 (Beer)
1 0.75 (Diapers)
2 0.75 (Eggs)
3 0.75(Beer, Diapers)
antecedentsconsequentsantecedentsupportconsequentsupportsupport\
0 (Beer) (Diapers) 0.75 0.75 0.75
1 (Diapers) (Beer) 0.75 0.75 0.75
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page| 25
confidence
0 1.0
1 1.0
2
0 1 2 3 4 5 6 7 8 9 10 1 1 1 14 15 16 17 18 1
1 2 3 9
0 shr alm av veg gr w y co en to lo gr h s mi sal anti fro spi ol
im ond oc etab ee h a tta er m w e o al ne m oxy zen na iv
p s ad les n ol m ge gy at fa e ne a ral on dant sm ch e
o mix gr e s ch dr o t n y d wa juic oot oi
ap w ee in jui yo te ter e hie l
es ea se k ce gu a
tfl rt
o
ur
1 bu me eg Na N N N N N N N N N N Na N Na Na Na N
rg atb gs N a a a a a a a a a a N a N N N a
ers alls N N N N N N N N N N N N
2 ch Na Na Na N N N N N N N N N N Na N Na Na Na N
ut N N N a a a a a a a a a a N a N N N a
ne N N N N N N N N N N N N
y
3 tur avo Na Na N N N N N N N N N N Na N Na Na Na N
ke cad N N a a a a a a a a a a N a N N N a
y o N N N N N N N N N N N N
4 mi mil en wh gr N N N N N N N N N Na N Na Na Na N
ne k erg ole ee a a a a a a a a a N a N N N a
ral y whe n N N N N N N N N N N N
wa bar at te
ter rice a
(7501,20)
<class'list'>
There are 48 Relation derived.
frozenset({'chicken', 'light cream'})
frozenset({'escalope','mushroomcreamsauce'})
frozenset({'escalope', 'pasta'})
frozenset({'ground beef', 'herb & pepper'})
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
frozenset({'tomatosauce','groundbeef'}) Page|26
frozenset({'wholewheatpasta','oliveoil'})
frozenset({'shrimp', 'pasta'})
frozenset({'nan', 'chicken', 'light cream'})
frozenset({'shrimp','frozenvegetables','chocolate'})
frozenset({'cooking oil', 'ground beef', 'spaghetti'})
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|27
AIM:ApplyK-Meansclusteringalgorithmonanydataset
Description/Theory:
K-MeansClusteringisanUnsupervisedLearningalgorithm,whichgroupstheunlabeleddataset into
different clusters.
Thek-meansclusteringalgorithmmainlyperformstwotasks:
DeterminesthebestvalueforKcenterpointsorcentroidsbyaniterativeprocess.
Assigns each datapointtoits closestk-center.Thosedatapointswhich areneartotheparticular k-
center, create a cluster.
Henceeachclusterhasdatapointswithsomecommonalities,anditisawayfromotherclusters.
PROCEDURE/CODE:
importmatplotlib.pyplotasplt
import numpy as np
from sklearn.cluster import KMeans X=np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],
[3.5,4.5]])
print(X)
kmeans=KMeans(n_clusters=2)
kmeans.fit(X)
print("\nClusters:",kmeans.cluster_centers_)print("\nLabels:
",kmeans.labels_)
plt.scatter(X[:,0],X[:,1],c=kmeans.labels_,cmap="rainbow")
OUTPUT:
[[1.1.]
[1.52.]
[3.4.]
[5.7.]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
[3.55.] Page |28
[4.55.]
[3.54.5]]
plt.scatter(X[:,0],X[:,1])
KMeans(n_clusters=2)
Clusters: [[1.25 1.5 ]
[3.9 5.1]]
Labels:[001 11 11]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|29
AIM:ApplyHierarchicalClusteringalgorithmonanydataset
Description/Theory:
HierarchicalClusteringinMachineLearning
Hierarchicalclusteringisanotherunsupervisedmachinelearningalgorithm,whichisusedto group
theunlabeleddatasets into a clusterand alsoknown ashierarchical cluster analysis or HCA.
Inthisalgorithm,wedevelopthehierarchyofclustersintheformofatree,andthistree-shaped structure
is known as the dendrogram.
SometimestheresultsofK-meansclusteringandhierarchicalclusteringmaylooksimilar,but they
both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.
PROCEDURE/CODE:
importpandasaspd
importnumpyas np
import matplotlib.pyplot as plt
importscipy.cluster.hierarchyasshc
fromscipy.spatial.distanceimportsquareform,pdist
a=np.random.random_sample(size=5)
b=np.random.random_sample(size=5)
point=['p1','p2','p3','p4','p5']
data=pd.DataFrame({'Point':point,'a':np.round(a,2),'b':np.round(b,2)})
data=data.set_index('Point')
data
plt.figure(figsize=(8,5))
plt.xlabel('Column a')
plt.ylabel('Column b')
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
plt.title('Scatterplotofxand y') Page|30
plt.scatter(data['a'],data['b'],c='r',marker='*')
for j in data.itertuples():plt.annotate
(j.index,(j.a,j.b),fontsize=15)
dist=pd.DataFrame(squareform(pdist(data[['a','b']]),'euclidean'),
columns=data.index.values,
index=data.index.values)
dist
DANDROGRAM
plt.figure(figsize=(12,5))
plt.title('DendrogramwithSinglelinkage')
dend=shc.dendrogram(shc.linkage(data[['a','b']],method='single'),labels=data.index)
OUTPUT:
a b
P
oi
nt
p1 0.30 0.96
p2 0.31 0.54
p3 0.64 0.88
p4 0.67 0.15
p5 0.25 0.70
<Figuresize800x500with0 Axes>
<Figuresize800x500with0 Axes>
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|31
p1 p2 p3 p4 p5
p1 0.000000 0.420119 0.349285 0.890505 0.264764
p2 0.420119 0.000000 0.473814 0.530754 0.170880
p3 0.349285 0.473814 0.000000 0.730616 0.429535
p4 0.890505 0.530754 0.730616 0.000000 0.692026
p5 0.264764 0.170880 0.429535 0.692026 0.000000
Dandrogram
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|32
AIM:ApplyDBSCANclusteringalgorithmonanydataset
Description/Theory:
StepsUsedInDBSCANAlgorithm
Findalltheneighborpointswithinepsandidentifythecorepointsorvisitedwithmorethan MinPts
neighbors.
Foreachcorepointifitisnotalreadyassignedtoacluster,createanewcluster.
Findrecursivelyallitsdensity-connectedpointsandassignthemtothesameclusterasthecore point.
A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chainingprocess.So,ifbis aneighborofc,cisaneighborofd,anddis aneighborofe,which in turn
isneighbor of a implying that b is a neighbor of a.
Iteratethroughtheremainingunvisitedpointsinthedataset.Thosepointsthatdonotbelongto any
cluster are noise.
PROCEDURE/CODE:
importnumpyasnp
fromsklearn.datasetsimportmake_blobs
fromsklearn.preprocessingimportStandardScaler
import matplotlib.pyplot as plt
fromsklearn.clusterimportDBSCAN
centers=[[0.5,2],[-1,-1],[1.5,-1]]
X,y=make_blobs(n_samples=100,centers=centers,cluster_std=0.5,random_state=0)
print(X,y)
db=DBSCAN(eps=0.4,min_samples=5)
db.fit(X)
n_clusters_=len(set(labels))-(1 if -1 in labels else 0)
print("Estimatednumberofclusters:%d"%n_clusters_)
y_pred=db.fit_predict(X)
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
plt.figure(figsize=(6,4)) Page|33
plt.scatter(X[:,0],X[:,1],c=y_pred,cmap='Paired')
plt.title("Clusters determined by DBSCAN")
plt.savefig("DBSCAN.jpg")
KmeansforDBSCAN:
importmatplotlib.pyplotasplt
import numpy as np
fromsklearn.clusterimportKMeans
import matplotlib.pyplot as plt
import numpy as np
fromsklearn.clusterimportKMeans
plt.scatter(X[:,0],X[:,1])
kmeans=KMeans(n_clusters=2)
kmeans.fit(X)
print("\nClusters:",kmeans.cluster_centers_) print("\
n Labels:",kmeans.labels_)
OUTPUT:
[[0.57747371 2.18908126]
[1.1781908-2.11170158]
[0.533258612.15123595]
[0.94780833-0.97391746]
[1.27223375-0.99126042]
[1.266389612.73467938]
[0.588713071.79910953]
DBSCAN(eps=0.4)
labels=db.labels_
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
2 0 2-1 2-1-1-12 1 1 1 1-11-1-11112-1 -10 Page | 34
21 0-1 0-101-1 1 1 12 - 2 2 0 0 0 12 -0 -1
1 1
01 0-1 02-1-12 1 1 01 1 0-1 0 1 0 22 1 0 2
12 22]
KMeans(n_clusters=2)
Clusters:[[0.25333922-0.89314775]
[0.492109632.00254955]]
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE