0% found this document useful (0 votes)
47 views39 pages

Dwdm-Lab Manual

The document appears to be a student's lab certificate for a data mining course. It contains the student's name, registration number, branch and academic year. It also mentions the number of experiments held and done for the course. The certificate is signed by the internal and external examiners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views39 pages

Dwdm-Lab Manual

The document appears to be a student's lab certificate for a data mining course. It contains the student's name, registration number, branch and academic year. It also mentions the number of experiments held and done for the course. The certificate is signed by the internal and external examiners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

(AUTONOMOUS)

Department of Computer Science & Engineering

20CS58 Data Mining Using Python Lab

Name of the Student:

Registered Number:

Branch & Section:

Academic Year: 2022 23


LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING
(AUTONOMOUS)

CERTIFICATE

Certificate that this is a bonafied record of the practical work

done in ______________________________________________________________Laboratory

by __________________________________ with Regd. No. ______________________________

of __________________B.Tech Course__________ ___________Semester in

______________________________________________________ Branch during the

Academic Year 2022-23

No. of Experiments held: ________

No. of Experiments Done: ________

2023 Signature of the Faculty

INTERNAL EXAMINER EXTERNAL EXAMINER


Module 1
1.Loading the dataset:
Dataset:
A dataset is a collection of data that is used to train the model. A dataset acts as an example to
teach the machine learning algorithm how to make predictions.
Program

import pandas as pd
data=pd.read_csv(r"D:\New folder\emp.csv")
data

Output

Country Gender Age Salary Purchased

0 France Male 44.0 72000. No


0

1 Spain Female 27.0 48000. Yes


0

2 Germany Male 30.0 54000. No


0

3 Spain MAle 38.0 61000. No


0

4 Germany Female 40.0 NaN Yes

5 France Female 35.0 58000. Yes


0

6 Spain Female NaN 52000. No


0

7 France Male 48.0 79000. Yes


0

8 Germany Male 50.0 83000. No


0

9 France Male 37.0 67000. Yes


0

2.Identifying dependent and Independent variables


Dependent Variables:
The values of this variable depend on other variables. It is the outcome that you’re studying. It’s
also known as the response variable, outcome variable, and left-hand variable. Statisticians commonly
denote them using a Y. Traditionally, graphs place dependent variables on the vertical, or Y, axis.
Independent Variables:
The name helps you understand their role in statistical analysis. These variables are
independent. In this context, independent indicates that they stand alone and other variables in the
model do not influence them. The researchers are not seeking to understand what causes the
independent variables to change.
Program

x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x)
print(y)
Output

x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x)
print(y)
[['France' 'Male' 44.0 72000.0]
['Spain' 'Female' 27.0 48000.0]
['Germany' 'Male' 30.0 54000.0]
['Spain' 'MAle' 38.0 61000.0]
['Germany' 'Female' 40.0 nan]
['France' 'Female' 35.0 58000.0]
['Spain' 'Female' nan 52000.0]
['France' 'Male' 48.0 79000.0]
['Germany' 'Male' 50.0 83000.0]
['France' 'Male' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

3.Accessing specific columns

Program

x=data.iloc[:,1:-1].values
y=data.iloc[:,[0,4]].values
print(x)
print(y)

Output

[['Male' 44.0 72000.0]


['Female' 27.0 48000.0]
['Male' 30.0 54000.0]
['MAle' 38.0 61000.0]
['Female' 40.0 nan]
['Female' 35.0 58000.0]
['Female' nan 52000.0]
['Male' 48.0 79000.0]
['Male' 50.0 83000.0]
['Male' 37.0 67000.0]]
[['France' 'No']
['Spain' 'Yes']
['Germany' 'No']
['Spain' 'No']
['Germany' 'Yes']
['France' 'Yes']
['Spain' 'No']
['France' 'Yes']
['Germany' 'No']
['France' 'Yes']]

4.Dealing with missing values


Missing Values:
Missing data is defined as the values or data that is not stored (or not present) for some
variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can
see the columns ‘Age’ and ‘Cabin’ have some missing values.In Pandas, usually, missing values are
represented by NaN. It stands for Not a Number.
Handling Missing Values:
 Deleting the Missing Values.
 Imputing the Missing Values
Deleting the Missing Values:
This approach is not recommended. It is one of the quick and dirty techniques one can
use to deal with missing values. If the missing value is of the type Missing Not At Random
(MNAR), then it should not be deleted.
Imputing Missing Values:
This approach focus on replacing the missing data with mean or most frequent value in
that feature.
Program
import numpy as np

# Importing the SimpleImputer class


from sklearn.impute import SimpleImputer

# Imputer object using the mean strategy and


# missing_values type for imputation
imputer = SimpleImputer(missing_values = np.nan,
strategy ='mean')

data = [[12, np.nan, 34], [10, 32, np.nan],


[np.nan, 11, 20]]

print("Original Data : \n", data)


# Fitting the data to the imputer object
imputer = imputer.fit(data)

# Imputing the data


data = imputer.transform(data)

print("Imputed Data : \n", data)

Output

Original Data :
[[12, nan, 34], [10, 32, nan], [nan, 11, 20]]
Imputed Data :
[[12. 21.5 34. ]
[10. 32. 27. ]
[11. 11. 20. ]]

MODULE-2
Demonstrate the following data preprocessing tasks using python libraries.
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way.
So for this, we use data preprocessing task.
1.Dealing with the Categorical Data:
Categorical data:
Categorical Data is the statistical data comprising categorical variables of data that are converted
into categories.
One-hot Encoding:
In this method, each category is mapped to a vector that contains 1 and 0 denoting the
presence or absence of the feature. The number of vectors depends on the number of categories for
features.
Program:
#loading the libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
print(x)

Output
[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
[1.0 0.0 0.0 162597.7 151377.59 443898.53]
[0.0 1.0 0.0 153441.51 101145.55 407934.54]
[0.0 0.0 1.0 144372.41 118671.85 383199.62]
[0.0 1.0 0.0 142107.34 91391.77 366168.42]
[0.0 0.0 1.0 131876.9 99814.71 362861.36]
[1.0 0.0 0.0 134615.46 147198.87 127716.82]
[0.0 1.0 0.0 130298.13 145530.06 323876.68]
[0.0 0.0 1.0 120542.52 148718.95 311613.29]
[1.0 0.0 0.0 123334.88 108679.17 304981.62]
[0.0 1.0 0.0 101913.08 110594.11 229160.95]
[1.0 0.0 0.0 100671.96 91790.61 249744.55]
[0.0 1.0 0.0 93863.75 127320.38 249839.44]
[1.0 0.0 0.0 91992.39 135495.07 252664.93]
[0.0 1.0 0.0 119943.24 156547.42 256512.92]
[0.0 0.0 1.0 114523.61 122616.84 261776.23]
[1.0 0.0 0.0 78013.11 121597.55 264346.06]
[0.0 0.0 1.0 94657.16 145077.58 282574.31]
[0.0 1.0 0.0 91749.16 114175.79 294919.57]
[0.0 0.0 1.0 86419.7 153514.11 0.0]
[1.0 0.0 0.0 76253.86 113867.3 298664.47]
[0.0 0.0 1.0 78389.47 153773.43 299737.29]
[0.0 1.0 0.0 73994.56 122782.75 303319.26]
[0.0 1.0 0.0 67532.53 105751.03 304768.73]
[0.0 0.0 1.0 77044.01 99281.34 140574.81]
[1.0 0.0 0.0 64664.71 139553.16 137962.62]
[0.0 1.0 0.0 75328.87 144135.98 134050.07]
[0.0 0.0 1.0 72107.6 127864.55 353183.81]
[0.0 1.0 0.0 66051.52 182645.56 118148.2]
[0.0 0.0 1.0 65605.48 153032.06 107138.38]
[0.0 1.0 0.0 61994.48 115641.28 91131.24]
[0.0 0.0 1.0 61136.38 152701.92 88218.23]
[1.0 0.0 0.0 63408.86 129219.61 46085.25]
[0.0 1.0 0.0 55493.95 103057.49 214634.81]
[1.0 0.0 0.0 46426.07 157693.92 210797.67]
[0.0 0.0 1.0 46014.02 85047.44 205517.64]
[0.0 1.0 0.0 28663.76 127056.21 201126.82]
[1.0 0.0 0.0 44069.95 51283.14 197029.42]
[0.0 0.0 1.0 20229.59 65947.93 185265.1]
[1.0 0.0 0.0 38558.51 82982.09 174999.3]
[1.0 0.0 0.0 28754.33 118546.05 172795.67]
[0.0 1.0 0.0 27892.92 84710.77 164470.71]
[1.0 0.0 0.0 23640.93 96189.63 148001.11]
[0.0 0.0 1.0 15505.73 127382.3 35534.17]
[1.0 0.0 0.0 22177.74 154806.14 28334.72]
[0.0 0.0 1.0 1000.23 124153.04 1903.93]
[0.0 1.0 0.0 1315.46 115816.21 297114.46]
[1.0 0.0 0.0 0.0 135426.92 0.0]
[0.0 0.0 1.0 542.05 51743.15 0.0]
[1.0 0.0 0.0 0.0 116983.8 45173.06]]

Label Encoder
In label encoding, each category is assigned a value from 1 through N where N is the number of
categories for the feature. There is no relation or order between these assignments.

Program
#importing libraries
import pandas as pd
from sklearn import preprocessing
df=pd.read_csv(r"D:\New folder\iris.csv")
# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['variety']= label_encoder.fit_transform(df['variety'])

Df['variety']

Output
0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: variety, Length: 150, dtype: int32

2.Featuring Scale:
When your data has different values, and even different measurement units, it can be difficult to
compare them. What is kilograms compared to meters? Or altitude compared to time?The answer to
this problem is scaling. We can scale data into new values that are easier to compare.
Program:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
scaledX = scale.fit_transform(X)
print(scaledX)
Output:
[[-2.10389253 -1.59336644]
[-0.55407235 -1.07190106]
[-1.52166278 -1.59336644]
[-1.78973979 -1.85409913]
[-0.63784641 -0.28970299]
[ 0.15800719 -0.0289703 ]
[ 0.3046118 -0.0289703 ]
[-0.05142797 1.53542584]
[-0.72580918 -0.0289703 ]
[ 0.14962979 1.01396046]
[ 1.2219378 -0.0289703 ]
[ 0.5685001 1.01396046]
[ 0.3046118 1.27469315]
[ 0.51404696 -0.0289703 ]
[ 0.51404696 1.01396046]
[ 0.72348212 -0.28970299]
[ 0.8281997 1.01396046]
[ 1.81254495 1.01396046]
[ 0.96642691 -0.0289703 ]
[ 1.72877089 1.01396046]]
3.Write a program for splitting the data into training and testing
Training Data:
The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, the training data is fed to the ML algorithms, which lets them learn how to make
predictions for the given task.
Testing Data:
This dataset evaluates the performance of the model and ensures that the model can generalize well with
the new or unseen dataset. The test dataset is another subset of original data, which is independent of the training
dataset.
Splitting:
For splitting the dataset, we can use the train_test_split function of scikit-learn.
Program:
#importing the packages
import numpy as np
from sklearn.model_selection import train_test_split
#splitting the data into training and testing
x = np.arange(1, 25).reshape(12, 2)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
print(x_train)
print(x_test)
print(y_train)
print(x_test)

Output
[[ 3 4]
[13 14]
[ 1 2]
[15 16]
[23 24]
[19 20]
[17 18]
[11 12]]
[[ 5 6]
[ 7 8]
[ 9 10]
[21 22]]
[1 0 0 1 0 0 1 0]
[[ 5 6]
[ 7 8]
[ 9 10]
[21 22]]

Module-3
1.Similarity and Dissimilarity Measures:
Similarity Measure:
Measures of similarity provide a numerical value which indicates the strength of associations between
objects or variables. The extent to which the variables are corresponding with each other is usually indicated
between “0” and “1” where “0” means no similarity or exclusion and “1” means perfect similarity of identity.
Dissimilarity Measure:
Numerical measure of how different two data objects are.Range from 0 (objects are alike) to ∞
(objects are different).

Pearsons coefficient:
The Pearson coefficient is a type of correlation coefficient that represents the relationship between two
variables that are measured on the same interval or ratio scale. The Pearson coefficient is a measure of
the strength of the association between two continuous variables.

Program

from scipy.stats import pearsonr


x=[-2,-1,0,1,2]
y=[4,1,3,2,0]
corr=pearsonr(x,y)
print('pearson correlation: ',corr)

Output

pearson correlation: PearsonRResult(statistic=-0.7000000000000001, pvalue=0.1881204043741873)

Cosine similarity:
Cosine similarity is a metric used to measure the similarity of two vectors. Specifically, it
measures the similarity in the direction or orientation of the vectors ignoring differences in their
magnitude or scale. Both vectors need to be part of the same inner product space, meaning they must
produce a scalar through inner product multiplication. The similarity of two vectors is measured by the
cosine of the angle between them.

Program

import numpy as np
from numpy.linalg import norm
#defining two lists
A=np.array([2,1,2,3,3,9])
B=np.array([3,4,2,4,5,5])
print("A: ",A)
print("B: ",B)
#compute ccosine similarity
cosine=np.dot(A,B)/norm(A)*norm(B)
print("cosine Similarity :",cosine)

Output

A: [2 1 2 3 3 9]
B: [3 4 2 4 5 5]
cosine Similarity : 80.6581721881964
Jaccard similarity:
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents. Jaccard similarity can be used to find the similarity between
two asymmetric binary vectors.

Program
import numpy as np
from scipy.spatial.distance import jaccard
A=np.array([1,0,0,1,1,1])
B=np.array([0,0,1,1,1,1])
d=jaccard(A,B)
print("distance:",d)

Output

distance: 0.4

Euclidean:
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents. Jaccard similarity can be used to find the similarity between
two asymmetric binary vectors
Program

from sklearn.metrics.pairwise import euclidean_distances


x=[[0,1],[1,1]]
euclidean_distances(x,x)
#calculating euclidean distance between two vectors
from scipy.spatial.distance import euclidean
row1=[10,20,15,10,5]
row2=[12,24,18,8,7]
dist=euclidean(row1,row2)
print(dist)

Output

6.082762530298219
MInkowski_distance:
Minkowski distance calculates the distance between two real-valued vectors. It is a
generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the
“order” or “p“, that allows different distance measures to be calculated.
Program

from scipy.spatial import minkowski_distance


row1=[10,20,15,10,5]
row2=[12,24,18,8,7]
#calculate distance (p=1)
dist=minkowski_distance(row1,row2,1)
print(dist)
#calculate distance (p=2)
dist=minkowski_distance(row1,row2,2)
print(dist)

Output
13.0
6.082762530298219

Manhattan Distance:
This determines the absolute difference among the pair of the coordinates. Suppose we have
two points P and Q to determine the distance between these points we simply have to calculate the
perpendicular distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1)
and Q at (x2, y2).
Program:
x1 = (1,2,3,4,5,6)
x2 = (10,20,30,1,2,3)
print(manhattan_distance(x1, x2))
Output:
63

Module-4
Build a model using linear regression algorithm on any dataset:
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.
Program:
import pandas as pd
import numpy as np
ds2 = pd.read_csv(r'C:\21761A05F6_PYTHON_SCRIPTS\SalaryInfo.csv')
print(ds2)
x = ds2.iloc[:,0].values
y = ds2.iloc[:,1].values
mean_x = np.mean(x)
mean_y = np.mean(y)
print("mean of Experience: ",mean_x)
print("mean of Experience: ",mean_y)
sum_add = 0
denom = 0
for (i,j) in zip(x,y):
sum_add = sum_add+((i-mean_x)*(j-mean_y))
denom = denom + (i-mean_x)*(i-mean_x)
w1_val = sum_add/denom
print("w1 Value is: ",w1_val)
w0_val = mean_y-(w1_val*mean_x)
print("w0 Value is: ",w0_val)
experience = int(input("Enter the Experience: "))
salary_predict = w0_val + w1_val*experience
print("Predicted Salary for ",experience," years experience is: ",salary_predict)

Output:
YEARS_EXPERIENCE SALARY(in $1000)
0 3 30
1 8 57
2 9 64
3 13 72
4 3 36
5 6 43
6 11 59
7 21 90
8 1 20
9 16 83
mean of Experience: 9.1
mean of Experience: 55.4
w1 Value is: 3.5374756199498467
w0 Value is: 23.208971858456394
Enter the Experience: 3
Predicted Salary for 3 years experience is: 33.821398718305936
Program:
#import required packages
import matplotlib.pyplot as pt
import numpy as np
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score

#Building a linear regression model for diabetes dataset


diabetes_X,diabetes_Y = datasets.load_diabetes(return_X_y=True)
print("X is: \n",diabetes_X)
print("Y is: \n",diabetes_Y)
diabetes_X = diabetes_X[:,np.newaxis,2]

#split data into training and testing sets

#split the independent variables


diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

#split the dependent variables


diabetes_Y_train = diabetes_Y[:-20]
diabetes_Y_test = diabetes_Y[-20:]

#create an object for linear regression


regr = linear_model.LinearRegression()

#train the model using the training sets


regr.fit(diabetes_X_train,diabetes_Y_train)

#make predictions using training set


diabetes_Y_pred = regr.predict(diabetes_X_test)

print(diabetes_Y_pred)

#plotting the result


pt.scatter(diabetes_X_test,diabetes_Y_test,color='red')
pt.plot(diabetes_X_test,diabetes_Y_pred,color='green',linewidth=2,marker='o',markerfacecolor='yellow')
pt.grid()
pt.show()

#calculating mean squared error


print('Mean Squared Error: %f' %r2_score(diabetes_Y_test,diabetes_Y_pred))

Output:
[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422
158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]
Plot:
Module-5
Build a classification model using Decision Tree algorithm on iris dataset
Classification:
Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets as well as
small and simple ones. It primarily involves using algorithms that you can easily modify to improve the
data quality.
Decision Tree Algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is called as
Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for ASM, which are:
Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as.
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

Program:
#importing libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
iris=load_iris()
#splitting data

x=iris.data
y=iris.target

#splitting data into training and testing

X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=None)
tree_clf=DecisionTreeClassifier(max_depth=6)
tree_clf.fit(X_train,Y_train)
DecisionTreeClassifier(max_depth=6)
y_pred=tree_clf.predict(X_test)

#determining Accuracy

accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy :",accuracy)
plt.figure(figsize=(15,10))

#plotting tree

plot_tree(tree_clf,filled=True,feature_names=iris.feature_names,class_names=iris.target_names)
plt.show()

Output

Accuracy : 1.0

Plot:
2. Diabetes dataset

Program

#loading libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X=diabetes_X[:,np.newaxis,2]
diabetes_x_train=diabetes_X[:-20]
diabetes_x_test=diabetes_X[-20:]
diabetes_y_train=diabetes_y[:-20]
diabetes_y_test=diabetes_y[-20:]
regr=linear_model.LinearRegression()
regr.fit(diabetes_x_train,diabetes_y_train)
diabetes_y_pred=regr.predict(diabetes_x_test)
print(diabetes_y_pred)

#building a decision tree and finding Accuracy

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
pima=pd.read_csv(r"D:\New folder\diabetes.csv")
feature_cols=["Pregnancies","Glucose","BloodPressure","SkinThickness","DiabetesPedigreeFunction","
BMI","Insulin","Age"]
X=pima[feature_cols]
y=pima.Outcome
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=None)
clf=DecisionTreeClassifier(max_depth=6)
clf=clf.fit(X_train,Y_train)
y_pred=clf.predict(X_test)
print("Accuracy :",metrics.accuracy_score(Y_test,Y_pred))

Output

[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422


158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]

Accuracy:1.00000000000000
Module-6

1.Demonstarte the naive bayes algorithm using any dataset


Naive Bayes Algorithm:
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.It is mainly used in text classification that includes a high-
dimensional training dataset.Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make quick
predictions.It is a probabilistic classifier, which means it predicts on the basis of the pro bability of an
object.
Bayes Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Bayes Theorem Formulae:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Program

#loading libraries
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset=pd.read_csv(r"D:\New folder\iris.csv")
print(dataset)
dataset.info()

#loading data into x and y

X=dataset.iloc[:,:4].values
Y=dataset['variety'].values
print(X)
print(Y)

#splitting the data into training and testing

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=None,test_size=0.3)
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
print(X_test[0])
y_pred=classifier.predict(X_test)
print(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy :",accuracy)

#making a dataframe of real and predicted values

df=pd.DataFrame({'Real Values':Y_test,'predicted_values':y_pred})
print(df)

Output
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica

[150 rows x 5 columns]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal.length 150 non-null float64
1 sepal.width 150 non-null float64
2 petal.length 150 non-null float64
3 petal.width 150 non-null float64
4 variety 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

[5.2 4.1 1.5 0.1]


[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
['Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa' 'Setosa'
'Setosa' 'Setosa' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica' 'Virginica'
'Virginica' 'Virginica' 'Virginica']

[5.4 3.9 1.3 0.4]


['Setosa' 'Virginica' 'Virginica' 'Versicolor' 'Setosa' 'Setosa'
'Versicolor' 'Setosa' 'Versicolor' 'Versicolor' 'Setosa' 'Virginica'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Setosa' 'Versicolor'
'Setosa' 'Virginica' 'Setosa' 'Setosa' 'Versicolor' 'Setosa' 'Versicolor'
'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor' 'Versicolor'
'Versicolor' 'Virginica' 'Setosa' 'Setosa' 'Virginica' 'Virginica'
'Setosa' 'Virginica' 'Versicolor' 'Versicolor' 'Setosa' 'Virginica'
'Setosa' 'Versicolor' 'Versicolor']
Accuracy : 0.9333333333333333

Real Values predicted_values


0 Setosa Setosa
1 Virginica Virginica
2 Virginica Virginica
3 Versicolor Versicolor
4 Setosa Setosa
5 Setosa Setosa
6 Versicolor Versicolor
7 Setosa Setosa
8 Virginica Versicolor
9 Versicolor Versicolor
10 Setosa Setosa
11 Virginica Virginica
12 Versicolor Versicolor
13 Versicolor Versicolor
14 Virginica Versicolor
15 Versicolor Versicolor
16 Setosa Setosa
17 Versicolor Versicolor
18 Setosa Setosa
19 Virginica Virginica
20 Setosa Setosa
21 Setosa Setosa
22 Versicolor Versicolor
23 Setosa Setosa
24 Versicolor Versicolor
25 Versicolor Versicolor
26 Versicolor Versicolor
27 Versicolor Versicolor
28 Versicolor Versicolor
29 Versicolor Versicolor
30 Versicolor Versicolor
31 Virginica Virginica
32 Setosa Setosa
33 Setosa Setosa
34 Virginica Virginica
35 Virginica Virginica
36 Setosa Setosa
37 Virginica Virginica
38 Virginica Versicolor
39 Versicolor Versicolor
40 Setosa Setosa
41 Virginica Virginica
42 Setosa Setosa
43 Versicolor Versicolor
44 Versicolor Versicolor

Program:
#importing Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

#loading iris dataset

iris=load_iris()
X=iris.data
Y=iris.target
le=LabelEncoder()
Y=le.fit_transform(Y)

#splitting dataset into training and testing

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
nb_model=GaussianNB()

#Fitting data

nb_model.fit(X_train,Y_train)
Y_pred=le.inverse_transform(Y_test)

#calculating accuracy

accuracy=accuracy_score(Y_test,Y_pred)
print("Accuracy ",accuracy)
new_observation=np.array([[5.8,3.0,4.5,1.5]])
predicted_class=nb_model.predict(new_observation)
predicted_class=le.inverse_transform(predicted_class)
print("predicted class ",predicted_class)

Output:
Accuracy 1.0
predicted class [1]

2.Write a program for calculating the entropies of a dataset:


Entropy:
Entropy is the measure of uncertainty in the data. The effort is to reduce the entropy and
maximize the information gain. The feature having the most information is considered important by the
algorithm and is used for training the model. By using Information gain you are actually using entropy.
Program:
#importing dependencies

from sklearn import tree


import pandas as pd
import numpy as np

#loading dataset

df=pd.read_csv(r"D:New folder\playtennis.csv")
print(df)
df.value_counts()

#calculating Entropy for play Tennis and printing that records

Entropy_play=-(9/14)*np.log2(9/14)-(5/14)*np.log2(5/14)
print(Entropy_play)

#calculating Entropy for Outlook sunny and printing that records

df[df.Outlook=='Sunny']
Entropy_play_Outlook_Sunny=-(3/5)*np.log2(3/5)-(2/5)*np.log2(2/5)
print(Entropy_play_Outlook_Sunny)

#calculating Entropy for Outlook_rain and printing that records

df[df["Outlook"]=='Overcast']
Entropy_play_outlook_overcast=-(0/4)*np.log2(0/4)-(4/4)*np.log2(4/4)
print(Entropy_play_outlook_overcast)

#calculating the entropy for Outlook_rain and printing that records

df[df.Outlook=='Rain']
Entropy_play_Outlook_Rain=-(2/5)*np.log2(2/5)-(3/5)*np.log2(3/5)
Entropy_play_Outlook_Rain

#calculating the gain

Gain=Entropy_play-(5/14)*Entropy_play_Outlook_Sunny-(4/14)*0-(5/14)*Entropy_play_Outlook_Rain
Gain
Output

Outlook Temperature Humidity Wind Play Tennis


0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
5 Rain Cool Normal Strong No
6 Overcast Cool Normal Strong Yes
7 Sunny Mild High Weak No
8 Sunny Cool Normal Weak Yes
9 Rain Mild Normal Weak Yes
10 Sunny Mild Normal Strong Yes
11 Overcast Mild High Strong Yes
12 Overcast Hot Normal Weak Yes
13 Rain Mild High Strong No

#Entropy for play

0.9402859586706311

#Entropy for sunny

Outlook Temperature Humidit Wind Play Tennis


y

0 Sunny Hot High Weak No

1 Sunny Hot High Strong No

7 Sunny Mild High Weak No

8 Sunny Cool Normal Weak Yes

10 Sunny Mild Normal Strong Yes

0.9709505944546686

#Entropy for Overcast

Outlook Temperature Humidit Wind Play Tennis


y

2 Overcast Hot High Weak Yes

6 Overcast Cool Normal Strong Yes

1 Overcast Mild High Strong Yes


1

1 Overcast Hot Normal Weak Yes


2

nan
#Entropy for rain

Outlook Temperature Humidit Wind Play Tennis


y

3 Rain Mild High Weak Yes

4 Rain Cool Normal Weak Yes

5 Rain Cool Normal Strong No

9 Rain Mild Normal Weak Yes

13 Rain Mild High Strong No

0.9709505944546686

#Gain

0.24674981977443933

Module-7

1.Apply Apriori algorithm on market basket data

Using our own data to apply apriori algorithm:


Apriori Algorithm:
two or more objects are related to one another. In other words, we can say that the apriori
algorithm is an association rule leaning that analyzes that people who bought product A also bought
product B.
Support:
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the total
number of transactions.
Confidence:
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the confidence.
Lift:
lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits.
Program:

#importing libraries

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori,association_rules
data=[["milk","Diapers","Beer","Cola"],
["bread","Milk","Diapers","Beer","Cola","Eggs"],
["Beer","Milk","Bread","Diapers"],
["Beer","Milk","Cola"]
]

#creating object for Transaction encoder

te=TransactionEncoder()
te_ary=te.fit(data).transform(data)
df=pd.DataFrame(te_ary,columns=te.columns_)
print(df)
frequent_itemsets=apriori(df,min_support=0.75,use_colnames=True)
print(frequent_itemsets)
itemsets=apriori(df,min_support=0.75)
print(itemsets)
rules=association_rules(frequent_itemsets,metric="confidence",min_threshold=0.7)
print(rules)

#Accessing specific columns in the rules

selected_columns=['antecedents','consequents','antecedent support','consequent
support','support','confidence']
print(rules[selected_columns])

Output

Beer Bread Cola Diapers Eggs Milk bread milk


0 True False True True False False False True
1 True False True True True True True False
2 True True False True False True False False
3 True False True False False True False False
support itemsets
0 1.00 (Beer)
1 0.75 (Cola)
2 0.75 (Diapers)
3 0.75 (Milk)
4 0.75 (Beer, Cola)
5 0.75 (Diapers, Beer)
6 0.75 (Beer, Milk)
support itemsets
0 1.00 (0)
1 0.75 (2)
2 0.75 (3)
3 0.75 (5)
4 0.75 (0, 2)
5 0.75 (0, 3)
6 0.75 (0, 5)
antecedents consequents antecedent support consequent support support \
0 (Beer) (Cola) 1.00 0.75 0.75
1 (Cola) (Beer) 0.75 1.00 0.75
2 (Diapers) (Beer) 0.75 1.00 0.75
3 (Beer) (Diapers) 1.00 0.75 0.75
4 (Beer) (Milk) 1.00 0.75 0.75
5 (Milk) (Beer) 0.75 1.00 0.75

confidence lift leverage conviction zhangs_metric


0 0.75 1.0 0.0 1.0 0.0
1 1.00 1.0 0.0 inf 0.0
2 1.00 1.0 0.0 inf 0.0
3 0.75 1.0 0.0 1.0 0.0
4 0.75 1.0 0.0 1.0 0.0
5 1.00 1.0 0.0 inf 0.0

antecedents consequents antecedent support consequent support support \


0 (0) (2) 1.00 0.75 0.75
1 (2) (0) 0.75 1.00 0.75
2 (0) (3) 1.00 0.75 0.75
3 (3) (0) 0.75 1.00 0.75
4 (0) (5) 1.00 0.75 0.75
5 (5) (0) 0.75 1.00 0.75

confidence
0 0.75
1 1.00
2 0.75
3 1.00
4 0.75
5 1.00

Apriori Program on Market Basket Data:

Program:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('store_data.csv')
transactions=[]
#k=0
for i in range(0, 7500):
print(k)
# k=k+1
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.0045, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
print(len(results))
print(results)
for item in results:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])

print("Support: " + str(item[1]))


print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")

Output

7
[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004533333333333334,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'chicken'}), confidence=0.2905982905982906, lift=4.843304843304844)]),
RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}),
support=0.005733333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}),
items_add=frozenset({'escalope'}), confidence=0.30069930069930073, lift=3.7903273197390845)]),
RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005866666666666667,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'escalope'}), confidence=0.37288135593220345, lift=4.700185158809287)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.016,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}),
items_add=frozenset({'ground beef'}), confidence=0.3234501347708895, lift=3.2915549671393096)]),
RelationRecord(items=frozenset({'tomato sauce', 'ground beef'}), support=0.005333333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}),
items_add=frozenset({'ground beef'}), confidence=0.37735849056603776, lift=3.840147461662528)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.008,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}),
items_add=frozenset({'olive oil'}), confidence=0.2714932126696833, lift=4.130221288078346)]),
RelationRecord(items=frozenset({'shrimp', 'pasta'}), support=0.005066666666666666,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'shrimp'}),
confidence=0.3220338983050848, lift=4.514493901473151)])]

Rule: chicken -> light cream


Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
=====================================
Rule: escalope -> mushroom cream sauce
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
=====================================
Rule: escalope -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
=====================================
Rule: ground beef -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
=====================================
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
=====================================
Rule: olive oil -> whole wheat pasta
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
=====================================
Rule: shrimp -> pasta
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
=====================================

Module-8
1.Apply K- Means clustering algorithm on any dataset
K-Means Algorithm:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Program:
#K-Means Clustering:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
x = np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],[3.5,4.5]])
print(x)

plt.scatter(x[:,0],x[:,1])

kmeans = KMeans(n_clusters=2)
kmeans.fit(x)

print("CLusters: ",kmeans.cluster_centers_)
print("Labels: ",kmeans.labels_)

plt.scatter(x[:,0],x[:,1],c=kmeans.labels_,cmap="rainbow")

Output:
[[1. 1. ]
[1.5 2. ]
[3. 4. ]
[5. 7. ]
[3.5 5. ]
[4.5 5. ]
[3.5 4.5]]
CLusters: [[1.25 1.5 ]
[3.9 5.1 ]]
Labels: [0 0 1 1 1 1 1]

Plot:
Module-9
1.Apply Hierarchical Clustering algorithm on any dataset
Hierarchical Clustering Algorithm:
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with


taking all data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

Program:
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
from scipy.spatial.distance import squareform,pdist
import pandas as pd
a = np.random.random_sample(size=5)
b = np.random.random_sample(size=5)
point = ['P1','P2','P3','P4','P5']
data = pd.DataFrame({'Point':point,'a':np.round(a,2),'b':np.round(b,2)})
data = data.set_index('Point')
print(data)
plt.figure(figsize=(8,5))
plt.xlabel('Column a')
plt.ylabel('Column b')

plt.title('Scatter plot of x and y')


plt.scatter(data['a'],data['b'],c='r',marker='*')
for j in data.itertuples():
plt.annotate(j.Index,(j.a,j.b),fontsize=15)
dist=pd.DataFrame(squareform(pdist(data[['a','b']]),'euclidean'),columns=data.index.values,index =
data.index.values)
plt.figure(figsize=(12, 5))
plt.title('Dendrogram with Single linkage')
dend = shc.dendrogram(shc.linkage(data[['a','b']],method='single'),labels=data.index)
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='single')
print(cluster.fit_predict(data))
Output:
Point a b
P1 0.86 0.13
P2 0.27 0.04
P3 0.89 0.65
P4 0.05 0.43
P5 0.44 0.05
[0 0 1 0 0]

Plots:
Module-10
1. Apply DBSCAN clustering algorithm on any dataset.
DBSCAN Clustering Algorithm:
DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are
dense regions in space separated by regions of lower density. It groups 'densely grouped' data
points.DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are
dense regions in space separated by regions of lower density. It groups 'densely grouped' data points.
Program:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

centers = [[0.5,2],[-1,-1],[1.5,-1]]
x,y = make_blobs(n_samples=100,centers=centers,cluster_std=0.5,random_state=0)
print(x,y)

db = DBSCAN(eps=0.4,min_samples=5)
print(db.fit(x))

labels = db.labels_
n_clusters_ = len(set(labels))-(1 if -1 in labels else 0)
print("Estimated number of clusters: %d" %n_clusters_)
y_pred = db.fit_predict(x)

print(db.labels_)
plt.figure(figsize=(6,4))
plt.scatter(x[:,0],x[:,1],c=y_pred,cmap='Paired')
plt.title("Clusters determined by DBSCAN")
plt.savefig("DBSCAN.jpg")

from sklearn.cluster import KMeans

plt.scatter(x[:,0],x[:,1])
kmeans = KMeans(n_clusters=2)
kmeans.fit(x)

print("Clusters: ",kmeans.cluster_centers_)
print("Labels: ",kmeans.labels_)

plt.title("K-Means for DBSCAN Algorithm")


plt.scatter(x[:,0],x[:,1],c=kmeans.labels_,cmap="Paired")
plt.show()
plt.savefig("KMeansDBSCAN.jpg")
Output:
[[ 0.57747371 2.18908126]
[ 1.1781908 -2.11170158]
[ 0.53325861 2.15123595]
[ 0.94780833 -0.97391746]
[ 1.27223375 -0.99126042]
[ 1.26638961 2.73467938]
[ 0.58871307 1.79910953]
[ 1.18207696 -0.66178335]
[-0.41061021 -1.08996242]
[-0.3531351 2.9753877 ]
[ 2.0940149 -0.84152869]
[-1.45364918 -0.9740273 ]
[-0.66385262 -0.79626908]
[ 1.45077374 -1.33173914]
[ 2.58161797 -0.33173603]
[ 0.57202179 2.72713675]
[ 1.4658792 -0.14332864]
[-0.69296031 -0.53889666]
[ 1.78829541 -1.10414938]
[-1.34728393 -1.07481727]
[-1.38495804 -0.7303754 ]
[ 0.24459743 1.40968391]
[-0.89586251 -0.51168048]
[-0.79882918 -1.34240505]
[-0.8218168 -0.64671342]
[ 0.9322181 1.62891749]
[-1.20158847 -0.38877746]
[ 1.433779 1.51136106]
[-1.58257492 -0.54958676]
[-0.05842465 -1.67387953]
[ 1.11514534 2.60118992]
[ 1.63487731 1.27281716]
[-1.5865617 -0.02818941]
[ 1.81261573 -1.80102883]
[ 1.88589528 -0.58824792]
[ 0.99989233 -1.77238555]
[ 0.85357155 -0.86647457]
[ 0.09342686 1.1368587 ]
[ 0.98287858 -0.65920274]
[ 2.69157239 -0.52776026]
[-0.77649491 2.3268093 ]
[ 2.06331796 -1.53996575]
[ 0.84204629 -1.2307923 ]
[ 1.96042941 -0.84063617]
[-0.85088091 -0.33680705]
[ 0.98936899 3.1204466 ]
[-0.2558739 -0.05205541]
[ 0.97504421 1.9243214 ]
[-0.99474999 -0.10706475]
[ 1.69800336 -1.54653075]
[ 0.32604393 2.07817448]
[-0.43029966 -1.61741291]
[ 0.30633659 1.84884862]
[ 1.38202617 2.2000786 ]
[ 0.65653385 1.57295213]
[ 1.48035859 -1.58404675]
[-0.76716878 -1.76812184]
[ 1.09829517 -1.34477489]
[ 0.88728224 -0.57781851]
[ 1.92841531 -1.3255128 ]
[-1.21757678 -0.07536814]
[-0.06622052 -0.54697767]
[-1.20680949 -1.37372741]
[-1.6352425 -0.51530165]
[-0.31509917 2.23139113]
[ 0.18283895 1.81862942]
[ 0.48590889 2.21416594]
[ 1.12762259 -1.41321927]
[-1.13400169 -0.5987718 ]
[-0.02427648 1.28999103]
[ 0.05226672 2.19345125]
[-0.03852899 -0.2597426 ]
[ 0.16376978 1.82022342]
[ 1.76163833 -1.08577317]
[ 0.44839057 2.20529925]
[-0.30694892 1.89362986]
[-0.12639768 2.38874518]
[-1.43061284 -0.04496752]
[ 0.05610713 1.00960177]
[-0.81178723 -1.5497004 ]
[-1.33716633 -0.98408472]
[ 1.58333675 -0.68248428]
[ 1.9747104 -0.95622438]
[ 0.88051886 2.06083751]
[ 1.32300304 -1.68747565]
[ 0.92626567 -1.21891002]
[ 0.24517391 1.78096285]
[ 1.25098377 -0.03523397]
[ 0.72193162 2.16683716]
[ 1.1302185 -0.2284927 ]
[ 1.24703954 1.89742087]
[-1.15577627 -0.97191733]
[-1.43539857 -1.28942483]
[ 0.7543712 -0.78030415]
[ 0.52287926 1.90640807]
[-1.53537631 -0.47277414]
[ 1.04358889 -0.44149186]
[-0.93654395 -0.79900532]
[-0.52637402 -1.07750505]
[-0.63545472 -0.93550854]] [0 2 0 2 2 0 0 2 1 0 2 1 1 2 2 0 2 1 2 1 1 0 1 1 1 0 1 0 1 1 0 0 1 2 2 2 2
0220222101012010002122211110002100102
0 0 0 1 0 1 1 2 2 0 2 2 0 2 0 2 0 1 1 2 0 1 2 1 1 1]
DBSCAN(eps=0.4)
Estimated number of clusters: 3
[ 0 -1 0 1 1 -1 0 1 2 -1 1 2 2 1 -1 -1 -1 2 1 2 2 0 2 2
2 0 2 -1 2 -1 -1 -1 2 1 1 1 1 -1 1 -1 -1 1 1 1 2 -1 -1 0
2 1 0 -1 0 -1 0 1 -1 1 1 1 2 -1 2 2 0 0 0 1 2 -1 0 -1
0 1 0 -1 0 2 -1 -1 2 1 1 0 1 1 0 -1 0 1 0 2 2 1 0 2
1 2 2 2]
Clusters: [[ 0.49210963 2.00254955]
[ 0.25333922 -0.89314775]]
Labels: [0 1 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1
0110111101011010001111111110001100101
0 0 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1]

Plot:

You might also like