Dwdm-Lab Manual
Dwdm-Lab Manual
(AUTONOMOUS)
Registered Number:
CERTIFICATE
done in ______________________________________________________________Laboratory
import pandas as pd
data=pd.read_csv(r"D:\New folder\emp.csv")
data
Output
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x)
print(y)
Output
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x)
print(y)
[['France' 'Male' 44.0 72000.0]
['Spain' 'Female' 27.0 48000.0]
['Germany' 'Male' 30.0 54000.0]
['Spain' 'MAle' 38.0 61000.0]
['Germany' 'Female' 40.0 nan]
['France' 'Female' 35.0 58000.0]
['Spain' 'Female' nan 52000.0]
['France' 'Male' 48.0 79000.0]
['Germany' 'Male' 50.0 83000.0]
['France' 'Male' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
Program
x=data.iloc[:,1:-1].values
y=data.iloc[:,[0,4]].values
print(x)
print(y)
Output
Output
Original Data :
[[12, nan, 34], [10, 32, nan], [nan, 11, 20]]
Imputed Data :
[[12. 21.5 34. ]
[10. 32. 27. ]
[11. 11. 20. ]]
MODULE-2
Demonstrate the following data preprocessing tasks using python libraries.
Data Preprocessing:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way.
So for this, we use data preprocessing task.
1.Dealing with the Categorical Data:
Categorical data:
Categorical Data is the statistical data comprising categorical variables of data that are converted
into categories.
One-hot Encoding:
In this method, each category is mapped to a vector that contains 1 and 0 denoting the
presence or absence of the feature. The number of vectors depends on the number of categories for
features.
Program:
#loading the libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
print(x)
Output
[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
[1.0 0.0 0.0 162597.7 151377.59 443898.53]
[0.0 1.0 0.0 153441.51 101145.55 407934.54]
[0.0 0.0 1.0 144372.41 118671.85 383199.62]
[0.0 1.0 0.0 142107.34 91391.77 366168.42]
[0.0 0.0 1.0 131876.9 99814.71 362861.36]
[1.0 0.0 0.0 134615.46 147198.87 127716.82]
[0.0 1.0 0.0 130298.13 145530.06 323876.68]
[0.0 0.0 1.0 120542.52 148718.95 311613.29]
[1.0 0.0 0.0 123334.88 108679.17 304981.62]
[0.0 1.0 0.0 101913.08 110594.11 229160.95]
[1.0 0.0 0.0 100671.96 91790.61 249744.55]
[0.0 1.0 0.0 93863.75 127320.38 249839.44]
[1.0 0.0 0.0 91992.39 135495.07 252664.93]
[0.0 1.0 0.0 119943.24 156547.42 256512.92]
[0.0 0.0 1.0 114523.61 122616.84 261776.23]
[1.0 0.0 0.0 78013.11 121597.55 264346.06]
[0.0 0.0 1.0 94657.16 145077.58 282574.31]
[0.0 1.0 0.0 91749.16 114175.79 294919.57]
[0.0 0.0 1.0 86419.7 153514.11 0.0]
[1.0 0.0 0.0 76253.86 113867.3 298664.47]
[0.0 0.0 1.0 78389.47 153773.43 299737.29]
[0.0 1.0 0.0 73994.56 122782.75 303319.26]
[0.0 1.0 0.0 67532.53 105751.03 304768.73]
[0.0 0.0 1.0 77044.01 99281.34 140574.81]
[1.0 0.0 0.0 64664.71 139553.16 137962.62]
[0.0 1.0 0.0 75328.87 144135.98 134050.07]
[0.0 0.0 1.0 72107.6 127864.55 353183.81]
[0.0 1.0 0.0 66051.52 182645.56 118148.2]
[0.0 0.0 1.0 65605.48 153032.06 107138.38]
[0.0 1.0 0.0 61994.48 115641.28 91131.24]
[0.0 0.0 1.0 61136.38 152701.92 88218.23]
[1.0 0.0 0.0 63408.86 129219.61 46085.25]
[0.0 1.0 0.0 55493.95 103057.49 214634.81]
[1.0 0.0 0.0 46426.07 157693.92 210797.67]
[0.0 0.0 1.0 46014.02 85047.44 205517.64]
[0.0 1.0 0.0 28663.76 127056.21 201126.82]
[1.0 0.0 0.0 44069.95 51283.14 197029.42]
[0.0 0.0 1.0 20229.59 65947.93 185265.1]
[1.0 0.0 0.0 38558.51 82982.09 174999.3]
[1.0 0.0 0.0 28754.33 118546.05 172795.67]
[0.0 1.0 0.0 27892.92 84710.77 164470.71]
[1.0 0.0 0.0 23640.93 96189.63 148001.11]
[0.0 0.0 1.0 15505.73 127382.3 35534.17]
[1.0 0.0 0.0 22177.74 154806.14 28334.72]
[0.0 0.0 1.0 1000.23 124153.04 1903.93]
[0.0 1.0 0.0 1315.46 115816.21 297114.46]
[1.0 0.0 0.0 0.0 135426.92 0.0]
[0.0 0.0 1.0 542.05 51743.15 0.0]
[1.0 0.0 0.0 0.0 116983.8 45173.06]]
Label Encoder
In label encoding, each category is assigned a value from 1 through N where N is the number of
categories for the feature. There is no relation or order between these assignments.
Program
#importing libraries
import pandas as pd
from sklearn import preprocessing
df=pd.read_csv(r"D:\New folder\iris.csv")
# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'species'.
df['variety']= label_encoder.fit_transform(df['variety'])
Df['variety']
Output
0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: variety, Length: 150, dtype: int32
2.Featuring Scale:
When your data has different values, and even different measurement units, it can be difficult to
compare them. What is kilograms compared to meters? Or altitude compared to time?The answer to
this problem is scaling. We can scale data into new values that are easier to compare.
Program:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]
scaledX = scale.fit_transform(X)
print(scaledX)
Output:
[[-2.10389253 -1.59336644]
[-0.55407235 -1.07190106]
[-1.52166278 -1.59336644]
[-1.78973979 -1.85409913]
[-0.63784641 -0.28970299]
[ 0.15800719 -0.0289703 ]
[ 0.3046118 -0.0289703 ]
[-0.05142797 1.53542584]
[-0.72580918 -0.0289703 ]
[ 0.14962979 1.01396046]
[ 1.2219378 -0.0289703 ]
[ 0.5685001 1.01396046]
[ 0.3046118 1.27469315]
[ 0.51404696 -0.0289703 ]
[ 0.51404696 1.01396046]
[ 0.72348212 -0.28970299]
[ 0.8281997 1.01396046]
[ 1.81254495 1.01396046]
[ 0.96642691 -0.0289703 ]
[ 1.72877089 1.01396046]]
3.Write a program for splitting the data into training and testing
Training Data:
The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, the training data is fed to the ML algorithms, which lets them learn how to make
predictions for the given task.
Testing Data:
This dataset evaluates the performance of the model and ensures that the model can generalize well with
the new or unseen dataset. The test dataset is another subset of original data, which is independent of the training
dataset.
Splitting:
For splitting the dataset, we can use the train_test_split function of scikit-learn.
Program:
#importing the packages
import numpy as np
from sklearn.model_selection import train_test_split
#splitting the data into training and testing
x = np.arange(1, 25).reshape(12, 2)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
print(x_train)
print(x_test)
print(y_train)
print(x_test)
Output
[[ 3 4]
[13 14]
[ 1 2]
[15 16]
[23 24]
[19 20]
[17 18]
[11 12]]
[[ 5 6]
[ 7 8]
[ 9 10]
[21 22]]
[1 0 0 1 0 0 1 0]
[[ 5 6]
[ 7 8]
[ 9 10]
[21 22]]
Module-3
1.Similarity and Dissimilarity Measures:
Similarity Measure:
Measures of similarity provide a numerical value which indicates the strength of associations between
objects or variables. The extent to which the variables are corresponding with each other is usually indicated
between “0” and “1” where “0” means no similarity or exclusion and “1” means perfect similarity of identity.
Dissimilarity Measure:
Numerical measure of how different two data objects are.Range from 0 (objects are alike) to ∞
(objects are different).
Pearsons coefficient:
The Pearson coefficient is a type of correlation coefficient that represents the relationship between two
variables that are measured on the same interval or ratio scale. The Pearson coefficient is a measure of
the strength of the association between two continuous variables.
Program
Output
Cosine similarity:
Cosine similarity is a metric used to measure the similarity of two vectors. Specifically, it
measures the similarity in the direction or orientation of the vectors ignoring differences in their
magnitude or scale. Both vectors need to be part of the same inner product space, meaning they must
produce a scalar through inner product multiplication. The similarity of two vectors is measured by the
cosine of the angle between them.
Program
import numpy as np
from numpy.linalg import norm
#defining two lists
A=np.array([2,1,2,3,3,9])
B=np.array([3,4,2,4,5,5])
print("A: ",A)
print("B: ",B)
#compute ccosine similarity
cosine=np.dot(A,B)/norm(A)*norm(B)
print("cosine Similarity :",cosine)
Output
A: [2 1 2 3 3 9]
B: [3 4 2 4 5 5]
cosine Similarity : 80.6581721881964
Jaccard similarity:
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents. Jaccard similarity can be used to find the similarity between
two asymmetric binary vectors.
Program
import numpy as np
from scipy.spatial.distance import jaccard
A=np.array([1,0,0,1,1,1])
B=np.array([0,0,1,1,1,1])
d=jaccard(A,B)
print("distance:",d)
Output
distance: 0.4
Euclidean:
Jaccard Similarity is a common proximity measurement used to compute the similarity between
two objects, such as two text documents. Jaccard similarity can be used to find the similarity between
two asymmetric binary vectors
Program
Output
6.082762530298219
MInkowski_distance:
Minkowski distance calculates the distance between two real-valued vectors. It is a
generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the
“order” or “p“, that allows different distance measures to be calculated.
Program
Output
13.0
6.082762530298219
Manhattan Distance:
This determines the absolute difference among the pair of the coordinates. Suppose we have
two points P and Q to determine the distance between these points we simply have to calculate the
perpendicular distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1)
and Q at (x2, y2).
Program:
x1 = (1,2,3,4,5,6)
x2 = (10,20,30,1,2,3)
print(manhattan_distance(x1, x2))
Output:
63
Module-4
Build a model using linear regression algorithm on any dataset:
Linear Regression:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.
Program:
import pandas as pd
import numpy as np
ds2 = pd.read_csv(r'C:\21761A05F6_PYTHON_SCRIPTS\SalaryInfo.csv')
print(ds2)
x = ds2.iloc[:,0].values
y = ds2.iloc[:,1].values
mean_x = np.mean(x)
mean_y = np.mean(y)
print("mean of Experience: ",mean_x)
print("mean of Experience: ",mean_y)
sum_add = 0
denom = 0
for (i,j) in zip(x,y):
sum_add = sum_add+((i-mean_x)*(j-mean_y))
denom = denom + (i-mean_x)*(i-mean_x)
w1_val = sum_add/denom
print("w1 Value is: ",w1_val)
w0_val = mean_y-(w1_val*mean_x)
print("w0 Value is: ",w0_val)
experience = int(input("Enter the Experience: "))
salary_predict = w0_val + w1_val*experience
print("Predicted Salary for ",experience," years experience is: ",salary_predict)
Output:
YEARS_EXPERIENCE SALARY(in $1000)
0 3 30
1 8 57
2 9 64
3 13 72
4 3 36
5 6 43
6 11 59
7 21 90
8 1 20
9 16 83
mean of Experience: 9.1
mean of Experience: 55.4
w1 Value is: 3.5374756199498467
w0 Value is: 23.208971858456394
Enter the Experience: 3
Predicted Salary for 3 years experience is: 33.821398718305936
Program:
#import required packages
import matplotlib.pyplot as pt
import numpy as np
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score
print(diabetes_Y_pred)
Output:
[225.9732401 115.74763374 163.27610621 114.73638965 120.80385422
158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]
Plot:
Module-5
Build a classification model using Decision Tree algorithm on iris dataset
Classification:
Classification in data mining is a common technique that separates data points into different
classes. It allows you to organize data sets of all sorts, including complex and large datasets as well as
small and simple ones. It primarily involves using algorithms that you can easily modify to improve the
data quality.
Decision Tree Algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is called as
Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for ASM, which are:
Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as.
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
Program:
#importing libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
iris=load_iris()
#splitting data
x=iris.data
y=iris.target
X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=None)
tree_clf=DecisionTreeClassifier(max_depth=6)
tree_clf.fit(X_train,Y_train)
DecisionTreeClassifier(max_depth=6)
y_pred=tree_clf.predict(X_test)
#determining Accuracy
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy :",accuracy)
plt.figure(figsize=(15,10))
#plotting tree
plot_tree(tree_clf,filled=True,feature_names=iris.feature_names,class_names=iris.target_names)
plt.show()
Output
Accuracy : 1.0
Plot:
2. Diabetes dataset
Program
#loading libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X=diabetes_X[:,np.newaxis,2]
diabetes_x_train=diabetes_X[:-20]
diabetes_x_test=diabetes_X[-20:]
diabetes_y_train=diabetes_y[:-20]
diabetes_y_test=diabetes_y[-20:]
regr=linear_model.LinearRegression()
regr.fit(diabetes_x_train,diabetes_y_train)
diabetes_y_pred=regr.predict(diabetes_x_test)
print(diabetes_y_pred)
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
pima=pd.read_csv(r"D:\New folder\diabetes.csv")
feature_cols=["Pregnancies","Glucose","BloodPressure","SkinThickness","DiabetesPedigreeFunction","
BMI","Insulin","Age"]
X=pima[feature_cols]
y=pima.Outcome
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=None)
clf=DecisionTreeClassifier(max_depth=6)
clf=clf.fit(X_train,Y_train)
y_pred=clf.predict(X_test)
print("Accuracy :",metrics.accuracy_score(Y_test,Y_pred))
Output
Accuracy:1.00000000000000
Module-6
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Bayes Theorem Formulae:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Program
#loading libraries
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
dataset=pd.read_csv(r"D:\New folder\iris.csv")
print(dataset)
dataset.info()
X=dataset.iloc[:,:4].values
Y=dataset['variety'].values
print(X)
print(Y)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,random_state=None,test_size=0.3)
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
print(X_test[0])
y_pred=classifier.predict(X_test)
print(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy :",accuracy)
df=pd.DataFrame({'Real Values':Y_test,'predicted_values':y_pred})
print(df)
Output
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica
Program:
#importing Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np
iris=load_iris()
X=iris.data
Y=iris.target
le=LabelEncoder()
Y=le.fit_transform(Y)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
nb_model=GaussianNB()
#Fitting data
nb_model.fit(X_train,Y_train)
Y_pred=le.inverse_transform(Y_test)
#calculating accuracy
accuracy=accuracy_score(Y_test,Y_pred)
print("Accuracy ",accuracy)
new_observation=np.array([[5.8,3.0,4.5,1.5]])
predicted_class=nb_model.predict(new_observation)
predicted_class=le.inverse_transform(predicted_class)
print("predicted class ",predicted_class)
Output:
Accuracy 1.0
predicted class [1]
#loading dataset
df=pd.read_csv(r"D:New folder\playtennis.csv")
print(df)
df.value_counts()
Entropy_play=-(9/14)*np.log2(9/14)-(5/14)*np.log2(5/14)
print(Entropy_play)
df[df.Outlook=='Sunny']
Entropy_play_Outlook_Sunny=-(3/5)*np.log2(3/5)-(2/5)*np.log2(2/5)
print(Entropy_play_Outlook_Sunny)
df[df["Outlook"]=='Overcast']
Entropy_play_outlook_overcast=-(0/4)*np.log2(0/4)-(4/4)*np.log2(4/4)
print(Entropy_play_outlook_overcast)
df[df.Outlook=='Rain']
Entropy_play_Outlook_Rain=-(2/5)*np.log2(2/5)-(3/5)*np.log2(3/5)
Entropy_play_Outlook_Rain
Gain=Entropy_play-(5/14)*Entropy_play_Outlook_Sunny-(4/14)*0-(5/14)*Entropy_play_Outlook_Rain
Gain
Output
0.9402859586706311
0.9709505944546686
nan
#Entropy for rain
0.9709505944546686
#Gain
0.24674981977443933
Module-7
#importing libraries
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori,association_rules
data=[["milk","Diapers","Beer","Cola"],
["bread","Milk","Diapers","Beer","Cola","Eggs"],
["Beer","Milk","Bread","Diapers"],
["Beer","Milk","Cola"]
]
te=TransactionEncoder()
te_ary=te.fit(data).transform(data)
df=pd.DataFrame(te_ary,columns=te.columns_)
print(df)
frequent_itemsets=apriori(df,min_support=0.75,use_colnames=True)
print(frequent_itemsets)
itemsets=apriori(df,min_support=0.75)
print(itemsets)
rules=association_rules(frequent_itemsets,metric="confidence",min_threshold=0.7)
print(rules)
selected_columns=['antecedents','consequents','antecedent support','consequent
support','support','confidence']
print(rules[selected_columns])
Output
confidence
0 0.75
1 1.00
2 0.75
3 1.00
4 0.75
5 1.00
Program:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('store_data.csv')
transactions=[]
#k=0
for i in range(0, 7500):
print(k)
# k=k+1
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.0045, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
print(len(results))
print(results)
for item in results:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
Output
7
[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004533333333333334,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}),
items_add=frozenset({'chicken'}), confidence=0.2905982905982906, lift=4.843304843304844)]),
RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}),
support=0.005733333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}),
items_add=frozenset({'escalope'}), confidence=0.30069930069930073, lift=3.7903273197390845)]),
RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005866666666666667,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}),
items_add=frozenset({'escalope'}), confidence=0.37288135593220345, lift=4.700185158809287)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.016,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}),
items_add=frozenset({'ground beef'}), confidence=0.3234501347708895, lift=3.2915549671393096)]),
RelationRecord(items=frozenset({'tomato sauce', 'ground beef'}), support=0.005333333333333333,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}),
items_add=frozenset({'ground beef'}), confidence=0.37735849056603776, lift=3.840147461662528)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.008,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}),
items_add=frozenset({'olive oil'}), confidence=0.2714932126696833, lift=4.130221288078346)]),
RelationRecord(items=frozenset({'shrimp', 'pasta'}), support=0.005066666666666666,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'shrimp'}),
confidence=0.3220338983050848, lift=4.514493901473151)])]
Module-8
1.Apply K- Means clustering algorithm on any dataset
K-Means Algorithm:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Program:
#K-Means Clustering:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
x = np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],[3.5,4.5]])
print(x)
plt.scatter(x[:,0],x[:,1])
kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
print("CLusters: ",kmeans.cluster_centers_)
print("Labels: ",kmeans.labels_)
plt.scatter(x[:,0],x[:,1],c=kmeans.labels_,cmap="rainbow")
Output:
[[1. 1. ]
[1.5 2. ]
[3. 4. ]
[5. 7. ]
[3.5 5. ]
[4.5 5. ]
[3.5 4.5]]
CLusters: [[1.25 1.5 ]
[3.9 5.1 ]]
Labels: [0 0 1 1 1 1 1]
Plot:
Module-9
1.Apply Hierarchical Clustering algorithm on any dataset
Hierarchical Clustering Algorithm:
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to
group the unlabeled datasets into a cluster and also known as hierarchical cluster analysis or
HCA.Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
both differ depending on how they work. As there is no requirement to predetermine the number of
clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
Program:
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as shc
from scipy.spatial.distance import squareform,pdist
import pandas as pd
a = np.random.random_sample(size=5)
b = np.random.random_sample(size=5)
point = ['P1','P2','P3','P4','P5']
data = pd.DataFrame({'Point':point,'a':np.round(a,2),'b':np.round(b,2)})
data = data.set_index('Point')
print(data)
plt.figure(figsize=(8,5))
plt.xlabel('Column a')
plt.ylabel('Column b')
Plots:
Module-10
1. Apply DBSCAN clustering algorithm on any dataset.
DBSCAN Clustering Algorithm:
DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are
dense regions in space separated by regions of lower density. It groups 'densely grouped' data
points.DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are
dense regions in space separated by regions of lower density. It groups 'densely grouped' data points.
Program:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
centers = [[0.5,2],[-1,-1],[1.5,-1]]
x,y = make_blobs(n_samples=100,centers=centers,cluster_std=0.5,random_state=0)
print(x,y)
db = DBSCAN(eps=0.4,min_samples=5)
print(db.fit(x))
labels = db.labels_
n_clusters_ = len(set(labels))-(1 if -1 in labels else 0)
print("Estimated number of clusters: %d" %n_clusters_)
y_pred = db.fit_predict(x)
print(db.labels_)
plt.figure(figsize=(6,4))
plt.scatter(x[:,0],x[:,1],c=y_pred,cmap='Paired')
plt.title("Clusters determined by DBSCAN")
plt.savefig("DBSCAN.jpg")
plt.scatter(x[:,0],x[:,1])
kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
print("Clusters: ",kmeans.cluster_centers_)
print("Labels: ",kmeans.labels_)
Plot: