r20 Datamining Lab (2-2 Sem Lab)
r20 Datamining Lab (2-2 Sem Lab)
LAB MANUAL
Regulation : R20
Branch : CAI
List of Experiments
a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance
7. Generate frequent itemsets using Apriori Algorithm in python and also generate association
rules for any market basket data.
Step 8: Choose whether to register Anaconda as your default Python. Unless you
plan on installing and running multiple versions of Anaconda, or multiple versions
of Python, accept the default and leave this box checked.
Step 9: Click the Install button. If you want to watch the packages Anaconda is
installing, clickShow Details.
Step 11: Optional: To install VS Code, click the Install Microsoft VS Code
button. After theinstall completes click the Next button.
Step 12: After a successful installation you will see the “Thanks for installing
Anaconda” dialogbox:
Experiment-1:
Aim: Demonstrate the following data preprocessing tasks using python libraries.
Description:
Importing the pandas
import pandas as pd
dataset = pd.read_csv("age_salary.csv")
print(dataset)
Note:
The ‘nan’ you see in some cells of the data frame denotes the missing fields
The dependent and independent values are stored in different arrays. In case of
multiple independent variables use
Y = Y.reshape(-1)
The scikit-learn library’s SimpleImputer Class allows us to impute the missing fields in a
dataset with valid data. In the above code, we have used the default strategy for filling
missing values which is the mean. The imputer can not be applied on 1D arrays and since Y is
a 1D array, it needs to be converted to a compatible shape.The reshape functions allows us to
reshape any array.The fit_transform() method will fit the imputer object and then transforms
the arrays.
dataset.isnull().sum()
Experiment-2:
Aim: Demonstrate the following data preprocessing tasks using python libraries.
Description:
For example, consider the dataset below with 2 categorical features nation and
purchased_item. Let us assume that the dataset is a record of how age, salary and country of a
person determine if an item is purchased or not.Thus purchased_item is the dependent factor
and age, salary and nation are the independent factors.
It has 3 countries listed. In a larger dataset, these may be large groups of data. Since countries
don’t have a mathematical relation between them unless we are considering some known
factors such as size or population etc , coding them in numbers will not work, as a number
may be less than or greater than another number. Dummy variables are the solution. Using
one hot encoding we will create a dummy variable for each of the category in the column.
And uses binary encoding for each dummy variable. We do not need to create dummy
variables for the feature purchased_item as it has only 2 categories either yes or no.
dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
ohe_X = OneHotEncoder(categorical_features = [0])
X = ohe_X.fit_transform(X).toarray()
Output
The the first 3 columns are the dummy features representing Germany,India and Russia
respectively.The 1’s in each column represent that the person belongs to that specific country.
Y = le_X.fit_transform(Y)
Output:
Scikit-learn comes with a method called train_test_split to help us with this task.
The above code will split X and Y into two subsets each.
The StandardScaler class from the scikit-learn library can help us scale the dataset.
sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()
Output
X_train before scaling :
Experiment-3:
a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance
Description:
Many data science techniques are based on measuring similarity and dissimilarity between
objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In
Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to
compute the distance between the cluster centroids and it’s assigned data points.
Recommendation engines use neighborhood based collaborative filtering methods which identify
an individual’s neighbor based on the similarity/dissimilarity to the other users.
Similarity based methods determine the most similar objects with the highest values as it implies
a) Pearson’s Correlation
Correlation is a technique for investigating the relationship between two quantitative, continuous
variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure
related to the strength and direction of a linear relationship. We calculate this metric for the
where
The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or
decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
Source: Wikipedia
Source Code:
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y,
1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()
b) Cosine Similarity
The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate
Recall the cosine function: on the left the red vectors point at different angles and the graph on the
Source code:
We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for
a single sample.
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)
c) Jaccard Similarity
Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for
Source Code:
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)
Distance based methods prioritize objects with the lowest values to detect similarity amongst
them.
d) Euclidean Distance
Source Code:
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)
e) Manhattan Distance
Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance
from one vector to another. You can imagine this metric as a way to compute the distance
between two points when you are not able to go through buildings.
Source Code:
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)
Experiment-4:
Aim: Build a model using linear regression algorithm on any dataset.
Description:
Populating the interactive namespace from numpy and matplotlib
In the code above I imported a few modules, here’s a breakdown of what they do:
Scipy – a collection of tools for statistics in python. Stats is the scipy module that imports
regression analysis functions
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns
Output:
id date price bedrooms bathrooms sqft_living sqft_lot
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080
df.isnull().any()
Output:
id False
date False
price False
bedrooms False
bathrooms False
sqft_living False
sqft_lot False
...
dtype: bool
Checking to see if any of our data has null values. If there were any, we’d drop or filter the null
values out.
df.dtypes
Output:
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
...
dtype: object
df.describe()
Output:
price bedrooms bathrooms sqft_living
count 21613 21613 21613 21613
mean 540088.10 3.37 2.11 2079.90
std 367127.20 0.93 0.77 918.44
min 75000.00 0.00 0.00 290.00
25% 321950.00 3.00 1.75 1427.00
50% 450000.00 3.00 2.25 1910.00
75% 645000.00 4.00 2.50 2550.00
max 7700000.00 33.00 8.00 13540.00
We are working with a data set that contains 21,613 observations, mean price is approximately
$540k, median price is approximately $450k, and the average house’s area is 2080 ft2
sqft.hist(df.sqft_living, bins=80)
sqft.set_xlabel('Ft^2')
sqft.set_title("Histogram of House Square Footage")
cost.hist(df.price, bins=80)
cost.set_xlabel('Price ($)')
cost.set_title("Histogram of Housing Prices")
plt.show()
output:
Experiment-5:
Aim: Build a classification model using Decision Tree algorithm on iris dataset.
Description:
On what basis should we make decisions?
In other words, what should we select as the yes or no questions which are used to classify our
data. We could take an educated guess (i.e. all mice with a weight over 5 pounds are obese).
However, it isn’t necessarily the best way to categorize our samples. What if, we could use some
kind of machine learning algorithm to learn what questions to ask in order to do the best job at
classifying our data? That is the purpose behind decision tree models.
Source code:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)
In the proceeding section, we’ll attempt to build a decision tree classifier to determine the kind of
flower given its dimensions.
X.head()
Although, decision trees can handle categorical data, we still encode the targets in terms of digits
(i.e. setosa=0, versicolor=1, virginica=2) in order to create a confusion matrix at a later point.
Fortunately, the pandas library provides a method for this very purpose.
y = pd.get_dummies(y)
Next, we create and train an instance of the DecisionTreeClassifer class. We provide the y values
Let’s see how our decision tree does when its presented with test data.
y_pred = dt.predict(X_test)
Experiment-6:
Aim: Apply Naïve Bayes Classification algorithm on any dataset.
Description:
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable
algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of
other features. For example, a loan applicant is desirable or not depending on his/her income,
previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class conditional
independence.
• P(h): the probability of hypothesis h being true (regardless of the data). This is known as the
prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
Source Code:
1. Importing the libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then
we have scaled the feature variable.
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below
is the code for it:
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We
can also use other classifiers as per our requirement.
Output:
Now we will predict the test set result. For this, we will create a new predictor variable y_pred,
and will use the predict function to make the predictions.
Output:
The above output shows the result for prediction vector y_pred and real vector y_test. We can
see that some predications are different from the real values, which are the incorrect predictions.
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:
3. cm = confusion_matrix(y_test, y_pred)
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.
Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for
it:
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points with
the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables. There
are some wrong predictions which we have calculated in Confusion matrix. But still it is
pretty good classifier.
Experiment-7:
Aim: Generate frequent item sets using Apriori Algorithm in python and also generate
association rules for any market basket data.
Description:
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases. It proceeds by identifying the frequent individual items in the database and
extending them to larger and larger item sets as long as those item sets appear sufficiently often in
the database. The frequent item sets determined by Apriori can be used to determine association
rules which highlight general trends in the database: this has applications in domains such
as market basket analysis.
Source Code:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
## Use this to read data from the csv file on local system.
df = pd.read_csv('./data/retail_data.csv', sep=',')
These NaNs make it hard to read the table. Let’s find out how many unique items are actually
items.update(df[col].unique())print(items)Out:
{'Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
'Diaper', 'Milk']}
Applying Apriori
apriori module from mlxtend library provides fast and efficient apriori implementation.
low_memory=False)
The output is a data frame with the support for each itemsets.
Experiment-8:
Aim: Apply K- Means clustering algorithm on any dataset.
Description:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
Source Code:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5. # Importing the dataset
6. dataset = pd.read_csv('Mall_Customers_data.csv')
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.
Experiment-9:
Aim: Apply Hierarchical Clustering algorithm on any dataset.
Description:
Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster
unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the
data points with similar characteristics. In some cases the result of hierarchical and K-Means
clustering can be similar. Before implementing hierarchical clustering using Scikit-Learn, let's
first understand the theory behind hierarchical clustering.
Source Code:
# Divisive Hierarchical Clustering. We choose Euclidean distance and ward method for our
# algorithm class
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='ward')
# Lets try to fit the hierarchical clustering algorithm to dataset X while creating the
# clusters vector that tells for each customer which cluster the customer belongs to.
y_hc=hc.fit_predict(X)
#5 Visualizing the clusters. This code is similar to k-means visualization code.
#We only replace the y_kmeans vector name to y_hc for the hierarchical clustering
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='cyan', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='magenta', label ='Cluster 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100')
plt.show()
Experiment-10:
Aim: Apply DBSCAN clustering algorithm on any dataset.
Description:
• DBSCAN algorithm steps, following the original research paper by Martin Ester et.al. [1]
• Key concept of directly density reachable points to classify core and border points of cluster.
This also helps us to identify noise in the data.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/Mall_Customers.csv")
# importing the dataset
• Output
Output
Dataset shape: (200, 5)
Output
False
Output
(200, 2)
from sklearn.neighbors import NearestNeighbors # importing the
library
neighb = NearestNeighbors(n_neighbors=2) # creating an object of
the NearestNeighbors class
nbrs=neighb.fit(x) # fitting the data to the object
distances,indices=nbrs.kneighbors(x) # finding the nearest
neighbours
Output
Clusters plot