0% found this document useful (0 votes)

48 views41 pages

r20 Datamining Lab (2-2 Sem Lab)

Uploaded by

Medhavi Shrivastav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views41 pages

r20 Datamining Lab (2-2 Sem Lab)

Uploaded by

Medhavi Shrivastav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

(R20)Data Mining using Python Lab

INTERNATIONAL SCHOOL OF TECHNOLOGY AND SCIENCES

(FOR WOMEN)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DATA MINING USING PYTHON

LAB MANUAL

Year : 2021 – 2022

Subject Name : Data Mining using Python

Regulation : R20

Class/Sem : II B.Tech- II Sem

Branch : CAI

International School of Technology and Sciences (for Women) Page 1

(R20)Data Mining using Python Lab

List of Experiments

1. Demonstrate the following data preprocessing tasks using python libraries.

a) Loading the dataset

b) Identifying the dependent and independent variables
c) Dealing with missing data

2. Demonstrate the following data preprocessing tasks using python libraries.

a) Dealing with categorical data

b) Scaling the features
c) Splitting dataset into Training and Testing Sets

3. Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance

4. Build a model using linear regression algorithm on any dataset.

5. Build a classification model using Decision Tree algorithm on iris dataset

6. Apply Naïve Bayes Classification algorithm on any dataset

7. Generate frequent itemsets using Apriori Algorithm in python and also generate association
rules for any market basket data.

8. Apply K- Means clustering algorithm on any dataset.

9. Apply Hierarchical Clustering algorithm on any dataset.

10. Apply DBSCAN clustering algorithm on any dataset.

International School of Technology and Sciences (for Women) Page 2

(R20)Data Mining using Python Lab

Tools and Technology Used

Operating System: Windows

IDE : Anaconda Navigator

Language: Python
Editors : SpyderJupyter

Eclipse with PyDev

INSTALLATION STEPS:
Step 1: Visit
www.anaconda.com
Step 2: Click
Download
Step 3: Click 64-Bit Command Line
installer(542 MB)
Step 4: Click Next.
Step 5: Read the licensing terms and click “I Agree”.
Step 6: Select an install for “Just Me” unless you’re installing
for all users(which requires Windows Administrator
privileges) and click Next.

Step 7: Choose whether to add Anaconda to your PATH environment variable. We

recommend not adding Anaconda to the PATH environment variable, since this
can interfere with other software. Instead, use Anaconda software by opening
Anaconda Navigator or the Anaconda Prompt from the Start Menu.

Step 8: Choose whether to register Anaconda as your default Python. Unless you
plan on installing and running multiple versions of Anaconda, or multiple versions
of Python, accept the default and leave this box checked.

Step 9: Click the Install button. If you want to watch the packages Anaconda is
installing, clickShow Details.

Step 10: Click the Next button.

Step 11: Optional: To install VS Code, click the Install Microsoft VS Code
button. After theinstall completes click the Next button.

Step 12: After a successful installation you will see the “Thanks for installing
Anaconda” dialogbox:

International School of Technology and Sciences (for Women) Page 3

(R20)Data Mining using Python Lab

Experiment-1:

Aim: Demonstrate the following data preprocessing tasks using python libraries.

a) Loading the dataset

b) Identifying the dependent and independent variables
c) Dealing with missing data

Description:
Importing the pandas

import pandas as pd

a) Loading the dataset

dataset = pd.read_csv("age_salary.csv")

print(dataset)

The data set used here is as simple as shown below:

International School of Technology and Sciences (for Women) Page 4

(R20)Data Mining using Python Lab

Note:

The ‘nan’ you see in some cells of the data frame denotes the missing fields

b) Classifying the dependent and Independent Variables

Having seen the data we can clearly identify the dependent and independent
factors.Here we just have 2 factors, age and salary.Salary is the dependent factor
that changes with the independent factor age.Now let’s classify them
programmatically.

X = dataset.iloc[:,:-1].values #Takes all rows of all columns except the last

column
Y = dataset.iloc[:,-1].values # Takes all rows of the last column

• X : independent variable set

• Y : dependent variable set

International School of Technology and Sciences (for Women) Page 5

(R20)Data Mining using Python Lab

The dependent and independent values are stored in different arrays. In case of
multiple independent variables use

X = dataset.iloc[:,a:b].values where a is the starting range and b is the ending

range (column indices). You can also specify the column indices in a list to
select specific columns.

c) Dealing with Missing Data

We have already noticed the missing fields in the data denoted by “nan”. Machine learning
models cannot accommodate missing fields in the data they are provided with.So the missing
fields must be filled with values that will not affect the variance of the data or make it more
noisy.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy="mean")
X = imp.fit_transform(X)
Y = Y.reshape(-1,1)
Y = imp.fit_transform(Y)

International School of Technology and Sciences (for Women) Page 6

(R20)Data Mining using Python Lab

Y = Y.reshape(-1)
The scikit-learn library’s SimpleImputer Class allows us to impute the missing fields in a
dataset with valid data. In the above code, we have used the default strategy for filling
missing values which is the mean. The imputer can not be applied on 1D arrays and since Y is
a 1D array, it needs to be converted to a compatible shape.The reshape functions allows us to
reshape any array.The fit_transform() method will fit the imputer object and then transforms
the arrays.

Another method to find missing values using pandas:

dataset.isnull().sum()

it returns all the nan values

Output

International School of Technology and Sciences (for Women) Page 7

(R20)Data Mining using Python Lab

Experiment-2:

Aim: Demonstrate the following data preprocessing tasks using python libraries.

a) Dealing with categorical data

b) Scaling the features
c) Splitting dataset into Training and Testing Sets

Description:

a) Dealing with Categorical Data

When dealing with large and real-world datasets, categorical data is almost
inevitable.Categorical variables represent types of data which may be divided into groups.
Examples of categorical variables are race, sex, age group, educational level etc. These
variables often has letters or words as its values. Since machine learning models are all about
numbers and calculations , these categorical variables need to be coded in to numbers. Having
coded the categorical variable into numbers may just not be enough.

For example, consider the dataset below with 2 categorical features nation and
purchased_item. Let us assume that the dataset is a record of how age, salary and country of a
person determine if an item is purchased or not.Thus purchased_item is the dependent factor
and age, salary and nation are the independent factors.

International School of Technology and Sciences (for Women) Page 8

(R20)Data Mining using Python Lab

It has 3 countries listed. In a larger dataset, these may be large groups of data. Since countries
don’t have a mathematical relation between them unless we are considering some known
factors such as size or population etc , coding them in numbers will not work, as a number
may be less than or greater than another number. Dummy variables are the solution. Using
one hot encoding we will create a dummy variable for each of the category in the column.
And uses binary encoding for each dummy variable. We do not need to create dummy
variables for the feature purchased_item as it has only 2 categories either yes or no.

dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
ohe_X = OneHotEncoder(categorical_features = [0])
X = ohe_X.fit_transform(X).toarray()

International School of Technology and Sciences (for Women) Page 9

(R20)Data Mining using Python Lab

Output

The the first 3 columns are the dummy features representing Germany,India and Russia
respectively.The 1’s in each column represent that the person belongs to that specific country.

Y = le_X.fit_transform(Y)

International School of Technology and Sciences (for Women) Page 10

(R20)Data Mining using Python Lab

Output:

b) Splitting the Dataset into Training and Testing sets

All machine learning models require us to provide a training set for the machine so that the
model can train from that data to understand the relations between features and can predict for
new observations.When we are provided a single huge dataset with too much of observations
,it is a good idea to split the dataset into to two, a training_set and a test_set, so that we can
test our model after its been trained with the training_set.

Scikit-learn comes with a method called train_test_split to help us with this task.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =
0.3, random_state = 0)

The above code will split X and Y into two subsets each.

• test_size: the desired size of the test_set. 0.3 denotes 30%.

• random_state: This is used to preserve the uniqueness. The split will happen uniquely
for a random_state.

International School of Technology and Sciences (for Women) Page 11

(R20)Data Mining using Python Lab

c) Scaling the features

Since machine learning models rely on numbers to solve relations it is important to have
similarly scaled data in a dataset. Scaling ensures that all data in a dataset falls in the same
range.Unscaled data can cause inaccurate or false predictions.Some machine learning
algorithms can handle feature scaling on its own and doesn’t require it explicitly.

The StandardScaler class from the scikit-learn library can help us scale the dataset.

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()

Output
X_train before scaling :

X_train after scaling :

International School of Technology and Sciences (for Women) Page 12
(R20)Data Mining using Python Lab

Experiment-3:

Aim:Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance

Description:
Many data science techniques are based on measuring similarity and dissimilarity between
objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In
Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to
compute the distance between the cluster centroids and it’s assigned data points.
Recommendation engines use neighborhood based collaborative filtering methods which identify
an individual’s neighbor based on the similarity/dissimilarity to the other users.

Similarity Based Metrics

Similarity based methods determine the most similar objects with the highest values as it implies

they live in closer neighborhoods.

a) Pearson’s Correlation

International School of Technology and Sciences (for Women) Page 13

(R20)Data Mining using Python Lab

Correlation is a technique for investigating the relationship between two quantitative, continuous

variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure

related to the strength and direction of a linear relationship. We calculate this metric for the

vectors x and y in the following way:

where

The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or

decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.

Source: Wikipedia

International School of Technology and Sciences (for Women) Page 14

(R20)Data Mining using Python Lab

Source Code:
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y,
1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation

corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.810

b) Cosine Similarity

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate

the cosine similarity we use the following formula:

International School of Technology and Sciences (for Women) Page 15

(R20)Data Mining using Python Lab

Recall the cosine function: on the left the red vectors point at different angles and the graph on the

right shows the resulting function.

Source code:

We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for
a single sample.
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

Output:Cosine similarity: 0.773

c) Jaccard Similarity

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for

comparing two binary vectors (sets).

Source Code:
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]

International School of Technology and Sciences (for Women) Page 16

(R20)Data Mining using Python Lab

jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)

Output:Jaccard similarity: 0.500

Distance Based Metrics

Distance based methods prioritize objects with the lowest values to detect similarity amongst

them.

d) Euclidean Distance

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Source Code:
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)

Output:Euclidean distance: 3.273

e) Manhattan Distance

Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance

from one vector to another. You can imagine this metric as a way to compute the distance
between two points when you are not able to go through buildings.

International School of Technology and Sciences (for Women) Page 17

(R20)Data Mining using Python Lab

We calculate the Manhattan distance as follows:

Source Code:
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)

Output: Manhattan distance: 10.468

Experiment-4:
Aim: Build a model using linear regression algorithm on any dataset.
Description:
Populating the interactive namespace from numpy and matplotlib
In the code above I imported a few modules, here’s a breakdown of what they do:

1. Numpy – a necessary package for scientific computation. It includes an incredibly

versatile structure for working with arrays, which are the primary data format that scikit-
learn uses for input data.
2. Matplotlib – the fundamental package for data visualization in Python. This module
allows for the creation of everything from simple scatter plots to 3-dimensional contour
plots. Note that from matplotlib we install pyplot, which is the highest order state-
machine environment in the modules hierarchy (if that is meaningless to you don’t worry
about it, just make sure you get it imported to your notebook). Using ‘%matplotlib inline’
is essential to make sure that all plots show up in your notebook.

Scipy – a collection of tools for statistics in python. Stats is the scipy module that imports
regression analysis functions

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns

International School of Technology and Sciences (for Women) Page 18

(R20)Data Mining using Python Lab

from matplotlib import rcParams

df = pd.read_csv('/Users/michaelrundell/Desktop/kc_house_data.csv')
df.head()

Output:
id date price bedrooms bathrooms sqft_living sqft_lot
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080

Reading the csv file from Kaggle using pandas (pd.read_csv).

df.isnull().any()

Output:
id False
date False
price False
bedrooms False
bathrooms False
sqft_living False
sqft_lot False
...
dtype: bool

Checking to see if any of our data has null values. If there were any, we’d drop or filter the null
values out.

df.dtypes

Output:
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
...
dtype: object

International School of Technology and Sciences (for Women) Page 19

(R20)Data Mining using Python Lab

df.describe()

Output:
price bedrooms bathrooms sqft_living
count 21613 21613 21613 21613
mean 540088.10 3.37 2.11 2079.90
std 367127.20 0.93 0.77 918.44
min 75000.00 0.00 0.00 290.00
25% 321950.00 3.00 1.75 1427.00
50% 450000.00 3.00 2.25 1910.00
75% 645000.00 4.00 2.50 2550.00
max 7700000.00 33.00 8.00 13540.00
We are working with a data set that contains 21,613 observations, mean price is approximately
$540k, median price is approximately $450k, and the average house’s area is 2080 ft2

fig = plt.figure(figsize=(12, 6))

sqft = fig.add_subplot(121)
cost = fig.add_subplot(122)

sqft.hist(df.sqft_living, bins=80)
sqft.set_xlabel('Ft^2')
sqft.set_title("Histogram of House Square Footage")

cost.hist(df.price, bins=80)
cost.set_xlabel('Price ($)')
cost.set_title("Histogram of Housing Prices")

plt.show()
output:

International School of Technology and Sciences (for Women) Page 20

(R20)Data Mining using Python Lab

Experiment-5:
Aim: Build a classification model using Decision Tree algorithm on iris dataset.
Description:
On what basis should we make decisions?
In other words, what should we select as the yes or no questions which are used to classify our
data. We could take an educated guess (i.e. all mice with a weight over 5 pounds are obese).
However, it isn’t necessarily the best way to categorize our samples. What if, we could use some
kind of machine learning algorithm to learn what questions to ask in order to do the best job at
classifying our data? That is the purpose behind decision tree models.
Source code:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)

In the proceeding section, we’ll attempt to build a decision tree classifier to determine the kind of
flower given its dimensions.
X.head()

Although, decision trees can handle categorical data, we still encode the targets in terms of digits

(i.e. setosa=0, versicolor=1, virginica=2) in order to create a confusion matrix at a later point.
Fortunately, the pandas library provides a method for this very purpose.
y = pd.get_dummies(y)

International School of Technology and Sciences (for Women) Page 21

(R20)Data Mining using Python Lab

X_train, X_test, y_train, y_test = train_test_split(X, y,

random_state=1)

Next, we create and train an instance of the DecisionTreeClassifer class. We provide the y values

because our model uses a supervised machine learning algorithm

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
ot_data = StringIO()export_graphviz(dt, out_file=dot_data,
feature_names=iris.feature_names)(graph, ) =
graph_from_dot_data(dot_data.getvalue())Image(graph.create_png())

Let’s see how our decision tree does when its presented with test data.
y_pred = dt.predict(X_test)

Experiment-6:
Aim: Apply Naïve Bayes Classification algorithm on any dataset.
Description:
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable
algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.

Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of
other features. For example, a loan applicant is desirable or not depending on his/her income,
previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class conditional
independence.

International School of Technology and Sciences (for Women) Page 22

(R20)Data Mining using Python Lab

• P(h): the probability of hypothesis h being true (regardless of the data). This is known as the
prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.

Source Code:
1. Importing the libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then
we have scaled the feature variable.

The output for the dataset is given as:

International School of Technology and Sciences (for Women) Page 23

(R20)Data Mining using Python Lab

International School of Technology and Sciences (for Women) Page 24

(R20)Data Mining using Python Lab

2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below
is the code for it:

1. # Fitting Naive Bayes to the Training set

2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We
can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor variable y_pred,
and will use the predict function to make the predictions.

1. # Predicting the Test set results

2. y_pred = classifier.predict(x_test)

Output:

International School of Technology and Sciences (for Women) Page 25

(R20)Data Mining using Python Lab

The above output shows the result for prediction vector y_pred and real vector y_test. We can
see that some predications are different from the real values, which are the incorrect predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:

1. # Making the Confusion Matrix

2. from sklearn.metrics import confusion_matrix

International School of Technology and Sciences (for Women) Page 26

(R20)Data Mining using Python Lab

3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for
it:

1. # Visualising the Training set results

2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())

International School of Technology and Sciences (for Women) Page 27

(R20)Data Mining using Python Lab

10. for i, j in enumerate(nm.unique(y_set)):

11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points with
the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.

6) Visualizing the Test set result:

1. # Visualising the Test set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())

International School of Technology and Sciences (for Women) Page 28

(R20)Data Mining using Python Lab

9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables. There
are some wrong predictions which we have calculated in Confusion matrix. But still it is
pretty good classifier.

Experiment-7:
Aim: Generate frequent item sets using Apriori Algorithm in python and also generate
association rules for any market basket data.
Description:
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases. It proceeds by identifying the frequent individual items in the database and

International School of Technology and Sciences (for Women) Page 29

(R20)Data Mining using Python Lab

extending them to larger and larger item sets as long as those item sets appear sufficiently often in
the database. The frequent item sets determined by Apriori can be used to determine association
rules which highlight general trends in the database: this has applications in domains such
as market basket analysis.

Source Code:

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

## Use this to read data directly from github

df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-
Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/re
tail_dataset.csv', sep=',')

## Use this to read data from the csv file on local system.
df = pd.read_csv('./data/retail_data.csv', sep=',')

## Print first 10 rows

df.head(10)

These NaNs make it hard to read the table. Let’s find out how many unique items are actually

there in the table.

items = set()
for col in df:

International School of Technology and Sciences (for Women) Page 30

(R20)Data Mining using Python Lab

items.update(df[col].unique())print(items)Out:
{'Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
'Diaper', 'Milk']}

Custom One Hot Encoding

itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
rowset = set(row)
labels = {}
uncommons = list(itemset - rowset)
commons = list(itemset.intersection(rowset))
for uc in uncommons:
labels[uc] = 0
for com in commons:
labels[com] = 1
encoded_vals.append(labels)
encoded_vals[0]ohe_df = pd.DataFrame(encoded_vals)

Applying Apriori

apriori module from mlxtend library provides fast and efficient apriori implementation.

apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0,

low_memory=False)

freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True,

verbose=1)
freq_items.head(7)

International School of Technology and Sciences (for Women) Page 31

(R20)Data Mining using Python Lab

The output is a data frame with the support for each itemsets.

Experiment-8:
Aim: Apply K- Means clustering algorithm on any dataset.

Description:

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

Source Code:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5. # Importing the dataset
6. dataset = pd.read_csv('Mall_Customers_data.csv')

International School of Technology and Sciences (for Women) Page 32

(R20)Data Mining using Python Lab

Step-2: Finding the optimal number of clusters using the elbow

method

1. #finding optimal number of clusters using the elbow method

2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)

International School of Technology and Sciences (for Women) Page 33

(R20)Data Mining using Python Lab

11. mtp.title('The Elobw Method Graph')

12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.

International School of Technology and Sciences (for Women) Page 34

(R20)Data Mining using Python Lab

Step- 3: Training the K-means algorithm on the training dataset

1. #training the K-means model on a dataset
2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)

Step-4: Visualizing the Clusters

1. #visulaizing the clusters
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for
first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #fo
r second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for t
hird cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for
fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
#for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', la
bel = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()
Output:

International School of Technology and Sciences (for Women) Page 35

(R20)Data Mining using Python Lab

Experiment-9:
Aim: Apply Hierarchical Clustering algorithm on any dataset.

Description:
Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster
unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the
data points with similar characteristics. In some cases the result of hierarchical and K-Means
clustering can be similar. Before implementing hierarchical clustering using Scikit-Learn, let's
first understand the theory behind hierarchical clustering.
Source Code:

#1 Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#2 Importing the Mall_Customers dataset by pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3,4]].values
#3 Using the dendrogram to find the optimal numbers of clusters.
# First thing we're going to do is to import scipy library. scipy is an open source
# Python library that contains tools to do hierarchical clustering and building dendrograms.
# Only import the needed tool.
import scipy.cluster.hierarchy as sch
#Lets create a dendrogram variable
# linkage is actually the algorithm itself of hierarchical clustering and then in
#linkage we have to specify on which data we apply and engage. This is X dataset
dendrogram = sch.dendrogram(sch.linkage(X, method = "ward"))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

International School of Technology and Sciences (for Women) Page 36

(R20)Data Mining using Python Lab

# Divisive Hierarchical Clustering. We choose Euclidean distance and ward method for our
# algorithm class
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='ward')
# Lets try to fit the hierarchical clustering algorithm to dataset X while creating the
# clusters vector that tells for each customer which cluster the customer belongs to.
y_hc=hc.fit_predict(X)
#5 Visualizing the clusters. This code is similar to k-means visualization code.
#We only replace the y_kmeans vector name to y_hc for the hierarchical clustering
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='cyan', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='magenta', label ='Cluster 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100')
plt.show()

International School of Technology and Sciences (for Women) Page 37

(R20)Data Mining using Python Lab

Experiment-10:
Aim: Apply DBSCAN clustering algorithm on any dataset.
Description:

• DBSCAN algorithm steps, following the original research paper by Martin Ester et.al. [1]

• Key concept of directly density reachable points to classify core and border points of cluster.
This also helps us to identify noise in the data.

• Example of DBSCAN algorithm application using python and scikit-learn by clustering

different regions in Canada based on yearly weather data. Learn to use a fantastic tool-
Basemap for plotting 2D data on maps using python. All the codes (with python), images
(made using Libre Office) are available in github (link given at the end of the post).

Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/Mall_Customers.csv")
# importing the dataset

International School of Technology and Sciences (for Women) Page 38

(R20)Data Mining using Python Lab

• Checking the head of the data.

data.head()

• Output

• Checking the shape of the dataset.

print("Dataset shape:", data.shape)

Output
Dataset shape: (200, 5)

Next, we check if the dataset has any missing values.

# checking for NULL data in the dataset
data.isnull().any().any()

Output
False

# extracting the above mentioned columns

x = data.loc[:, ['Annual Income (k$)',
'Spending Score (1-100)']].values

International School of Technology and Sciences (for Women) Page 39

(R20)Data Mining using Python Lab

• Let us check the shape of x.

print(x.shape)

Output
(200, 2)
from sklearn.neighbors import NearestNeighbors # importing the
library
neighb = NearestNeighbors(n_neighbors=2) # creating an object of
the NearestNeighbors class
nbrs=neighb.fit(x) # fitting the data to the object
distances,indices=nbrs.kneighbors(x) # finding the nearest
neighbours

Sorting and plot the distances between the data points

# Sort and plot the distances results
distances = np.sort(distances, axis = 0) # sorting the distances
distances = distances[:, 1] # taking the second column of the
sorted distances
plt.rcParams['figure.figsize'] = (5,3) # setting the figure size
plt.plot(distances) # plotting the distances
plt.show() # showing the plot

Output

Executing the code above, we obtain the following plot:

Implementing the DBSCAN model

from sklearn.cluster import DBSCAN
# cluster the data into five clusters
dbscan = DBSCAN(eps = 8, min_samples = 4).fit(x) # fitting the
model

International School of Technology and Sciences (for Women) Page 40

(R20)Data Mining using Python Lab

labels = dbscan.labels_ # getting the labels

# Plot the clusters
plt.scatter(x[:, 0], x[:,1], c = labels, cmap= "plasma") #
plotting the clusters
plt.xlabel("Income") # X-axis label
plt.ylabel("Spending Score") # Y-axis label
plt.show() # showing the plot

Clusters plot

International School of Technology and Sciences (for Women) Page 41

Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
Business Intelligence Question Bank
No ratings yet
Business Intelligence Question Bank
35 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
Chat With PDF: Your Go-To Website For Smarter Exam Prep With PDF Chat Support
No ratings yet
Chat With PDF: Your Go-To Website For Smarter Exam Prep With PDF Chat Support
6 pages
Fuzzy Logic-Driven Natural Language Processing in Pharma Supply Chain Analytics
No ratings yet
Fuzzy Logic-Driven Natural Language Processing in Pharma Supply Chain Analytics
22 pages
Cross Camera Tracking
No ratings yet
Cross Camera Tracking
18 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
UNIT 2 Bigdata Mining and Analytics
No ratings yet
UNIT 2 Bigdata Mining and Analytics
18 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Machine Learning
100% (1)
Machine Learning
65 pages
KNN - Feb 19
No ratings yet
KNN - Feb 19
42 pages
Context-Based Persian Multi-Document Summarization (Global View)
No ratings yet
Context-Based Persian Multi-Document Summarization (Global View)
5 pages
NLP03 Vector Space Models
No ratings yet
NLP03 Vector Space Models
61 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Report Intership Chapters
No ratings yet
Report Intership Chapters
39 pages
Chapter
100% (1)
Chapter
101 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Data - Science - Manaul (Te)
No ratings yet
Data - Science - Manaul (Te)
78 pages
Design Report 1 (Repaired)
No ratings yet
Design Report 1 (Repaired)
50 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
AI Report
No ratings yet
AI Report
15 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Job Recommendation System Report
No ratings yet
Job Recommendation System Report
47 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Emotional RAG Enhancing Role-Playing Agents Through Emotional Retrieval
No ratings yet
Emotional RAG Enhancing Role-Playing Agents Through Emotional Retrieval
8 pages
Multi-Factor Fusion POI Recommendation Model
No ratings yet
Multi-Factor Fusion POI Recommendation Model
15 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
ML Pgms - 24mar2025
No ratings yet
ML Pgms - 24mar2025
23 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
ML Lab Manual
No ratings yet
ML Lab Manual
90 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
Kartik MLP 4-9prg
No ratings yet
Kartik MLP 4-9prg
10 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Datascience
No ratings yet
Datascience
26 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
Yann Debray - 1714613827618
No ratings yet
Yann Debray - 1714613827618
16 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Data-Analytics-Manual Lab G.anill Kumar
No ratings yet
Data-Analytics-Manual Lab G.anill Kumar
23 pages
Machine Learning Laboratory: Manual
No ratings yet
Machine Learning Laboratory: Manual
52 pages
Paper 179
No ratings yet
Paper 179
6 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Machine Learning Lab Record Report
No ratings yet
Machine Learning Lab Record Report
38 pages
DMW - Unit - 3 MCQ
No ratings yet
DMW - Unit - 3 MCQ
4 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Resume Match System
No ratings yet
Resume Match System
6 pages
CS3362 - Data Science Laboratory - Manual - Final-1
No ratings yet
CS3362 - Data Science Laboratory - Manual - Final-1
76 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System
No ratings yet
A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System
11 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
INDUSTRY 2 Jaimin
No ratings yet
INDUSTRY 2 Jaimin
14 pages
INDUSTRY 2 Akshat
No ratings yet
INDUSTRY 2 Akshat
12 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
No ratings yet
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
12 pages
asila-IR
No ratings yet
asila-IR
16 pages
ANL252 SU5 Jul2022
No ratings yet
ANL252 SU5 Jul2022
58 pages
Unit 2 ML
No ratings yet
Unit 2 ML
93 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Science
No ratings yet
Data Science
18 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Datascience
No ratings yet
Datascience
8 pages
DM Guidelines 14jan2022
No ratings yet
DM Guidelines 14jan2022
5 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Go4braindumps 1z0 1127 24 Questions by Day 22 07 2024 11qa
No ratings yet
Go4braindumps 1z0 1127 24 Questions by Day 22 07 2024 11qa
12 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
No ratings yet
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
5 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet