0% found this document useful (0 votes)
48 views41 pages

r20 Datamining Lab (2-2 Sem Lab)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views41 pages

r20 Datamining Lab (2-2 Sem Lab)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

(R20)Data Mining using Python Lab

INTERNATIONAL SCHOOL OF TECHNOLOGY AND SCIENCES


(FOR WOMEN)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DATA MINING USING PYTHON

LAB MANUAL

Year : 2021 – 2022

Subject Name : Data Mining using Python

Regulation : R20

Class/Sem : II B.Tech- II Sem

Branch : CAI

International School of Technology and Sciences (for Women) Page 1


(R20)Data Mining using Python Lab

List of Experiments

1. Demonstrate the following data preprocessing tasks using python libraries.

a) Loading the dataset


b) Identifying the dependent and independent variables
c) Dealing with missing data

2. Demonstrate the following data preprocessing tasks using python libraries.

a) Dealing with categorical data


b) Scaling the features
c) Splitting dataset into Training and Testing Sets

3. Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance

4. Build a model using linear regression algorithm on any dataset.

5. Build a classification model using Decision Tree algorithm on iris dataset

6. Apply Naïve Bayes Classification algorithm on any dataset

7. Generate frequent itemsets using Apriori Algorithm in python and also generate association
rules for any market basket data.

8. Apply K- Means clustering algorithm on any dataset.

9. Apply Hierarchical Clustering algorithm on any dataset.

10. Apply DBSCAN clustering algorithm on any dataset.

International School of Technology and Sciences (for Women) Page 2


(R20)Data Mining using Python Lab

Tools and Technology Used


Operating System: Windows

IDE : Anaconda Navigator


Language: Python
Editors : SpyderJupyter

Eclipse with PyDev


INSTALLATION STEPS:
Step 1: Visit
www.anaconda.com
Step 2: Click
Download
Step 3: Click 64-Bit Command Line
installer(542 MB)
Step 4: Click Next.
Step 5: Read the licensing terms and click “I Agree”.
Step 6: Select an install for “Just Me” unless you’re installing
for all users(which requires Windows Administrator
privileges) and click Next.

Step 7: Choose whether to add Anaconda to your PATH environment variable. We


recommend not adding Anaconda to the PATH environment variable, since this
can interfere with other software. Instead, use Anaconda software by opening
Anaconda Navigator or the Anaconda Prompt from the Start Menu.

Step 8: Choose whether to register Anaconda as your default Python. Unless you
plan on installing and running multiple versions of Anaconda, or multiple versions
of Python, accept the default and leave this box checked.

Step 9: Click the Install button. If you want to watch the packages Anaconda is
installing, clickShow Details.

Step 10: Click the Next button.

Step 11: Optional: To install VS Code, click the Install Microsoft VS Code
button. After theinstall completes click the Next button.

Step 12: After a successful installation you will see the “Thanks for installing
Anaconda” dialogbox:

International School of Technology and Sciences (for Women) Page 3


(R20)Data Mining using Python Lab

Experiment-1:

Aim: Demonstrate the following data preprocessing tasks using python libraries.

a) Loading the dataset


b) Identifying the dependent and independent variables
c) Dealing with missing data

Description:
Importing the pandas

import pandas as pd

a) Loading the dataset

dataset = pd.read_csv("age_salary.csv")

print(dataset)

The data set used here is as simple as shown below:

International School of Technology and Sciences (for Women) Page 4


(R20)Data Mining using Python Lab

Note:

The ‘nan’ you see in some cells of the data frame denotes the missing fields

b) Classifying the dependent and Independent Variables


Having seen the data we can clearly identify the dependent and independent
factors.Here we just have 2 factors, age and salary.Salary is the dependent factor
that changes with the independent factor age.Now let’s classify them
programmatically.

X = dataset.iloc[:,:-1].values #Takes all rows of all columns except the last


column
Y = dataset.iloc[:,-1].values # Takes all rows of the last column

• X : independent variable set


• Y : dependent variable set

International School of Technology and Sciences (for Women) Page 5


(R20)Data Mining using Python Lab

The dependent and independent values are stored in different arrays. In case of
multiple independent variables use

X = dataset.iloc[:,a:b].values where a is the starting range and b is the ending


range (column indices). You can also specify the column indices in a list to
select specific columns.

c) Dealing with Missing Data


We have already noticed the missing fields in the data denoted by “nan”. Machine learning
models cannot accommodate missing fields in the data they are provided with.So the missing
fields must be filled with values that will not affect the variance of the data or make it more
noisy.

from sklearn.impute import SimpleImputer


imp = SimpleImputer(missing_values=np.nan, strategy="mean")
X = imp.fit_transform(X)
Y = Y.reshape(-1,1)
Y = imp.fit_transform(Y)

International School of Technology and Sciences (for Women) Page 6


(R20)Data Mining using Python Lab

Y = Y.reshape(-1)
The scikit-learn library’s SimpleImputer Class allows us to impute the missing fields in a
dataset with valid data. In the above code, we have used the default strategy for filling
missing values which is the mean. The imputer can not be applied on 1D arrays and since Y is
a 1D array, it needs to be converted to a compatible shape.The reshape functions allows us to
reshape any array.The fit_transform() method will fit the imputer object and then transforms
the arrays.

Another method to find missing values using pandas:

dataset.isnull().sum()

it returns all the nan values


Output

International School of Technology and Sciences (for Women) Page 7


(R20)Data Mining using Python Lab

Experiment-2:

Aim: Demonstrate the following data preprocessing tasks using python libraries.

a) Dealing with categorical data


b) Scaling the features
c) Splitting dataset into Training and Testing Sets

Description:

a) Dealing with Categorical Data


When dealing with large and real-world datasets, categorical data is almost
inevitable.Categorical variables represent types of data which may be divided into groups.
Examples of categorical variables are race, sex, age group, educational level etc. These
variables often has letters or words as its values. Since machine learning models are all about
numbers and calculations , these categorical variables need to be coded in to numbers. Having
coded the categorical variable into numbers may just not be enough.

For example, consider the dataset below with 2 categorical features nation and
purchased_item. Let us assume that the dataset is a record of how age, salary and country of a
person determine if an item is purchased or not.Thus purchased_item is the dependent factor
and age, salary and nation are the independent factors.

International School of Technology and Sciences (for Women) Page 8


(R20)Data Mining using Python Lab

It has 3 countries listed. In a larger dataset, these may be large groups of data. Since countries
don’t have a mathematical relation between them unless we are considering some known
factors such as size or population etc , coding them in numbers will not work, as a number
may be less than or greater than another number. Dummy variables are the solution. Using
one hot encoding we will create a dummy variable for each of the category in the column.
And uses binary encoding for each dummy variable. We do not need to create dummy
variables for the feature purchased_item as it has only 2 categories either yes or no.

dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
ohe_X = OneHotEncoder(categorical_features = [0])
X = ohe_X.fit_transform(X).toarray()

International School of Technology and Sciences (for Women) Page 9


(R20)Data Mining using Python Lab

Output

The the first 3 columns are the dummy features representing Germany,India and Russia
respectively.The 1’s in each column represent that the person belongs to that specific country.

Y = le_X.fit_transform(Y)

International School of Technology and Sciences (for Women) Page 10


(R20)Data Mining using Python Lab

Output:

b) Splitting the Dataset into Training and Testing sets


All machine learning models require us to provide a training set for the machine so that the
model can train from that data to understand the relations between features and can predict for
new observations.When we are provided a single huge dataset with too much of observations
,it is a good idea to split the dataset into to two, a training_set and a test_set, so that we can
test our model after its been trained with the training_set.

Scikit-learn comes with a method called train_test_split to help us with this task.

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =
0.3, random_state = 0)

The above code will split X and Y into two subsets each.

• test_size: the desired size of the test_set. 0.3 denotes 30%.


• random_state: This is used to preserve the uniqueness. The split will happen uniquely
for a random_state.

International School of Technology and Sciences (for Women) Page 11


(R20)Data Mining using Python Lab

c) Scaling the features


Since machine learning models rely on numbers to solve relations it is important to have
similarly scaled data in a dataset. Scaling ensures that all data in a dataset falls in the same
range.Unscaled data can cause inaccurate or false predictions.Some machine learning
algorithms can handle feature scaling on its own and doesn’t require it explicitly.

The StandardScaler class from the scikit-learn library can help us scale the dataset.

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()

Output
X_train before scaling :

X_train after scaling :


International School of Technology and Sciences (for Women) Page 12
(R20)Data Mining using Python Lab

Experiment-3:

Aim:Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation
b) Cosine Similarity
c) Jaccard Similarity
d) Euclidean Distance
e) Manhattan Distance

Description:
Many data science techniques are based on measuring similarity and dissimilarity between
objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In
Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to
compute the distance between the cluster centroids and it’s assigned data points.
Recommendation engines use neighborhood based collaborative filtering methods which identify
an individual’s neighbor based on the similarity/dissimilarity to the other users.

Similarity Based Metrics

Similarity based methods determine the most similar objects with the highest values as it implies

they live in closer neighborhoods.

a) Pearson’s Correlation

International School of Technology and Sciences (for Women) Page 13


(R20)Data Mining using Python Lab

Correlation is a technique for investigating the relationship between two quantitative, continuous

variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure

related to the strength and direction of a linear relationship. We calculate this metric for the

vectors x and y in the following way:

where

The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or

decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.

Source: Wikipedia

International School of Technology and Sciences (for Women) Page 14


(R20)Data Mining using Python Lab

Source Code:
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y,
1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation


corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.810

b) Cosine Similarity

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate

the cosine similarity we use the following formula:

International School of Technology and Sciences (for Women) Page 15


(R20)Data Mining using Python Lab

Recall the cosine function: on the left the red vectors point at different angles and the graph on the

right shows the resulting function.

Source code:

We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for
a single sample.
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

Output:Cosine similarity: 0.773

c) Jaccard Similarity

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for

comparing two binary vectors (sets).

Source Code:
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]

International School of Technology and Sciences (for Women) Page 16


(R20)Data Mining using Python Lab

jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)

Output:Jaccard similarity: 0.500

Distance Based Metrics

Distance based methods prioritize objects with the lowest values to detect similarity amongst

them.

d) Euclidean Distance

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Source Code:
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)

Output:Euclidean distance: 3.273

e) Manhattan Distance

Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance

from one vector to another. You can imagine this metric as a way to compute the distance
between two points when you are not able to go through buildings.

International School of Technology and Sciences (for Women) Page 17


(R20)Data Mining using Python Lab

We calculate the Manhattan distance as follows:

Source Code:
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)

Output: Manhattan distance: 10.468

Experiment-4:
Aim: Build a model using linear regression algorithm on any dataset.
Description:
Populating the interactive namespace from numpy and matplotlib
In the code above I imported a few modules, here’s a breakdown of what they do:

1. Numpy – a necessary package for scientific computation. It includes an incredibly


versatile structure for working with arrays, which are the primary data format that scikit-
learn uses for input data.
2. Matplotlib – the fundamental package for data visualization in Python. This module
allows for the creation of everything from simple scatter plots to 3-dimensional contour
plots. Note that from matplotlib we install pyplot, which is the highest order state-
machine environment in the modules hierarchy (if that is meaningless to you don’t worry
about it, just make sure you get it imported to your notebook). Using ‘%matplotlib inline’
is essential to make sure that all plots show up in your notebook.

Scipy – a collection of tools for statistics in python. Stats is the scipy module that imports
regression analysis functions

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns

International School of Technology and Sciences (for Women) Page 18


(R20)Data Mining using Python Lab

from matplotlib import rcParams


df = pd.read_csv('/Users/michaelrundell/Desktop/kc_house_data.csv')
df.head()

Output:
id date price bedrooms bathrooms sqft_living sqft_lot
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080

Reading the csv file from Kaggle using pandas (pd.read_csv).

df.isnull().any()

Output:
id False
date False
price False
bedrooms False
bathrooms False
sqft_living False
sqft_lot False
...
dtype: bool

Checking to see if any of our data has null values. If there were any, we’d drop or filter the null
values out.

df.dtypes

Output:
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
...
dtype: object

International School of Technology and Sciences (for Women) Page 19


(R20)Data Mining using Python Lab

df.describe()

Output:
price bedrooms bathrooms sqft_living
count 21613 21613 21613 21613
mean 540088.10 3.37 2.11 2079.90
std 367127.20 0.93 0.77 918.44
min 75000.00 0.00 0.00 290.00
25% 321950.00 3.00 1.75 1427.00
50% 450000.00 3.00 2.25 1910.00
75% 645000.00 4.00 2.50 2550.00
max 7700000.00 33.00 8.00 13540.00
We are working with a data set that contains 21,613 observations, mean price is approximately
$540k, median price is approximately $450k, and the average house’s area is 2080 ft2

fig = plt.figure(figsize=(12, 6))


sqft = fig.add_subplot(121)
cost = fig.add_subplot(122)

sqft.hist(df.sqft_living, bins=80)
sqft.set_xlabel('Ft^2')
sqft.set_title("Histogram of House Square Footage")

cost.hist(df.price, bins=80)
cost.set_xlabel('Price ($)')
cost.set_title("Histogram of Housing Prices")

plt.show()
output:

International School of Technology and Sciences (for Women) Page 20


(R20)Data Mining using Python Lab

Experiment-5:
Aim: Build a classification model using Decision Tree algorithm on iris dataset.
Description:
On what basis should we make decisions?
In other words, what should we select as the yes or no questions which are used to classify our
data. We could take an educated guess (i.e. all mice with a weight over 5 pounds are obese).
However, it isn’t necessarily the best way to categorize our samples. What if, we could use some
kind of machine learning algorithm to learn what questions to ask in order to do the best job at
classifying our data? That is the purpose behind decision tree models.
Source code:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)

In the proceeding section, we’ll attempt to build a decision tree classifier to determine the kind of
flower given its dimensions.
X.head()

Although, decision trees can handle categorical data, we still encode the targets in terms of digits

(i.e. setosa=0, versicolor=1, virginica=2) in order to create a confusion matrix at a later point.
Fortunately, the pandas library provides a method for this very purpose.
y = pd.get_dummies(y)

International School of Technology and Sciences (for Women) Page 21


(R20)Data Mining using Python Lab

X_train, X_test, y_train, y_test = train_test_split(X, y,


random_state=1)

Next, we create and train an instance of the DecisionTreeClassifer class. We provide the y values

because our model uses a supervised machine learning algorithm


dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
ot_data = StringIO()export_graphviz(dt, out_file=dot_data,
feature_names=iris.feature_names)(graph, ) =
graph_from_dot_data(dot_data.getvalue())Image(graph.create_png())

Let’s see how our decision tree does when its presented with test data.
y_pred = dt.predict(X_test)

Experiment-6:
Aim: Apply Naïve Bayes Classification algorithm on any dataset.
Description:
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable
algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.

Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of
other features. For example, a loan applicant is desirable or not depending on his/her income,
previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class conditional
independence.

International School of Technology and Sciences (for Women) Page 22


(R20)Data Mining using Python Lab

• P(h): the probability of hypothesis h being true (regardless of the data). This is known as the
prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.

Source Code:
1. Importing the libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then
we have scaled the feature variable.

The output for the dataset is given as:

International School of Technology and Sciences (for Women) Page 23


(R20)Data Mining using Python Lab

International School of Technology and Sciences (for Women) Page 24


(R20)Data Mining using Python Lab

2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below
is the code for it:

1. # Fitting Naive Bayes to the Training set


2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We
can also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor variable y_pred,
and will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test)

Output:

International School of Technology and Sciences (for Women) Page 25


(R20)Data Mining using Python Lab

The above output shows the result for prediction vector y_pred and real vector y_test. We can
see that some predications are different from the real values, which are the incorrect predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:

1. # Making the Confusion Matrix


2. from sklearn.metrics import confusion_matrix

International School of Technology and Sciences (for Women) Page 26


(R20)Data Mining using Python Lab

3. cm = confusion_matrix(y_test, y_pred)

Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for
it:

1. # Visualising the Training set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())

International School of Technology and Sciences (for Women) Page 27


(R20)Data Mining using Python Lab

10. for i, j in enumerate(nm.unique(y_set)):


11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points with
the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.

6) Visualizing the Test set result:


1. # Visualising the Test set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())

International School of Technology and Sciences (for Women) Page 28


(R20)Data Mining using Python Lab

9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above output is final output for test set data. As we can see the classifier has
created a Gaussian curve to divide the "purchased" and "not purchased" variables. There
are some wrong predictions which we have calculated in Confusion matrix. But still it is
pretty good classifier.

Experiment-7:
Aim: Generate frequent item sets using Apriori Algorithm in python and also generate
association rules for any market basket data.
Description:
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases. It proceeds by identifying the frequent individual items in the database and

International School of Technology and Sciences (for Women) Page 29


(R20)Data Mining using Python Lab

extending them to larger and larger item sets as long as those item sets appear sufficiently often in
the database. The frequent item sets determined by Apriori can be used to determine association
rules which highlight general trends in the database: this has applications in domains such
as market basket analysis.

Source Code:

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

## Use this to read data directly from github


df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-
Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/re
tail_dataset.csv', sep=',')

## Use this to read data from the csv file on local system.
df = pd.read_csv('./data/retail_data.csv', sep=',')

## Print first 10 rows


df.head(10)

These NaNs make it hard to read the table. Let’s find out how many unique items are actually

there in the table.


items = set()
for col in df:

International School of Technology and Sciences (for Women) Page 30


(R20)Data Mining using Python Lab

items.update(df[col].unique())print(items)Out:
{'Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
'Diaper', 'Milk']}

Custom One Hot Encoding


itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
rowset = set(row)
labels = {}
uncommons = list(itemset - rowset)
commons = list(itemset.intersection(rowset))
for uc in uncommons:
labels[uc] = 0
for com in commons:
labels[com] = 1
encoded_vals.append(labels)
encoded_vals[0]ohe_df = pd.DataFrame(encoded_vals)

Applying Apriori

apriori module from mlxtend library provides fast and efficient apriori implementation.

apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0,

low_memory=False)

freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True,


verbose=1)
freq_items.head(7)

International School of Technology and Sciences (for Women) Page 31


(R20)Data Mining using Python Lab

The output is a data frame with the support for each itemsets.

Experiment-8:
Aim: Apply K- Means clustering algorithm on any dataset.

Description:

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

Source Code:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5. # Importing the dataset
6. dataset = pd.read_csv('Mall_Customers_data.csv')

International School of Technology and Sciences (for Women) Page 32


(R20)Data Mining using Python Lab

Step-2: Finding the optimal number of clusters using the elbow


method

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)

International School of Technology and Sciences (for Women) Page 33


(R20)Data Mining using Python Lab

11. mtp.title('The Elobw Method Graph')


12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.

International School of Technology and Sciences (for Women) Page 34


(R20)Data Mining using Python Lab

Step- 3: Training the K-means algorithm on the training dataset


1. #training the K-means model on a dataset
2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)

Step-4: Visualizing the Clusters


1. #visulaizing the clusters
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for
first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #fo
r second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for t
hird cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for
fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
#for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', la
bel = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()
Output:

International School of Technology and Sciences (for Women) Page 35


(R20)Data Mining using Python Lab

Experiment-9:
Aim: Apply Hierarchical Clustering algorithm on any dataset.

Description:
Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster
unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the
data points with similar characteristics. In some cases the result of hierarchical and K-Means
clustering can be similar. Before implementing hierarchical clustering using Scikit-Learn, let's
first understand the theory behind hierarchical clustering.
Source Code:

#1 Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#2 Importing the Mall_Customers dataset by pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3,4]].values
#3 Using the dendrogram to find the optimal numbers of clusters.
# First thing we're going to do is to import scipy library. scipy is an open source
# Python library that contains tools to do hierarchical clustering and building dendrograms.
# Only import the needed tool.
import scipy.cluster.hierarchy as sch
#Lets create a dendrogram variable
# linkage is actually the algorithm itself of hierarchical clustering and then in
#linkage we have to specify on which data we apply and engage. This is X dataset
dendrogram = sch.dendrogram(sch.linkage(X, method = "ward"))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

International School of Technology and Sciences (for Women) Page 36


(R20)Data Mining using Python Lab

# Divisive Hierarchical Clustering. We choose Euclidean distance and ward method for our
# algorithm class
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='ward')
# Lets try to fit the hierarchical clustering algorithm to dataset X while creating the
# clusters vector that tells for each customer which cluster the customer belongs to.
y_hc=hc.fit_predict(X)
#5 Visualizing the clusters. This code is similar to k-means visualization code.
#We only replace the y_kmeans vector name to y_hc for the hierarchical clustering
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='cyan', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='magenta', label ='Cluster 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100')
plt.show()

International School of Technology and Sciences (for Women) Page 37


(R20)Data Mining using Python Lab

Experiment-10:
Aim: Apply DBSCAN clustering algorithm on any dataset.
Description:

• DBSCAN algorithm steps, following the original research paper by Martin Ester et.al. [1]

• Key concept of directly density reachable points to classify core and border points of cluster.
This also helps us to identify noise in the data.

• Example of DBSCAN algorithm application using python and scikit-learn by clustering


different regions in Canada based on yearly weather data. Learn to use a fantastic tool-
Basemap for plotting 2D data on maps using python. All the codes (with python), images
(made using Libre Office) are available in github (link given at the end of the post).

Source Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/Mall_Customers.csv")
# importing the dataset

International School of Technology and Sciences (for Women) Page 38


(R20)Data Mining using Python Lab

• Checking the head of the data.


data.head()

• Output

• Checking the shape of the dataset.


print("Dataset shape:", data.shape)

Output
Dataset shape: (200, 5)

Next, we check if the dataset has any missing values.


# checking for NULL data in the dataset
data.isnull().any().any()

Output
False

# extracting the above mentioned columns


x = data.loc[:, ['Annual Income (k$)',
'Spending Score (1-100)']].values

International School of Technology and Sciences (for Women) Page 39


(R20)Data Mining using Python Lab

• Let us check the shape of x.


print(x.shape)

Output
(200, 2)
from sklearn.neighbors import NearestNeighbors # importing the
library
neighb = NearestNeighbors(n_neighbors=2) # creating an object of
the NearestNeighbors class
nbrs=neighb.fit(x) # fitting the data to the object
distances,indices=nbrs.kneighbors(x) # finding the nearest
neighbours

Sorting and plot the distances between the data points


# Sort and plot the distances results
distances = np.sort(distances, axis = 0) # sorting the distances
distances = distances[:, 1] # taking the second column of the
sorted distances
plt.rcParams['figure.figsize'] = (5,3) # setting the figure size
plt.plot(distances) # plotting the distances
plt.show() # showing the plot

Output

Executing the code above, we obtain the following plot:

Implementing the DBSCAN model


from sklearn.cluster import DBSCAN
# cluster the data into five clusters
dbscan = DBSCAN(eps = 8, min_samples = 4).fit(x) # fitting the
model

International School of Technology and Sciences (for Women) Page 40


(R20)Data Mining using Python Lab

labels = dbscan.labels_ # getting the labels


# Plot the clusters
plt.scatter(x[:, 0], x[:,1], c = labels, cmap= "plasma") #
plotting the clusters
plt.xlabel("Income") # X-axis label
plt.ylabel("Spending Score") # Y-axis label
plt.show() # showing the plot

Clusters plot

International School of Technology and Sciences (for Women) Page 41

You might also like