0% found this document useful (0 votes)
5 views33 pages

Mlfile

The document outlines a series of practical exercises focused on implementing various machine learning techniques, including data preprocessing, linear regression, decision trees, random forests, Naive Bayes, K-means clustering, and K-nearest neighbors. Each practical includes objectives, steps for implementation using Python, and concludes with a summary of key takeaways and potential applications. The document emphasizes the importance of data preparation and model evaluation in machine learning projects.

Uploaded by

dhandasanmeet86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views33 pages

Mlfile

The document outlines a series of practical exercises focused on implementing various machine learning techniques, including data preprocessing, linear regression, decision trees, random forests, Naive Bayes, K-means clustering, and K-nearest neighbors. Each practical includes objectives, steps for implementation using Python, and concludes with a summary of key takeaways and potential applications. The document emphasizes the importance of data preparation and model evaluation in machine learning projects.

Uploaded by

dhandasanmeet86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Practical no.

1
Aim: Implementing Data Preprocessing.

Objective: This practical demonstrates the importance of data preprocessing in machine


learning projects and provides hands-on experience with common data preprocessing
techniques.

1. Introduction to Data Preprocessing:

 Data preprocessing is a crucial step in machine learning pipelines to prepare raw data
for model training.
 It involves cleaning, transforming, and organizing the data to make it suitable for
machine learning algorithms.

2. Common Data Preprocessing Techniques:

a. Handling Missing Values: - Identify missing values in the dataset. - Options for handling
missing values include: - Removing rows or columns with missing values. - Imputing
missing values using mean, median, or mode. - Using advanced techniques like K-nearest
neighbors (KNN) imputation or predictive modeling.

b. Encoding Categorical Variables: - Convert categorical variables into numerical format


suitable for machine learning algorithms. - Techniques include: - One-hot encoding: Create
binary columns for each category. - Label encoding: Encode categories as integer values.

c. Scaling and Normalization: - Scale numerical features to a similar range to prevent features
with large values from dominating the model. - Techniques include: - Standardization: Scale
features to have mean=0 and variance=1. - Min-max scaling: Scale features to a specified
range, typically [0, 1].

d. Feature Engineering: - Create new features from existing ones to capture additional
information or improve model performance. - Techniques include: - Polynomial features:
Generate interaction terms and polynomial features. - Feature scaling: Scale features to have
similar magnitudes.

e. Handling Imbalanced Classes: - Address class imbalance in classification tasks where one
class significantly outnumbers the other. - Techniques include: - Resampling: Oversample
minority class or undersample majority class. - Algorithmic techniques: Use algorithms that
handle class imbalance well, such as ensemble methods.

3. Implementation in Python using scikit-learn:

 Use the scikit-learn library to implement common data preprocessing techniques.


 Example code snippets demonstrate how to:
 Handle missing values using SimpleImputer.
 Encode categorical variables using OneHotEncoder or LabelEncoder.
 Scale and normalize features using StandardScaler or MinMaxScaler.
 Perform feature engineering using PolynomialFeatures.

1|Page
 Handle imbalanced classes using RandomOverSampler or
RandomUnderSampler from imbalanced-learn.

4. Example Data Preprocessing Pipeline:

 Construct a data preprocessing pipeline using scikit-learn's Pipeline class.


 Sequentially apply data preprocessing steps to the dataset before feeding it into the
machine learning model.
 Include steps for handling missing values, encoding categorical variables, scaling
features, and any other necessary preprocessing steps

Feature scaling

Standardisation

2|Page
Dataset

Extract the dependent and independent variables

Categorical data

3|Page
Splitting the dataset into training and test set

Feature scaling

4|Page
Practical – 2
Aim: Deploying a Simple Linear Regression Model.

Objective: In this practical, we will deploy a simple linear regression model using Python
and Flask. We will follow a step-by-step approach to prepare the data, train the model, save it
to disk, create a Flask web application for deployment, and test the deployed model.

Step 1: Install Required Libraries Ensure you have Python installed on your system. Install
the required libraries using pip:

Step 2: Import Libraries

Step 3: Prepare Data


Step 4: Split Data into Training and Testing Sets

Step 5: Train the Model

5|Page
Step 6: Evaluate the Model
Step 7: Save the Model

Step 8: draw the line of regression

6|Page
Conclusion: In this practical, we successfully deployed a simple linear regression model
using Python and Flask. We trained the model, saved it to disk, created a Flask web
application for deployment, and tested the deployed model using sample input data. This
practical demonstrates the process of deploying machine learning models for real-world
applications.

7|Page
Practical - 3
Aim : Implementing Multiple Linear Regression

Objective: In this practical, we will implement multiple linear regression using Python and
Scikit-learn. We'll follow a step-by-step approach to prepare the data, train the model,
evaluate its performance, and make predictions.

Step 1: Install Required Libraries

Step 2: Import Libraries

Step 3: Prepare Data

Step 4: Split Data into Training and Testing Sets

Step 5: Train the Model

Step 6: Evaluate the Model

Step 7: Interpret Results

Step 8: Make Predictions

You can now make predictions using the trained model. Provide new input data in the same
format as the training data and use the predict() method of the model.

Conclusion: In this practical, we implemented multiple linear regression using Python and
Scikit-learn. We prepared the data, trained the model, evaluated its performance, and made
predictions. This practical demonstrates the process of implementing and using multiple
linear regression for predictive modelling tasks.

8|Page
9|Page
Practical – 4
Aim: Implement Decision Tree.

Introduction: In this practical, we will learn how to implement a Decision Tree classifier, a
popular supervised learning algorithm used for classification tasks. We will use Python to
build a Decision Tree classifier from scratch and apply it to a sample dataset.

1. Import Necessary Libraries:

 Import libraries such as NumPy and pandas for data manipulation, and scikit-learn for
implementing the Decision Tree classifier.

2. Load and Prepare Data:

 Load a sample dataset suitable for classification tasks.


 Preprocess the data as needed, including handling missing values, encoding
categorical variables, and splitting into training and testing sets.

3. Implement Decision Tree Classifier:

 Write a Python function to calculate the Gini impurity or entropy for a given dataset.
 Implement a function to find the best split for a dataset based on the calculated
impurity measure.
 Recursively build the decision tree by splitting the dataset at each node based on the
best split.

4. Train and Evaluate the Model:

 Train the Decision Tree classifier on the training data.


 Evaluate the classifier's performance on the testing data using metrics such as
accuracy, precision, recall, and F1-score.
 Compare the performance of the implemented Decision Tree algorithm with scikit-
learns implementation for validation.

5. Visualize the Decision Tree:

 Visualize the trained Decision Tree classifier using libraries such as Graphviz or
Matplotlib.
 Interpret the decision tree structure and explain how it makes predictions.

6. Experimentation and Analysis:

 Experiment with different hyperparameters such as maximum depth, minimum


samples per leaf, and criterion (Gini impurity or entropy).
 Analyze the results and discuss any insights or observations.

7. Conclusion:

10 | P a g e
 Summarize the key takeaways from implementing the Decision Tree classifier.
 Reflect on the strengths and limitations of Decision Trees and their suitability for
different types of datasets.
 Discuss potential applications and real-world use cases where Decision Trees can be
effective.

11 | P a g e
12 | P a g e
Practical – 5
Aim: Deploy Random Forest classification.
Random forest
A random forest is a supervised machine learning algorithm used to solve regression and
classification problems. As the name forest suggest multiple trees, in the same way random
forest also have multiple trees. More trees mean a more robust forest. Different decision trees
are created on data samples and the prediction from each of them is collected and the final
decision is selected by means of voting. Random forest eliminates the drawbacks of decision
trees by reducing the overfitting of datasets and increasing precision.This type of technique is
called ensemble learning. The motivation behind ensemble learning is the belief that a group
of experts working together are more likely to be accurate than individual experts
Working of random forest:
Before going for random forest working let’s understand briefly ensemble learning because
the random forest is based on ensemble learning. Ensemble’s literal meaning is the group.
So, ensemble learning is a technique for combining outputs of different models. These
models are called weak learners. Rather than going for individual trees, different trees make
predictions and the output is selected according to majority voting. Or you can say model
averaging will be followed. Suppose there are 5 models. Out of which 3 have predicted
as YES and 2 have predicted as NO. Then the final predictions will be taken as YES.

About dataset
The dataset used in this is ‘titanic.csv’ which is available for free, which is available on
Kaggle.com. This dataset includes the following features
1. Importing Libraries and reading dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("titanic.csv")

13 | P a g e
2. Data preprocessing
df.drop(['Cabin','PassengerId','Name','Ticket’],axis=1,inplace=True)
df = df.fillna(0)
3. Handling categorical data
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])
df['Embarked']=le.fit_transform(df['Embarked'])
df
4. Dependent and independent variables
# Putting feature variable to X
X = df.drop('Survived',axis=1)
# Putting response variable to y
y = df['Survived']

5. Splitting dataset into Training and Testing Set


# Splitting the data into train and test
from sklearn.model_selection import train_test_split
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.7, random_state=42)
Next, split both x and y into training and testing sets with the help of the train_test_split() function. In
this training data set is 0.8 which means 80%.
6. Implementing a Random Forest classifier
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
Different parameters are used in the Random forest algorithm
1. N_estimators-The number of decision trees in the forest.
2. criterion{“gini”, “entropy”}, default=”gini”
This is to measure the quality of a split. These are the criterion by which the decision tree
actually split the variables.
 “gini” for the Gini impurity
 “entropy” for the information gain
3. Max_depth int, default=None
The maximum depth of the tree(root node to terminal node).
4. min_samples_split(int or float, default=2)
The minimum number of samples actually required to split an internal node:
5. min_samples_leaf(int or float, default=1)
The minimum number of samples is required to be at a leaf node.
6. Max_features {“auto”, “sqrt”, “log2”}, int or float, default=”auto”
a maximum number of features random forest considers when looking for the best split.
7. n_jobs(int, default=None)

14 | P a g e
It is the number of jobs to run in parallel. This is used when you have the capability to do
parallel processing where n_jobs= -1 means using all processors and n_jobs=1, it can use
only one processor
8. random_state(int, RandomState instance or None, default=None)
It Controls both the randomness of the samples used when building trees.
9. verbose(int, default=0)
Controls the verbosity when fitting and predicting. It gives you all the run-time information.
You can hyper-tune these by changing the values. You can read my blog on hyper-tuning.
7. Predicting test cases using random forest
# Predicting the test set results
Pred = classifier.predict(X_test)
print(Pred)
Output:
[0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1
1111111101100001100010001011001111111
0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1]
8. Checking the accuracy score
from sklearn.metrics import classification_report
rand_score=classifier.score(X_test, y_test)
'''rand_score=classifier.accuracy_score(y_test,Pred)'''
classification_report_rf=classification_report(y_test,Pred)
print("Accuracy score:",rand_score)
Output:
Accuracy score: 0.8268156424581006

15 | P a g e
Practical – 6

Aim: Stimulating Naive Bayes Algorithm for Classification

Introduction: In this practical, we will learn how to implement and simulate the Naive
Bayes algorithm, a popular classification algorithm based on Bayes' theorem and the
assumption of feature independence. We will use Python to build a simple Naive Bayes
classifier and apply it to a sample dataset.

1. Import Necessary Libraries:

 Import libraries such as NumPy and pandas for data manipulation, and scikit-learn for
implementing the Naive Bayes classifier.

2. Load and Prepare Data:

 Load a sample dataset suitable for classification tasks.


 Preprocess the data as needed, including handling missing values, encoding
categorical variables, and splitting into training and testing sets.

3. Implement Naive Bayes Classifier:

 Write a Python function to calculate the probability of each class using the Naive
Bayes algorithm.
 Implement functions to calculate the likelihood of each feature given the class and the
prior probability of each class.
 Combine these probabilities using Bayes' theorem to make predictions.

4. Train and Evaluate the Classifier:

 Train the Naive Bayes classifier on the training data.


 Evaluate the classifier's performance on the testing data using metrics such as
accuracy, precision, recall, and F1-score.
 Compare the performance of the simulated Naive Bayes classifier with scikit-learn's
implementation for validation.

5. Experimentation and Analysis:

 Experiment with different variations of the Naive Bayes algorithm (e.g., Gaussian
Naive Bayes, Multinomial Naive Bayes, etc.) and observe the impact on performance.
 Analyse the results and discuss any insights or observations

16 | P a g e
17 | P a g e
Practical – 7 ( a )
Aim : Deploy K means clustering algorithm.

Introduction: In this practical, we will demonstrate how to apply a trained K-means


clustering model to new data for cluster assignment. K-means is an unsupervised learning
algorithm used for clustering data points into groups based on their similarities. We will use
Python to deploy the K-means model and assign cluster labels to new data points.

1. Train a K-means Model:

 Load a dataset suitable for clustering tasks.


 Train a K-means clustering model on the dataset using an appropriate number of
clusters (K).

2. Save the Trained Model:

 Serialize the trained K-means model using libraries like pickle or joblib.
 Save the model to disk for later use.

3. Load the Trained Model:

 Load the saved K-means model from disk when needed for deployment.

4. Deploying the Model:

 When new data becomes available, load the trained K-means model.
 Apply the model to the new data to assign cluster labels using the predict method.

5. Visualize Cluster Assignments:

 Optionally, visualize the cluster assignments of the new data along with the centroids
of each cluster.
 Use scatter plots or other visualization techniques to display the clustered data points.

6. Conclusion:

 Summarize the process of deploying a K-means clustering model for cluster


assignment.
 Discuss potential applications of K-means clustering in real-world scenarios.
 Highlight the importance of understanding the underlying data and choosing an
appropriate number of clusters (K)

18 | P a g e
19 | P a g e
20 | P a g e
Practical – 7 ( b )
Aim: Implement K- nearest neighbors algorithm.

Introduction: In this practical, we will learn how to implement the K-nearest neighbors
(KNN) algorithm, a simple yet effective supervised learning algorithm for classification and
regression tasks. We will use Python to build a KNN classifier from scratch and apply it to a
sample dataset.

Implementation of KNN Algorithm in Python

Let’s now get into the implementation of KNN in Python. We’ll go over the steps to help you
break the code down and make better sense of it.

1.Importing the modules

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
2. Creating Dataset

Scikit-learn has a lot of tools for creating synthetic datasets, which are great for testing
machine learning algorithms. I’m going to utilize the make blobs method.
X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5,
random_state = 4)

This code generates a dataset of 500 samples separated into four classes with a total of two
characteristics. Using associated parameters, you may quickly change the number of samples,
characteristics, and classes. We may also change the distribution of each cluster (or class).

3. Visualize the Dataset


plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')
plt.show()

Data Visualization KNN

21 | P a g e
4. Splitting Data into Training and Testing Datasets

It is critical to partition a dataset into train and test sets for every supervised machine learning
method. We first train the model and then put it to the test on various portions of the dataset.
If we don’t separate the data, we’re simply testing the model with data it already knows.
Using the train_test_split method, we can simply separate the tests.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)


Copy

With the train size and test size options, we may determine how much of the original data is
utilized for train and test sets, respectively. The default separation is 75% for the train set and
25% for the test set.

5. KNN Classifier Implementation

After that, we’ll build a kNN classifier object. I develop two classifiers with k values of 1 and
5 to demonstrate the relevance of the k value. The models are then trained using a train set.
The k value is chosen using the n_neighbors argument. It does not need to be explicitly
specified because the default value is 5.

knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
Copy
6. Predictions for the KNN Classifiers

Then, in the test set, we forecast the target values and compare them to the actual values.

knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)

y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)
Copy
7. Predict Accuracy for both k values
from sklearn.metrics import accuracy_score
print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)
print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)
Copy

The accuracy for the values of k comes out as follows:

Accuracy with k=5 93.60000000000001


Accuracy with k=1 90.4
Copy
8. Visualize Predictions

Let’s view the test set and predicted values with k=5 and k=1 to see the influence of k values.

plt.figure(figsize = (15,5))
22 | P a g e
plt.subplot(1,2,1)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=5", fontsize=20)

plt.subplot(1,2,2)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=1", fontsize=20)
plt.show()
Copy

Visualize Predictions KNN


How to find the best k value to implement KNN
1. k=1: The model is too narrow and not properly generalized. It also has a high
sensitivity to noise. The model predicts new, previously unknown data points with a
high degree of accuracy on a train set, but it is a poor predictor on fresh, previously
unseen data points. As a result, we’re likely to have an overfit model.
2. k=100: The model is overly broad and unreliable on both the train and test sets.
Underfitting is the term for this circumstance.
Limitations of KNN Algorithm

KNN is a straightforward algorithm to grasp. It does not rely on any internal machine
learning model to generate predictions. KNN is a classification method that simply needs to
know how many categories there are to work (one or more). This means it can quickly assess
whether or not a new category should be added without having to know how many others
there are.

The drawback of this simplicity is that it can’t anticipate unusual things (like new diseases),
which KNN can’t accomplish since it doesn’t know what the prevalence of a rare item would
be in a healthy population.

Although KNN achieves high accuracy on the testing set, it is slower and more expensive in
terms of time and memory. It needs a considerable amount of memory in order to store the
whole training dataset for prediction. Furthermore, because Euclidean distance is very
sensitive to magnitudes, characteristics in the dataset with large magnitudes will always
outweigh those with small magnitudes.

Finally, considering all we’ve discussed so far, we should keep in mind that KNN isn’t ideal
for large-dimensional datasets.

23 | P a g e
Practical – 8
Aim: Deploy support vector machine, Apriori Algorithm.
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or
nonlinear classification, regression, and even outlier detection tasks. SVMs can be used for
a variety of tasks, such as text classification, image classification, spam
detection, handwriting identification, gene expression analysis, face detection, and anomaly
detection. SVMs are adaptable and efficient in a variety of applications because they can
manage high-dimensional data and nonlinear relationships.
SVM algorithms are very effective as we try to find the maximum separating hyperplane
between the different classes available in the target feature.
Let’s consider two independent variables x 1, x2, and one dependent variable which is either
a blue circle or a red circle .

Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x 1, x2) that segregate our data
points or do a classification between red and blue circles. So how do we choose the best
line or in general the best hyperplane that segregates our data points?

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.

24 | P a g e
Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side
is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a
scenario like shown below

Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.

25 | P a g e
Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the margins in these types of cases are called soft margins. When there is a soft margin to
the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly
used penalty. If no violations no hinge loss.If violations hinge loss proportional to the
distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls
are separable by a straight line/linear line). What to do if data are not linearly separable?

Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a function
of distance from origin o.so if we plot this we get something like as shown below

26 | P a g e
Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non -
linear function that creates a new variable is referred to as a kernel.

Types of Support Vector Machine

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data points into their
respective classes. A hyperplane that maximizes the margin between the classes
is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot
be separated into two classes by a straight line (in the case of 2D). By using
kernel functions, nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions into a higher-
dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified
space.
SVM implementation in Python
Predict if cancer is Benign or malignant. Using historical data about patients diagnosed
with cancer enables doctors to differentiate malignant cases and benign ones are given
independent attributes.
Steps
 Load the breast cancer dataset from sklearn.datasets

27 | P a g e
 Separate input features and target variables.
 Buil and train the SVM classifiers using RBF kernel.
 Plot the scatter plot of the input features.
 Plot the decision boundary.
 Plot the decision boundary

# Load the important packages

from sklearn.datasets import load_breast_cancer

import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay

from sklearn.svm import SVC

# Load the datasets

cancer = load_breast_cancer()

X = cancer.data[:, :2]

y = cancer.target

#Build the model

svm = SVC(kernel="rbf", gamma=0.5, C=1.0)

# Trained the model

svm.fit(X, y)

# Plot Decision Boundary

DecisionBoundaryDisplay.from_estimator(

svm,

X,

response_method="predict",

cmap=plt.cm.Spectral,

alpha=0.8,

xlabel=cancer.feature_names[0],

ylabel=cancer.feature_names[1],

28 | P a g e
)

# Scatter plot

plt.scatter(X[:, 0], X[:, 1],

c=y,

s=20, edgecolors="k")

plt.show()

Output

Apriori Algorithm
Apriori Algorithm is a Machine Learning algorithm which is used to gain insight into the
structured relationships between different items involved. The most prominent practical
application of the algorithm is to recommend products based on the products already
present in the user’s cart. Walmart especially has made great use of the algorithm in
suggesting products to its users.
Step 1: Importing the required libraries
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
Step 2: Loading and exploring the data
# Changing the working location to the location of the file
cd C:\Users\Dev\Desktop\Kaggle\Apriori Algorithm

# Loading the Data


data = pd.read_excel('Online_Retail.xlsx')
data.head()
output

29 | P a g e
# Exploring the columns of the data
data.columns
Output

# Exploring the different regions of transactions


data.Country.unique()
Output

Step 3: Cleaning the Data


# Stripping extra spaces in the description
data['Description'] = data['Description'].str.strip()

# Dropping the rows without any invoice number


data.dropna(axis = 0, subset =['InvoiceNo'], inplace = True)
data['InvoiceNo'] = data['InvoiceNo'].astype('str')

# Dropping all transactions which were done on credit


data = data[~data['InvoiceNo'].str.contains('C')]
Step 4: Splitting the data according to the region of transaction
# Transactions done in France
basket_France = (data[data['Country'] =="France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

# Transactions done in the United Kingdom


basket_UK = (data[data['Country'] =="United Kingdom"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

# Transactions done in Portugal


basket_Por = (data[data['Country'] =="Portugal"]

30 | P a g e
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))

basket_Sweden = (data[data['Country'] =="Sweden"]


.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
Step 5: Hot encoding the Data

# Defining the hot encoding function to make the data suitable


# for the concerned libraries
def hot_encode(x):
if(x<= 0):
return 0
if(x>= 1):
return 1

# Encoding the datasets


basket_encoded = basket_France.applymap(hot_encode)
basket_France = basket_encoded

basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded

basket_encoded = basket_Por.applymap(hot_encode)
basket_Por = basket_encoded

basket_encoded = basket_Sweden.applymap(hot_encode)
basket_Sweden = basket_encoded
Step 6: Building the models and analyzing the results
a) France:
# Building the model
frq_items = apriori(basket_France, min_support = 0.05, use_colnames =
True)

# Collecting the inferred rules in a dataframe


rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())

31 | P a g e
From the above output, it can be seen that paper cups and paper and plates are bought
together in France. This is because the French have a culture of having a get-together with
their friends and family atleast once a week. Also, since the French government has banned
the use of plastic in the country, the people have to purchase the paper-based alternatives.
b) United Kingdom:
frq_items = apriori(basket_UK, min_support = 0.01, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())

If the rules for British transactions are analyzed a little deeper, it is seen that the British
people buy different colored tea-plates together. A reason behind this may be because
typically the British enjoy tea very much and often collect different colored tea-plates for
different occasions.
c) Portugal:
frq_items = apriori(basket_Por, min_support = 0.05, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())

On analyzing the association rules for Portuguese transactions, it is observed that Tiffin sets
(Knick Knack Tins) and color pencils. These two products typically belong to a primary
school going kid. These two products are required by children in school to carry their lunch
and for creative work respectively and hence are logically make sense to be paired together.
d) Sweden:

32 | P a g e
frq_items = apriori(basket_Sweden, min_support = 0.05, use_colnames =
True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())

33 | P a g e

You might also like