Mlfile
Mlfile
1
Aim: Implementing Data Preprocessing.
Data preprocessing is a crucial step in machine learning pipelines to prepare raw data
for model training.
It involves cleaning, transforming, and organizing the data to make it suitable for
machine learning algorithms.
a. Handling Missing Values: - Identify missing values in the dataset. - Options for handling
missing values include: - Removing rows or columns with missing values. - Imputing
missing values using mean, median, or mode. - Using advanced techniques like K-nearest
neighbors (KNN) imputation or predictive modeling.
c. Scaling and Normalization: - Scale numerical features to a similar range to prevent features
with large values from dominating the model. - Techniques include: - Standardization: Scale
features to have mean=0 and variance=1. - Min-max scaling: Scale features to a specified
range, typically [0, 1].
d. Feature Engineering: - Create new features from existing ones to capture additional
information or improve model performance. - Techniques include: - Polynomial features:
Generate interaction terms and polynomial features. - Feature scaling: Scale features to have
similar magnitudes.
e. Handling Imbalanced Classes: - Address class imbalance in classification tasks where one
class significantly outnumbers the other. - Techniques include: - Resampling: Oversample
minority class or undersample majority class. - Algorithmic techniques: Use algorithms that
handle class imbalance well, such as ensemble methods.
1|Page
Handle imbalanced classes using RandomOverSampler or
RandomUnderSampler from imbalanced-learn.
Feature scaling
Standardisation
2|Page
Dataset
Categorical data
3|Page
Splitting the dataset into training and test set
Feature scaling
4|Page
Practical – 2
Aim: Deploying a Simple Linear Regression Model.
Objective: In this practical, we will deploy a simple linear regression model using Python
and Flask. We will follow a step-by-step approach to prepare the data, train the model, save it
to disk, create a Flask web application for deployment, and test the deployed model.
Step 1: Install Required Libraries Ensure you have Python installed on your system. Install
the required libraries using pip:
5|Page
Step 6: Evaluate the Model
Step 7: Save the Model
6|Page
Conclusion: In this practical, we successfully deployed a simple linear regression model
using Python and Flask. We trained the model, saved it to disk, created a Flask web
application for deployment, and tested the deployed model using sample input data. This
practical demonstrates the process of deploying machine learning models for real-world
applications.
7|Page
Practical - 3
Aim : Implementing Multiple Linear Regression
Objective: In this practical, we will implement multiple linear regression using Python and
Scikit-learn. We'll follow a step-by-step approach to prepare the data, train the model,
evaluate its performance, and make predictions.
You can now make predictions using the trained model. Provide new input data in the same
format as the training data and use the predict() method of the model.
Conclusion: In this practical, we implemented multiple linear regression using Python and
Scikit-learn. We prepared the data, trained the model, evaluated its performance, and made
predictions. This practical demonstrates the process of implementing and using multiple
linear regression for predictive modelling tasks.
8|Page
9|Page
Practical – 4
Aim: Implement Decision Tree.
Introduction: In this practical, we will learn how to implement a Decision Tree classifier, a
popular supervised learning algorithm used for classification tasks. We will use Python to
build a Decision Tree classifier from scratch and apply it to a sample dataset.
Import libraries such as NumPy and pandas for data manipulation, and scikit-learn for
implementing the Decision Tree classifier.
Write a Python function to calculate the Gini impurity or entropy for a given dataset.
Implement a function to find the best split for a dataset based on the calculated
impurity measure.
Recursively build the decision tree by splitting the dataset at each node based on the
best split.
Visualize the trained Decision Tree classifier using libraries such as Graphviz or
Matplotlib.
Interpret the decision tree structure and explain how it makes predictions.
7. Conclusion:
10 | P a g e
Summarize the key takeaways from implementing the Decision Tree classifier.
Reflect on the strengths and limitations of Decision Trees and their suitability for
different types of datasets.
Discuss potential applications and real-world use cases where Decision Trees can be
effective.
11 | P a g e
12 | P a g e
Practical – 5
Aim: Deploy Random Forest classification.
Random forest
A random forest is a supervised machine learning algorithm used to solve regression and
classification problems. As the name forest suggest multiple trees, in the same way random
forest also have multiple trees. More trees mean a more robust forest. Different decision trees
are created on data samples and the prediction from each of them is collected and the final
decision is selected by means of voting. Random forest eliminates the drawbacks of decision
trees by reducing the overfitting of datasets and increasing precision.This type of technique is
called ensemble learning. The motivation behind ensemble learning is the belief that a group
of experts working together are more likely to be accurate than individual experts
Working of random forest:
Before going for random forest working let’s understand briefly ensemble learning because
the random forest is based on ensemble learning. Ensemble’s literal meaning is the group.
So, ensemble learning is a technique for combining outputs of different models. These
models are called weak learners. Rather than going for individual trees, different trees make
predictions and the output is selected according to majority voting. Or you can say model
averaging will be followed. Suppose there are 5 models. Out of which 3 have predicted
as YES and 2 have predicted as NO. Then the final predictions will be taken as YES.
About dataset
The dataset used in this is ‘titanic.csv’ which is available for free, which is available on
Kaggle.com. This dataset includes the following features
1. Importing Libraries and reading dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("titanic.csv")
13 | P a g e
2. Data preprocessing
df.drop(['Cabin','PassengerId','Name','Ticket’],axis=1,inplace=True)
df = df.fillna(0)
3. Handling categorical data
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])
df['Embarked']=le.fit_transform(df['Embarked'])
df
4. Dependent and independent variables
# Putting feature variable to X
X = df.drop('Survived',axis=1)
# Putting response variable to y
y = df['Survived']
14 | P a g e
It is the number of jobs to run in parallel. This is used when you have the capability to do
parallel processing where n_jobs= -1 means using all processors and n_jobs=1, it can use
only one processor
8. random_state(int, RandomState instance or None, default=None)
It Controls both the randomness of the samples used when building trees.
9. verbose(int, default=0)
Controls the verbosity when fitting and predicting. It gives you all the run-time information.
You can hyper-tune these by changing the values. You can read my blog on hyper-tuning.
7. Predicting test cases using random forest
# Predicting the test set results
Pred = classifier.predict(X_test)
print(Pred)
Output:
[0 1 1 0 1 1 1 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1
1111111101100001100010001011001111111
0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1]
8. Checking the accuracy score
from sklearn.metrics import classification_report
rand_score=classifier.score(X_test, y_test)
'''rand_score=classifier.accuracy_score(y_test,Pred)'''
classification_report_rf=classification_report(y_test,Pred)
print("Accuracy score:",rand_score)
Output:
Accuracy score: 0.8268156424581006
15 | P a g e
Practical – 6
Introduction: In this practical, we will learn how to implement and simulate the Naive
Bayes algorithm, a popular classification algorithm based on Bayes' theorem and the
assumption of feature independence. We will use Python to build a simple Naive Bayes
classifier and apply it to a sample dataset.
Import libraries such as NumPy and pandas for data manipulation, and scikit-learn for
implementing the Naive Bayes classifier.
Write a Python function to calculate the probability of each class using the Naive
Bayes algorithm.
Implement functions to calculate the likelihood of each feature given the class and the
prior probability of each class.
Combine these probabilities using Bayes' theorem to make predictions.
Experiment with different variations of the Naive Bayes algorithm (e.g., Gaussian
Naive Bayes, Multinomial Naive Bayes, etc.) and observe the impact on performance.
Analyse the results and discuss any insights or observations
16 | P a g e
17 | P a g e
Practical – 7 ( a )
Aim : Deploy K means clustering algorithm.
Serialize the trained K-means model using libraries like pickle or joblib.
Save the model to disk for later use.
Load the saved K-means model from disk when needed for deployment.
When new data becomes available, load the trained K-means model.
Apply the model to the new data to assign cluster labels using the predict method.
Optionally, visualize the cluster assignments of the new data along with the centroids
of each cluster.
Use scatter plots or other visualization techniques to display the clustered data points.
6. Conclusion:
18 | P a g e
19 | P a g e
20 | P a g e
Practical – 7 ( b )
Aim: Implement K- nearest neighbors algorithm.
Introduction: In this practical, we will learn how to implement the K-nearest neighbors
(KNN) algorithm, a simple yet effective supervised learning algorithm for classification and
regression tasks. We will use Python to build a KNN classifier from scratch and apply it to a
sample dataset.
Let’s now get into the implementation of KNN in Python. We’ll go over the steps to help you
break the code down and make better sense of it.
import numpy as np
import pandas as pd
Scikit-learn has a lot of tools for creating synthetic datasets, which are great for testing
machine learning algorithms. I’m going to utilize the make blobs method.
X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5,
random_state = 4)
This code generates a dataset of 500 samples separated into four classes with a total of two
characteristics. Using associated parameters, you may quickly change the number of samples,
characteristics, and classes. We may also change the distribution of each cluster (or class).
21 | P a g e
4. Splitting Data into Training and Testing Datasets
It is critical to partition a dataset into train and test sets for every supervised machine learning
method. We first train the model and then put it to the test on various portions of the dataset.
If we don’t separate the data, we’re simply testing the model with data it already knows.
Using the train_test_split method, we can simply separate the tests.
With the train size and test size options, we may determine how much of the original data is
utilized for train and test sets, respectively. The default separation is 75% for the train set and
25% for the test set.
After that, we’ll build a kNN classifier object. I develop two classifiers with k values of 1 and
5 to demonstrate the relevance of the k value. The models are then trained using a train set.
The k value is chosen using the n_neighbors argument. It does not need to be explicitly
specified because the default value is 5.
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
Copy
6. Predictions for the KNN Classifiers
Then, in the test set, we forecast the target values and compare them to the actual values.
knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)
y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)
Copy
7. Predict Accuracy for both k values
from sklearn.metrics import accuracy_score
print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)
print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)
Copy
Let’s view the test set and predicted values with k=5 and k=1 to see the influence of k values.
plt.figure(figsize = (15,5))
22 | P a g e
plt.subplot(1,2,1)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=5", fontsize=20)
plt.subplot(1,2,2)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=1", fontsize=20)
plt.show()
Copy
KNN is a straightforward algorithm to grasp. It does not rely on any internal machine
learning model to generate predictions. KNN is a classification method that simply needs to
know how many categories there are to work (one or more). This means it can quickly assess
whether or not a new category should be added without having to know how many others
there are.
The drawback of this simplicity is that it can’t anticipate unusual things (like new diseases),
which KNN can’t accomplish since it doesn’t know what the prevalence of a rare item would
be in a healthy population.
Although KNN achieves high accuracy on the testing set, it is slower and more expensive in
terms of time and memory. It needs a considerable amount of memory in order to store the
whole training dataset for prediction. Furthermore, because Euclidean distance is very
sensitive to magnitudes, characteristics in the dataset with large magnitudes will always
outweigh those with small magnitudes.
Finally, considering all we’ve discussed so far, we should keep in mind that KNN isn’t ideal
for large-dimensional datasets.
23 | P a g e
Practical – 8
Aim: Deploy support vector machine, Apriori Algorithm.
Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or
nonlinear classification, regression, and even outlier detection tasks. SVMs can be used for
a variety of tasks, such as text classification, image classification, spam
detection, handwriting identification, gene expression analysis, face detection, and anomaly
detection. SVMs are adaptable and efficient in a variety of applications because they can
manage high-dimensional data and nonlinear relationships.
SVM algorithms are very effective as we try to find the maximum separating hyperplane
between the different classes available in the target feature.
Let’s consider two independent variables x 1, x2, and one dependent variable which is either
a blue circle or a red circle .
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x 1, x2) that segregate our data
points or do a classification between red and blue circles. So how do we choose the best
line or in general the best hyperplane that segregates our data points?
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.
24 | P a g e
Multiple hyperplanes separate the data from two classes
So we choose the hyperplane whose distance from it to the nearest data point on each side
is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a
scenario like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.
25 | P a g e
Hyperplane which is the most optimized one
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the margins in these types of cases are called soft margins. When there is a soft margin to
the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly
used penalty. If no violations no hinge loss.If violations hinge loss proportional to the
distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls
are separable by a straight line/linear line). What to do if data are not linearly separable?
Say, our data is shown in the figure above. SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a function
of distance from origin o.so if we plot this we get something like as shown below
26 | P a g e
Mapping 1D data to 2D to become able to separate the two classes
In this case, the new variable y is created as a function of distance from the origin. A non -
linear function that creates a new variable is referred to as a kernel.
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data points into their
respective classes. A hyperplane that maximizes the margin between the classes
is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot
be separated into two classes by a straight line (in the case of 2D). By using
kernel functions, nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions into a higher-
dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified
space.
SVM implementation in Python
Predict if cancer is Benign or malignant. Using historical data about patients diagnosed
with cancer enables doctors to differentiate malignant cases and benign ones are given
independent attributes.
Steps
Load the breast cancer dataset from sklearn.datasets
27 | P a g e
Separate input features and target variables.
Buil and train the SVM classifiers using RBF kernel.
Plot the scatter plot of the input features.
Plot the decision boundary.
Plot the decision boundary
cancer = load_breast_cancer()
X = cancer.data[:, :2]
y = cancer.target
svm.fit(X, y)
DecisionBoundaryDisplay.from_estimator(
svm,
X,
response_method="predict",
cmap=plt.cm.Spectral,
alpha=0.8,
xlabel=cancer.feature_names[0],
ylabel=cancer.feature_names[1],
28 | P a g e
)
# Scatter plot
c=y,
s=20, edgecolors="k")
plt.show()
Output
Apriori Algorithm
Apriori Algorithm is a Machine Learning algorithm which is used to gain insight into the
structured relationships between different items involved. The most prominent practical
application of the algorithm is to recommend products based on the products already
present in the user’s cart. Walmart especially has made great use of the algorithm in
suggesting products to its users.
Step 1: Importing the required libraries
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
Step 2: Loading and exploring the data
# Changing the working location to the location of the file
cd C:\Users\Dev\Desktop\Kaggle\Apriori Algorithm
29 | P a g e
# Exploring the columns of the data
data.columns
Output
30 | P a g e
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded
basket_encoded = basket_Por.applymap(hot_encode)
basket_Por = basket_encoded
basket_encoded = basket_Sweden.applymap(hot_encode)
basket_Sweden = basket_encoded
Step 6: Building the models and analyzing the results
a) France:
# Building the model
frq_items = apriori(basket_France, min_support = 0.05, use_colnames =
True)
31 | P a g e
From the above output, it can be seen that paper cups and paper and plates are bought
together in France. This is because the French have a culture of having a get-together with
their friends and family atleast once a week. Also, since the French government has banned
the use of plastic in the country, the people have to purchase the paper-based alternatives.
b) United Kingdom:
frq_items = apriori(basket_UK, min_support = 0.01, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())
If the rules for British transactions are analyzed a little deeper, it is seen that the British
people buy different colored tea-plates together. A reason behind this may be because
typically the British enjoy tea very much and often collect different colored tea-plates for
different occasions.
c) Portugal:
frq_items = apriori(basket_Por, min_support = 0.05, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())
On analyzing the association rules for Portuguese transactions, it is observed that Tiffin sets
(Knick Knack Tins) and color pencils. These two products typically belong to a primary
school going kid. These two products are required by children in school to carry their lunch
and for creative work respectively and hence are logically make sense to be paired together.
d) Sweden:
32 | P a g e
frq_items = apriori(basket_Sweden, min_support = 0.05, use_colnames =
True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False,
False])
print(rules.head())
33 | P a g e