Lab Manual
Lab Manual
Lab Manual
Program: BS(AI/CS/SE)
1
Table of Content
2
Lab No. 01
Google Colab:
Google Colab, short for Google Colaboratory, is a popular cloud-based platform that allows you
to write and execute Python code in a web-based interactive environment. It's a great tool for data
analysis, machine learning, and collaborative coding projects. In this introduction, we'll cover the
basics of Google Colab, Python libraries, and working with data.
3
Working with Data:
In data analysis and machine learning, working with data is fundamental. Here's a simplified
overview of the data workflow:
Data Collection: Acquire data from various sources, such as files, databases, APIs, or web
scraping.
Data Preprocessing: Clean and prepare the data by handling missing values, scaling,
encoding categorical variables, and more.
Exploratory Data Analysis (EDA): Use libraries like Pandas and visualization tools to
understand the data's characteristics, distributions, and relationships.
Data Modeling: Build, train, and evaluate machine learning models using libraries like
Scikit-Learn, TensorFlow, or PyTorch.
Model Evaluation: Assess the model's performance using metrics like accuracy, precision,
recall, or custom evaluation criteria.
Deployment: If the model is satisfactory, deploy it to production for real-world use.
Lab Task
Write the details of the following python libraries:
1. Numpy
2. Pandas
3. Matplotlib
4. Sci-Kit Learn
5. Seaborn
4
Lab No. 02
Supervised Learning:
It is the learning where the value or result that we want to predict is within the training data (labeled
data) and the value which is in data that we want to study is known as Target or Dependent Variable
or Response Variable.
All the other columns in the dataset are known as the Feature or Predictor Variable or Independent
Variable.
Supervised Learning is classified into two categories:
Classification: Here our target variable consists of the categories.
Regression: Here our target variable is continuous and we usually try to find out the line
of the curve.
As we have understood that to carry out supervised learning we need labeled data. How we can
get labeled data? There are various ways to get labeled data:
Historical labeled Data
Experiment to get data: We can perform experiments to generate labeled data like A/B
Testing.
Crowd-sourcing
Now it’s time to understand algorithms that can be used to solve supervised machine learning
problem. In this post, we will be using popular scikit-learn package.
5
Lab Task:
Implement K-Nearest Neighbor algorithm for the Iris dataset. The implementation should
involve the following steps:
1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a k-NN model using neighbors value.
5. Train or fit the data into the model.
6. Predict the future.
6
Lab No. 03
Naive Bayes Classification:
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a
large chunk of data. Naive Bayes classifier is successfully used in various applications such as
spam filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes
theorem of probability for prediction of unknown class.
Classification Workflow
Whenever you perform classification, the first step is to understand the problem and identify
potential features and label. Features are those characteristics or attributes which affect the results
of the label. For example, in the case of a loan distribution, bank managers identify the customer’s
occupation, income, age, location, previous loan history, transaction history, and credit score.
These characteristics are known as features that help the model classify customers.
The classification has two phases, a learning phase and the evaluation phase. In the learning phase,
the classifier trains its model on a given dataset, and in the evaluation phase, it tests the classifier's
performance. Performance is evaluated on the basis of various parameters such as accuracy, error,
precision, and recall.
7
What is Naive Bayes Classifier?
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and reliable
algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of
other features. For example, a loan applicant is desirable or not depending on his/her income,
previous loan and transaction history, age, and location. Even if these features are interdependent,
these features are still considered independently. This assumption simplifies computation, and
that's why it is considered as naive. This assumption is called class conditional independence.
P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior
probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
How Naive Bayes Classifier Works?
Let’s understand the working of Naive Bayes through an example. Given an example of weather
conditions and playing sports. You need to calculate the probability of playing sports. Now, you
need to classify whether players will play or not, based on the weather condition. Naive Bayes
classifier calculates the probability of an event in the following steps:
Step 1: Calculate the prior probability for given class labels
Step 2: Find Likelihood probability with each attribute for each class
Step 3: Put these value in Bayes Formula and calculate posterior probability.
Step 4: See which class has a higher probability, given the input belongs to the higher
probability class.
For simplifying prior and posterior probability calculation, you can use the two tables frequency
and likelihood tables. Both of these tables will help you to calculate the prior and posterior
8
probability. The Frequency table contains the occurrence of labels for all features. There are two
likelihood tables. Likelihood Table 1 is showing prior probabilities of labels and Likelihood Table
2 is showing the posterior probability.
Now suppose you want to calculate the probability of playing when the weather is overcast.
Probability of playing:
P(Yes | Overcast) = P(Overcast | Yes) P(Yes) / P (Overcast) .....................(1)
Calculate Prior Probabilities:
P(Overcast) = 4/14 = 0.29
P(Yes)= 9/14 = 0.64
Calculate Posterior Probabilities:
P(Overcast |Yes) = 4/9 = 0.44
Put Prior and Posterior probabilities in equation (1)
P (Yes | Overcast) = 0.44 * 0.64 / 0.29 = 0.98(Higher)
Similarly, you can calculate the probability of not playing:
Probability of not playing:
P(No | Overcast) = P(Overcast | No) P(No) / P (Overcast) .....................(2)
Calculate Prior Probabilities:
P(Overcast) = 4/14 = 0.29
9
P(No)= 5/14 = 0.36
Calculate Posterior Probabilities:
P(Overcast |No) = 0/9 = 0
Put Prior and Posterior probabilities in equation (2)
P (No | Overcast) = 0 * 0.36 / 0.29 = 0
The probability of a 'Yes' class is higher. So you can determine here if the weather is overcast than
players will play the sport.
Lab Task:
Generate synthetic data using scikit-learn and train and evaluate the Gaussian Naive Bayes
algorithm. Use the following outline for the data generation.
Create a dataset with six features, three classes, and 800 samples using the
`make_classification` function.
Use matplotlib.pyplot’s `scatter` function to visualize the dataset.
Split the dataset into training and testing for model evaluation.
Build a generic Gaussian Naive Bayes and train it on a training dataset.
Predict the values for the test dataset and use them to calculate accuracy and F1 score.
Visualize the confusion matrix.
10
Lab No. 04
Introduction
SVM is a powerful supervised algorithm that works best on smaller datasets but on complex ones.
Support Vector Machine, abbreviated as SVM can be used for both regression and classification
tasks, but generally, they work best in classification problems. They were very famous around the
time they were created, during the 1990s, and keep on being the go-to method for a high-
performing algorithm with a little tuning.
It is a supervised machine learning problem where we try to find a hyperplane that best separates
the two classes. Note: Don’t get confused between SVM and logistic regression. Both the
algorithms try to find the best hyperplane, but the main difference is logistic regression is a
separable means that the data points can be classified into 2 classes by using a single straight line(if
2D).
2. Non-Linear SVM
When the data is not linearly separable then we can use Non-Linear SVM, which means when the
data points cannot be separated into 2 classes by using a straight line (if 2D) then we use some
advanced techniques like kernel tricks to classify them. In most real-world applications we do not
find linearly separable datapoints hence we use kernel trick to solve them.
Important Terms
Now let’s define two main terms which will be repeated again and again in this article:
Support Vectors: These are the points that are closest to the hyperplane. A separating line
11
Margin: it is the distance between the hyperplane and the observations closest to the
hyperplane (support vectors). In SVM large margin is considered a good margin. There are
two types of margins hard margin and soft margin. I will talk more about these two in the
later section.
Lab Task:
Implement Support Vector Machine (SVM) algorithm for the Iris dataset. The implementation
should involve the following steps:
1. The SVM algorithm is imported from the scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a SVM values.
5. Train or fit the data into the model.
6. Predict the future.
12
Lab No. 05
K-means
K-means is an unsupervised learning method for clustering data points. The algorithm iteratively
divides data points into K clusters by minimizing the variance in each cluster. Here, we will show
you how to estimate the best value for K using the elbow method, then use K-means clustering to
group the data points into clusters.
Lab Task:
Implement k-Means algorithm for any dataset. The implementation should involve the following
steps:
1. The k-Means algorithm is imported from the scikit-learn package.
2. Make clusters.
3. Implement k-Means algorithm.
4. Visualize the final outcome.
13
Lab No. 06
K-Medoids
K-Medoids is an unsupervised learning method for clustering data points. The algorithm iteratively
divides data points into K clusters by minimizing the variance in each cluster. Here, we will show
you how to estimate the best value for K using the elbow method, then use K-means clustering to
group the data points into clusters.
Lab Task:
Implement k-Medoids algorithm for any dataset. The implementation should involve the following
steps:
1. Import all the relevant libraries.
2. Make a k-Medoid class.
3. For a set of random number implement k-Medoid algorithm.
4. Visualize the final outcome.
14
Lab No. 07
Introduction
In data mining and statistics, hierarchical clustering analysis is a method of clustering analysis that
seeks to build a hierarchy of clusters i.e. tree-type structure based on the hierarchy.
In machine learning, clustering is the unsupervised learning technique that groups the data based
on similarity between the set of data. There are different-different types of clustering algorithms
in machine learning. Connectivity-based clustering: This type of clustering algorithm builds the
cluster based on the connectivity between the data points. Example: Hierarchical clustering
Centroid-based clustering: This type of clustering algorithm forms around the centroids of the data
points. Example: K-Means clustering, K-Mode clustering
Distribution-based clustering: This type of clustering algorithm is modeled using statistical
distributions. It assumes that the data points in a cluster are generated from a particular probability
distribution, and the algorithm aims to estimate the parameters of the distribution to group similar
data points into clusters Example: Gaussian Mixture Models (GMM)
Density-based clustering: This type of clustering algorithm groups together data points that are in
high-density concentrations and separates points in low-concentrations regions. The basic idea is
that it identifies regions in the data space that have a high density of data points and groups those
points together into clusters. Example: DBSCAN(Density-Based Spatial Clustering of
Applications with Noise)
Lab Task:
Implement Agglomerative Hierarchical Clustering algorithm for any dataset. The implementation
should involve the following steps:
1. Import all the relevant libraries.
2. Generate a random dataset.
3. Decide the number of clusters.
4. Deploy agglomerative hierarchical clustering algorithm.
5. Print the class labels.
15
Lab No. 08
Regular Expression
Lab Task:
Write a program to look for lines of the form “New Revision: 39772” for the file mbox.txt. The
text file is uploaded on MS Team along with the task.
16
Lab No. 09
Text Processing:
Whenever we have textual data, we need to apply several pre-processing steps to the data to transform
words into numerical features that work with machine learning algorithms. The pre-processing steps for a
problem depend mainly on the domain and the problem itself, hence, we don’t need to apply all steps to
every problem. We will be using the NLTK (Natural Language Toolkit) library here.
def text_lowercase(text):
return text.lower()
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days
!!"
text_lowercase(input_str)
Example:
Input: “Hey, did you know that the summer break is coming? Amazing right!! It’s only 5 more days!!”
Output: “hey, did you know that the summer break is coming? amazing right!! it’s only 5 more days!!”
Remove numbers:
We can either remove numbers or convert the numbers into their textual representations. We can use
regular expressions to remove the numbers.
# Remove numbers
def remove_numbers(text):
result = re.sub(r'\d+', '', text)
return result
input_str = "There are 3 balls in this bag, and 12 in the other one."
remove_numbers(input_str)
Lab Task:
Convert the numbers into words:
Input: “There are 3 balls in this bag, and 12 in the other one.”
Output: ‘There are balls in this bag, and in the other one.’
17