0% found this document useful (0 votes)
15 views19 pages

Lab Manual

The document is a laboratory manual for the Fundamentals of Machine Learning course at SAL Institute of Diploma Studies, detailing various experiments for the academic year 2024-2025. Each experiment focuses on different machine learning concepts and techniques, including numerical computing, data analysis with Pandas, decision trees, and K-nearest neighbors. The manual includes objectives, theoretical background, exercises, and quizzes for each experiment to enhance students' understanding of machine learning algorithms.

Uploaded by

shubhamvj41
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views19 pages

Lab Manual

The document is a laboratory manual for the Fundamentals of Machine Learning course at SAL Institute of Diploma Studies, detailing various experiments for the academic year 2024-2025. Each experiment focuses on different machine learning concepts and techniques, including numerical computing, data analysis with Pandas, decision trees, and K-nearest neighbors. The manual includes objectives, theoretical background, exercises, and quizzes for each experiment to enhance students' understanding of machine learning algorithms.

Uploaded by

shubhamvj41
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SAL Institute of Diploma Studies,

AHMEDABAD

Department of Information Technology


Fundamentals of Machine Learning (4341603)

Laboratory Manual

Year: 2024-2025

1
INDEX

Page No.
Sr. Date Marks Signature
Experiment CO
No.
From To

Numerical Computing with Python


1. CO5
(NumPy, Matplotlib)

2. Introduction to Pandas to analyse data. CO5

To implement the Find-S concept


3. CO2
learning algorithm.

Import Pima indian diabetes data to select


4. CO2
best features.

Learn a decision tree algorithm for


5. CO3
prediction.

Implement the K-nearest neighbour


6. CO3
algorithm for predicting class labels.

7. Import vgsales.csv from Kaggle platform. CO3


8. Project on regression CO3
Implement the K-means clustering
9. CO4
algorithm for clustering a set of points.

10. Import Iris dataset CO4

2
Experiment No: 1 Date: / /

TITLE: Numerical Computing with Python (NumPy, Matplotlib)

OBJECTIVES:

After completing this experiment students will be able to…


● Get familiarized with python libraries related to visualization and computation.

THEORY:

Refer Unit 6 of course curriculum.


● NumPy, short for Numerical Python, is a powerful library in Python used for numerical
computing. It provides support for multidimensional arrays and matrices along with a
collection of mathematical functions to operate on these arrays efficiently. Developed to
tackle the shortcomings of Python's native data structures when dealing with large datasets
and complex mathematical operations, NumPy is widely utilized in various fields such as
scientific computing, data analysis, machine learning, and more.
● Matplotlib is a comprehensive library in Python widely used for creating static,
interactive, and animated visualizations. It offers a wide range of plotting capabilities,
making it a go-to tool for data visualization tasks in fields such as data analysis, scientific
research, engineering, and more.

EXERCISES:
1) Perform numerical operations using Numpy to Creating arrays
2) Perform Accessing Array Elements using Numpy
3) Perform Slicing arrays using Numpy
4) Draw a different types of graphs using Matplotlib
5) Write a program to draw following graph using Matplotlib.

QUIZ:
1. Which of the functions is a function to create a numpy array?
a) empty() b) array() c) ones() d) All the above
3
2. What is the output of the below code?
import numpy as np
a = np.arange(2, 8)
print(a)
a) array([2, 3, 4, 5, 6, 7]) b) array([3, 4, 5, 6, 7])
c) array([2, 3, 4, 5, 6, 7, 8]) d) array([3, 4, 5, 6, 7, 8])

3. Find the output of the given code


a = np.array([[[1,2,3],[4,5,6]]])
print(a.shape)
a) 1 b) (1,3) c) 3 d) (3,1)

4. By default, Plot() function plots a?


a) Bar chart b) Line chart c) Pie chart d) Horizontal bar chart

5. Which of the following types of chart is not supported by pyplot?


a) Pie b) Boxplot c) Histogram d) All of the above

6. To create a histogram pyplot provides?


a) hist() b) histo() c) histg() d) histogram()

EVALUATION:
Problem Analysis Understanding Timely Mock Total
and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with date : _________________________

4
Experiment No: 2 Date: / /

TITLE: Introduction to Pandas to analyse data.

OBJECTIVES:

After completing this experiment students will be able to…


• Getting familiarized with python machine learning libraries.

THEORY:

Refer Unit 6 of course curriculum.


• Pandas is a powerful and popular open-source library in Python used for data manipulation
and analysis. It provides easy-to-use data structures and functions for working with structured
data, making it an essential tool for tasks such as data cleaning, transformation, exploration,
and aggregation.

EXERCISES:
1) Load a CSV file into a Pandas DataFrame
2) Create a simple Pandas Series from a list
3) Create a simple Pandas DataFrame

QUIZ:
1. Which of the following features is not provided by the Pandas module?
a) Merge and join the data sets b) Filter data using the condition
c) Plot and visualize the data d) None of the above

2. From which of the following files, pandas can read data?


a) JSON b) Excel c) HTML d) All the above

3. Given a dataset named ‘data’ containing the 5 columns and 10 rows, find the output
of the given code? print(len(data.columns))
a) 5 b) 10 c) 15 d) 50

4. What does the attribute shape return?


a) It returns the number of rows and columns respectively in the form of a tuple
b) It returns the number of columns and rows respectively in the form of a list
c) It returns the number of rows and columns respectively in the form of a list
d) It returns the number of columns and rows respectively in the form of a tuple

5
5. Which of the following commands return the data type of the values in each column
in the data frame.
a) print(df.dtype) b) print(dtypes(df))
c) print(df.dtypes) d) None of the above

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with date : _________________________

6
Experiment No: 3 Date: / /

TITLE: To implement the Find-S concept learning algorithm.

OBJECTIVES:
After completing this experiment students will be able to…
• Understand concept of Machine learning.
• The objective of the Find-S algorithm is to find the most specific hypothesis that fits all
positive instances in the training data while minimizing the number of misclassifications
of negative instances.

THEORY:
Refer Unit 2 of course curriculum.
• The Find-S algorithm is a simple, yet powerful, concept learning algorithm used in machine
learning for learning a hypothesis from training data represented as instances of a target
concept. Developed by Tom Mitchell, the Find-S algorithm is primarily used in the context
of supervised learning for learning from examples.
• Hypothesis Representation: The hypothesis space in the Find-S algorithm is represented
using a conjunction of attribute-value pairs. Each attribute-value pair in the hypothesis
represents a specific condition that must be satisfied by positive instances of the target
concept.

EXERCISES:
1. Initialize the hypothesis to the most specific hypothesis in the hypothesis space.
2. For each positive training instance, update the hypothesis to include only the attributes that
are present in the instance.
3. For each negative training instance, remove any attributes that are present in the instance
from the hypothesis.
4. Return the final hypothesis.

QUIZ:
State True or False
1. Find S algorithm only considers positive training examples and neglect negative training
examples.
2. In Find-S algorithm we move bottom to top i.e. general hypothesis to specific hypothesis.
3. A maximally specific hypothesis covers none of the negative training examples.

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

7
Signature with date : _________________________
Experiment No: 4 Date: / /

TITLE: Import Pima Indian diabetes data to select best features.

OBJECTIVES:
After completing this experiment students will be able to…
• The primary objective of the practical is to understand data pre-processing along with
identifying various types of data.

THEORY:

Refer Unit 2 of course curriculum.


• Feature subset selection, also known as feature selection, is a process in machine learning
and data mining where a subset of relevant features (variables, predictors) is selected from
the original set of features to build models. The goal of feature subset selection is to
improve the performance of the model by reducing the dimensionality of the feature space,
enhancing model interpretability, and potentially reducing overfitting.
EXERCISES:

1. Apply select K best and chi2 for feature selection


2. Identify the best features.

QUIZ:
1. What is the main advantage of using feature selection?
a) speeding-up the training of an algorithm
b) fine tuning the model’s performance
c) remove noisy features

2. When selecting feature, the decision should be made using:


a) the entire dataset b) the training set c) the testing set

3. Given 20 potential features, How many models do you have to evaluate in all the subsets
algorithm
a) 20 b) 40 c) 1048576 d) 1048596

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with Date : __________________________


8
Experiment No: 5 Date: / /

TITLE: Learn a decision tree algorithm for prediction.

OBJECTIVES:

After completing this experiment students will be able to…


• Learning Decision tree and predicting class labels.

THEORY:

A decision tree is a popular machine learning algorithm used for both classification and regression
tasks. It's a predictive modeling tool that maps observations about an item to conclusions about its
target value, typically represented in a tree-like structure.

The learned tree should be tested on test instances with unknown class labels, and the predicted
class labels for the test instances should be printed as output. Predicted class labels (0/1) for the
test data must be exactly in the order in which the test instances are present in the test file.

Refer Unit 3 of course curriculum. Students are suggested to read chapter 3 of Machine Learning
authored by Dutt, Chandramouli and das.

EXERCISES:
1. Predict class labels of test data.

QUIZ:
1. What is a decision tree?
a) A visual representation of decision-making using nodes and branches
b) A mathematical formula for predicting outcomes
c) A statistical model for regression analysis

2. What is the purpose of a decision tree?


a) To predict outcomes based on input variables
b) To summarize data using a graphical representation
c) To perform hypothesis testing on a dataset

3. What is a split in a decision tree?


a) A branch that represents a decision based on a feature or attribute
b) A point where the tree branches into different paths
c) A method for reducing the complexity of a decision tree

4. What is pruning in a decision tree?


a) A technique for simplifying the tree by removing branches that don't
contribute to accuracy
b) A method for reducing the number of input variables
c) A way to increase the complexity of a decision tree
9
5. What is overfitting in a decision tree?
a) When the tree is too simple and doesn't capture all the relevant information
b) When the tree is too complex and fits the training data too closely
c) When the input variables are not correlated with the outcome variable

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with date : _______________________

10
Experiment No: 6 Date: / /

TITLE: Implement the K-nearest neighbor algorithm for predicting class labels.

OBJECTIVES:

After completing this experiment students will be able to…


• Learn simplest supervised machine learning algorithm used for classification.

THEORY:

The k-nearest neighbors (k-NN) algorithm is a straightforward and intuitive machine learning
algorithm used for both classification and regression tasks. It's a type of instance-based learning,
where the algorithm memorizes the training dataset and makes predictions for new instances based
on their similarity to existing instances in the training data.
Training data: data.csv
1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 0 0 1
1 1 1 1 1 1 1 1 0
1 1 1 1 1 0 0 1 1
1 1 1 1 1 0 0 0 1
1 1 1 0 1 1 0 1 1
1 1 0 1 1 1 0 1 0
1 1 1 0 1 1 0 0 1
1 1 1 0 1 0 0 1 1
1 1 1 0 1 0 0 0 1
0 1 1 1 1 1 0 1 1
0 1 1 1 1 1 0 0 1
1 0 1 1 1 1 0 1 0
0 1 1 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0
1 0 0 1 1 1 0 1 0
1 0 0 1 0 1 1 1 0
0 1 1 1 1 0 0 0 1
1 0 1 1 1 1 1 1 0
0 1 1 0 1 1 0 1 1

11
Test Data: test.csv
0 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0
0 1 1 0 1 0 0 0
0 1 1 1 1 0 0 0
Refer unit 4 of course curriculum. Students are suggested to read chapter 7 of Machine Learning
authored by Dutt, Chandramouli and das
EXERCISES:
1.
Load the training and test datasets.
2.
Implement a function to calculate the Euclidean distance between two data points.
3.
For each data point in the test dataset, find the K-nearest neighbors from the training dataset.
4.
Determine the majority class among the K-nearest neighbors.
5.
Assign the predicted class label to the test data point.
6.
Evaluate the performance of the algorithm using metrics such as accuracy, precision, recall, and F1
score.
QUIZ:
1. What does KNN stand for?
a) K-Nearest Neighbors b) Kernel Nonlinear Network
c) K-Means Nearest Neighbors d) None of the above

2. In KNN, how is the distance between a new data point and its neighbors typically measured?
a) Euclidean distance b) Manhattan distance c) Cosine similarity d) All of the above

3. In what type of machine learning problems is KNN generally used?


a) Regression problems b) Classification problems
c) Clustering problems d) Dimensionality reduction problems

4. What are some advantages of using KNN for machine learning?


a) It is a simple and easy-to-implement algorithm.
b) It can handle both continuous and categorical data.
c) It can adapt to complex decision boundaries.
d) All of the above.

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with Date : __________________________


12
Experiment No: 7 Date: / /

TITLE: Import vgsales.csv from kaggle platform.

OBJECTIVES:
After completing this experiment students will be able to…
• Understand the imported data from known repositories

THEORY:
https://pandas.pydata.org/docs/user_guide/index.html
Refer unit 4 of course curriculum.

EXERCISES:
a. Find rows and columns in dataset
b. Find basic information regarding the dataset using the describe command.
c. Find values using values command.

QUIZ:
1. What is Pandas used for?
a) Data analysis and manipulation b) Web development
c) Machine learning d) Image processing

2. What are the two main data structures in Pandas?


a) Series and DataFrames b) Arrays and lists
c) Dictionaries and tuples d) Matrices and vectors

3. How do you read a CSV file into a Pandas DataFrame?


a) pd.read_csv('filename.csv') b) pd.read_excel('filename.csv')
c) pd.read_table('filename.csv') d) pd.read_json('filename.csv')

4. How do you select a subset of rows and columns from a Pandas DataFrame?
a) df.loc[row_index, column_index] b) df.iloc[row_index, column_index]
c) df[row_index, column_index] d) df.select(row_index, column_index)

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

13
Signature with Date : __________________________
Experiment No: 8 Date: / /

TITLE: Project on regression

OBJECTIVES:

After completing this experiment students will be able to…


• Understand the linear model..

THEORY:

Linear Regression : Linear regression is a foundational and widely-used statistical technique for
modeling the relationship between a dependent variable (often denoted as y) and one or more
independent variables (often denoted as x). It's a supervised learning algorithm used for predictive
modeling and understanding the relationship between variables.
y=β0+β1x1+β2x2+…+βnxn+ϵ
where:
• y is the dependent variable.
• x1,x2,…,xn are the independent variables.
• β0 is the intercept (the value of y when all independent variables are zero).
• β1,β2,…,βn are the coefficients (slopes) of the independent variables.
• ϵ is the error term, representing the difference between the observed and predicted values.
https://scikit-learn.org/stable/
Refer unit 4 of course curriculum.

EXERCISES:
a. Import home_data.csv on kaggle using pandas.
b. Understand data by running head ,info and describe command.
c. Plot the price of house with respect to area using matplotlib library.
d. Apply linear regression model to predict the price of house.

QUIZ:
1. What is linear regression used for?

14
a) Data visualization b) Clustering c) Predictive modeling d) Dimensionality reduction

2. In linear regression, what is the objective?


a) To minimize the mean squared error between the predicted and actual values
b) To maximize the correlation coefficient between the features and target variable
c) To maximize the R-squared value between the features and target variable
d) To minimize the sum of absolute errors between the predicted and actual values

3. How is linear regression implemented in Scikit-Learn?


a) By instantiating a LinearRegression object and calling its fit method
b) By instantiating a Regression object and calling its fit method
c) By instantiating a LinearModel object and calling its fit method
d) By instantiating a LinearSolver object and calling its fit method

4. What is the R-squared value in linear regression?


a) A measure of how well the model fits the data
b) A measure of the correlation between the features and target variable
c) A measure of the variance in the target variable that can be explained by the features
d) A measure of the error between the predicted and actual values

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with date : __________________________

15
Experiment No: 9 Date: / /

TITLE: Implement the K-means clustering algorithm for clustering a set of points.

OBJECTIVES:

After completing this experiment students will be able to…


• Determine the correct number of clusters.

THEORY:

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a
dataset into a predefined number of clusters. It aims to group similar data points together and
discover underlying patterns or structures in the data.
Refer unit 5 of course curriculum.

EXERCISES:
1. Load the dataset containing the points to be clustered.
2. Initialize K cluster centroids randomly.
3. Repeat until convergence (i.e., cluster assignments do not change):
a. Assign each data point to the nearest cluster centroid.
b. Update each cluster centroid to be the mean of all data points assigned to it.
4. Assign each test data point to the nearest cluster centroid.
5. Evaluate the performance of the algorithm

QUIZ:
1. What is K-means clustering used for?
a) Dimensionality reduction b) Data cleaning c) Data clustering d) Model selection

2. What is the objective of K-means clustering?


a) To minimize the sum of squared distances between data points and their centroids
b) To maximize the variance between data points and their centroids
c) To minimize the sum of absolute distances between data points and their centroids
d) To maximize the correlation between data points and their centroids

3. What is the value of K in K-means clustering?


a) The number of clusters b) The number of data points
c) The number of features d) The number of centroids

16
4. How is the initial centroid for K-means clustering selected?
a) Randomly b) Based on the mean of the data points
c) Based on the median of the data points d) Based on the mode of the data points

5. How do you evaluate the quality of the clustering in K-means clustering?


a) By calculating the sum of squared distances between data points and their centroids
b) By calculating the silhouette score
c) By calculating the F1 score
d) By calculating the Pearson correlation coefficient

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with date : __________________________

17
Experiment No: 10 Date: / /

TITLE: Import Iris dataset

OBJECTIVES:

After completing this experiment students will be able to…

• Differentiate between supervised v/s unsupervised learning approaches

THEORY:

Refer unit 5 of course curriculum.

EXERCISES:
a. Find rows and columns using shape command
b. Print first 30 instances using head command
c. Find out the data instances in each class. (use group by and size)
d. Plot the univariate graphs (box plot and histograms)
e. Plot the multivariate plot (scatter matrix)
f. Split data to train model by 80% data values
g. Apply K-NN and k means clustering to check accuracy and decide which is better.

QUIZ:
1. Which algorithm is supervised and which one is unsupervised?
a) K-means clustering is supervised, KNN algorithm is unsupervised
b) K-means clustering is unsupervised, KNN algorithm is supervised
c) Both K-means clustering and KNN algorithm are supervised
d) Both K-means clustering and KNN algorithm are unsupervised

2. What is the output of K-means clustering?


a) A classification of the data points into different classes
b) A prediction of the target variable for a given data point
c) A grouping of similar data points into K clusters
d) The K nearest neighbors for a given data point

3. What is the output of KNN algorithm?


a) A classification of the data points into different classes
b) A prediction of the target variable for a given data point
c) A grouping of similar data points into K clusters
18
d) The K nearest neighbors for a given data point

4. What is the primary objective of K-means clustering?


a) To classify data points into different classes
b) To find the K nearest neighbors for a given data point
c) To group similar data points into K clusters
d) To predict the target variable for a given data point

5. What is the primary objective of KNN algorithm?


a) To classify data points into different classes
b) To find the K nearest neighbors for a given data point
c) To group similar data points into K clusters
d) To predict the target variable for a given data point

Problem Analysis Understanding Timely Mock Total


and Solution Level Completion
(3) (3) (2) (2) (10)

Signature with date : __________________________

19

You might also like