0% found this document useful (0 votes)
34 views71 pages

21CSU393 Kunal Verma - AI&ML Lab Manual

Uploaded by

Kunal Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views71 pages

21CSU393 Kunal Verma - AI&ML Lab Manual

Uploaded by

Kunal Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Introduction to AI & ML

Lab Manual
Department of Computer Science and Engineering
The NorthCap University, Gurugram
Introduction to AI & ML
CSL236

Introduction to AI &ML Lab Manual


CSL 236

Department of Computer Science and Engineering

The NorthCap University, Gurugram- 122001, India

Session 2022-23

Published by:

School of Engineering and Technology

Department of Computer Science & Engineering

The NorthCap University Gurugram


Introduction to AI & ML
CSL236

• Laboratory Manual is for Internal Circulation only

© Copyright Reserved

No part of this Practical Record Book may be

reproduced, used, stored without prior permission of The NorthCap University

Copying or facilitating copying of lab work comes under cheating and is considered as use
of unfair means. Students indulging in copying or facilitating copying shall be awarded zero
marks for that particular experiment. Frequent cases of copying may lead to disciplinary
action. Attendance in lab classes is mandatory.

Labs are open up to 7 PM upon request. Students are encouraged to make full use of labs
beyond normal lab hours.
Introduction to AI & ML
CSL236

PREFACE
Machine Learning Lab Manual is designed to meet the course and program requirements of
NCU curriculum for B.Tech. III year students of CSE branch. The concept of the lab work is
to give brief practical experience for basic lab skills to students. It provides the space and
scope for self-study so that students can come up with new and creative ideas.

The Lab manual is written on the basis of “teach yourself pattern” and expected that
students who come with proper preparation should be able to perform the experiments
without any difficulty. Brief introduction to each experiment with information about self-
study material is provided. The pre-requisite is having a basic working knowledge of
Python. The laboratory exercises will include familiarization with data pre-processing
techniques for ML like handling missing data, duplicate data, outliers feature scaling and
encoding. Feature Selection and Dimensionality Reduction are included to enhance the
performance and reduce the computational time. Various ML classification and regression
techniques are taught. Students would learn the algorithms pertaining to these and
implement the same using a high-level language, i.e. Python. Students are expected to come
thoroughly prepared for the lab. General disciplines, safety guidelines and report writing
are also discussed.

The lab manual is a part of curriculum for the The NorthCap University, Gurugram.
Teacher’s copy of the experimental results and answer for the questions are available as
sample guidelines.

We hope that lab manual would be useful to students of CSE, IT, ECE and BSc branches and
author requests the readers to kindly forward their suggestions / constructive criticism for
further improvement of the work book.

Author expresses deep gratitude to Members, Governing Body-NCU for encouragement and
motivation.
Introduction to AI & ML
CSL236
Authors
The NorthCap University
Gurugram, India

CONTENTS

Page
S.N. Details No.

Syllabus 6

1 Introduction 9

2 Lab Requirement 10

3 General Instructions 11

4 List of Experiments 13

5 List of Flip Assignment 14

6 List of Projects 15

7 Rubrics 16

8 Annexure 1 (Format of Lab Report) 18


Introduction to AI & ML
CSL236

SYLLABUS
1. Department: Department of Computer Science and Engineering

3. Course Code 4. L-T-P 5. Credits


2. Course Name: Machine Learning
CSL236 3-0-2 4

6. Type of Course
(Check one): Program Core ✔ Program Elective Open Elective

7. Pre-requisite(s), if any: Introduction to AI and ML

8. Frequency of offering (check one): Odd ✔ Even Either semester Every


semester

9. Brief Syllabus:

Introduction to artificial intelligence, History of AI, Proposing and evaluating AI application, Preprocessing and
Feature Engineering, Case study: Exploratory Analysis of Delhi Pollution, Simple Linear Regression, Multiple
Regression, Polynomial Regression, Support Vector Regression SVR, Decision Tree Regression, Random Forest
Regression, Logistic Regression, K Nearest Neighbors, Support Vector Machine, Kernel SVM, Naïve Bayes,
Decision Trees Classification, Random Forest Classification, Basic Terminologies: Overfitting, Underfitting, Bias
and Variance model, Bootstrapping, Cross-Validation and Resampling Methods, Performance Measures: Confusion
matrix, ROC. Comparing two classification Algorithms: McNamara’s Test, paired t-test.

Total lecture, Tutorial and Practical Hours for this course (Take 15 teaching weeks per
semester): 75 hours

The class size is maximum 30 learners

Practice
Introduction to AI & ML
CSL236
Lectures: 40 hours Tutorials: 0 hours Lab Work: 35 hours

10.Course Outcomes (COs)

On successful completion of this course students will be able to:

Understand and implement the preprocessing of the data to be used for machine
CO 1
learning models.

CO 2 Understand the strengths and limitations of various ML algorithms.

CO 3 Understand why models degrade and how to maintain them.

CO 4 Implement and use model grading metrics.

CO 5 Apply ML techniques and technologies to solve real world business problems.

11. UNIT WISE DETAILS No. of Units: 5

Unit Number: 1 Title: Introduction to AI and ML No. of hours: 4

Content Summary:

Introduction to artificial intelligence, History of AI, Overview of machine learning, techniques in


machine learning, deep learning, differences between deep learning, machine learning and AI,
different applications of machine learning, different types of data.

Unit Number: 2 Data preprocessing and engineering No. of hours:10

Content Summary:

Introduction to Data Preprocessing, different preprocessing techniques, data cleaning, data


transformation: standardization and normalization, data smoothing, dimensionality reduction,
different encoding schemes for categorical and numerical features.

Unit Number: 3 Title: Regression Techniques

No. of hours:12
Introduction to AI & ML
CSL236

Content Summary:

Simple Linear Regression, Multiple Regression, Polynomial Regression, Support Vector Regression
SVR, Decision Tree Regression, Random Forest Regression

Unit Number: 4 Title: Classification algorithm techniques No. of hours:12

Logistic Regression, K Nearest Neighbors, Support Vector Machine, Kernel SVM, Naïve
Bayes, Decision Trees Classification, Random Forest Classification

Unit Number: 5 Title: Analysis of various algorithms No. of hours:7

Basic Terminologies: Over fitting, Under fitting, Bias and Variance model, Bootstrapping, Cross-
Validation and Resampling Methods, Performance Measures: Confusion matrix, ROC.

12.Brief Description of Self-learning components by students (through books/resource


material etc.):

Data-preprocessing techniques

13. Advance Learning Components:

Probability and Statistics and Linear Algebra

14. Books Recommended :

Text Books:

1. Michael Bowles, “Machine Learning in Python ” Wiley, Third Edition, 2019

2. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques,
3rd ed.
Introduction to AI & ML
CSL236
Reference Books:

1. Ian H. Witten & Eibe Frank., “Data Mining Practical Machine Learning Tools and
Techniques”, Morgan Kauffmann Publishers, Second Edition, 2020

2. Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Third Edition, 2015

3. Tom Mitchell. Machine Learning. Mc Graw Hill

Reference Websites: (NPTEL, Swayam, Coursera, Edx, Udemy, LMS, official documentation
weblink)

● https://nculms.ncuindia.edu/

● https://www.simplilearn.com/big-data-and-analytics/machine-learning-
certification-training-course

● https://www.coursera.org/learn/machine-learning
Introduction to AI & ML
CSL236

1. INTRODUCTION

That ‘learning is a continuous process’ cannot be over emphasized. The theoretical


knowledge gained during lecture sessions need to be strengthened through practical
experimentation. Thus, practical makes an integral part of a learning process.

OBJECTIVES:
The purpose of conducting experiments can be stated as follows:

● To familiarize the students with the basic concepts of Machine Learning like
supervised, unsupervised and reinforcement learning.
● The lab sessions will be based on exploring the concepts discussed in class.

● Learning and understanding Data Preprocessing techniques.

● Learning and understanding regression and classification problems and


algorithms.
● Learning and understanding Feature selection and Dimensionality Reduction.

● Learning and understanding performance metrics.

● Hands on experience

2. LAB REQUIREMENTS

S.No. Requirements Details

1 Software Python 3.
Requirements

2 Operating System Windows(64-bit), Linux

3 Hardware 8 GB RAM (Recommended)


Requirements
2.60 GHz (Recommended)
Introduction to AI & ML
CSL236
4 Required Bandwidth NA

3. GENERAL INSTRUCTIONS

3.1 General discipline in the lab

● Students must turn up in time and contact concerned faculty for the experiment
they are supposed to perform.
● Students will not be allowed to enter late in the lab.

● Students will not leave the class till the period is over.

● Students should come prepared for their experiment.

● Experimental results should be entered in the lab report format and


certified/signed by concerned faculty/ lab Instructor.
● Students must get the connection of the hardware setup verified before
switching on the power supply.
● Students should maintain silence while performing the experiments. If any
necessity arises for discussion amongst them, they should discuss with a very
low pitch without disturbing the adjacent groups.
● Violating the above code of conduct may attract disciplinary action.

● Damaging lab equipment or removing any component from the lab may invite
penalties and strict disciplinary action.

3.2 Attendance

● Attendance in the lab class is compulsory.

● Students should not attend a different lab group/section other than the one
assigned at the beginning of the session.
Introduction to AI & ML
CSL236

● On account of illness or some family problems, if a student misses his/her lab


classes, he/she may be assigned a different group to make up the losses in
consultation with the concerned faculty / lab instructor. Or he/she may work
in the lab during spare/extra hours to complete the experiment. No attendance
will be granted for such case.

3.3 Preparation and Performance

● Students should come to the lab thoroughly prepared on the experiments they
are assigned to perform on that day. Brief introduction to each experiment
with information about self study reference is provided on LMS.
● Students must bring the lab report during each practical class with written
records of the last experiments performed complete in all respect.
● Each student is required to write a complete report of the experiment he has
performed and bring to lab class for evaluation in the next working lab.
Sufficient space in work book is provided for independent writing of theory,
observation, calculation and conclusion.
● Students should follow the Zero tolerance policy for copying / plagiarism. Zero
marks will be awarded if found copied. If caught further, it will lead to
disciplinary action.
● Refer Annexure 1 for Lab Report Format

4. LIST OF EXPERIMENTS

Sr. Title of the Experiment Software Unit CO Time


No. used Covered Covered Required

To introduce various python Python 1 CO1 4 hours


1 libraries used for machine
(Jupyter)
learning.

2 To apply various data pre- Python 1 CO1 2 hours


processing techniques used for
(Jupyter)
effective machine learning on the
Introduction to AI & ML
CSL236
given dataset.

To apply feature encoding Python 1 CO1 3 hours


3 schemes such as label encoder
(Jupyter)
and onehotencoder.

To apply different feature Python 1 CO1 3 hours


4 selection techniques in machine
(Jupyter)
learning.

To apply PCA as feature Python 2 CO1 2 hours


5 reduction technique on IRIS
(Jupyter)
dataset.

To apply Simple Linear Python 2 CO2, CO3, 2 hours


6 Regression on the given dataset. CO4
(Jupyter)

To apply multiple linear Python 2 CO2, CO3, 3 hours


7 regression on any regression CO4
(Jupyter)
dataset.

To apply Polynomial Linear Python 2 CO2, CO3, 3 hours


8 Regression on the given dataset. CO4
(Jupyter)

To solve classification problems Python 3 CO2, CO3, 2 hours


9 CO4
using Logistic Regression. (Jupyter)

To solve classification problems Python 3 CO2, CO3, 2 hours


10 CO4
using KNN classification. (Jupyter)

To solve classification problems Python 4 CO2, CO3, 2 hours


11 CO4
using Naïve Bayes. (Jupyter)

To apply Support Vector Python 4 CO2, CO3, 3 hours


12 Machines (SVM) on classification CO4
(Jupyter)
problems.

To apply Decision Trees for Python 4 CO2, CO3, 3 hours


13 CO4
classification problems. (Jupyter)
Introduction to AI & ML
CSL236
Value Added Experiments

Build a ML model from scratch Python 1,2,3,4 CO1,CO2, 5 hours


using data-preprocessing and
(Jupyter) CO3,CO4,CO
14 regression algorithms and
5
calculating various performance
metrics.

Build a ML model from scratch Python 1,2,3,4 CO1,CO2, 5 hours


using data-preprocessing and
(Jupyter) CO3,CO4,CO
15 classification algorithms and
5
calculating various performance
metrics.

5. LIST OF FLIP EXPERIMENTS

5.1 Project – Dimensionality reduction using LDA.


5.2 Competition on Kaggle

6. LIST OF PROJECTS

1. Titanic Challenge: The sinking of the Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely
considered “unsinkable” RMS Titanic sank after colliding with an iceberg.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in
the death of 1502 out of 2224 passengers and crew. While there was some
element of luck involved in surviving, it seems some groups of people were more
likely to survive than others. In this project, the students need to build a predictive
model that answers the question: “what sorts of people were more likely to
survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
2. House Price Prediction Using Advanced Regression Techniques: Ask a home buyer
to describe their dream house, and they probably won't begin with the height of
the basement ceiling or the proximity to an east-west railroad. But Kaggle’s
advanced house price prediction dataset proves that much more influences price
negotiations than the number of bedrooms or a white-picket fence. With 79
explanatory variables describing (almost) every aspect of residential homes in
Ames, Iowa, this dataset can be used to predict the final price of each home.
Introduction to AI & ML
CSL236
3. Mechanism of Action (MoA) Prediction: Mechanism of action means the
biochemical interactions through which a drug generates its pharmacological
effect. If we know a disease affects some particular receptor or downstream set of
cell activity, we can develop drugs faster if we can predict how cells and genes
affect various receptor sites. Using a dataset that combines gene expression and
cell viability data in addition to the MoA annotations of more than 5,000 drugs. In
this, each drug was tested under two dose (cp_dose) and three times (cp_time). So,
six samples basically correspond to one drug. We need to train a model that
classifies drugs based on their biological activity. This problem is a multi-label
classification, which means we have multiple targets (not multiple classes). In this
project, perform explanatory data analysis and then train a model using deep
neural networks with Keras.

7. RUBRICS

Marks Distribution

Continuous Evaluation (30 Marks) Project Evaluations (40 Marks)

Each experiment shall be evaluated for 10 Both the projects shall be evaluated for
marks and at the end of the semester 40 marks each and at the end of the
proportional marks shall be awarded out semester viva will be conducted related
of total 10. to the projects as well as concepts
learned in labs and this component
15 Marks: For viva carries 40 marks.
5 Marks: MOOC
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236

Annexure 1
Introduction to AI and ML
(CSL 236)
Lab Practical Report

Faculty name: Dr. Ruchika Lalit Student name: Kunal Verma

Roll No.: 21CSU393

Semester: 5th

Group: CC

Department of Computer Science and Engineering


NorthCap University, Gurugram- 122001, India
Session 2022-23
Introduction to AI & ML
CSL236

v
Introduction to AI & ML
CSL236
EXPERIMENT NO. 1

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● To understand the basic libraries of python.

● To differentiate between numpys and pandas.

Outcome:

Students will be familiarized with handling dataset using python basic libraries and applying
various operations on dataset using these.

Problem Statement:

To introduce various libraries of python used for machine learning.

Background Study:

Basic libraries of python are necessary to import datasets and applying various data pre-
processing and machine learning techniques on them.

Question Bank:

1. How pandas can be used to read data from internet and from your system?

To read data from the internet, use Pandas with functions like
pd.read_csv("https://example.com/data.csv"). For local files, use
pd.read_csv("path/to/your/local/file.csv").
Introduction to AI & ML
CSL236

2. How pandas dataframe can be converted to numpy arrays and vice versa?

To convert a Pandas DataFrame to a NumPy array, use df.to_numpy(); for the reverse,
create a DataFrame with pd.DataFrame(numpy_array). This facilitates seamless
interchange between the two structures.

3. How to access different rows and columns using loc and iloc?

Use loc for label-based indexing, accessing rows and columns by their labels, and iloc for
integer-based indexing, accessing rows and columns by their integer positions in Python.

4. Differentiate between Feature Selection and Dimensionality Reduction.

Feature selection involves choosing a subset of relevant features from the original set,
enhancing model interpretability and efficiency. Dimensionality reduction, such as PCA,
transforms data into a lower-dimensional space, preserving essential information while
reducing computational complexity.

5. What are the advantages of Wrapper methods over filter methods for feature selection?

Wrapper methods for feature selection assess subsets of features using the actual
model's performance, capturing complex relationships. This can outperform filter
methods by considering interactions and dependencies within the data.

6. Explain Regularization methods for Feature Selection.

Regularization methods, like L1 (Lasso) and L2 (Ridge), penalize large coefficients


during model training, encouraging simpler models and implicitly performing feature
selection by shrinking less relevant features towards zero.

7. What are Embedded feature selection methods.

Embedded feature selection methods integrate feature selection into the model training
process. Examples include LASSO regression, decision trees, and random forests, which
naturally identify and prioritize important features during training.
Introduction to AI & ML
CSL236

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236

EXPERIMENT NO. 2
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● To understand the importance of data pre-processing techniques.

● To handle missing values, duplicate values, feature scaling etc.

Outcome:

Students will be familiarized with the understanding and importance of applying various data pre-
processing techniques.

Problem Statement:

Write a program to perform data pre-processing techniques for effective machine learning.

Background Study: Data preprocessing in Machine Learning is a crucial step that helps enhance
the quality of data to promote the extraction of meaningful insights from the data. Data
preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing)
the raw data to make it suitable for a building and training Machine Learning models.
1.data_set= pd.read_csv('Dataset.csv')
dataset is a name of the variable to store our dataset, and inside the function, we have passed the
name of our dataset. Once we execute the above line of code, it will successfully import the dataset
in our code.
Scikit-learn library in our code, which contains various libraries for building machine learning
models

Question Bank:
Introduction to AI & ML
CSL236
1. What are different ways to handle missing values both for numerical as well as categorical
data?

For numerical data, options include mean or median imputation, interpolation, or advanced
methods like K-nearest neighbors. For categorical data, common approaches are mode
imputation, creating a separate category, or using advanced techniques like model-based
imputation.

2. What is the function in python used for finding duplicate rows in data?

In Python, the Pandas library provides the duplicated() function to identify duplicate rows
in a DataFrame. It returns a Boolean Series indicating whether each row is a duplicate of a
previous row.

3. Differentiate between two scaling methods used for feature scaling?

 Min-Max Scaling (Normalization): Scales features to a specific range (e.g., [0, 1])
preserving relative relationships.

 Standardization (Z-score Normalization): Scales features to have zero mean and unit
variance, maintaining relative relationships but not restricting to a specific range.

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236

EXPERIMENT NO. 3
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● To perform different encoding schemes.

● To prepare dataset by converting categorical data to numeric form for machine learning.

● To understand and implement label encoding and one hot encoding.

Outcome:

Students will be able to understand different encoding schemes to prepare data for machine
learning.

Problem Statement:

To apply different feature encoding schemes on the given dataset.

Background Study: Machine learning models require all input and output variables to be numeric.
This means that if your data contains categorical data, you must encode it to numbers before you
can fit and evaluate a model. The two most popular techniques are Label Encoding and One-Hot
Encoding.

LabelEncoder() class from preprocessing library. This class has successfully encoded the
variables into digits.

OneHotEncoder class of preprocessing library: This class encodes all the variables into numbers
0 and 1 and divided into three columns.
Introduction to AI & ML
CSL236
Question Bank:

1. Can ML algorithms handle categorical data directly?

Many machine learning algorithms require numerical input, so handling categorical data
often involves encoding it into numerical form. Techniques like one-hot encoding or label
encoding are commonly used for this purpose.

2. What are the different schemes for encoding categorical data?

 One-Hot Encoding: Creates binary columns for each category.

 Label Encoding: Assigns a unique numerical label to each category.

 Ordinal Encoding: Maps ordinal categories to numerical values based on their order.

3. Differentiate between Label Encoding and One Hot Encoding?

 Label Encoding: Assigns unique numerical labels to categories, preserving ordinal


relationships.

 One-Hot Encoding: Creates binary columns for each category, suitable for nominal data,
but doesn't preserve ordinal relationships.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 4

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● To understand the importance of feature selection

● To differentiate between different types of feature selection.

● Build a model using feature selection techniques.


Outcome:

Students will be familiarized with model building using feature selection techniques and
optimization.
Problem Statement:

Write a program to apply filter feature selection techniques.

Background Study: Feature selection is the process of reducing the number of input variables
when developing a predictive model. It is desirable to reduce the number of input variables to both
reduce the computational cost of modeling and, in some cases, to improve the performance of the
model.

Question Bank:

1. What are different filter feature selection techniques?

Filter feature selection techniques include Pearson correlation, Chi-square, Information


Gain, and Mutual Information. These methods assess the relevance of features
independently of the learning algorithm, aiding in dimensionality reduction.
Introduction to AI & ML
CSL236

2. How feature selection techniques depend on the data type of input features and output
variable?

Feature selection techniques depend on data types; for numerical features, methods like
correlation are suitable, while for categorical features, techniques like Chi-square are more
appropriate. The choice aligns with the output variable type.

3. What is the mathematics behind Pearson’s Correlation to rank features?

Pearson's Correlation measures linear relationships between two numerical variables,


ranging from -1 to 1. The formula calculates the covariance of the variables divided by the
product of their standard deviations.
Introduction to AI & ML
CSL236

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs

# Pearson’s Correlation with f_regression function

# Creating regression dataset with make_regression function


Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 5

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Study Dimensionality Reduction.

● Understand the basic principle behind Principal Component Analysis.

Outcome:

Students will be familiarized with Dimensionality Reduction especially Principal Component


Analysis (PCA).
Problem Statement:

Reduce dimensionality of Iris dataset using Principal Component Analysis.

Background Study: Principal component analysis is a statistical technique that is used to analyze
the interrelationships among a large number of variables and to explain these variables in terms of
a smaller number of variables, called principal components, with a minimum loss of information.

PCA is affected by scale so you need to scale the features in your data before applying PCA. Use
StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and
variance = 1) which is a requirement for the optimal performance of many machine learning
algorithms.
StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and
variance = 1) which is a requirement for the optimal performance of many machine learning
algorithms.
explained_variance_ratio_: helps to visualize the first principal component contains of the
variance and the second principal component contains of the variance. Together, the two
components of the information.
Introduction to AI & ML
CSL236
Question Bank:

1. What is dimensionality reduction?

Dimensionality reduction is a process of reducing the number of features or variables in a


dataset. It aims to simplify data representation, mitigate the curse of dimensionality, and
enhance computational efficiency.

2. Differentiate between Feature Selection, Feature Engineering and Dimensionality


Reduction.

 Feature Selection: Involves choosing relevant features to improve model performance.

 Feature Engineering: Involves creating new features or transforming existing ones to


enhance data representation.

 Dimensionality Reduction: Aims to reduce feature space by transforming data into a


lower-dimensional form, preserving essential information while minimizing redundancy
and computational complexity.

3. What are principal components?

Principal components are linear combinations of original features in a dataset. They are
derived through techniques like Principal Component Analysis (PCA) to capture the
maximum variance in data, reducing dimensionality while retaining crucial information.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236

EXPERIMENT NO. 6
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):
● Understand Simple Linear Regression (SLR).

● Study about the different performance metrics of SLR.

Outcome:

Student will be familiarized with regression problems and SLR as a solution to single feature
problem.

Problem Statement:

To apply Simple Linear Regression on the given dataset.

Background Study:

Simple linear regression is an approach for predicting a response using a single feature. It is
assumed that the two variables are linearly related. Hence, we try to find a linear function that
predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
statsmodels — a module that provides classes and functions for the estimation of many different
statistical models, as well as for conducting statistical tests, and statistical data exploration.
scikit-learn — a module that provides simple and efficient tools for data mining and data analysis.
Calling model.params will show us the model’s parameters.
Introduction to AI & ML
CSL236
Question Bank:

1. What is a regression problem?

A regression problem involves predicting a continuous output variable based on input


features. The goal is to establish a relationship between the independent variables and the
dependent variable, enabling the prediction of numerical values within a given range.

2. How Simple Linear Regression (SLR) helps in solving regression problems containing an
input feature and an output variable?

Simple Linear Regression (SLR) models the linear relationship between an input feature
and an output variable, providing a straightforward approach to predict and understand the
connection in regression problems.

4. What are the different performance metrics that can be used for evaluating SLR?

 Mean Squared Error (MSE): Measures the average squared difference between
predicted and actual values.

 Root Mean Squared Error (RMSE): The square root of MSE, offering error
interpretation in the original unit.

 Mean Absolute Error (MAE): Calculates the average absolute difference between
predicted and actual values.

 R-squared (R2): Represents the proportion of variance in the dependent variable


explained by the model, ranging from 0 to 1.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 7

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Understand mathematics behind Multiple Linear Regression (MLR).

● Solving linear regression problems containing more than one independent feature using
MLR.

Outcome:

Students will be familiarized with Multiple Linear Regression for solving linear regression problems.

Problem Statement:

To apply multiple linear regression on any regression dataset.

Background Study:

Multiple Linear Regression attempts to model the relationship between two or more features and a
response by fitting a linear equation to observed data . Statsmodels is a Python module that
provides classes and functions for the estimation of different statistical models, as well as different
statistical tests. statsmodels — a module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration.

Question Bank:
Introduction to AI & ML
CSL236

1. What is MLR?

MLR stands for Multiple Linear Regression. It's a statistical technique in machine learning
where the relationship between multiple independent variables and a single dependent
variable is modeled using a linear equation.

2. Differentiate between SLR and MLR?

Simple Linear Regression (SLR) involves modeling the relationship between a single
independent variable and a dependent variable using a linear equation. Multiple Linear
Regression (MLR), on the other hand, extends this to include multiple independent
variables, providing a more complex model to capture nuanced relationships.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
EXPERIMENT NO. 8

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Study and understand about Polynomial Regression on non-linear regression data.

● Study the mathematics behind Polynomial Regression.

Outcome:

Students will be familiarized with handling of regression data having non-linear relationship
between input and output.

Problem Statement:

To apply Polynomial Linear Regression on the given dataset.

Background Study:

Polynomial Regression is a form of linear regression in which the relationship between the
independent variable x and dependent variable y is modeled as an nth degree polynomial.
Polynomial regression fits a nonlinear relationship between the value of x and the corresponding
conditional mean of y, denoted E(y |x). statsmodels — a module that provides classes and
functions for the estimation of many different statistical models, as well as for conducting statistical
tests, and statistical data exploration.

Question Bank:
Introduction to AI & ML
CSL236

1. What is non-linear relationship between input and output?

A non-linear relationship between input and output indicates that the change in the output
variable is not proportional to the change in the input variable. The pattern or trend
between the variables does not follow a straight line, but rather a curved or nonlinear
pattern.

2. How Polynomial Regression is used to handle non-linear relationship?

Polynomial Regression addresses non-linear relationships by introducing polynomial terms


(e.g., quadratic, cubic) into the regression equation. This allows the model to capture and
represent curved patterns, offering a more flexible approach than traditional linear
regression for non-linear data.

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 9

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Study Logistic Regression.

● How Logistic Regression is used to solve classification problems.

Outcome:

Students will be familiarized with Logistic Regression and performance metrics to calculate its
performance on the given dataset.

Problem Statement:

To solve classification problems using Logistic Regression.

Background Study:

Logistic regression is a classification technique which helps to predict the probability of an


outcome that can only have two values. Logistic Regression is used when the dependent variable
(target) is categorical. A logistic regression produces a logistic curve, which is limited to values
between 0 and 1.
confusion_matrix and accuracy_score functions are used to evaluate the model. A confusion
matrix is visualized using a heatmap from the seaborn package, and Boxplot from seaborn is used
to check for the outliers in the dataset. The confusion matrix consists of the matrix elements with
True Positive, True Negative, False Positive, and False Negative values.

Question Bank:
Introduction to AI & ML
CSL236

1. What is Logistic Regression?

Logistic Regression is a statistical method used for binary classification. It models the
probability that a dependent variable belongs to a particular category, employing the
logistic function to constrain predictions between 0 and 1.

2. How Logistic Regression is used for solving classification problems?

Logistic Regression is used in classification by modeling the probability of an instance


belonging to a particular class. It applies the logistic function to transform a linear
combination of input features, mapping predictions between 0 and 1. A threshold is then
chosen to classify instances into classes.

3. Why sigmoid function is used in it?

The sigmoid function (logistic function) is used in Logistic Regression because it maps any
real-valued number to the range [0, 1]. This is crucial for interpreting the output as
probabilities in binary classification, where the outcome is either 0 or 1. The sigmoid
function ensures a smooth transition between the two classes.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
EXPERIMENT NO. 10

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Study K-Nearest Neighbor algorithm (KNN).

● Understand the working principle behind KNN.

Outcome:

Students will be familiarized with classification technique using KNN.

Problem Statement:

To solve classification problems using KNN classification.

Background Study:

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as


Lazy learning algorithm KNN is a lazy learning algorithm because it does not have a specialized training phase


Non-parametric learning algorithm KNN is also a non-parametric learning algorithm because it doesn’t assume

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. The most common parameter used to perform
Introduction to AI & ML
CSL236
matching is Euclidean distance between the points.
The sklearn library has provided a layer of abstraction on top of Python. Therefore, in order to
make use of the KNN algorithm, it’s sufficient to create an instance of KNeighborsClassifier. By
default, the KNeighborsClassifier looks for the 5 nearest neighbors. We must explicitly tell the
classifier to use Euclidean distance for determining the proximity between neighboring points.

Question Bank:

1. What is KNN classifier?

KNN (K-Nearest Neighbors) classifier is a supervised machine learning algorithm used for
classification. It assigns a data point to the majority class among its k nearest neighbors in
the feature space.

2. How KNN makes use of Euclidean distance to calculate nearest neighbor?

KNN calculates Euclidean distance between data points in the feature space. It measures
the straight-line distance between two points, considering each feature as a dimension.
Smaller Euclidean distances indicate closer proximity, and the k nearest neighbors are
determined based on these distances for classification.

4. What are the other distances that can be used for nearest neighbor?

 Manhattan Distance (L1 Norm): Sum of absolute differences along each dimension.

 Minkowski Distance: Generalization of Euclidean and Manhattan distances, where the


degree parameter influences the distance measure.

 Hamming Distance: Suitable for categorical data, counts the number of positions at
which corresponding elements are different.

 Cosine Similarity: Measures the cosine of the angle between vectors, often used for
high-dimensional data or text classification.

5. What are the various performance metrics used for classification problems?
Introduction to AI & ML
CSL236

 Accuracy: Proportion of correctly classified instances.

 Precision: Proportion of true positive predictions among all positive predictions.

 Recall (Sensitivity): Proportion of true positive predictions among all actual positives.

 F1 Score: Harmonic mean of precision and recall.

 Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the
classifier's ability to distinguish between classes.

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs

EXPERIMENT NO. 11
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Understand and study Naïve Bayes (NB) Classifier.

● Understand Naïve Bayes theorem behind it.

Outcome:

Students will be familiarized with NB classification technique.

Problem Statement:

To solve classification problems using Naïve Bayes.

Background Study:

Naïve Bayes Classifier is a probabilistic classifier and is based on Bayes Theorem.


In Machine learning, a classification problem represents the selection of the Best Hypothesis given
the data.
Given a new data point, we try to classify which class label this new data instance belongs to. The
prior knowledge about the past data helps us in classifying the new data point. The Naïve Bayes
theorem:

gives us the probability of Event A to happen given that event B has occurred.
The receiver operating characteristic curve also known as roc_curve is a plot that tells about the
interpretation potential of a binary classifier system. It is plotted between the true positive rate and
the false positive rate at different thresholds.

Question Bank:
Introduction to AI & ML
CSL236
1. What is Bayes theorem?

Bayes' Theorem is a mathematical formula that describes the probability of an event based
on prior knowledge of conditions that might be related to the event. It is expressed as P(A|B)
= P(B|A) * P(A) / P(B), where A and B are events, and P(A|B) is the conditional probability of
event A given that event B has occurred.

2. How Naïve Bayes classifier helps for solving classification problems?

The Naïve Bayes classifier uses Bayes' Theorem to predict the probability of a data point
belonging to a particular class. It assumes independence between features, simplifying
computations. By calculating conditional probabilities, it efficiently handles classification
problems, especially in natural language processing and spam filtering.

3. What is the condition on features that should be fulfilled for successful application of Naïve
Bayes method?

Naïve Bayes assumes feature independence given the class label. For successful
application, features should be conditionally independent, meaning that the presence of one
feature does not provide information about the presence or absence of other features, given
the class label.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 12

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Understand and study Support Vector Machines (SVM).

● To study how linear hyperplane is calculated to differentiate between two classes.

● Basic understanding of the different variants of SVM.

Outcome:

Students will be familiarized with Support Vector Machines classifier.

Problem Statement:

To solve classification problems using SVM.

Background Study:

In machine learning, support-vector machines (SVMs, also support-vector networks) are


supervised learning models with associated learning algorithms that analyze data for classification
and regression analysis.
Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues, SVMs are one of the
most robust prediction methods, being based on statistical learning frameworks or VC theory
proposed by Vapnik (1982, 1995) and Chervonenkis (1974).
Given a set of training examples, each marked as belonging to one of two categories, an SVM
training algorithm builds a model that assigns new examples to one category or the other, making
it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as
to maximise the width of the gap between the two categories. New examples are then mapped into
Introduction to AI & ML
CSL236
that same space and predicted to belong to a category based on which side of the gap they fall.
In Python, scikit-learn is a widely used library for implementing machine learning algorithms. SVM
is also available in the scikit-learn library and we follow the same structure for using it(Import
library, object creation, fitting model and prediction).
Tuning the parameters’ values for machine learning algorithms effectively improves model
performance. Let’s look at the list of parameters available with SVM.
Question Bank:

1. What is SVM?

SVM, or Support Vector Machine, is a supervised machine learning algorithm used for
classification and regression tasks. It finds the optimal hyperplane that maximally
separates data points of different classes in the feature space.

2. What are the advantages of using SVM over other classifiers?

 Effective in high-dimensional spaces.

 Robust against overfitting, especially in high-dimensional data.

 Versatile, as it supports linear and non-linear classification.

 Can handle both binary and multi-class classification.

 Effective in cases where the number of features is greater than the number of samples.

3. What do you mean by support vectors?

Support vectors are the data points in a dataset that are crucial for defining the optimal
hyperplane in a Support Vector Machine (SVM). These are the points closest to the
decision boundary, influencing the positioning and orientation of the hyperplane, and
ultimately determining the separation between different classes.

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 13

Student Name and Roll Number: Kunal Verma 21CSU393

Semester /Section: 5th / CC - A

Link to Code:

Date:

Faculty Signature:

Grade:

Objective(s):

● Understand and study Decision Trees for classification problems.

● Study about the information gain used to create decision trees.

Outcome:

Students will be familiarized with creation of decision trees.

Problem Statement:

Apply Decision Tree classifier for solving classification problems.

Decision tree analysis is a predictive modelling tool that can be applied across many
areas. Decision trees can be constructed by an algorithmic approach that can split the
dataset in different ways based on different conditions. Decision trees are the most
powerful algorithms that falls under the category of supervised algorithms.
They can be used for both classification and regression tasks. The two main entities of a
tree are decision nodes, where the data is split and leaves, where we got outcome.

Question Bank:
Introduction to AI & ML
CSL236

1. What is a decision tree?

A decision tree is a tree-like model used for both classification and regression in machine
learning. It recursively splits the dataset into subsets based on the most significant attribute,
creating a tree structure of decisions to reach a final prediction or decision.

2. How decision tree is created to solve problems?

A decision tree is created by recursively splitting the dataset based on features that best
separate the data into homogeneous subsets. The splitting process continues until a stopping
criterion is met, such as a predefined tree depth or a minimum number of samples in a leaf
node. Each node in the tree represents a decision based on a feature, leading to the creation
of a hierarchical structure that facilitates classification or regression based on the learned
rules.

3. List out the advantages and disadvantages of Decision Tree Classifiers?

 Advantages:
I. Interpretable and easy to understand.
II. Requires minimal data preprocessing.
III. Handles both numerical and categorical data.
IV. Implicitly performs feature selection.

 Disadvantages:
I. Prone to overfitting, especially on complex datasets.
Introduction to AI & ML
CSL236

Student Work Area


Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Annexure 2

Introduction to AI and ML
CSL236

Project Report

Faculty name: Dr. Ruchika Lalit Student name: Kunal Verma

Roll No.: 21CSU393

Semester: 5th

Group: CC - A

Department of Computer Science and Engineering


The NorthCap University, Gurugram- 122001, India
Session 2022-23
Introduction to AI & ML
CSL236

Table of Contents
S.No Page
No.
1. Project Description

2. Problem Statement

3. Analysis

3.1 Hardware Requirements

3.2 Software Requirements

4. Design

4.1 Data/Input Output Description:

4.2 Algorithmic Approach / Algorithm / DFD / ER


diagram/Program Steps

5. Implementation and Testing (stage/module


wise)

6. Output (Screenshots)

7. Conclusion and Future Scope

You might also like