21CSU393 Kunal Verma - AI&ML Lab Manual
21CSU393 Kunal Verma - AI&ML Lab Manual
Lab Manual
Department of Computer Science and Engineering
The NorthCap University, Gurugram
Introduction to AI & ML
CSL236
Session 2022-23
Published by:
© Copyright Reserved
Copying or facilitating copying of lab work comes under cheating and is considered as use
of unfair means. Students indulging in copying or facilitating copying shall be awarded zero
marks for that particular experiment. Frequent cases of copying may lead to disciplinary
action. Attendance in lab classes is mandatory.
Labs are open up to 7 PM upon request. Students are encouraged to make full use of labs
beyond normal lab hours.
Introduction to AI & ML
CSL236
PREFACE
Machine Learning Lab Manual is designed to meet the course and program requirements of
NCU curriculum for B.Tech. III year students of CSE branch. The concept of the lab work is
to give brief practical experience for basic lab skills to students. It provides the space and
scope for self-study so that students can come up with new and creative ideas.
The Lab manual is written on the basis of “teach yourself pattern” and expected that
students who come with proper preparation should be able to perform the experiments
without any difficulty. Brief introduction to each experiment with information about self-
study material is provided. The pre-requisite is having a basic working knowledge of
Python. The laboratory exercises will include familiarization with data pre-processing
techniques for ML like handling missing data, duplicate data, outliers feature scaling and
encoding. Feature Selection and Dimensionality Reduction are included to enhance the
performance and reduce the computational time. Various ML classification and regression
techniques are taught. Students would learn the algorithms pertaining to these and
implement the same using a high-level language, i.e. Python. Students are expected to come
thoroughly prepared for the lab. General disciplines, safety guidelines and report writing
are also discussed.
The lab manual is a part of curriculum for the The NorthCap University, Gurugram.
Teacher’s copy of the experimental results and answer for the questions are available as
sample guidelines.
We hope that lab manual would be useful to students of CSE, IT, ECE and BSc branches and
author requests the readers to kindly forward their suggestions / constructive criticism for
further improvement of the work book.
Author expresses deep gratitude to Members, Governing Body-NCU for encouragement and
motivation.
Introduction to AI & ML
CSL236
Authors
The NorthCap University
Gurugram, India
CONTENTS
Page
S.N. Details No.
Syllabus 6
1 Introduction 9
2 Lab Requirement 10
3 General Instructions 11
4 List of Experiments 13
6 List of Projects 15
7 Rubrics 16
SYLLABUS
1. Department: Department of Computer Science and Engineering
6. Type of Course
(Check one): Program Core ✔ Program Elective Open Elective
9. Brief Syllabus:
Introduction to artificial intelligence, History of AI, Proposing and evaluating AI application, Preprocessing and
Feature Engineering, Case study: Exploratory Analysis of Delhi Pollution, Simple Linear Regression, Multiple
Regression, Polynomial Regression, Support Vector Regression SVR, Decision Tree Regression, Random Forest
Regression, Logistic Regression, K Nearest Neighbors, Support Vector Machine, Kernel SVM, Naïve Bayes,
Decision Trees Classification, Random Forest Classification, Basic Terminologies: Overfitting, Underfitting, Bias
and Variance model, Bootstrapping, Cross-Validation and Resampling Methods, Performance Measures: Confusion
matrix, ROC. Comparing two classification Algorithms: McNamara’s Test, paired t-test.
Total lecture, Tutorial and Practical Hours for this course (Take 15 teaching weeks per
semester): 75 hours
Practice
Introduction to AI & ML
CSL236
Lectures: 40 hours Tutorials: 0 hours Lab Work: 35 hours
Understand and implement the preprocessing of the data to be used for machine
CO 1
learning models.
Content Summary:
Content Summary:
No. of hours:12
Introduction to AI & ML
CSL236
Content Summary:
Simple Linear Regression, Multiple Regression, Polynomial Regression, Support Vector Regression
SVR, Decision Tree Regression, Random Forest Regression
Logistic Regression, K Nearest Neighbors, Support Vector Machine, Kernel SVM, Naïve
Bayes, Decision Trees Classification, Random Forest Classification
Basic Terminologies: Over fitting, Under fitting, Bias and Variance model, Bootstrapping, Cross-
Validation and Resampling Methods, Performance Measures: Confusion matrix, ROC.
Data-preprocessing techniques
Text Books:
2. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques,
3rd ed.
Introduction to AI & ML
CSL236
Reference Books:
1. Ian H. Witten & Eibe Frank., “Data Mining Practical Machine Learning Tools and
Techniques”, Morgan Kauffmann Publishers, Second Edition, 2020
2. Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Third Edition, 2015
Reference Websites: (NPTEL, Swayam, Coursera, Edx, Udemy, LMS, official documentation
weblink)
● https://nculms.ncuindia.edu/
● https://www.simplilearn.com/big-data-and-analytics/machine-learning-
certification-training-course
● https://www.coursera.org/learn/machine-learning
Introduction to AI & ML
CSL236
1. INTRODUCTION
OBJECTIVES:
The purpose of conducting experiments can be stated as follows:
● To familiarize the students with the basic concepts of Machine Learning like
supervised, unsupervised and reinforcement learning.
● The lab sessions will be based on exploring the concepts discussed in class.
● Hands on experience
2. LAB REQUIREMENTS
1 Software Python 3.
Requirements
3. GENERAL INSTRUCTIONS
● Students must turn up in time and contact concerned faculty for the experiment
they are supposed to perform.
● Students will not be allowed to enter late in the lab.
● Students will not leave the class till the period is over.
● Damaging lab equipment or removing any component from the lab may invite
penalties and strict disciplinary action.
3.2 Attendance
● Students should not attend a different lab group/section other than the one
assigned at the beginning of the session.
Introduction to AI & ML
CSL236
● Students should come to the lab thoroughly prepared on the experiments they
are assigned to perform on that day. Brief introduction to each experiment
with information about self study reference is provided on LMS.
● Students must bring the lab report during each practical class with written
records of the last experiments performed complete in all respect.
● Each student is required to write a complete report of the experiment he has
performed and bring to lab class for evaluation in the next working lab.
Sufficient space in work book is provided for independent writing of theory,
observation, calculation and conclusion.
● Students should follow the Zero tolerance policy for copying / plagiarism. Zero
marks will be awarded if found copied. If caught further, it will lead to
disciplinary action.
● Refer Annexure 1 for Lab Report Format
4. LIST OF EXPERIMENTS
6. LIST OF PROJECTS
1. Titanic Challenge: The sinking of the Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely
considered “unsinkable” RMS Titanic sank after colliding with an iceberg.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in
the death of 1502 out of 2224 passengers and crew. While there was some
element of luck involved in surviving, it seems some groups of people were more
likely to survive than others. In this project, the students need to build a predictive
model that answers the question: “what sorts of people were more likely to
survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
2. House Price Prediction Using Advanced Regression Techniques: Ask a home buyer
to describe their dream house, and they probably won't begin with the height of
the basement ceiling or the proximity to an east-west railroad. But Kaggle’s
advanced house price prediction dataset proves that much more influences price
negotiations than the number of bedrooms or a white-picket fence. With 79
explanatory variables describing (almost) every aspect of residential homes in
Ames, Iowa, this dataset can be used to predict the final price of each home.
Introduction to AI & ML
CSL236
3. Mechanism of Action (MoA) Prediction: Mechanism of action means the
biochemical interactions through which a drug generates its pharmacological
effect. If we know a disease affects some particular receptor or downstream set of
cell activity, we can develop drugs faster if we can predict how cells and genes
affect various receptor sites. Using a dataset that combines gene expression and
cell viability data in addition to the MoA annotations of more than 5,000 drugs. In
this, each drug was tested under two dose (cp_dose) and three times (cp_time). So,
six samples basically correspond to one drug. We need to train a model that
classifies drugs based on their biological activity. This problem is a multi-label
classification, which means we have multiple targets (not multiple classes). In this
project, perform explanatory data analysis and then train a model using deep
neural networks with Keras.
7. RUBRICS
Marks Distribution
Each experiment shall be evaluated for 10 Both the projects shall be evaluated for
marks and at the end of the semester 40 marks each and at the end of the
proportional marks shall be awarded out semester viva will be conducted related
of total 10. to the projects as well as concepts
learned in labs and this component
15 Marks: For viva carries 40 marks.
5 Marks: MOOC
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Annexure 1
Introduction to AI and ML
(CSL 236)
Lab Practical Report
Semester: 5th
Group: CC
v
Introduction to AI & ML
CSL236
EXPERIMENT NO. 1
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Students will be familiarized with handling dataset using python basic libraries and applying
various operations on dataset using these.
Problem Statement:
Background Study:
Basic libraries of python are necessary to import datasets and applying various data pre-
processing and machine learning techniques on them.
Question Bank:
1. How pandas can be used to read data from internet and from your system?
To read data from the internet, use Pandas with functions like
pd.read_csv("https://example.com/data.csv"). For local files, use
pd.read_csv("path/to/your/local/file.csv").
Introduction to AI & ML
CSL236
2. How pandas dataframe can be converted to numpy arrays and vice versa?
To convert a Pandas DataFrame to a NumPy array, use df.to_numpy(); for the reverse,
create a DataFrame with pd.DataFrame(numpy_array). This facilitates seamless
interchange between the two structures.
3. How to access different rows and columns using loc and iloc?
Use loc for label-based indexing, accessing rows and columns by their labels, and iloc for
integer-based indexing, accessing rows and columns by their integer positions in Python.
Feature selection involves choosing a subset of relevant features from the original set,
enhancing model interpretability and efficiency. Dimensionality reduction, such as PCA,
transforms data into a lower-dimensional space, preserving essential information while
reducing computational complexity.
5. What are the advantages of Wrapper methods over filter methods for feature selection?
Wrapper methods for feature selection assess subsets of features using the actual
model's performance, capturing complex relationships. This can outperform filter
methods by considering interactions and dependencies within the data.
Embedded feature selection methods integrate feature selection into the model training
process. Examples include LASSO regression, decision trees, and random forests, which
naturally identify and prioritize important features during training.
Introduction to AI & ML
CSL236
EXPERIMENT NO. 2
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Students will be familiarized with the understanding and importance of applying various data pre-
processing techniques.
Problem Statement:
Write a program to perform data pre-processing techniques for effective machine learning.
Background Study: Data preprocessing in Machine Learning is a crucial step that helps enhance
the quality of data to promote the extraction of meaningful insights from the data. Data
preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing)
the raw data to make it suitable for a building and training Machine Learning models.
1.data_set= pd.read_csv('Dataset.csv')
dataset is a name of the variable to store our dataset, and inside the function, we have passed the
name of our dataset. Once we execute the above line of code, it will successfully import the dataset
in our code.
Scikit-learn library in our code, which contains various libraries for building machine learning
models
Question Bank:
Introduction to AI & ML
CSL236
1. What are different ways to handle missing values both for numerical as well as categorical
data?
For numerical data, options include mean or median imputation, interpolation, or advanced
methods like K-nearest neighbors. For categorical data, common approaches are mode
imputation, creating a separate category, or using advanced techniques like model-based
imputation.
2. What is the function in python used for finding duplicate rows in data?
In Python, the Pandas library provides the duplicated() function to identify duplicate rows
in a DataFrame. It returns a Boolean Series indicating whether each row is a duplicate of a
previous row.
Min-Max Scaling (Normalization): Scales features to a specific range (e.g., [0, 1])
preserving relative relationships.
Standardization (Z-score Normalization): Scales features to have zero mean and unit
variance, maintaining relative relationships but not restricting to a specific range.
EXPERIMENT NO. 3
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
● To prepare dataset by converting categorical data to numeric form for machine learning.
Outcome:
Students will be able to understand different encoding schemes to prepare data for machine
learning.
Problem Statement:
Background Study: Machine learning models require all input and output variables to be numeric.
This means that if your data contains categorical data, you must encode it to numbers before you
can fit and evaluate a model. The two most popular techniques are Label Encoding and One-Hot
Encoding.
LabelEncoder() class from preprocessing library. This class has successfully encoded the
variables into digits.
OneHotEncoder class of preprocessing library: This class encodes all the variables into numbers
0 and 1 and divided into three columns.
Introduction to AI & ML
CSL236
Question Bank:
Many machine learning algorithms require numerical input, so handling categorical data
often involves encoding it into numerical form. Techniques like one-hot encoding or label
encoding are commonly used for this purpose.
Ordinal Encoding: Maps ordinal categories to numerical values based on their order.
One-Hot Encoding: Creates binary columns for each category, suitable for nominal data,
but doesn't preserve ordinal relationships.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 4
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Students will be familiarized with model building using feature selection techniques and
optimization.
Problem Statement:
Background Study: Feature selection is the process of reducing the number of input variables
when developing a predictive model. It is desirable to reduce the number of input variables to both
reduce the computational cost of modeling and, in some cases, to improve the performance of the
model.
Question Bank:
2. How feature selection techniques depend on the data type of input features and output
variable?
Feature selection techniques depend on data types; for numerical features, methods like
correlation are suitable, while for categorical features, techniques like Chi-square are more
appropriate. The choice aligns with the output variable type.
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Background Study: Principal component analysis is a statistical technique that is used to analyze
the interrelationships among a large number of variables and to explain these variables in terms of
a smaller number of variables, called principal components, with a minimum loss of information.
PCA is affected by scale so you need to scale the features in your data before applying PCA. Use
StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and
variance = 1) which is a requirement for the optimal performance of many machine learning
algorithms.
StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and
variance = 1) which is a requirement for the optimal performance of many machine learning
algorithms.
explained_variance_ratio_: helps to visualize the first principal component contains of the
variance and the second principal component contains of the variance. Together, the two
components of the information.
Introduction to AI & ML
CSL236
Question Bank:
Principal components are linear combinations of original features in a dataset. They are
derived through techniques like Principal Component Analysis (PCA) to capture the
maximum variance in data, reducing dimensionality while retaining crucial information.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
EXPERIMENT NO. 6
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
● Understand Simple Linear Regression (SLR).
Outcome:
Student will be familiarized with regression problems and SLR as a solution to single feature
problem.
Problem Statement:
Background Study:
Simple linear regression is an approach for predicting a response using a single feature. It is
assumed that the two variables are linearly related. Hence, we try to find a linear function that
predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
statsmodels — a module that provides classes and functions for the estimation of many different
statistical models, as well as for conducting statistical tests, and statistical data exploration.
scikit-learn — a module that provides simple and efficient tools for data mining and data analysis.
Calling model.params will show us the model’s parameters.
Introduction to AI & ML
CSL236
Question Bank:
2. How Simple Linear Regression (SLR) helps in solving regression problems containing an
input feature and an output variable?
Simple Linear Regression (SLR) models the linear relationship between an input feature
and an output variable, providing a straightforward approach to predict and understand the
connection in regression problems.
4. What are the different performance metrics that can be used for evaluating SLR?
Mean Squared Error (MSE): Measures the average squared difference between
predicted and actual values.
Root Mean Squared Error (RMSE): The square root of MSE, offering error
interpretation in the original unit.
Mean Absolute Error (MAE): Calculates the average absolute difference between
predicted and actual values.
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
● Solving linear regression problems containing more than one independent feature using
MLR.
Outcome:
Students will be familiarized with Multiple Linear Regression for solving linear regression problems.
Problem Statement:
Background Study:
Multiple Linear Regression attempts to model the relationship between two or more features and a
response by fitting a linear equation to observed data . Statsmodels is a Python module that
provides classes and functions for the estimation of different statistical models, as well as different
statistical tests. statsmodels — a module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical data
exploration.
Question Bank:
Introduction to AI & ML
CSL236
1. What is MLR?
MLR stands for Multiple Linear Regression. It's a statistical technique in machine learning
where the relationship between multiple independent variables and a single dependent
variable is modeled using a linear equation.
Simple Linear Regression (SLR) involves modeling the relationship between a single
independent variable and a dependent variable using a linear equation. Multiple Linear
Regression (MLR), on the other hand, extends this to include multiple independent
variables, providing a more complex model to capture nuanced relationships.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
EXPERIMENT NO. 8
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Students will be familiarized with handling of regression data having non-linear relationship
between input and output.
Problem Statement:
Background Study:
Polynomial Regression is a form of linear regression in which the relationship between the
independent variable x and dependent variable y is modeled as an nth degree polynomial.
Polynomial regression fits a nonlinear relationship between the value of x and the corresponding
conditional mean of y, denoted E(y |x). statsmodels — a module that provides classes and
functions for the estimation of many different statistical models, as well as for conducting statistical
tests, and statistical data exploration.
Question Bank:
Introduction to AI & ML
CSL236
A non-linear relationship between input and output indicates that the change in the output
variable is not proportional to the change in the input variable. The pattern or trend
between the variables does not follow a straight line, but rather a curved or nonlinear
pattern.
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Students will be familiarized with Logistic Regression and performance metrics to calculate its
performance on the given dataset.
Problem Statement:
Background Study:
Question Bank:
Introduction to AI & ML
CSL236
Logistic Regression is a statistical method used for binary classification. It models the
probability that a dependent variable belongs to a particular category, employing the
logistic function to constrain predictions between 0 and 1.
The sigmoid function (logistic function) is used in Logistic Regression because it maps any
real-valued number to the range [0, 1]. This is crucial for interpreting the output as
probabilities in binary classification, where the outcome is either 0 or 1. The sigmoid
function ensures a smooth transition between the two classes.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
EXPERIMENT NO. 10
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Problem Statement:
Background Study:
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as
●
Lazy learning algorithm KNN is a lazy learning algorithm because it does not have a specialized training phase
●
Non-parametric learning algorithm KNN is also a non-parametric learning algorithm because it doesn’t assume
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. The most common parameter used to perform
Introduction to AI & ML
CSL236
matching is Euclidean distance between the points.
The sklearn library has provided a layer of abstraction on top of Python. Therefore, in order to
make use of the KNN algorithm, it’s sufficient to create an instance of KNeighborsClassifier. By
default, the KNeighborsClassifier looks for the 5 nearest neighbors. We must explicitly tell the
classifier to use Euclidean distance for determining the proximity between neighboring points.
Question Bank:
KNN (K-Nearest Neighbors) classifier is a supervised machine learning algorithm used for
classification. It assigns a data point to the majority class among its k nearest neighbors in
the feature space.
KNN calculates Euclidean distance between data points in the feature space. It measures
the straight-line distance between two points, considering each feature as a dimension.
Smaller Euclidean distances indicate closer proximity, and the k nearest neighbors are
determined based on these distances for classification.
4. What are the other distances that can be used for nearest neighbor?
Manhattan Distance (L1 Norm): Sum of absolute differences along each dimension.
Hamming Distance: Suitable for categorical data, counts the number of positions at
which corresponding elements are different.
Cosine Similarity: Measures the cosine of the angle between vectors, often used for
high-dimensional data or text classification.
5. What are the various performance metrics used for classification problems?
Introduction to AI & ML
CSL236
Recall (Sensitivity): Proportion of true positive predictions among all actual positives.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): Measures the
classifier's ability to distinguish between classes.
EXPERIMENT NO. 11
Introduction to AI & ML
CSL236
Student Name and Roll Number: Kunal Verma 21CSU393
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Problem Statement:
Background Study:
gives us the probability of Event A to happen given that event B has occurred.
The receiver operating characteristic curve also known as roc_curve is a plot that tells about the
interpretation potential of a binary classifier system. It is plotted between the true positive rate and
the false positive rate at different thresholds.
Question Bank:
Introduction to AI & ML
CSL236
1. What is Bayes theorem?
Bayes' Theorem is a mathematical formula that describes the probability of an event based
on prior knowledge of conditions that might be related to the event. It is expressed as P(A|B)
= P(B|A) * P(A) / P(B), where A and B are events, and P(A|B) is the conditional probability of
event A given that event B has occurred.
The Naïve Bayes classifier uses Bayes' Theorem to predict the probability of a data point
belonging to a particular class. It assumes independence between features, simplifying
computations. By calculating conditional probabilities, it efficiently handles classification
problems, especially in natural language processing and spam filtering.
3. What is the condition on features that should be fulfilled for successful application of Naïve
Bayes method?
Naïve Bayes assumes feature independence given the class label. For successful
application, features should be conditionally independent, meaning that the presence of one
feature does not provide information about the presence or absence of other features, given
the class label.
Introduction to AI & ML
CSL236
Student Work Area
Algorithm/Flowchart/Code/Sample Outputs
Introduction to AI & ML
CSL236
Introduction to AI & ML
CSL236
EXPERIMENT NO. 12
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Problem Statement:
Background Study:
1. What is SVM?
SVM, or Support Vector Machine, is a supervised machine learning algorithm used for
classification and regression tasks. It finds the optimal hyperplane that maximally
separates data points of different classes in the feature space.
Effective in cases where the number of features is greater than the number of samples.
Support vectors are the data points in a dataset that are crucial for defining the optimal
hyperplane in a Support Vector Machine (SVM). These are the points closest to the
decision boundary, influencing the positioning and orientation of the hyperplane, and
ultimately determining the separation between different classes.
Link to Code:
Date:
Faculty Signature:
Grade:
Objective(s):
Outcome:
Problem Statement:
Decision tree analysis is a predictive modelling tool that can be applied across many
areas. Decision trees can be constructed by an algorithmic approach that can split the
dataset in different ways based on different conditions. Decision trees are the most
powerful algorithms that falls under the category of supervised algorithms.
They can be used for both classification and regression tasks. The two main entities of a
tree are decision nodes, where the data is split and leaves, where we got outcome.
Question Bank:
Introduction to AI & ML
CSL236
A decision tree is a tree-like model used for both classification and regression in machine
learning. It recursively splits the dataset into subsets based on the most significant attribute,
creating a tree structure of decisions to reach a final prediction or decision.
A decision tree is created by recursively splitting the dataset based on features that best
separate the data into homogeneous subsets. The splitting process continues until a stopping
criterion is met, such as a predefined tree depth or a minimum number of samples in a leaf
node. Each node in the tree represents a decision based on a feature, leading to the creation
of a hierarchical structure that facilitates classification or regression based on the learned
rules.
Advantages:
I. Interpretable and easy to understand.
II. Requires minimal data preprocessing.
III. Handles both numerical and categorical data.
IV. Implicitly performs feature selection.
Disadvantages:
I. Prone to overfitting, especially on complex datasets.
Introduction to AI & ML
CSL236
Introduction to AI and ML
CSL236
Project Report
Semester: 5th
Group: CC - A
Table of Contents
S.No Page
No.
1. Project Description
2. Problem Statement
3. Analysis
4. Design
6. Output (Screenshots)