0% found this document useful (0 votes)
16 views18 pages

6.2.unit-2 ML Handsout

The document consists of lecture handouts from Muthayammal Engineering College on Machine Learning, specifically focusing on Decision Tree Learning and Ensemble Learning. It covers topics such as constructing decision trees, recursive induction, splitting attributes using entropy and information gain, and the implications of Occam's razor in machine learning. Each section includes prerequisite knowledge, detailed content, and resources for further learning.

Uploaded by

Dhamu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views18 pages

6.2.unit-2 ML Handsout

The document consists of lecture handouts from Muthayammal Engineering College on Machine Learning, specifically focusing on Decision Tree Learning and Ensemble Learning. It covers topics such as constructing decision trees, recursive induction, splitting attributes using entropy and information gain, and the implications of Occam's razor in machine learning. Each section includes prerequisite knowledge, detailed content, and resources for further learning.

Uploaded by

Dhamu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

MUTHAYAMMAL ENGINEERING COLLEGE

(An Autonomous Institution)


(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-10
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:

Topic of Lecture: Representing concepts as decision trees

Introduction: (Maximum 5 sentences) : Decision Tree is the most powerful and popular tool for
classification and prediction. A Decision tree is a flowchart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label.

Prerequisite knowledge for Complete understanding and learning of Topic: (Max. Four
important topics)
Theory of computation basics
Non- Deterministic Finite Automata
Deterministic Finite Automata
Detailed content of the Lecture:
 Construction of Decision Tree: A tree can be “learned” by splitting the source set into
subsets based on an attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning.
 The recursion is completed when the subset at a node all has the same value of the target
variable, or when splitting no longer adds value to the predictions.
 The construction of a decision tree classifier does not require any domain knowledge or
parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision
trees can handle high-dimensional data.
 In general decision tree classifier has good accuracy. Decision tree induction is a typical
inductive approach to learn knowledge on classification.
 Decision Tree Representation: Decision trees classify instances by sorting them down the
tree from the root to some leaf node, which provides the classification of the instance.
 An instance is classified by starting at the root node of the tree, testing the attribute specified by
this node, then moving down the tree branch corresponding to the value of the attribute as
shown in the above figure.
 This process is then repeated for the subtree rooted at the new node. The decision tree in
above figure classifies a particular morning according to whether it is suitable for playing
tennis and returns the classification associated with the particular leaf.(in this case Yes or
No).Would be sorted down the leftmost branch of this decision tree and would therefore be
classified as a negative instance.
 In other words, we can say that the decision tree represents a disjunction of conjunctions of
constraints on the attribute values of instances.
Gini Index:
 Gini Index is a score that evaluates how accurate a split is among the classified groups. Gini
index evaluates a score in the range between 0 and 1, where 0 is when all observations belong to
one class, and 1 is a random distribution of the elements within classes.
 In this case, we want to have a Gini index score as low as possible. Gini Index is the evaluation
metrics we shall use to evaluate our Decision Tree Model.
Video Content / Details of website for further learning (if any):
https://lecturenotes.in/notes/24274-note-for-machine-learning-ml-by-new- swaroop
https://www.youtube.com/watch?v=IpGxLWOIZy4

Important Books/Journals for further learning including the page nos.:


Tom Mitchell, Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-11
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:

Topic of Lecture: Recursive induction of decision trees

Introduction: (Maximum 5 Sentences): Construction of Decision Tree: A tree can be “learned” by


splitting the source set into subsets based on an attribute value test. This process is repeated on each
Derived subset in a recursive manner called recursive partitioning.
Prerequisite knowledge for Complete understanding and learning of Topic: (Max.
Four important topics)
Supervised learning basics
Unsupervised learning basics
Reinforcement Learning basics
Detailed content of the Lecture:
 Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes.
 The decision tree creates classification or regression models as a tree structure. It separates a data
set into smaller subsets, and at the same time, the decision tree is steadily developed. The final
tree is a tree with the decision nodes and leaf nodes.
 A decision node has at least two branches. The leaf nodes show a classification or decision. We
can’t accomplish more split on leaf nodes-The uppermost decision node in a tree that relates to
the best predictor called the root node. Decision trees can deal with both categorical and
numerical data.
Decision tree Algorithm:
 The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques
is as follows: The algorithm is based on three parameters: D, attribute_list, and Attribute
_selection_method.

Generally, we refer to D as a data partition.


 Initially, D is the entire set of training tuples and their related class levels (input training data).
 The parameter attribute_list is a set of attributes defining the tuples. Attribute_ selection_method
specifies a heuristic process for choosing the attribute that "best" discriminates the given tuples
according to class.Attribute_selection_method process applies an attribute selection measure.
Advantages of using decision trees:
 A decision tree does not need scaling of information.

 Missing values in data also do not influence the process of building a choice tree to any
considerable extent.
Video Content/Details of website for further learning (if any):
https://lecturenotes.in/notes/2 4274-note-for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=ukzFI9rgwfU
Important Books/Journals for further learning including the page nos.:
Tom Mitchell, Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-12
LECTURE HANDOUTS

IT IV/VII-A
Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:
Topic of Lecture: Picking the best splitting attribute entropy and information gain

Introduction: (Maximum 5 Sentences) A decision tree is a supervised learning algorithm used for
both classification and regression problems. Simply put, it takes the form of a tree with branches
representing the potential answers to a given question. There are metrics used to train decision trees.
One of them is information gain. In this article, we will learn how information gain is computed, and
how it is used to train decision trees.
Prerequisite knowledge for Complete understanding and learning of Topic: (Max. Four
important topics)
Concepts of Supervised Learning
Application of supervised learning
Detailed content of the Lecture:
 Decision trees are one of the predictive modeling approaches used in machine learning. It uses
a decision tree to travel from observations about an object (represented by the branches) to
inferences about the item’s target value (represented by the leaves) (as a predictive model)
 A decision tree’s main idea is to locate the features that contain the most information about the
target feature and then split the dataset along with their values. The characteristic that best
isolates the uncertainty from knowledge about the target feature is the most informative. The
search for the most informative attribute continues until all we have are pure leaf nodes.

Decision Tree Terminologies


 Root Node: Represents the entire sample. This will further get divided into two or
more homogeneous sets.
Decision Node: Nodes Branched from Root nodes are Decision nodes.
Branch: Formed by splitting the tree.
 To summarize, The inputs are routed through the root node of every tree. This root node is
further segmented into decision nodes that are conditionally dependent on results and
observations.
 The process of splitting a single node into many nodes is known as splitting. A leaf node, also
known as a terminal node, is a node that does not break into other nodes. A branch,
sometimes known as a sub-tree, is a section of a decision tree. Splitting is not the only
concept that is diametrically opposite it.
 Decision trees classify cases by sorting them from the root to some leaf/terminal node, with
the leaf/terminal node categorizing the example. Each node in the tree is a test case for a
property, and each edge descending from it represents one of the test case’s possible solutions.
This is a recursive procedure that is carried out for each new node-rooted subtree.

 To put it another way, a high order of disorder indicates a low level of impurity. Entropy is a
measure of disorder that ranges from 0 to 1. It can be higher than 1 depending on the number
of groups or classes present in the data collection, but it has the same meaning.

Video Content/Details of website for further learning (if any):


https://lecturenotes.in/notes/24274-note-for- machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0
Important Books/Journals for further learning including the page nos.:
Tom Mitchell ,Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-13
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:
Topic of Lecture: Searching for simple trees and computational complexity
Introduction: (Maximum 5 Sentences): The time complexity for creating a tree is O(1). The time
complexity for searching, inserting or deleting a node depends on the height of the tree h, so the worst
case is O(h) in case of skewed trees.

Prerequisite knowledge for Complete understanding and learning of Topic: (Max. Four
important topics)
Learning a single class Concepts of supervised learning

Detailed content of the Lecture:


 Binary Tree: In a binary tree, a node can have maximum two children. Consider the left
skewed binary tree shown in Figure 1.

 Searching: For searching element 2, we have to traverse all elements (assuming we do


breadth first traversal). Therefore, searching in binary tree has worst case complexity of O(n).
 Insertion: For inserting element as left child of 2, we have to traverse all elements.
Therefore, insertion in binary tree has worst case complexity of O(n).
 Deletion: For deletion of element 2, we have to traverse all elements to find 2 (assuming we
do breadth first traversal). Therefore, deletion in binary tree has worst case complexity of
O(n).
 Binary Search Tree (BST): BST is a special type of binary tree in which left child of a node has
value less than the parent and right child has value greater than parent. Consider the left skewed
BST shown in Figure 2.
 Searching: For searching element 1, we have to traverse all elements (in order 3, 2, 1).
Therefore, searching in binary search tree has worst case complexity of O(n). In general, time
complexity is O(h) where h is height of BST.
 Insertion: For inserting element 0, it must be inserted as left child of 1. Therefore, we need
to traverse all elements (in order 3, 2, 1) to insert 0 which has worst case complexity of
O(n). In general, time complexity is O(h).
 Deletion: For deletion of element 1, we have to traverse all elements to find 1 (in order 3, 2, 1).
Therefore, deletion in binary tree has worst case complexity of O(n). In general, time
complexity is O(h).
 AVL/ Height Balanced Tree: AVL tree is binary search tree with additional property that
difference between height of left sub-tree and right sub-tree of any node can’t be more than 1. For
example, BST shown in Figure 2 is not AVL as difference between left sub-tree and right sub-tree
of node 3 is 2. However, BST shown in Figure 3 is AVL tree.

Video Content/Details of website for further learning (if any):


https://lecturenotes.in/notes/24274-note-for- machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0
Important Books/Journals for further learning including the page nos.:
Tom Mitchell ,Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-14
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:
Topic of Lecture: Occam's razor Over fitting, noisy data, and pruning

Introduction: (Maximum 5 Sentences) : Ockham's razor (also spelled Occam's razor, pronounced AHK-
uhmz RAY-zuhr) is the idea that, in trying to understand something, getting unnecessary information out of the
way is the fastest way to the truth or to the best explanation.

Prerequisite knowledge for Complete understanding and learning of Topic: (Max.


Four important topics)
Classification
Regression
Detailed content of the Lecture:
 Many philosophers throughout history have advocated the idea of parsimony. One of the
greatest Greek philosophers, Aristotle who goes as far as to say, “Nature operates in the shortest
way possible”. It is as a consequence that humans might be biased as well to choose a simpler
explanation given a set of all possible explanations with the same descriptive power. This post
gives a brief overview of Occam’s razor, the relevance of the principle and ends with a note on
the usage of this razor as an inductive bias in machine learning (decision tree learning in
particular).
What is Occam’s razor?
 Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must
never be posited without necessity”. Alternatively, as a heuristic, it can be viewed as, when
there are multiple hypotheses to solve a problem, the simpler one is to be preferred.
 It is not clear as to whom this principle can be conclusively attributed to, but William of
Occam’s (c. 1287 – 1347) preference for simplicity is well documented. Hence this principle
goes by the name, “Occam’s razor”. This often means cutting off or shaving away other
possibilities or explanations, thus “razor” appended to the name of the principle. It should be
noted that these explanations or hypotheses should lead to the same result.
Relevance of Occam’s razor.
 There are many events that favor a simpler approach either as an inductive bias or a constraint
to begin with. Some of them are :
 Studies like this , where the results have suggested that preschoolers are sensitive to
simpler explanations during their initial years of learning and development.
 Preference for a simpler approach and explanations to achieve the same goal is seen in various
facets of sciences; for instance, the parsimony principle applied to the understanding of
evolution.
 The information gain of every attribute (which is not already included in the tree) is calculated to
infer which attribute to be considered as the next node. Information gain is the essence of the ID3
algorithm. It gives a quantitative measure of the information that an attribute can provide about
the target variable i.e, assuming only information of that attribute is available, how efficiently can
we infer about the target. It can be defined as :

 Well, there can be many decision trees that are consistent with a given set of training
examples, but the inductive bias of the ID3 algorithm results in the preference for simper (or
shorter trees) trees.

 This preference bias of ID3 arises from the fact that there is an ordering of the hypotheses in
the search strategy. This leads to additional bias that attributes high with information gain
closer to the root is preferred. Therefore, there is a definite order the algorithm follows until it
terminates on reaching a hypothesis that is consistent with the training data.

Video Content/Details of website for further learning (if any):


https://lecturenotes.in/notes/24274-note-for- machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0
Important Books/Journals for further learning including the page nos.:
Tom Mitchell ,Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-15
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning


Date of Lecture:
Topic of Lecture: Ensemble Learning, Active learning with ensembles

Introduction:(Maximum 5 Sentences): Ensemble learning is the process by which multiple models, such as
classifiers or experts, are strategically generated and combined to solve a particular computational intelligence
problem. Ensemble learning is primarily used to improve the (classification, prediction, function approximation,
etc.)

Prerequisite knowledge for Complete understanding and learning of Topic: (Max. Four
important topics)
Concepts of Supervised Learning
Application of supervised
learning Probability and Inference

Detailed content of the Lecture:


 Ensemble learning is a general meta approach to machine learning that seeks better
predictive performance by combining the predictions from multiple models.
 Although there are a seemingly unlimited number of ensembles that you can develop for your
predictive modelling problem, there are three methods that dominate the field of ensemble
learning. So much so, that rather than algorithms per se, each is a field of study that has
spawned many more specialized methods.
 The three main classes of ensemble learning methods are bagging, stacking, and boosting, and
it is important to both have a detailed understanding of each method and to consider them on
your predictive modelling project.
 But, before that, you need a gentle introduction to these approaches and the key ideas behind
each method prior to layering on math and code. In this tutorial, you will discover the three
standard ensemble learning techniques for machine learning.
After completing this tutorial, you will know:
 Bagging involves fitting many decision trees on different samples of the same dataset
and averaging the predictions.
 Stacking involves fitting many different models types on the same data and using another
model to learn how to best combine the predictions.
 Boosting involves adding ensemble members sequentially that correct the predictions made
by prior models and outputs a weighted average of the predictions.
.
Video Content/Details of website for further learning(if any):
https://lecturenotes.in/notes/2 4274-note-for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0

Important Books/Journals for further learning including the page nos.:


Tom Mitchell, Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

L-16
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:
Topic of Lecture: Measuring the accuracy of learned hypotheses
Introduction: (Maximum 5 Sentences): Ensemble learning refers to the process of combining multiple
models, such as classifiers or experts into a committee, in order to solve a computational problem. The
main objective of using ensemble learning is to improve the model performance, such as classification
and predictions accuracy.
Prerequisite knowledge for Complete understanding and learning of Topic: (Max. Four
important topics)
 Concepts of Supervised Learning
 Bayes rule
Detailed content of the Lecture:
 Labeled data can be expensive to acquire in several application domains, including medical
imaging, robotics, and computer vision.
 To efficiently train machine learning models under such high labeling costs, active learning
(AL) judiciously selects the most informative data instances to label on-the-fly. This active
sampling process can benefit from a statistical function model, that is typically captured by a
Gaussian process (GP).
 While most GP-based AL approaches rely on a single kernel function, the present contribution
advocates an ensemble of GP models with weights adapted to the labeled data collected
incrementally. Building on this novel EGP model, a suite of acquisition functions emerges based
on the uncertainty and disagreement rules.
 An adaptively weighted ensemble of EGP-based acquisition functions is also introduced to
further robustify performance. Extensive tests on synthetic and real datasets showcase the merits
of the proposed EGP-based approaches with respect to the single GP-based AL alternatives.
 Active Learning (AL) is an emerging field of machine learning focusing on creating a closed loop of
learner (statistical model) and oracle (expert able to label examples) in order to exploit the vast
amounts of accessible unlabeled datasets in the most effective way from the classification point of
view.
 This paper analyzes the problem of multiclass active learning methods and proposes to approach it
in a new way through substitution of the original concept of predefined utility function with an
ensemble of learners.
 As opposed to known ensemble methods in AL, where learners vote for a particular example, we
use them as a black box mechanisms for which we try to model the current competence value using
adaptive training scheme.
We show that modeling this problem as a multi-armed bandit problem and applying even very basic
strategies bring significant improvement to the AL process.
Video Content/Details of website for further learning(if any):
https://lecturenotes.in/notes/2 4274-note-for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0

Important Books/Journals for further learning including the page nos.:


Tom Mitchell, Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(An Autonomous Institution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna University)
Rasipuram-637408, Namakkal Dist., TamilNadu

LECTURE HANDOUTS L-17

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:
Topic of Lecture: Comparing learning algorithms: cross validation
Introduction:(Maximum 5 Sentences):
This is made clear by distinguishing between the true error of a model and the estimated or sample
error. One is the error rate of the hypothesis over the sample of data that is available. The other is the
error rate of the hypothesis over the entire unknown distribution D of examples.

Prerequisite knowledge for Complete understanding and learning of Topic:


(Max. Four important topics)
 Concepts of Supervised Learning
 Application of supervised learning

Detailed content of the Lecture:

 Second, depending on the nature of the particular set of test examples, even if the hypothesis
accuracy is tested over an unbiased set of test instances independent of the training examples,
the measurement accuracy can still differ from the true accuracy. The anticipated variance
increases as the number of test examples decreases.
 When evaluating a taught hypothesis, we want to know how accurate it will be at classifying
future instances.Also, to be aware of the likely mistake in the accuracy estimate. There is an X-
dimensional space of conceivable scenarios. We presume that different instances of X will be
met at different times.
The following two questions are of particular relevance to us in this context,

1. What is the best estimate of the accuracy of h over future instances taken from the same
distribution, given a hypothesis h and a data sample containing n examples picked at random
according to the distribution D
True Error and Sample Error:
We must distinguish between two concepts of accuracy or, to put it another way, error. One is the
hypothesis’s error rate based on the available data sample. The hypothesis’ error rate over the
complete unknown distribution D of examples is the other. These will be referred to as the sampling
error and real error, respectively. The fraction of S that a hypothesis misclassifies is the sampling error
of a hypothesis with respect to some sample S of examples selected from X.
Sample Error:
It is denoted by errors(h) of hypothesis h with respect to target function f and data sample S is
Where n is the number of examples in S, and the quantity is 1 if f(x) != h(x), and
0 otherwise.

Video Content/Details of website for further learning (if any):


https://lecturenotes.in/notes/24274-note- for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0

Important Books/Journals for further learning including the page nos.:


Tom Mitchell, Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD
MUTHAYAMMAL ENGINEERING COLLEGE
(AnAutonomousInstitution)
(Approved by AICTE, New Delhi, Accredited by NAAC & Affiliated to Anna
University) Rasipuram-637408, NamakkalDist., TamilNadu

L-18
LECTURE HANDOUTS

IT IV/VII-A

Course Name with Code : Machine Learning -

19ITE33 Course Teacher: Mr.M.Dhamodaran

Unit : II- Decision Tree Learning and Ensemble Learning

Date of Lecture:
Topic of Lecture: Learning curves and statistical hypothesis testing

Introduction:(Maximum5sentences):Cross-Validation is a statistical method of evaluating and


comparing learning algorithms by dividing data into two segments: one used to learn or train a model
and the other used to validate the model.
Prerequisite knowledge for Complete understanding and learning of
Topic:(Max.Fourimportanttopics)
Concepts of Supervised Learning
Application of supervised learning
Detailed content of the Lecture:
 Cross-Validation (CV) is one of the key topics around testing your learning models. Although
the subject is widely known, I still find some misconceptions cover some of its aspects. When we
train a model, we split the dataset into two main sets: training and testing. The training set
represents all the examples that a model is learning from, while the testing set simulates the
testing examples as in Figure 1.

 Figure 1: Typical dataset split into training and testing sets.

Why cross-validation?
 CV provides the ability to estimate model performance on unseen data not used while training.
 Data scientists rely on several reasons for using cross-validation during their building process of
Machine Learning (ML) models. For instance, tuning the model hyperparameters, testing
different properties of the overall datasets, and iterate the training process. Also, in cases where
your training dataset is small, and the ability to split them into training, validation, and testing
will significantly affect training accuracy. The following main points can summarize the reason
we use a CV, but they overlap. Hence, the list is presented here in a simplified way:

(1) Testing on unseen data


 One of the critical pillars of validating a learning model before putting them in production is
making accurate predictions on unseen data. The unseen data is all types of data that a model has never
learned before. Ideally, the testing data is supposed to flow directly to the model in many testing
iterations. However, in reality, access to such data is limited or not yet available in a new environment.
 The typical 80–20 rule of splitting data into training and testing can still be vulnerable to
accidentally ending up in a perfect split that boosts the model accuracy while limiting it from
performing the same in a real environment. Sometimes, the accuracy calculated this way is
mostly a matter of luck! The 80–20 is not an actual rule per se, and you will find alternative
ratios that range between 25~30% for testing and 70~75% for training.

(2) Tuning model hyper parameter


Finding the best combination of model parameters is a common step to tune an algorithm toward
learning the dataset’s hidden patterns. But, doing this step on a simple training-testing split is typically
not recommended. The model performance is usually very sensitive to such parameters, and adjusting
those based on a predefined dataset split should be avoided. It can cause the model to overfit and
reduce its ability to generalize.
Video Content/Details of website for further learning (if any):
https://lecturenotes.in/notes/2 4274-note-for-machine-learning-ml-by-new-swaroop
https://www.youtube.com/watch?v=WpxKSK2a0

Important Books/Journals for further learning including the page nos.:


Tom Mitchell, Machine Learning ,Tata Mc Grill,1997

Course Teacher

Verified by HoD

You might also like