0% found this document useful (0 votes)
22 views

Module 01

Uploaded by

62.SHRUTI KAMBLE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Module 01

Uploaded by

62.SHRUTI KAMBLE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

no.

(1-3)
(Introduction to Machine Leaming)...Page
AIDS)
Machine Learning (MU- Sem 6 - ECS & development of
Machine focuses on the study and
and also make
1.1 MACHINE LEARNING algorithms that can learn from data
predictions on data.
different from Mitchell as A
UQ. What is Machine learning? How it is Machine learning is defined by Tom
respect to
data mining ? program learns from experience 'E' with
measure P,
I (Ref. MU(Comp.) - May 17, 5.Marks, May I9,
5 Marks) some class of tasks T' and performance
measured by P
example if its performance on tasks in T' as
UO. Define Machine learning and explain with
importance of Machine Learning. improves with E." Here E' represents the past
experienced data and "T' represents the tasks such as
(Ref. MU (Comp.) - Dec. 19,5 Marks)
prediction, classification, etc. Example of P', we
A machine that is intellectually capable as much as might want to increase accuracy in prediction.
early
humans, have always attracted writers and Machine learning mainly focuses on the design and
about artificial
computer scientist who were excited development of computer programs that can teach
intelligence and machine learning. themselves to grow and change when exposed to new
The first machine learning system was developed in the data.
1950s. In 1952, Samuel has developed a program to
Using machine learning we can collect information
play checkers. The program was able to observe from a dataset by asking the computer to make some
positions at game and learn the model that gives better sense from data. Machine learning is turning data into
moves for machine player. information.
In 1957, Frank Rosenblatt designed the Perceptron, Data
which is a simple classifier but when it is combined in Computer Program
Output
large numbers, in a network, it became a powerful tool.
Minsky in 1960, came up with limitation of perceptron. Fig. 1.1.1: Machine Learning
He showed that the X-OR problem could not be
The Fig. 1.1.1 is the schematic representation of the
represented by perceptron and such inseparable data ML system. ML system takes the training data and
distribution cannot be handled and following this background knowledge as the input. Background
Minsky's work neural network research went to
knowledge and data helps the Learner program to
dormant until 1980s. provide a solution for a particular task or problem.
Performance corresponding to the solution can be also
Machine learning became very famous in 1990s,due to
the introduction of statistics. Computer science and measurèd. ML system comprises of mainly two
components, Learner and a Reasoner. Learner use the
statistics combination lead to probabilistic approaches training data and background knowledge to build the
in Arificial intelligence. This area is further shifted to model and this can be used by reasoner to provide the
data driven techniques. As Huge amount of data is solution for a task.
available, scientists started to design intelligent Machine learning can be applied to many applications
systems that are able to analyze and learn from data. such as politics to geosciences. It is a tool that can be
Machine learning is a category of Artificial applied to many problems. Any application which
needs to extract some information from data and also
Intelligence. In machine learning computers has the
takes some action on data, can benefit from machine
ability to learn themselves, explicit programming is not learning methods.
required.

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


Machine Learming (MU - Sem 6- ECS &AIDS) (Introduction to Machine Leaming)...Page no. ((1-4)
Some of the applications are spam filtering in email, 1.2 KEY TERMINOLOGY
face recognition, product recommendations from
Amazon.com and handwriting digit recognition. IUO. What are the key tasks of Machine Learning?
In detecting spam email, if you check for the (Ref. MU (Comp.) - May 16, 5 Marks i
Occurrence of single word it will not be very helpful.
But checking the occurrences of certain words used Expert System
together and combined this with the length of the email Expert system is a system which is developed using
and other parameters, you could get a much clearer some training set, testing set, and knowledge
idea of whether the email is spam or not. representation, features, algorithm and classification
Machine learning is used by most of the companies to terminology.
increase productivity, forecast weather, to improve () Training Set : Atraining set comprises of training
business decisions, detect disease and do many more examples which will be used to train machine
things. learning algorithms.
(ü) Testing Set : To test machine learning algorithms
Machine learning uses statistics. There are many
what's usually done is to have a training set of
problemns where the solution is not deterministic.
data and a separate dataset, called a test set.
There are certain problems for which we don't have
(üi) Knowledge Representation Knowledge
that much information and also don't have that much
representation may be stored in the form of a set
computing power to properly model the problem. of rules. It may be an example from the training
For these problems we need statistics, example of such set or a probability distribution.
type of problem is prediction of motivation and (iv) Features : Important properties or attributes.
behavior of humans. (v) Classification : We classify the data based on
The behavíor and motivation of humans is a problem features.
that is currently very difficult to model. Process: Suppose we want to use a machine learning
Data Problem/Task algorithm for classification. The next step is to train the
algorithm, or allows it to learn. To train the algorithm
we give as a input a quality data called as
training set.
Models Each training example has some features and one
target
Learner Reasoner variable. The target variable is what we will be try1ng
to predict with our machine
leaming algorithms. In a
training dataset the target variable is known. The
Background Knowedge Solution machine learns by finding some relationship
the target variable and the between
features, In the classification
Fig. 1.1.2: Schematic diagram of Machine Learning tasks the target variables are known as
classes. lt 1s
Machine learning = Take dala t understand it+ process uSsumed that there will be a limited number of
classes.
it + extract value from it + visualize it + communicate it The clasS or target variable that the
training example
belongs to is then compared to the predicted
we can get a idea about the accuracy of the
value, and
algorithm.

(M6-131) Lb Tech-Neo
Publications..ASACHIN SHAH Venture
Machine Learning (MU - Sem 6 - ECS& AIDS) (Introduction to Machine Leaming)..Page no. (1-5)
Example : First we will see some terminologies that In classification task the target variable takes a discrete
are frequently used in machine learning methods.Let's value, and in the task of regression its value could be
take an example that we want to design a classification continuous.
system that will classify the instances in to either In a training dataset we have the value of target
Acceptable or Unacceptable. This kind of system is a variable. The relationship that exists between the
fascinating topic often related with machine learning features and the target variable used by machine for
called expert systems.
learning. The target variable is the evaluation of the
Four features of the various cars are stored in Car.

Table 1.2.1. The features or the attributes selected are


Classes are the target variables in the classification
Buying Price, Maintenance_Price, Lug Boot and task. In classification systems it is assumed that classes
Safety. Examples belong to Table 1.2.1 represents a are to be of limited number.
record comprises of features.
Attributes or features are the individual values that,
In Table 1.2.1 all the features are categorical in nature when combined with other features, make up a training
and takes limited disjoint values. The first two features example. This is usually columns in a training or test
represent the buying price and maintenance price of a set.
car such as high, medium and low. Third feature shows
Atraining dataset and a testing dataset, is used to test
the luggage capacity of a car as small, medium or big.
machine learning algorithms. First the training dataset
Fourth feature represents whether the car has safety
is given as input to the program. Program uses this data
measures or not, which takes the value as low, medium
to learn. Next, the test set is given to the program.
or high.
The program decides which instance of test data
Classification is one of the important task in machine
belongs to which class.
learning. In this application we want to evaluate the car
out of a group of other cars. Suppose we have all The predicted output is compared with the actual
information about car's Buying Price, output of the program, and we can get a idea about the
Maintenance_Price, Lug_Boot and Safety. accuracy of the algorithm.
Classification method is used to evaluate a given car as
There are best ways to use all the information in the
Acceptable or Unacceptable. Many machine learning training dataset and test dataset.
algorithms are there that can be used for classification.
The target or the response variable in this example is Assume in car evaluation classification system, we
the evaluation of a car. have tested the program and it meets the desired level
of accuracy.
Suppose we have selected a machine learning
algorithm to use for classification. The main task in the Knowledge representation is used to check what the
classification is to train the algorithm, or allow it to machine has learned. There are many ways in which
learn. We give the experienced data as the input to knowledge can be represented.
train the algorithm which is called as training data.
We can use set of rules or a probability distribution to
Let's assume training datasct contains 14 training represent the knowledge.
records in Table 1.2.1. Suppose cach training record
has four features and one target or the response Many algorithms represent the knowledge which is
variable as shown in Fig, 1.2.1.The machine learning more interpretable to humans than others.
algorithm is used to predict the target variable.

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


Machine Learning (MU - Sem 6 -ECS & AIDS) (Introduction to Machine Learming)....Page no. (1-A\
In some situationswe may not want to build an
expert bascd on labelled data. In short, while learning, the
system but we are interested only in the knowlcdge system has knowledge of a set of labelled data. This is
representation that's acquired from training amachine onc of the most common and frequently used learning
learning algorithm. methods
Table 1.2.1 : Car evaluation classification based on four The supcrvised lcarning method is comprised of a
features
series of algorithms that build mathematical models of
Buying_ Price Maintenance Lug_Boot Safety Evaluatlon ? certain data sets that are capable of containing both
Price inputs and the desired outputs for that particular
High High Small High Unacceptable machine.
High High Small Low Unacceptatble The data being inputted into the supervised learning
Medium High Small High Acceptable method is known as training data, and essentially
Low Medium Small High Acceptable consists of training examples which contain one or
Low Low Big High Acceptable more inputs and typically only one desired output. This
Low Low Big Low Unacceptable output is known as a "supervisory signal."
Medium Low Big Low Acceptable
In the training examples for the supervised learning
High Medium Small High Unacceptable method, the training example is represented by an
High Low Big High Acceptable array, also knoWn as a vector or a feature vector, and
Low Medium Big High Acceptable the training data is represented by a matrix.
High Mediumn Big Low Acceptable
Small Low
The algorithm uses the iterative optimization of an
Medium Medium Acceptable
objective function to predict the output that will be
Medium High Big High Acceptable
associated with new inputs.
Low Medium Smali Low Unacceptable
Ideally, if the supervised learning algorithm is working
Evaluation properly, the machine will be able to correcly
Buying_Price Maintenance _Price Lug_Boot Safety
Low Low Big High Acceptable determine the output for the inputs that were not a part
Target
of the training data.
Features Variables
Supervised learning uses classification and regression
Fig. 1.2.1 : Features and target variable identified techniques to develop predictive models. Classification
techniques predict categorical responses,
1.3 TYPES OF MACHINE LEARNING
Regression techniques predict continuous responses.
Supervised Learning for example, changes in tenmperature or fluctuations in
a 1.3.1
power demand. Typical applications include electricity
load forecasting and algorithmic trading.
GQ What is supervised learning?
the help of an i Let us begin by considering the simplest machine
GO, Explain _upervised learning with
learning task : supervised leaming for classification.
example. Lot us take an example of classification of documents.
learning works?
GQ How supervised In this partieular case a learner learns based on the
on acluss of cxumples available documents and their classes. This is as
Learning that takesplace based referred to us labelled data.
supervised learning, It 0s leurnine
is referred to us
' lTech-Neo Publications...A SACHIN SHAHVenture
(M6-131)
Machine Learning (MU - Sem 6- ECS &AIDS) (Introduction to Machine Leaming)....Page no. (1-7)
The program that can map the input documents to A 1.3.1(A) How Supervised Learming Works?
appropriate classes is called a classifier, because it In supervised learning, models are trained using
assigns a class (i.e., document type) to an object (i.e., a labelled dataset, where the model learns about each
document). The task of supervised learning is to type of data. Once the training process is completed,
construct a classifier given a set of classified training the model is tested on the basis of test data (a subset of
examples. A typical classification is depicted in the training set), and then it predicts the output.
Fig. 1.3.4. The working of Supervised learning can be easily
Fig. 1.34 represents a hyperplane that has been understood by the below example and diagram
(Fig. 1.3.5).
generated after learning, separating two classes - class Labeled data

A and class B in different parts. Each input point Predicion


Square

presents input-output instance from sample space. In


case of document classification, these points are ATriange
Model training
documents.

Test data
Class A
Haxagon ASquare
Triangle

(1D2)Fig. 1.3.5 How Supervised learning works?

Suppose we have a dataset of different types of shapes


which includes square, rectangle, triangle, and
o Class B Polygon. Now the first step is that we need to train the
model for each shape.
If the given shape has four sides, and all the sides
(101)Fig. 1.3.4 : Supervised learning
are equal, then it will be labelled as a Square.
Learning computes a separating line or hyperplane If the given shape has three sides, then it will be
labelled as a triangle.
among documents. An unknown document type will be
decided by its position with respect to a separator. If the given shape has six equal sides then it will
be labelled as hexagon.
There are a number of challenges in supervised Now, after training, we test our model using the test
classification such as generalization, selection of right set, and the task of the model is to identify the shape.
data for learning, anddealing with variations. Labelled The machine is already trained on all types of shapes,
and when it finds a new shape, it classifies the shape
examples are used for training in case of supervised
on the bases of a number of sides, and predicts the
learning. The set of labelled examples provided to the output.
learning algorithm is called the training set.
Following are the steps involved in Supervised
Superviscd learning is not just about classification, but Lcarning :
it is the overall process that with guidelines maps to the First Determine the type of training dataset
most appropriate decision. Collec/Guther the labelled training data.

(M6-131) 5Tech-Neo Publications..A SACHIN SHAH Venture


Machine Learning (MU - Sem 6 - ECS &AIDS) (Introduction to Machine Learming)...Page no. (1-81
Split the training dataset into training dataset, test (2) In supervised learning, we can have an exact idea aho.
dataset, and validation dataset. the classes of objects.
Determine the input features of the training (3) Supervised learning model helps us to solve varione
dataset, which should have enough knowledge so real-world problems such as fraud detection, spam
that the model can accurately predict the
output. filtering, etc.
Determine the suitable algorithm for the model,
such as support vector machine, decision tree, etc. 1.3.1 (C) Disadvantages of Supervised
Learning
Execute the algorithm on the training dataset.
Sometimes we need validation sets as the control (1) Supervised learning models are not suitable for
parameters, which are the subset of training handling the complex tasks.
datasets.
(2) Supervised learning cannot predict the correct output if
Evaluate the accuracy of the model by providing
the test data is different from the training dataset.
the test set. If the model predicts the correct
output, which means our model is accurate. (3) Training required lots of computation times.
Supervised learning can be further divided into two (4) In supervised learning, we need enough knowledge
about the classes of object.
types of problems: Regression and Classification.
Regression a 1.3.2 Unsupervised learning
Regression algorithms are used if there is a relationship GQ What is Unsupervised Learning?
between the input variable and the output variable.It is used
for the prediction of continuous variables, such as Weather GQ What are the types of unsupervised learning?
forecasting, Market Trends, etc. Below are some popular GQ What are the advantages and disadvantages ot
Regression algorithms which come under supervised unsupervised arning?
learning:
Unsupervised learning refers to learning from
Linear Regression Regression Trees Non-Linear
unlabeled data. It is based more on similarity and
Regression
differences than on anything else. In this type of
Bayesian Linear Polynomial learning, all similar items are clustered together in a
Regression Regression particular class where the label of a class is not known.
Classification It is not possible to learm in a supervised way in ns
absence of properly labeled data. In these scenari0s
Classification algorithms are used when the output
there is need to learn in an
variable is categorical, which means there are two classes unsupervised way.
Here the learning is based more on
such as Yes-No, Male-Female, True-false, ctc. similarities and
Random Forest LogisticRegression differences that ure visible. These differences and
similarities are mathematically represented in
Decision Trees Support vector Muchines
unsupervised learning.
a 1.3.1 (B) Advantages of Supervised Learnlng Given a large collection of objects, we often
want to e
the model can able to understand these objects and
(1) With the help of supervised learning, visualize thel
predict the output on the basis of prior
experiences, relationships.
àlTech-Neo
(M6-131) Publications...A SACHIN SHAH VentU
Machine Learning (MU - Sem 6 - ECS & AIDS) (Introduction to Machine Leaming)....Page no. (1-9)
For an example based on similarities, a kid can Instead, the developers believe that they have created
separate birds from other animals. It may use some strong enough inputs to ultimately program the
property or similarity while separating, such as the machine to create stronger results than they themsel ves
birds have wings. possibly could.
The criterion in initial stages is the most visible aspects The idea here is that the machine is programmed to run
of those objects. Linnaeus devoted much of his life to flawlessly to the point where it can be intuitive and
arranging living organisms into a hierarchy of classes, inventive in the most effective manner possible.
with the goal of arranging similar organisms together The information in the algorithms being run by
at all levels of the hierarchy. unsupervised learning methods is not labelled,
Many unsupervised learning algorithms create similar classified, or categorized by humans. Instead, the
hierarchical arrangements based on similarity-based unsupervised algorithm rejects responding to feedback
mappings. in favour of identifying commonalities in the data. It
The task of hierarchical clustering is to arrange a set of then reacts based on the presence, or absence, of such
objects into a hierarchy such that similar objects are commonalities in each new piece of data that is being
inputted into the machine itself.
grouped together.
It is used to draw inferences from datasets consisting of
Non-hierarchical clustering seeks to partition the data
into some number of disjoint clusters. The process of input data without labelled responses. Clustering is the
clustering is depicted in Fig. 1.3.6. most common unsupervised learning technique. It is
used for exploratory data analysis to find hidden
A learner is fed with a set of scattered points, and it
patterns or groupings in data.
generates two clusters with representative centroids
Applications for custering include gene sequence
after learning. Clusters show that points with similar
analysis, market research, and object recognition.
properties and closeness are grouped together.
1.3.2(A) Types of Unsupervised Learning
o
Algorithm
The unsupervised learning algorithm can be further
categorized into two types of problems:
1. Clustering 2. Association

Unlabeled Cluster8 (1) Clustering


data

(1Dg)Fig. 1.3.6 : Unsupervised learning Clustering is amethod of grouping the objects into
clusters such that objects with most similarities
Unsupervised learning is a set of algorithms where the remains into a group and has less or no similarities
only information being uploaded is inputs. with the objects of another group.
The device itself, then, is responsible for grouping Cluster analysis finds the commonalities between the
together and creating ideal outputs based on the data it
discovers. Often, unsupervised learning algorithms data objects and categorizes them as per the presence
und absence of those commonalities.
have certain goals, but they are not controlled in any
manner.

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


Machine Learning (MU- Sem 6 - ECS & (Introduction to Machine Leaming)...Page no. (1-10)
AIDS)
(2) Association
(2) The result of the unsupervised learning algorithm miphe
An be less accurate as input data 1s not labeled. and
association rule is an
unsupervised learning method
which is used for finding the relationships algorithms do not know the exact output in advance
bctween
variables in the large database. In practical scenarios there is always need to learn from
It determines the set of items
that occurs
together in the both labeled and unlabeled data. Even while learning in an
dataset. Association rule makes marketing unsupervised way, there is the need to make the best use of
strategy
more effective.
labeled data available. This is referred to as semi supervised
Such as people who buy X item (suppose a bread) are
learning. Semisupervised learning is making the best use of
also tend to purchase Y (Butter/Jam) item. A
typical two paradigms of learning that is, learning based on
example of Association rule is Market Basket
Analysis. similarity and learning based on inputs from a teacher. Semi
supervised learning tries to get the best of both the worlds.
Below is the list of some popular unsupervised learning
algorithms : 1.3.2(D) Difference between Supervised and
K-means clustering Neural Networks Unsupervised Learning
KNN (k-nearest Principle Component Supervised and Unsupervised learning are the two
neighbors) Analysis techniques of machine learning. But both the
techniques are used in different scenarios and with
Hierarchal clustering Independent Component different datasets. Below the explanation of both
Anomaly detection Analysis learning methods along with their difference table is
Apriori algorithm given.

Singular value decomposition Supervised learning is a machine learning method in


which models are trained using labeled data.
A 1.3.2(B) Avantages of Unsupervised In supervised learning, models need to find the
Learning mapping function to map the input variable (X) wih
(1) Unsupervised learning is used for more complex tasks the output variable (Y).
as compared to supervised learning because, in Y = f (X)
unsupervised learning, we don't have labeled input
data.
Supervised learning needs supervision to train the
model, which is similar to as a student learnms things n
(2) Unsupervised learning is preferable as it is easy to get the presence of a teacher. Supervised learning can be
unlabeled data in comparison to labeled data, used for two types of problems:
Regression. Classification and
1.3.2(C) Disadvantages of Unsupervised
Learning Example : Suppose we have an
types of fruits. The task of our image different
of
(1) Unsupervised learning is intrinsically more difficult model is to identify the fruits supervised learming
and classify them
than supervised learning s it does not huve accordingly. So to identify the image in supervised
corresponding output. learning, we will give the input data as well as
for that, which neans we outpul
will train the model by the
shape, s0zo, color, and taste of each
training is completed, we will test the fruit. Once the
model by givns
(M6-131) 'àlTech-Neo
Publications..A SACHIN SHAH Venture
Machine Leaming (MU- Sem 6- ECS &AIDS) (Introduction to Machine Learning)...Page no.(1-1l
the new set of fruit. The model will identify the fruit Example : To understand the unsupervised learning,
and predict the output using a suitable algorithm. we will use the example given above. So unlike
Unsupervised leaning is another machine learning supervised learning, here we will not provide any
method in which patterns inferred from the unlabeled supervision to the model. We will just provide the
input data. The goal of unsupervised learning is to find input dataset to the model and allow the model to find
the structure and patterns from the input data. the patterns from the data. With the help of a suitable
Unsupervised learning does not need any supervision. algorithm, the model will train itself and divide the
Instead, it finds patterns from the data by itsown. fruits into different groups according to the most
similar features between them.
Unsupervised learning can be used for two types of
problems: Clustering and Association.

IUQ. Explain how supervised learning is different from unsupervised learning.


(Ref. MU (Comp.) - May 17, 5 Marks)

The main differences between Supervised and Unsupervised learning are given below :
Table : 1.3.1

Supervised Learning Unsupervised Learing


Supervised learning algorithms are trained using labeled Unsupervised learning algorithms are trained using unlabeled
data. data.

Supervised learning model takes direct feedback to check if Unsupervised learning model does not take any feedback.
it is predicting correct output or not.

Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in
data.

In supervised learning, input data is provided to the model In unsupervised learning, only input data is provided to the
along with the output. model.

The goal of supervised learning is to train the model so that The goal of unsupervised learning is to find the hidden
it can predict the output when it is given new data. patterns and useful insights from the unknown dataset.

Supervised learning needs supervision to train the model. Unsupervised learning does not need any supervision to train
the model.

Supervised Jearning can be categorized in Classification and Unsupervised Learning can be classified in Clustering and
Regression problems. Associations problems,
Supervised learning can be used for those cases where we Unsupervised learning can be used for those cases where we
know the input as well as corresponding outputs. have only input data and no corresponding output data.
Supervised learning model produces an accurate result. Unsupervised leurning model may give less accurate result as
compared to supervised learning.

(M6-131) hTech-Neo Publications.A SACHIN SHAH Venture


Machine Learning (MU- Sem 6 - ECS &AIDS) (Introduction to Machine Learning)....Page no. (1-12)
Unsupervised Learning
Supervised Learning
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true Artificial
intell1gence as in this, we first train the model for each data. Intelligence as it learns siilariy as a child learns daily
and then only it can predict the corect output. routine things by his experiences.

t includes various algorithms such as Linear Regression, It includes various algorithms such as Clustering, KNN, and
Logistic Regression, Support Vector Machine, Multi-class Apriori algorithm.
Classification, Decision tree, Bayesian Logic, etc.
a 1.3.3 Reinforcement Learning environment and learns to act within that." How a
Robotic dog learns the movement of his arms is an
GQ. What is Reinforcement Learning? Explain with an example of Reinforcement learning.
example.
It is a core part of Artificial intelligence, and all AI
Reinforcement Learning is a feedback-based Machine agent works on the concept of reinforcement learning.
learning technique in which an agent learns to behave Here we do not need to pre-program the agent, as it
in an environmnent by performing the actions and learns from its own experience without any human
intervention.
seeing the results of actions. For each good action, the
agent gets positive feedback, and for each bad action, Example : Suppose there is an AI agent present within
the agent gets negative feedback or penalty. a maze environment, and his goal is to find the
In Reinforcement Learning, the agent learns diamond. The agent interacts with the environment by
automatically using feedbacks without any labeled performing some actions, and based on those actions,
data, unlike supervised learning. the state of the agent gets changed, and it also receives
a reward or penalty as feedback.
Since there is no labelled data, so the agent is bound to
learn by its experience only. The agent continues doing these three things (take
RL solves a specific type of problem where decision action, change state/remain in the same state, and get
making is sequential, and the goal is long-term, such as feedback), and by doing these actions, he learns and
game-playing, robotics, etc. explores the environment.
The agent interacts with the environment and explores The agent learns that what actions lead to positive
it by itself. The primary goal of an agent in feedback or rewards and what actions lead to negative
reinforcement learning is to improve the performance Tecdback penalty. As a positive reward, the agent
by getting the maximum positive
rewards. positive point, and as a penalty, it gets a negative point.
The agent learns with the procc88 of hit and trial, and
in
based on the experience, it learns to perform the task
"Reinforcement
a better way. Hence, we can say that
learning is atype of machine learning method where
interacts with he
intelligent agent (computer program)
Venture
(M6-131) Tech-Neo Publications...A SACHIN SHAH
Machine Learning (MU - Sem 6 - ECS &AIDS) (ntroduction to Machine Leaming)...Page no. (1-13)
Environment There are mainly three ways to implement
reinforcement-learning in ML, which are :
1. Value-based : The value-based approach is about to
find the optimal value function, which is the maximum
Reward, Actions value at a state under any policy. Therefore, the agent
State expects the long-term return at any state(s) under
policy a.
2. Policy-based : Policy-based approach is to find the
optimal policy for the maximum future rewards without
using the value function. In this approach, the agent
Agent
tries to apply such a policy that the action performed in
(1D4) Fig. 1.3.7 each step helps to maximize the future reward. The
policy-based approach has mainly two types of policy:
For machine learning, the environment is typically
Deterministic : The same action is produced by the
represented by an "MDP" or Markov Decision Process.
policy () at any state.
These algorithms do not necessarily assume
Stochastic : In this policy, probability determines the
knowledge, but instead are used when exact models are produced action.
infeasible. In other words, they are not quite as precise
3. Model-based : In the model-based approach, a virtual
or exact, but they will still serve a strong method in
model is created for the environment, and the agent
various applications throughout different technology explores that environment to learn it. There is no
systems.
particular solution or algorithm for this approach
The key features of Reinforcement Learning are because the model representation is different for each
mentioned below. environment.
In RL, the agent is not instructed about the Here are important characteristics of reinforcement
environment and what actions need to be taken.
learning
It is based on the hit and trial process.
There is no supervisor, only a real number or
The agent takes the next action and changes states reward signal
according to the feedback of the previous action.
Sequential decision making
The agent may get a delayed reward.
Time plays a crucial role in Reinforcement
The environment is stochastic, and the agent needs problems
toexplore it to reach to get the maximum positive
rewards. Fecdback is always delayed, not instantaneous
Agent's actions determine the subsequent data it
A 1.3.3(A) Approaches to Implement receives
Reinforcement Learning

GQ What are the approaches for Reinforcement


Learning?

(M6-131) Tech-Neo Publications..A SACHIN SHAH Venture


(Introduction to Machine Learning)....Page no. (1-14)
Machine Leaming (MU - Sem6 - ECS &AIDS)
RL can be used in almost any application. It is a Parameters Reinforcement Supervised
learming based on experience algorithm, a decision maker Learning Learning
algorithm, an algorithm that learns autonomously, an Reinforcement In this method, a
Decision
optimization algorithm that over time learns to maximize its learning helps you decision is made on
style
reward, the reward can be defined by the engineer to reach to take your the input given at the
the objective of the problem. decisions beginning.
sequentially.
a 1.3.3(B) Challenges of Reinforcement
Learning Works on Works on Works on examples
interacting with the or given sanple
Here are the major challenges you will face while doing environment. data.
Reinforcement earming :
Dependency In RL method Supervised learning
(1) Feature/re ward design which should be very involved on decision learning decision is the decisions which

(2) Parameters may affect the speed of learming. dependent. are independent of
Therefore, you each other, so labels
(3) Realistic environments can have partial observability.
should give labels to are given for every
(4) Too much Reinforcement may lead to an overload of all the dependent decision.
states whích can diminish the results. decisions.
(5) Realistic environments can be non-stationary. Best suited Supports and work It is mostly operated
better in AI, where with an interactive
a 1.3.3(C) Applications of Reinforcement human interaction is software systemn or
Learning prevalent. applications.
Here are applications of Reinforcerment Learning : Example Chess game Object recognitioa
(1) Robotics for industrial automation.
1.4 ISSUES IN MACHINE LEARNING
(2) Busincss stratcgy planning
(3) Machine learning and data processing UQ, What are the issues in Machine learning?

(4) Aircraft control and robotmotion control (Ref. MU (Comp.) May 15, 5 Marks)
(5) IL helps you o creatle training systems that provide 1. Which algorithm we have to select to leurn geerl
CUsom instruction and malerials ccording to the larget functions from specific training dataset? Whut
requirernent of students. should be the settings for particular algoriths, so as e

1.3.3(D) Relnforcement Learning Vs. converge to the desired function, given sufficient
raining data? Whicth algorihs perform best for which
Supervised Learning
type of problems und representations?
GQ What is the difference between Reinforcement 2. How much training data is sufticient? What should be
Learning and Supervised Learning? the general aount of data that can be found to relale
the contidence in leurned hypotheses lo the amount
Iraining experience and the character of the learner's
hypothesis space?
(M6-131)
a Tech -Neo Publications..A SACHIN SHAH Venture
Machine Leaming (MU -Sem 6 - ECS &AIDS)
(ntroduction to Machine Learning)..Page no. (1-15)
3. Prior knowledge held by the learner is used at
which (b) If you have chosen unsupervised learning, then
time and manner to guide the process of
generalizing next you necd to focus on what is your aim?
from examples? If we have
approximately correct If you want to fit your data into some discrete
knowledge, will it helpful even when it is only
approximately correct? groups, then use Clustering
4 What is the best strategy for choosing a useful next If you want to find numerical estimate of how

training experience, and how does the choice of this strong the fit into each group, then use density
strategy after the complexity of the learning problem? estimation algorithm
2. Data : Are the features continuous or nominal ? Are
5. To reduce the task of learning to one or more function
there missing values in features? If yes, what is a
approximation problems, what will be the best
reason for missing values? Are there outliers in the
approach? What specific functions should the system
data? To narrow the algorithm selection process, all of
attempt to learn? Can this process itself be automated?
these features of your data can help you.
6 To improve the knowledge representation and to learn Table1.5.1 :Selection of Algorithm
the target function, how can the learner automatically
alter its representation? Supervised Unsupervised
Learning Learning
1.5 HOW TO CHOOSE THE RIGHT Discrete Classification Clustering
ALGORITHM ? Continuous Regression Density Estimation

1.6 STEPS IN DEVELOPING A MACHINE


IUQ. Explain the steps required for selecting the right
LEARNING APPLICATION
machine learning algorithm.
(Ref. MU(Comp.) - May 16, 8 Marks) UQ. Explain the steps of developing Machine Learning I
With all the different algorithms available in machine applications. (Ref. MU (Comp) - May 19, 10 Marks)i
learning, how can you select which one to use ? First you 1, Collection of Data
need to focus on your goal. What are you trying to get out of
You could collect the samples from a website and
this? What data do you have or can you collect ? Secondly
extracting data.
you have to consider the data.
From RSS feed or an API
L Goal : If you are trying to predict or forecast a target
From device to collect wind speed measurement
value, then you need to look into supervised learning.
Otherwise, you have to use unsupervised learning. Publicly available data.
(a) If you have chosen supervised learning, then next 2, Preparationof the input data
you need to focus on what's your target valuc? Once you have the input data, you need to check
whether it's in a useable format or not.
If target value discrete (e.g. Yes/ No, I /2/3,
Some algorithm can accept target variables and
A/B/C), then use Classification. features as string; some need them to be integers.
If target value is continuous i.e. Number of Some algorithm accepts features in a special
values (e.g. 0 - 100, 99 to 99), hen use format.
Regression.

(M6-131) Tech-Neo Publications..A SACHIN SHAHVenture


Machine Leaming (MU - Sem 6 - ECS &AIDS) (Introduction to Machine Leaning).Page no. (1-16)
7. Use it
3. Analyse the input data
Looking at the data you have passed in a text In this step a real program is developed to do some
editor to check collection and preparation of input task, and once again it is checked if all the previous
data steps are properly working and you don't steps worked as you expected. You might encounter
have a bunch of empty values. some new data and have to revisit step 1-5.
You can also check at the data to find out if you
Training Phase
can see any patterns or if there is anything
obvious, such as a few data points greatly differ Label Machíne
from remaining set of the data. learníng
Feature algorithm
Plotting data in 1, 2 or 3 dimensions can also extractor
Input Features
help.
Distil multiple dimensions down to 2/3 so that
you can visualize the data. Testing Phase
Classifier
4. The importance of this step is that it makes you model
Label
Feature
understand that you don't have any garbage value
Input extractor
coming in. Features
5. Train the algorithm

Good clean data from the first tWO steps is given as Fig. 1.6.1: Typical example of Machine Learning
Application
input to the algorithm. The algorithm extracts
information or knowledge. This knowledge is mostly 1.7 APPLICATIONS OF MACHINE
stored in a format that is readily useable by machine LEARNING
for next 2 steps.

In case of unsupervised learning, training step is not UQ. Write short note on: Machine learming applications.
there because target value is not present. Complete data (Ref. MU (Comp.) -May 16, May 17, 10 Marks)
is used in the next step.
1. Learning Associations
6. Test the algorithm
A supermarket chain-one an example of retail
In this step the information learned in the previous step application of machine learning is basket analysis,
is used. When you are checking an algorithm, you will which is finding associations between products bought
test it to find out whether it works properly or not. In by customers:
supervised case, you have some known values that can If people who buy Ptypically also buy Qand if there is
be used to evaluate the algorithm. a customer who buys Q and does not buy P, he or she
is a potential P customer. Once we identify such
In case of unsupervised, you may have to use some customers, we can target them for cross-selling.
other matrices to evaluate the success. In either case, if In finding an association rule, we are interested in
you are not satisfied, you can again go back to step 4, learning a conditional probability of the form P (Q/P)
change some things and test again. where Q is the product we would like to condition on
P, which are the product l products which we know
Mostly problem occurs in collection or preparation of that customer has already purchased.
data and you will have to go back to step 1.
P(Milk /Bread) = 0.7

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


Machine Learning (MU- Sem 6 - ECS &AIDS)
(Introduction to Machine Learning)....Page no. (1-17)
It implies that 70% of customers who buy bread also
Let X represents flat features and Y is the price of flat.
buy milk
We can collect training data by surveying past
2. Classification purchased transactions and the Machine Learning
A credit is an amount of money loancd by a financial algorithm fits a function to this data to learn Y as a
institution. It is important for the bank to be able to function of X for the suitable values of W and Wo.
predict in advance the risk associated with a loan. Y = w*x + Wo
Which is the probability that the customer will default
and not pay the whole amount back?
In credit scoring, the bank calculates the risk given the
amount of credit and the information about the Y=w'x t Wo
customer. (Income, savings, collaterals, profession, Y: price
of flat
age, past financial history). The aim is to infer a
general rule from this data, coding the association X:area of flat
between a customer's attributes and his risk.

Machine Learning system fits a model to the past data Fig, 1.7.2: Regression for prediction of price of flat
to be able to calculate the risk for a new application 4. Unsupervised Learning
and then decides to accept or refuse it accordingly. One of the important unsupervised learning problem is
If income > Q, and savings >Q2
clustering. In clustering dataset is partitioned in to
Then low - risk ELSE high - risk meaningful sub classes known as clusters. For
Other ciassification examples are Optical character example, suppose you want to decorate your home
using given items.
recognition, face recognition, medical diagnosis,
speech recognition and biometric. Now you will classify them using unsupervised
Low-Risk
learning (n0 prior knowledge) and this classification
Savings
can be on the basis of color of items, shape of items,
material used for items, type of items or whatever way
High-Risk
AA youwould like.
5. Reinforcement Learning
There are some of the applications where output of
Income
system is a sequence of actions. In such applications
the sequence of correct actions instead of single action
Fig. 1.7.1l :Classification for credit scoring is important in order toreach goal. An action is said to
3. Regression be good if it is part of good policy. Machine learning
program generates a policy by learning previous good
Suppose we want to design a system that can predict
action scquences. Such methods are called
the price of a flat. Let's take the inputs as the area of
reinforcement methods
the flat, location and purchase ycar and other
information that affects the rate of flat. The output is A good example of reinforcement learning is chess
the price of the flat. The applications where output is playing. In artificial intelligence and machine learning,
numeric are regression problems. one of the most important research area is game

(M6-131) B Tech-Neo Publications..A SACHIN SHAH Venture


Machine Leamning (MUU -Sem 6 - (Introduction to Machine Learning)...Page no.
playing. Games can
ECS & AIDS)
be arc casily describcd but at the To rcalistically measure a model''s performance,
(1-18)it is
same time, they arc quite better to evaluate it against a validation
difficult to play wcll. dataset that
Let's take a example of chess that has was not used when training the model.
limited number
of rüles, but the game is very difficult train multiple models with different
because for each You can
state there can be large number of hyperparameters and compare them with the
possible moves. same
Another application of reinforcement learning is robot validation dataset. But this creates a new problem: ine
navigation. The robot can move in all like a model's parameters can overfit to the trainipe
directions at any point of time. The algorithmpossible
should dataset, a model's parameters and hyperparameters can
reach goal state from an initial state by
learning the overfit to the validation dataset.
correct sequence of actions after conducting number of
trial runs.
To realistically measure a set of models, it is better to
When the system has unreliable and partial evaluate them against a test dataset not used for
sensory
information, it makes reinforcement learning complex. training or validation. The test dataset is used to
Let's take an example of robot with measure the performance of your various models at the
incomplete camera
information. Here robot does not know its exact end of the training process. Be careful not to repeatedly
location. use the test dataset to re-train models or choose
1.8 models, otherwise you risk creating models that have
TRAINING ERROR AND overfit to the test dataset.
GENERALIZATION ERROR
a 1.8.3 Picking the Size of the Validation and
a 1.8.1 Training, Testing and Validation
Test Datasets
Dataset
The validation and test datasets need to be
When a labeled dataset is used to train larger than a
machine certain minimum size. Otherwise, the model's
learning models, it is common to break up the dataset into
three parts : validation and test accuracy will not be representative
of the real-world" acccuracy. If
Training : used to directly improve the your input dataset is
model's very small, you can use
croOss-validation to train and
parameters.
evaluate a model against many difterent
Validation : used to evaluate a model's performance training/validation/test splits.
while optimizing the model's hyperparameters. For medium-sized datasets, it is typical for the
Test: used to evaluate a model after validation and test datasets to each be 10%-30% ot he
hyperparameter
optimization is complcte. total amount of data, For
example, a commod
a 1.8.2 Why the Validation and Test Datasets training/validation/test
split is 60%/20%r20%.
are Necessary However, the validation and test datasets do not neeo
to be
During the process of training a machine learning larger than a certain absolute size. Above that
size, adding more
model. it is common for a model's parameters to over validation or test data does not make
(he model's performance
fit to the training dalaset. These models metrics more realistic. If you
report
artificially high accuracy against the training dataset, input dataset contains
millions of data
samples, u
but they perform poorly against data not in the training you may only need about
and test datasets.
1% each for the validation
dataset.

(M6-131)
Tech-Neo Publications. ASACHIN SHAH Ve
Machine Learning (MU - Sem 6 - ECS &AIDS) (Introducion to Machine Learning).Page no.(1-19)
A 1.8.4 How to Balance the Validation and Howcver, when a machine learning model is deployed
Test Datasets to the "rcal world" and is making predictions, typically
the model will not perform any augmentation or
Preserve imbalanced classes
regularization on its input. To mirror the real world, a
If you are working on a classification problem with model should not perform augmentation
imbalanced classes-such as a dataset where one class regularization on the validation or test dataset.
is 99% of the dataset and the other class is 1% of the
There are a few exceptions to the rule:
dataset-then you might consider improving the training
If the validation and/or test datasets are too small
process by oversampling the smaller class. But for your
validation and test datasets, you want to measure your for a model to reliably evaluate, then it might
make sense to use data augmentation to add data
model's performance against the same class balance
that your model would encounter in the real world. samples.
If the entire training dataset is computer
Validation and test datasets should have "newer"
samples generated-like a dataset of images generated from
a video game-then it may be reasonable for the
If you are training a model on time series data, validation and test datasets to also be entirely
typically your goal is to predict something about the computer-generated.
future using data from the past or present.
Cross Validation
In order to properly evaluate a time series model, your
training/validation/test split must obey the "arrow of In machine learning, we couldn't fit the model on the
time": training data and can't say that the model will work
All of the data samples in your validation dataset accurately for the real data. For this, we must assure
should be newer than your training dataset. that our model got the correct patterns from the data,
All of the data samples in your test dataset should be and it is not getting up too much noise. For this
newer than your validation dataset. purpose, we use the cross-validation technique.

If your training dataset contains data samples that are Cross-validation is a technique in which we train our
newer than your validation dataset, then your model's model using the subset of the data-set and then
validation accuracy will be misleadingly high. Your evaluate using the complementary subset of the data
model is effectively traveling backward in time if it set.

trains on new data and evaluates on old data.


The three steps involved in cross-validation are as
Many people make the mistake of randomly shuffling follows:
the input dataset before splitting it into a training,
Reserve some portion of sample data-set.
validation, and test datascts-effectively violating the
arrow of time. 2 Using the rest data-set train the model.

I 3. Test the model using the reserve portion of the


Do not apply data augmentation
data-set.
Data augmentation is the use of computer algorithms to
WS Methods of Cross Validation
create or modify training data samples. The goal of
Valldatlon
data augmentation is to increase the size of the training
dataset and to act as aregularizer-something that In this method, we perform training on the S0% of the
reduces a model'sability to overfit. given data-set and rest 50% is used for the testing purpose.

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


Machine Leaming (MU -Sem6 - ECS &AIDS) (Introduction to Machine Learning)...Page no. (1-20)
The major drawback of this method is that we perform
training on the 50% of the dataset, it may possiblc that the
remaining 50% of the data contains some important Tralning Test
information which we are leaving while training our model Test
i.e higher bias.
Test
LOOCV(Leave One Out Cross Validation)
In this method, we perform training on the whole data Test

set but leaves only one data-point of the available data Test
set and then iterates for each data-point. It has some
Fig:1.8.1: Cross Validation
advantages as well as disadvantages also.
An advantage of using this method is that we make use 1.8.5 Advantages of train/test split
of all data points and hence it is low bias.
(1) This runs K times faster than Leave One Out cross
The major drawback of this method is that it leads to
validation because K-fold cross-validation repeats the
higher variation in the testing model as we are testing train/test split K-times.
against one data point.
(2) Simpler to examine the detailed results of the testing
If the data point is an outlier it can lead to higher
process.
variation. Another drawback is it takes a lot of
execution time as it iterates over the number of data 1.8.6 Advantages of cross-validation
points' times. (1) More accurate estimate of out-of-sample accuracy.
K-Fold Cross Validation
(2) More efficient" use of data as every observation is
In this method, we split the data-set into k number of
subsets(known as folds) then we perform training on used for both training and testing.
the all the subsets but leave one(k-1) subset for the
a 1.8.7 Training Error
evaluation of the trained model. In this method, we
iterate k times with a different subset reserved for In machine learning, training a predictive model
testing purpose each time. means finding a function which maps a set of values x to a
It is always suggested that the value of k should be 10 value y. If we apply the model to the data it was trained on,
as the lower value of k is takes towards validation and we are calculating the training error.
higher value of k leads toLOOCV method. If wecalculate the error on data which was unknown
Example
in the training phase, we are calculating the test
The Fig 1.8.1 shows an cxample of the training subscts error.
Training error is calculated as follows:
and evaluation subsets gencrated in k-fold cross
validation. Here, we have total 25 instances. Erain = neror (Ip (X), Y)
In first iteration we use the first 20 percent of data for
In the above cquation n
evaluation, and the remaining 80 percent for training represents the number of
((1-5] testing and (5-25] training) while in the second training examples. fp (X;)represents the predicted value and
iteration we use the second subset of 20 percent for Y; represents the true or actual values, error (fp (X;), Y) is
evaluation, and the remaining three subsets of the data Used to represent that these two values are same or
not and if
for training([5-10] testing and [1-5 and 10-251 not then these values differs by
how much.
training), and so on.

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


Machine Leaming (MU- Sem 6-ECS &AIDS) (Introduction to Machine Leaming)..Page no. (1-21)
1.8.8 Generalization Error Another student might prepare by trying to understand
the rcasons for giving certain answers. In most cases,
For supervised learning applications in machine
the latter student will do much better.
leaming and statistical leaming theory, generalization
Let's sec one more exarnple, consider the problem of
error is a measure of how accurately an algorithm is able to
predict outcome values for previously unseen data. trying to classify the outcomes of coin tosses (class 0:
heads, class 1: tails) based on some contextual features
Generalization error is calculated as follows:
that might be available.
Egen = error (fp (X;), Y)P(Y, X) dX
Suppose that the coin is fair. No matter what algorithm
In the above equation error is calculated over all we come up with, the generalization error will always
possible values of X and Y. error fa CX),Y) is used to be 1/2. However, for most algorithms, we should
represent that these tWO values are same or not and if not expect our training error to be considerably lower,
then these values differes by how much. P(X,Y) represents depending on the luck of the draw, even if we did not
how often we expect to see such X and Y. have any features! Consider the dataset

a 1.8.9 Training Error versus Generalization {0, 1, 1, 1,0, 1).


Error Our feature-less algorithm would have to fall back on
always predicting the majority class, which appears
The training error is the error of our model as
from our limited sample to be 1.In this case, the model
calculated on the training dataset, while generalization that always predicts class 1 will incur an error of 13,
error is the expectation of our model's error were we
considerably better than our generalization error.
need to apply it to an infinite stream of additional data
As we increase the amount of data, the probability that
examples drawn from the same underlying data
the fraction of heads will deviate significantly
distribution as our original sample.
from 1/2 diminishes, and our training error would
Problematically, we Can never calculate the
come to match the generalization eTor.
generalization error exactly. That is because the stream
of infinite data is an imaginary object. In practice, we When we train our models, we attempt to search for a
must estimate the generalization error by applying our function that fits the training data as wellas possible. If
model to an independent test set constituted of a the function is so flexible that it can catch on to
random selection of data examples that were withheld spurious patterns just as easily as to true associations,
from our training set. then it might perform too well without producing a
Let's see an example. Consider a college student trying model that generalizes well to unseen data. This is
to prepare for his final exam. precisely what we want to avoid or at least control.

Adiligent student will strive to practice well and test Many of the techniques in deep learning are heuristics
his abilities using exams from previous years. and tricks aimed at guarding against over fitting.
Nonetheless, doing wellon past exams is noguarantee When we have simple models and abundant data, we
that he willexcel when it matters. expect the generalization error to resemble the training
For instance, the student might try to prepare by rote error. When we work with more complex models and
learning the answers to the exam questions. "This fewer examples, we expect the training error to go
requires the student to memorize many things. She
down but the generalization gap to grow.
might even remember the answers for past exams
perfectly.
(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture
Machine Leaming (MU - Sem 6 - ECS & AIDS) (Introduction to Machine Learning)..Page no. (1-221

Underfitting zone Overfitting zone


Training error It usually happens when we have less data to build an
Generalzation error
Error accurate model and also when we try to build a lines
model with a non-lincar data. In such cases the rules of
the machine learning model are too casy and flexible to
Generalization gap be applied on such minimal data and therefore the
0 Optimal capacity
modelwill probably make a lot of wrong predictions.
Capacity
Under fitting can be avoided by using more data and
Fig. 1.8.2 : Training Error and Generalization Error also reducing the features by feature selection.
1.9 UNDERFITTING, OVERFITTING, BIAS In a nutshell, Under fitting High bias and lo%
AND VARIANCE TRADE OFF varnance

Let us consider that we are designing a machine


Techniquesto reduce under fitting :
learning model. A model is said to be a good machine 1 Increase modelcomplexity
learning model if it generalizes any new input data 2. Increase number of features, performing feature
from the problem domain in a proper way. This helps engineering
us to make predictions in the future data, that data 3 Remove noise from the data.
model has never seen.
4 Increase the number of epochs or increase the duration
Now, suppose we want to check howwell our machine of training to get better results.
learning model learns and generalizes to the new data. S Over fitting
For that we have over fitting and under fitting, which
are majorly responsible for the poor performances of A statistical model is said to be over fitted, when we
the machine learning algorithms. train it with a lot of data. When a model gets trained
Before diving further let's understands two important with so much of data, it starts learning from the nois
terms: and inaccurate data entries in our data set. Then the

Bias - Assumptions made by a model to make a model does not categorize the data corectly, becaus
function easier to learn. (The algorithms error rate on of too many details and noise.
the training set is algorithms bias.)Variance - If you The causes of over fitting are the non-parametnc and
train your data on training data and obtain a very low
non-linear methods because these types of machine
error, upon changing the data and then training the
same previous model you experience high error, this is
learning algorithms have more freedom in building the
variance. (How much worse the algorithm does on the model based on the dataset and therefore they can
test set than the training set is known as the algorithms really build unrealistic models.
variance.) A solution to avoid over fitting is using a
linear
Under fitting algorithm if we have linear data or using the
A statistical model or a machine learning algorithm is parameters like the maximal depth if we are using
said to have under fitting when it cannot capture the decision trees.
underlying trend of the data.
In anutshell, Overitting - High variance and
Under fitting destroys the accuracy of our machine low bias
Jearning model. Its occurrence simply means that our Techniques to reduce overfitting:
model or the algorithm does not fit the data wel 1.
Increase training data.
enough.
2.
Reduce model complexity.
(M6-131)
Tech-Neo Publications..A SACHIN SHAH Venture
Machine Leaming (MU- Sem6 - ECS &AIDS) (Introduction to Machine Learning)..Page no. (1-23)
3 Early stopping during the training phase (have an cyc In order to get a good fit, we will stop at a point just
over the loss over the training period as soon as loss
before where the error starts increasing. At this point
begins to increase stop training). the model is said to have good skills on train1ng
4. Ridge Regularization and Lasso Regularization datasets as well as our unscen testing dataset.
5. Use dropout for neural networks to tackle over fitting. IeT Bias-variance trade-off
ldeally,the case when the model makes the predictions So what is the right measure? Depending on the model
with 0 error, is said to have a good fit on the data, This
at hand, a performance that lies between over fitting and
situation is achievable at a spot between over fitting
under fitting is more desirable. This trade-off is the most
and under fitting.
integral aspect of Machine Learning model training. As we
In order to understand it we will have to look at the
discussed, Machine Learning models fulfil their purpose
performance of our model with the passage of time,
when they generalize well. Generalization is bound by the
while it is learning from training dataset.
two undesirable outcomes - high bias and high variance.
With the passage of time, our model will keep on Detecting whether the model suffers from either one is the
learning and thus the eror for the model on the
sole responsibility of the model developer.
training and testing data will keep on decreasing.
If it will learn for too long, the model will become
more prone to overfitting due to the presence of noise
and less useful details. Hence the performance of our
model will decrease.

XX XXXX
Xx Xx X XX
Under-fitting Approplrate-fitting Over-fltting
(too simple to (forcefitting-too
explain the variance) good to be true)

Fig. 1.9.1 :Underfitting and Overfitting


So what is the right measure? Depending on the model
at hand, a performance that lies between overfitting and
underfitting is more desirable. This trade-off is the most Underfittlng
zone
Overfttlng
Ione

integral aspect of Machine Lcarning model training. As we


discussed, Machine Learning models fulfil their purpose Generalzation
emor

when they generalize well. Generalization is bound by the


two undesirable outcomes - high bias and high variance. Blas Varnance

Detecting whether the model suffers from either one is the


sole responsibility of the
model developer. Capacity
Optimal capacity
Fig. 1.9,2 : Bias variance Tradeoff asa function of model capacity

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


AIDS)
MachineLa9(MU - Sem6 - ECS & (Introduction to Machine Learning)....Page no. (1-24)
4. R-Squared
.10 PERFORMANCE METRICS
SSRES
Ry = 1 = 1
a 1.10.1 Performance Metrics for Regression SSToT E(-)
Regression analysis is a subfield of supervised machine R-squared is calculated by dividing the sum of squares
arning. Itaims to model the relationship between a certain of residuals (SS,es) from the regression model by the
umber of features and a continuous target variable. total sum of squares (SS,o) of errors from the average
Following are the performance metrics used for evaluating a model and then subtract it from 1. R-squared is also
regressionmodel: known as the Coefficient of Determination. It explains
1, Mean Absolute Error (MAE) the degree to which the input variables explain the
N variation of the output /predicted variable.
MAE =
1 AR-squared value of 0.81, tells that the input variables
N
i=1
explains 81 % of the variation in the output variable.
Where y; is the actual expected output and y, is the The higher the R squared, the more variation is
model's prediction. It is the simplest evaluation metric explained by the input variables and bette is the
model. Although, there exists a limitation in this
for a regression scenario and is not much popular
metric,which is solved by the Adjusted R-squared.
compared to the following metrics.
5. Adjusted R-squared
Say,y; = [5,10,15,20]and ý; = [4.8,10.6,14.3,20.1] AdjustedR= l-R)N-1)
1
Thus, MAE =7*(15-4.81 +110 10.6|+ |15 14.3 I+1 N-p-1
Here, N - total sample size (number of rows) and p -
20- 20.1|) = 0.4
number of predictors (number of columns). The
2. Mean Squared Error (MSE) limitation of R-squared is that it will *either stay the
N
same or increases with the addition of more variables,
MAE =
even if they do not have any relationship with the
output variables.
Here, the error term is squared and thus more sensitive
To overcome this limitation, Adjusted R-square comes
to outliers as compared to Mean Absolute Error (MAE).
into the picture as it penalizes you for adding the
1
Thus, MSE = *(I5 - 4.8 | 2+| 10 10.6 |2+|15 variables which do not improve your existing model.
Hence, if you are building Linear regression on
- 14.3 12+120 20.112) =0.225 multiple variables, it is always suggested that you use
3. Root Mean Squared Error (RMSE) Adjusted R-squared to judge the goodness of the
model. If there exists only one input variable, R-square
RMSE = and Adjusted R squared are same.
n=1
Since MSE includes squared error terms, we take the 1.10.2 Performance Metrics for Classification
square root of the MSE, which gives rise to Root Mean
Classification is the problem of identifying to which 01
Squared Error (RMSE). Thus, RMSE = (0.225) 0.5 a set of categories/classes a new
=0,474
observation belongs, base0
on the training set of data
containing records whose clas
label is known.
for evaluating a
Following are the performance metrics used

classification model :
(M6-131)
achine Learning (MU- Sem 6 - ECS &AlDS) (Introduction to Machine Learning).Page no. (1-25)
To understand different metrics, we must understand the number of False Negatives as a minimum.
the Confusion matrix. A confusion matrix is a table Thus, we have different metrics like recall,
that is often used to describe the performance of a precision, Fl-score etc.
classification model (or "classifier") on a set of test Thus, Accuracy using above values will be
data for which the true values are known. (500+300y(500+50+150+300) = 800/1000 = 80%
Predicted Predicted
2. Precision and Recall
TP TP
Actual
TN FA Precision
TP + FP Recall= TP + FN

Actual Recall is a useful metric in case of cancer detection,


FN TP
1 where we want to minimize the number of False

TN- True negatives (actual 0 predicted 0) & TP- True negatives for any practical use since we don't want our
positives (actual 1 predicted 1) model to mark a patient suffering from cancer as safe.
FP- False positives (actual Opredicted 1) & FN- False On the other hand, predicting a healthy patient as
Negatives (actual 1predicted 0) cancerous is not a big issue since, in further diagnosis,
it will be cleared that he does not have cancer. Recall is
Consider the following values for the confusion
also known as Sensitivity.
matrix
Thus, Recall using above values willbe 500/(500+150)
True negatives (TN) = 300
= 500/650= 76.92%
True positives (TP) = 500
Precision is useful when we want to reduce the number
False negatives (FN) = 150 of False Positives. Consider a system that prricts
False positives (FP) = 50 whether the e-mail received is spam or not. Taking
spam as apositive class, we do not want our system to
1, Accuracy
TP + TN predict non-spam e-mails (important e-mails) as spam,
Accuracy = TP + FP+ FN + TN ie., the aim is to reduce the number of False Positives.
will be
Accuracy is defined as the ratio of the number of Thus, Precision using above values
correct predictions and the total number of 500/(500+50) = 50O/550 = 90.90%
predictions. It lies between [0,1]. In general,
3. Speciflcity
higher accuracy means a better model (TP and
Specificity is defined as the ratio of True negatives and
TN must be high).
True negatives + False positives. We want the value of
However, accuracy is not a useful metric in case specificity to be high. Its value lies between [0,1].
of an imbalanced dataset (datasets with uneven True Negatives
distribution of classes). Say we have a data of Specificity = True Negatives + False Positives
be
1000 patients out of which S50 are having cancer Thus, Specificity using above values will
predicts
and 950 not, a dumb model which always 300/(300+50) = 300/350 = 85.71%
accuracy of 95%, but it
as no cancer will have the 4. F1-score
want
is of no practical use since in this case, we
Tech-Neo Publication...A SACHIN SHAH Venture
(M6-131)
Machine Leaming (MU- Sem 6 - ECS &AIDS) (Introduction to Machine Learning)..Page no. (1-26)
F = 2x precision X recall agrecment occurring by chance. Cohen's kappa
precision + recall
mcasurcs the agrcement between two raters who each
F-score is a metric that combines both Precision and
classify Nitems into Cmutually exclusive cate gories.
Recall and equals to the harmonic mean of prccision
Cohen's kappa coefficient is defined and given by the
and recall. Its value lies between [0,1] (more the valuc
following function :
better the Fl-score).
Po- Pe
K =
Using values of precision=0.9090 and recall=0.7692, 1-Pe
Fl-score = 0.8333 = 83.33%
Where:
5. AUC-ROC
Po = relative observed agreement among raters.
AUC (Area Under The Curve)- ROC (Receiver
Pe = the hypothetical probability of chance
Operating Characteristics) curve is one of the most
agreement.
important evaluation metrics for checking any
classification model's performance. Po and pe are computed using the observed data to

It is plotted between FPR (X-axis) and TPR (Y-axis). If calculate the probabilities of each observer randomly
the value is less than 0.5 than the model is even worse saying each category. If the raters are in complete
than a random guessing model. agreement then k = 1. If there is no agreement among
TP
the raters other than what would be expected by chance
True Positive Rate (TPR) =
FP + FN (as given by pe), k<0.
FP
False Positive Rate (FPR) = S Example
FP+ TN
Comparing ROC Curves
Ex. 1.10.1: Suppose that you were analyzing data related to
0.9
a group of 50 people applying for a grant. Each grant
0.8
0.74
proposal was read by two readers and each reader either said
yesHsodonl
0.6 Yes" or No" to the proposal. Suppose the disagreement
0.5 - count data were as follows, where A and B are readers, data
0.44 on the diagonal slanting left shows the count of agreements
0.3
0.2
Worthless and the data on the diagonal slanting right, disagreements :
Good
0.1 Excellent
0
0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Yes No
A
False positive rate Yes 20 5

Fig. 1.10.1 No 10 15

a 1.10.3 Kappa Statistics Calculate Cohen's kappa coefficient.


Soln.:
It is a statistic which
measures inter-rater agreement for
Note that there were 20 proposals that were
qualitative (categorical) items. It is generally thought granted y
both reader A and reader B and 15
to be a more robust
measure than simple percent proposals that e
rejected by both readers. Thus, the observed proportiona
agreement calculation, since k takes into account the agreement is
(M6-131)
Tech-Neo Publications..A SACHIN SHAH Venture
Machine Learning (MU - Sem 6 - ECS & AIDS) (introduction to Machine Leaming),...Page no. (1-27)
20 + 15
Po = = 0.70 The probability that both of them would say "Yes"
50
randomly is 0.50 x 0.60 = 0.30and the probability that both
Tocalculate pe (the probability of random agreement) of them would say "No" is 0.50 x 0.40 = 0.20. Thus the
we note that:
overall probability of random agreement is
Reader A said "Yes" to25 applicants and "No" to 25 Pe Pe = 0.3 +0.2 = 0.5.
applicants. Thus reader A said "Yes" 50% of the time.
So now applying our formula for Cohen's Kappa we
Reader Bsaid "Yes" to 30 applicants and "No" to 20 get:
applicants. Thus reader B said Yes" 60% of the time. 0.50
k = 0.70-1-0.50 =0.40
Using formula P (A and B) = P (A) xP (B) where P is
probability of event occuring.
Chapter Ends...
DOO

You might also like