MLT Unit 4 and 5 Part 2
MLT Unit 4 and 5 Part 2
Advantages:
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
Deep learning is a computer software that mimics the network of neurons in a brain. It is a subset of
machine learning and is called deep learning because it makes use of deep neural networks.
MACHINE LEARNING 50
DEPARTMENT OF CSE AY:2023-24
Each Hidden layer is composed of neurons. The neurons are connected to each other. The neuron will
process and then propagate the input signal it receives the layer above it. The strength of the signal given the
neuron in the next layer depends on the weight, bias and activation function.
The network consumes large amounts of input data and operates them through multiple layers; the network
can learnincreasingly complex features of the data at each layer.
Deep learning is a powerful tool to make prediction an actionable result. Deep learning excels in pattern
discovery (unsupervised learning) and knowledge-based prediction. Big data is the fuel for deep learning.
When both are combined, an organization can reap unprecedented results in term of productivity, sales,
management, and innovation.
Deep learning can outperform traditional method. For instance, deep learning algorithms are 41% more
accurate than machine learning algorithm in image classification, 27 % more accurate in facial recognition
and 25% in voice recognition.
It has been shown that simple deep learning techniques like CNN can, in some cases, imitate the knowledge
of experts in medicine and other fields. The current wave of machine learning, however, requires training
data sets that are not only labeled but also sufficiently broad and universal.
MACHINE LEARNING 51
DEPARTMENT OF CSE AY:2023-24
Deep-learning methods required thousands of observation for models to become relatively good at
classification tasks and, in some cases, millions for them to perform at the level of humans. Without surprise,
deep learning is famous in giant tech companies; they are using big data to accumulate petabytes of data. It
allows them to create an impressive and highly accurate deep learning model.
Unsupervised
Unsupervised feature learning is learning features from unlabeled data. The goal of unsupervised feature
learning is often to discover low-dimensional features that captures some structure underlying the high-
dimensional input data. When the feature learning is performed in an unsupervised way, it enables a form of
semisupervised learning where features learned from an unlabeled dataset are then employed to improve
performance in a supervised setting with labeled data. Several approaches are introduced in the following.
A Recurrent Neural Network is architected in the same way as a “traditional” Neural Network. We
have someinputs, we have some hidden layers and we have some outputs.
The only difference is that each hidden unit is doing a slightly different function. So, let’s explore how this
hiddenunit works.
A recurrent hidden unit computes a function of an input and its own previous output, also known as the cell
state. For textual data, an input could be a vector representing a word x(i) in a sentence of n words (also
known as word embedding).
MACHINE LEARNING 52
DEPARTMENT OF CSE AY:2023-24
W and U are weight matrices and tanh is the hyperbolic tangent function.
Similarly, at the next step, it computes a function of the new input and its previous cell state: s2 =
tanh(Wx1+ Us1 . This behavior is similar to a hidden unit in a feed-forward Network. The difference, proper
to sequences, is that we are adding an additional term to incorporate its own previous state.
A common way of viewing recurrent neural networks is by unfolding them across time. We can notice that
we are using the same weight matrices W and U throughout the sequence. This solves our problem of
parameter sharing. We don’t have new parameters for every point of the sequence. Thus, once we learn
something, it can apply at any point in the sequence.
The fact of not having new parameters for every point of the sequence also helps us deal with variable-
length sequences. In case of a sequence that has a length of 4, we could unroll this RNN to four timesteps.
In other cases, we can unroll it to ten timesteps since the length of the sequence is not prespecified in the
algorithm. By unrolling we simply mean that we write out the network for the complete sequence. For
example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-
layer neural network, one layer for each word.
MACHINE LEARNING 53
DEPARTMENT OF CSE AY:2023-24
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in
this layer is equalto the total number of features in our data (number of pixels in the case of an
image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can
be many hidden layers depending upon our model and data size. Each hidden layer can have
different numbers of neurons which are generally greater than the number of features. The output from
each layer is computed by matrix multiplication of output of the previous layer with learnable weights
of that layer and then by the addition of learnable biases followed by activation function which makes
the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like
sigmoid orsoftmax which converts the output of each class into the probability score of each class.
The data is then fed into the model and output from each layer is obtained this step is called
feedforward, we then calculate the error using an error function, some common error functions are
cross-entropy, square loss error, etc. After that, we backpropagate into the model by calculating the
derivatives. This step is called Back propagation which basically is used to minimize the
loss. Here’s thebasic python code for a neural network with random inputs and two hidden
layers.
MACHINE LEARNING 54
DEPARTMENT OF CSE AY:2023-24
Now imagine taking a small patch of this image and running a small neural network on it, with say, k
outputsand represent them vertically. Now slide that neural network across the whole image, as a result,
we will get another image with different width, height, and depth. Instead of just R, G, and B channels
now we have more channels but lesser width and height. This operation is called Convolution. If the
patch size is the same as that of the image it will be a regular neural network. Because of this small
patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
Convolution layers consist of a set of learnable filters (a patch in the above image). Every filter has
small width and height and the same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimension 34x34x3. The possible size
of filters can be axax3, where ‘a’ can be 3, 5, 7, etc but small as compared to image dimension.
During forward pass, we slide each filter across the whole input volume step by step where each
step is called stride (which can have value 2 or 3 or even 4 for high dimensional images) and compute
the dot product between the weights of filters and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together and as a
result, we’ll get output volume having a depth equal to the number of filters. The network will learn all
the filters.
MACHINE LEARNING 55
DEPARTMENT OF CSE AY:2023-24
Types of layers:
1. Input Layer: This layer holds the raw input of the image with width 32, height 32, and depth 3.
2. Convolution Layer: This layer computes the output volume by computing the dot product between all filters
and image patches. Suppose we use a total of 12 filters for this layer we’ll get output volume of
dimension 32x 32 x 12.
3. Activation Function Layer: This layer will apply an element-wise activation function to the output of
the convolution layer. Some common activation functions are RELU: max(0, x), Sigmoid: 1/(1+e^-x),
Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have dimension 32 x 32
x 12.
4. Pool Layer: This layer is periodically inserted in the covnets and its main function is to reduce the size
of volume which makes the computation fast reduces memory and also prevents overfitting. Two common
types of pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2 filters and
stride 2,the resultanvolume will be of dimension .
Performance Metrics
• Accuracycan be calculated by taking average of the values lying across the “main diagonal” i.e
Accuracy = (True Positives+False Negatives)/Total Number of Samples
Precision:-It is the number of correct positive results divided by the number of positive results predicted by
classifier.
MACHINE LEARNING 56
DEPARTMENT OF CSE AY:2023-24
• Recall :- It is the number of correct positive results divided by the number of all relevant samples
It is an umbrella term for supervised machine learning techniques that involves predicting structured objects,
rather than scalar discrete or real values.
Similar to commonly used supervised learning techniques, structured prediction models are typically trained
by means of observed data in which the true prediction value is used to adjust model parameters. Due to the
complexityof the model and the interrelations of predicted variables the process of prediction using a trained
model and of training itself is often computationally infeasible and approximate inference and learning
methods are used.
For example, the problem of translating a natural language sentence into a syntactic representation such as a
parse tree can be seen as a structured prediction problem in which the structured output domain is the set of
all possible parse trees. Structured prediction is also used in a wide variety of application
domains including bioinformatics, natural language processing, speech recognition, and computer vision.
MACHINE LEARNING 57
DEPARTMENT OF CSE AY:2023-24
Sequence tagging is a class of problems prevalent in natural language processing, where input data are often
sequences (e.g. sentences of text). The sequence tagging problem appears in several guises, e.g. part-of-
speech tagging and named entity recognition. In POS tagging, for example, each word in a sequence must
receive a "tag" (class label) that expresses its "type" of word:
DT-DeterminerVB-Verb
JJ-AdjectiveNN-Noun
Ranking :-
Learning to Rank (LTR) is a class of techniques that apply supervised machine learning (ML) to solve
ranking problems. The main difference between LTR and traditional supervised ML is this:
The most common application of LTR is search engine ranking, but it's useful anywhere you need to produce
a ranked list of items.
The training data for a LTR model consists of a list of items and a "ground truth" score for each of those
items. For search engine ranking, this translates to a list of results for a query and a relevance rating for each
of those results with respect to the query. The most common way used by major search engines to generate
these relevance ratingsis to ask human raters to rate results for a set of queries
Learning to rank algorithms have been applied in areas other than information retrieval:
MACHINE LEARNING 58
DEPARTMENT OF CSE AY:2023-24
UNIT - IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold, Stratified K-Fold, Leave-
One-Out Cross Validation. Bias-Variance tradeoff, Regularization , Overfitting, Underfitting. Ensemble
Methods: Boosting, Bagging, Random Forest.
Model validation is the process of evaluating a trained model on test data set. This provides the
generalization ability of a trained model. Here I provide a step by step approach to complete first iteration of
model validation in minutes.
The basic recipe for applying a supervised machine learning model are:
CV is a technique used to train and evaluate an ML model using several portions of a dataset. This
implies that rather than splitting the dataset into two parts only, one to train on and another to test on, the
dataset is divided into more slices, or “folds”. And these slices use CV techniques to train the ML model
so as to test its predictive capability and hence accuracy.
In the process of building a training set, different portions of data are gathered, while the remaining ones
are reserved for constructing a validation set. This strategic approach ensures that the model
continuously leverages new and diverse data during training and testing stages, promoting its ability to
adapt to various scenarios and challenges.
MACHINE LEARNING 59
DEPARTMENT OF CSE AY:2023-24
One key objective of employing cross-validation is to safeguard the model against overfitting.
Overfitting occurs when a model simply memorizes the samples in the training set, resulting in an
artificially high predictive test score. However, such a model may struggle to generalize well on unseen
data, leading to a lack of useful results. By validating the model's performance on a separate validation
set, CV helps identify if the model has truly learned meaningful patterns and can generalize to new and
unseen scenarios effectively.
1. Slice and reserve portions of the dataset for the training set,
2. Using what's left, test the ML model.
3. Use CV techniques to test the model using the reserve portions of the dataset created in step 1.
1. CV assists in realizing the optimal tuning of hyperparameters (or model settings) that increase
the overall efficiency of the ML model's performance.
2. Training data is efficiently utilized as every observation is employed for both testing and
training.
1. One of the main considerations with computer vision (CV) is the significant increase in testing
and training time it requires for machine learning models. This is because CV involves multiple
iterative testing cycles to ensure the accuracy and efficiency of the model.
It includes various steps such as test preparation, execution, and rigorous analysis of the results
to fine-tune and optimize the CV system. Therefore, understanding the time commitment
involved in CV development is crucial for effectively leveraging its potential benefits.
2. Additional computation translates to increased resource demands. Cross Validation is known for
its high computational expense, necessitating ample processing power. This results in the first
drawback of extended time, which further inflates the budgetary requirements for an ML model
project.
Cross validation in machine learning is a crucial technique for evaluating the performance of predictive
models. It involves dividing the available data into multiple subsets, or folds, to train and test the model
iteratively.Non-exhaustive methods, such as k-fold cross-validation, randomly partition the data into k
subsets and train the model on k-1 folds while evaluating it on the remaining fold.On the other hand,
exhaustive methods, like leave-one-out cross-validation, systematically leave out one data point at a
time for testing while training the model on the remaining data points.These methods provide a
MACHINE LEARNING 60
DEPARTMENT OF CSE AY:2023-24
comprehensive assessment of the model's performance and help in addressing overfitting or underfitting
issues effectively.
1. Holdout Method
2. K-Fold CV
3. Stratified K-Fold CV
4. Leave-P-Out CV
5. Leave-One-Out CV
Holdout Method
The holdout method is a basic CV approach in which the original dataset is divided into two discrete
segments:
1. Training Data - As a reminder this set is used to fit and train the model.
2. Test Data - This set is used to evaluate the model.
As a non-exhaustive method, the Hold-out model 'trains' the ML model on the training dataset and
evaluates the ML model using the testing dataset.
In the majority of cases, the size of the training dataset is typically much larger than the test dataset.
Therefore, a standard holdout method split ratio is 70:30 or 80:20. Furthermore, the overall dataset is
randomly rearranged before dividing it into the training and test set portions using the predetermined
ratio.
There are several disadvantages to the holdout method that need to be considered. One drawback is that
as the model trains on distinct combinations of data points, it can sometimes yield inconsistent results,
which can introduce doubt into the validity of the model and the overall validation process.
Another concern is that there is no certainty that the training dataset selected fully represents the
MACHINE LEARNING 61
DEPARTMENT OF CSE AY:2023-24
complete dataset. If the original data sample is not large enough, there is a possibility that the test data
may contain information that the model will fail to recognize because it was not included in the original
training data portion.
However, despite these limitations, the Holdout CV method can be considered ideal in situations where
time is a scarce project resource and there is an urgency to train and test an ML model using a large
dataset.
K fold Cross-Validation
The k-fold cross-validation method is considered an improvement over the holdout method due to its
ability to provide additional consistency to the overall testing score of machine learning models. This
improvement is achieved by applying a specific procedure for selecting and dividing the training and
testing datasets.
To implement k-fold cross-validation, the original dataset is divided into k number of partitions. The
holdout method is then performed k number of occasions, each time using a different partition as the
testing set, while the remaining partitions are used for training. This repeated process helps to obtain a
more reliable and robust evaluation of the model's performance by leveraging a larger amount of data for
testing and training purposes.
Let us look at an example: if the value of k is set to six, there will be six subsets of equivalent sizes or
folds of data. In the first iteration, the model trains on one subset and validates on the other. In the
second iteration, the model re-trains on another subset and then is tested on the remaining subset. And so
on for six iterations in total.
MACHINE LEARNING 62
DEPARTMENT OF CSE AY:2023-24
The k-fold cross-validation randomly splits the original dataset into k number of folds
The test results of each iteration are then averaged out, which is called the CV accuracy. Finally, CV
accuracy is employed as a performance metric to contrast and compare the efficiencies of different ML
models.It is important to note that the value of k is incidental or random. However, the k value is
commonly set to ten within the data science field. The k-fold cross-validation approach is widely
recognized for generating ML models with reduced subjectivity. By ensuring that each data point is
present in both testing and training datasets, this technique enhances the objectivity of the
models.Moreover, the k-fold method proves to be particularly advantageous for data science projects
with a finite amount of data. It maximizes the utilization of available data by repeatedly utilizing
different data sets
Jake VanderPlas, gives the process of model validation in four simple and clear steps. There is also a whole
process needed before we even get to his first step. Like fetching all the information we need from the data to
make a good judgement for choosing a class model. Also providing finishing touches to confirm the results
after. I will get into depth about these steps and break it down further.
Data cleansing and wrangling.
Feature engineering to optimize the metrics. (Skip this during first pass).
Data pre-processing.
Feature selection.
Model selection.
Model validation.
Get the best model and check it against test data set.
Domain knowledge on the problem in hand will be of great use for feature engineering. This is a bigger topic
in itselfand requires extensive investment of time and resource.
Data pre-processing.
Data pre-processing converts features into format that is more suitable for the estimators. In general,
machine learning model prefer standardization of the data set. I will make use of RobustScaler for our
example.
MACHINE LEARNING 63
DEPARTMENT OF CSE AY:2023-24
Feature selection.
High variance: The model is very sensitive to the provided inputs for the learned features.
Low accuracy: One model (or one algorithm) to fit the entire training data might not provide
you with the nuance your project requires.
Features noise and bias: The model relies heavily on too few features while making a
prediction.
Ensemble Algorithm
A single algorithm may not make the perfect prediction for a given data set. Machine learning
algorithms have their limitations and producing a model with high accuracy is challenging. If we
build and combine multiple models, we have the chance to boost the overall accuracy. We then
implement the combination of models by aggregating the output from each model with two
objectives:
MACHINE LEARNING 64
DEPARTMENT OF CSE AY:2023-24
1. Bagging
2. Boosting
3. Stacking
4. Blending
5.
BAGGING
The idea of bagging is based on making the training data available to an iterative learning process.
Each model learns the error produced by the previous model using a slightly different subset of the
training data set. Bagging reduces variance and minimizes overfitting. One example of such a
technique is the random forest algorithm.
This technique is based on a bootstrapping sampling technique. Bootstrapping creates multiple sets
of the original training data with replacement. Replacement enables the duplication of sample
instances in a set. Each subset has the same equal size and can be used to train models in parallel.
The method involves:
Creating multiple subsets from the original dataset with replacement,
Building a base model for each of the subsets,
Running all the models in parallel,
Combining predictions from all models to obtain final predictions.
MACHINE LEARNING 65
DEPARTMENT OF CSE AY:2023-24
Boosting
Boosting is a machine learning ensemble technique that reduces bias and variance by converting weak learners
into strong learners. The weak learners are applied to the dataset in a sequential manner. The first step is building
an initial model and fitting it into the training set.
A second model that tries to fix the errors generated by the first model is then fitted. Here’s what the entire
process looks like:
Create a subset from the original data,
Build an initial model with this data,
Run predictions on the whole data set,
Calculate the error using the predictions and the actual values,
Assign more weight to the incorrect predictions,
Create another model that attempts to fix errors from the last model,
Run predictions on the entire dataset with the new model,
Create several models with each model aiming at correcting the errors generated by the previous one,
Obtain the final model by weighting the mean of all the models.
Random Forest Algorithm widespread popularity stems from its user-friendly nature and
adaptability, enabling it to tackle both classification and regression problems effectively. The
algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making
One of the most important features of the Random Forest Algorithm is that it can handle the data
set containing continuous variables, as in the case of regression, and categorical variables, as in
the case of classification. It performs better for classification and regression tasks.
MACHINE LEARNING 66
DEPARTMENT OF CSE AY:2023-24
One of the most important features of the Random Forest Algorithm is that it can handle the data
set containing continuous variables, as in the case of regression, and categorical variables, as in
the case of classification. It performs better for classification and regression tasks. In this tutorial,
we will understand the working of random forest and implement random forest on a
classification task.
As mentioned earlier, Random forest works on the Bagging principle. Now let’s dive in and
understand bagging in detail.
Step 1: In the Random forest model, a subset of data points and a subset of features is selected
for constructing each decision tree. Simply put, n random records and m features are taken from
the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and
regression, respectively.
For example
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are
taken from the fruit basket, and an individual decision tree is constructed for each sample. Each
decision tree will generate an output, as shown in the figure. The final output is considered based on
majority voting. In the below figure, you can see that the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an apple.
MACHINE LEARNING 67
DEPARTMENT OF CSE AY:2023-24
Diversity: Not all attributes/variables/features are considered while making an individual tree;
each tree is different.
Immune to the curse of dimensionality: Since each tree does not consider all the features, the
feature space is reduced.
Parallelization: Each tree is created independently out of different data and attributes. This
means we can fully use the CPU to build random forests.
Train-Test split: In a random forest, we don’t have to segregate the data for train and test as
there will always be 30% of the data which is not seen by the decision tree.
Stability: Stability arises because the result is based on majority voting/ averaging.
MACHINE LEARNING 68
DEPARTMENT OF CSE AY:2023-24
UNIT – V
Unsupervised Learning : Clustering-K-means, K-Modes, K-Prototypes, Gaussian Mixture
Models, Expectation-Maximization.
Reinforcement Learning: Exploration and exploitation trade-offs, non-associative learning, Markov decision
processes, Q-learning.
Introduction to clustering
As the name suggests, unsupervised learning is a machine learning technique in which models are
not supervised using training dataset. Instead, models itself find the hidden patterns and insights
from the given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.”
Below are some main reasons which describe the importance of Unsupervised Learning:
MACHINE LEARNING 69
DEPARTMENT OF CSE AY:2023-24
o
Unsupervised learning works on unlabeled and uncategorized data
which make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.
we have taken an unlabeled input data, which means it is not categorized and corresponding outputs
are also not given. Now, this unlabeled input data is fed to the machine learning model in order to
train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then will
apply suitable algorithms such as k- means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groupsaccording to
the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
MACHINE LEARNING 70
DEPARTMENT OF CSE AY:2023-24
o Hierarchal clustering
o Anomaly detection
o Neural Networks
One of the most used clustering algorithm is k-means. It allows to group the data according
to the existing similarities among them in k clusters, given as input to the algorithm. I‟ll
startwith a simple example.
Let’s imagine we have 5 objects (say 5 people) and for each of them we know two features
MACHINE LEARNING 71
DEPARTMENT OF CSE AY:2023-24
As you probably already know, I‟m using Python libraries to analyze my data. The k-means
algorithm is implemented in the scikit-learn package. To use it, you will just need the following line
in your script:
At this point, you will maybe have noticed something. The basic concept of k-means stands on
mathematical calculations (means, euclidian distances). But what if our data is non-numerical or, in
other words, categorical? Imagine, for instance, to have the ID code and date of birth of the five
people of the previous example, instead of their heights and weights.
We could think of transforming our categorical values in numerical values and eventually apply k-
means. But beware: k-means uses numerical distances, so it could consider close two really distant
objects that merely have been assigned two close numbers.
Expectation-step is used to assign data points to the nearest cluster, and the Maximization-step is
When using the K-means algorithm, we must keep the following points in mind:
It is suggested to normalize the data while dealing with clustering algorithms such as K-
Means since such algorithms employ distance-based measurement to identify the similarity
Because of the iterative nature of K-Means and the random initialization of centroids, K-
Means may become stuck in a local optimum and fail to converge to the global optimum. As
k-Prototype
One of the conventional clustering methods commonly used in clustering techniques and efficiently
used for large data is the K-Means algorithm. However, its method is not good and suitable for data
that contains categorical variables. This problem happens when the cost function in K-Means is
calculated using the Euclidian distance that is only suitable for numerical data. While K-Mode is
only suitable for categorical data only, not mixed data types.
Facing these problems, Huang proposed an algorithm called K-Prototype which is created in order to
handle clustering algorithms with the mixed data types (numerical and categorical variables). K-
MACHINE LEARNING 73
DEPARTMENT OF CSE AY:2023-24
Reinforcement learning
Reinforcement learning addresses the question of how an autonomous agent that senses and acts in its
environment can learn to choose optimal actions to achieve its goals
Introduction
Consider building a learning robot. The robot, or agent, has a set of sensors
to observe the state of itsenvironment, and a set of actions it can performto alter this state.
Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals.
The goals of the agent can be defined by a reward function that assigns a
numericalvalue to each distinctaction the agent may take from each distinct state.
This reward function may be built into the robot, or known only to an external
teacher whoprovides thereward value for each action performed bythe robot.
The task of the robot is to perform sequences of actions, observe their
consequences,and learn a controlpolicy.
The control policy is one that, from any initial state, chooses actions that
maximize thereward accumulatedover time by the agent.
Example:
A mobile robot may have sensors such as a camera and sonars, and actions such as "move
forward"and "turn."
The robot may have a goal of docking onto its battery charger whenever its battery level islow.
The goal of docking to the battery charger can be captured by assigning a positive
reward (Eg., +100) to state- action transitions that immediately result in a connection to the charger
and a reward of zero to every other state-action transition.
MACHINE LEARNING 74
DEPARTMENT OF CSE AY:2023-24
1. Delayed reward: The task of the agent is to learn a target function 𝜋 that maps from
the current state s to the optimal action a = 𝜋 (s). In reinforcement learning, training information is
not available in (s, 𝜋 (s)). Instead, the trainer provides only a sequence of immediate reward values
as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal
credit assignment: determining which of the actions in its sequence are to be credited with
producing the eventual rewards.
3. Partially observable states: The agent's sensors can perceive the entire state of the
environment at each time step, in many practical situations sensors provide only partial information.
In such cases, the agent needs to consider its previous observations together with its current sensor
data when choosing actions, and the best policy may be onethat chooses actions specifically to
improve the observability of the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same
environment,using the same sensors. For example, a mobile robot may need to learn how to dock
on its battery charger, how to navigate through narrow corridors, and how to pick up output from
laser printers. This setting raises the possibility of using previously obtained experience or
knowledge to reduce sample complexity when learning new tasks.
Learning Task
Consider Markov decision process (MDP) where the agent can perceive a set S of distinct states of
itsenvironment and has a set A of actions that it can perform
At each discrete time step t, the agent senses the current state st, chooses a current action
at, andperforms it.
The environment responds by giving the agent a reward rt = r(st, at) and by producing the
succeedingstate st+l
= δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on the current state and action, and not
onearlier states or actions.
The task of the agent is to learn a policy, 𝝅: S → A, for selecting its nextaction a, based on
the current observedstate st; that is, 𝝅(st) = at.
Howshall we specify precisely which policy π we would like the agent to learn?
MACHINE LEARNING 75
DEPARTMENT OF CSE AY:2023-24
Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that maximizes Vπ (st) for allstates s. such a
policy is called an optimalpolicy and denote it by π*
Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum
discounted cumulative rewardthat the agent can obtain starting from state s.
Example:
The six grid squares in this diagram represent six possible states, or locations,for theagent.
Each arrow in the diagram represents a possible action the agent can take tomove from one state
to another.
MACHINE LEARNING 76
DEPARTMENT OF CSE AY:2023-24
The number associated with each arrow represents the immediate reward r(s, a) the
agent receives if it executesthe corresponding state-action transition
The immediate reward in this environment is defined to be zero forall state-action
transitions except for those leading into the state labelled G. The state G as the goal
state, and the agent can receive reward by entering thisstate.
Once the states, actions, and immediate rewards are defined, choose a value for the
discount factor γ, determine theoptimal policy π * and itsvalue function V*(s).
Let’s choose γ = 0.9. The diagramat the bottom of the figure shows one optimal
Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ =
0.9. An optimal policy, corresponding toactions with maximal Q values,is also shown.
Non-Associative Learning:
As applied to animal behavior, is instances where behavior toward stimulus changes in the
absence of any apparent associated stimulus or event (such as a reward or punishment).
In non-associative learning, the person is being trained on how to respond to a certain situation.
There is a right and a wrong answer.
Supervised learning algorithms use non-associative learning. These algorithms learn from the
training data. Primarily, they are taught based on the assumption there is a right or wrong answer.
The cost function, or loss, associated with the algorithm, is a similar concept to ‘punishment.’
In non-associative machine learning, you use the training data set to teach the machine learning
algorithm how to predict on the data set.
This is instead of letting the algorithm learn for itself on what the outcome should be.
1. REGRESSION ANALYSIS
The classic example of supervised ML using regression is the prediction of house prices.
MACHINE LEARNING 77
DEPARTMENT OF CSE AY:2023-24
For example, the number of rooms a house has (input) and the price of the house (output).
This training data will teach the machine how the number of rooms and price are related, allowing it
to make predictions of the output, cost of a house, based on the inputs, number of rooms.
2. CLASSIFICATION ANALYSIS
If we move onto classification analysis, we begin to use machine learning to determine which group
an object belongs to. One of the classic examples is whether or not a tumor is malignant or benign.
Or you could use it to say yes or no if someone is likely to pass an exam.
Another example is, will this person develop diabetes? Yes or No.
In classification analysis, the labeled training data set will have a sample set of people and their
characteristics alongside whether or not they developed diabetes.
This training data is there to teach the machine how different characteristics of a person’s genetics
or lifestyle contribute to whether or not they would get diabetes.
Q LEARNING
The training information available to the learner is the sequence of immediate rewards r(si,ai)
for i = 0, 1,2, . . . .
Given this kind of training information it is easier to learn a numerical evaluation
function defined over states andactions, then implement the optimal policy in terms of
this evaluation function.
What evaluation function should the agent attempt to learn?
One obvious choice is V*. The agent should prefer state sl over state s2 whenever
V*(sl) > V*(s2), because thecumulative future reward will begreater from sl
The optimal action in state s is the action a that maximizes the sum of theimmediate
reward r(s, a) plus the value V*of the immediate successor state, discounted by γ.
The Q Function
The value of Evaluation function Q(s, a) is the reward receivedimmediately
upon executing action a from state s,plus the value (discounted by γ ) of
MACHINE LEARNING 78
DEPARTMENT OF CSE AY:2023-24
The key problem is finding a reliable way to estimate training valuesfor Q, given only
a sequence of immediaterewards r spread out over
Rewriting Equation
Q learning algorithm:
MACHINE LEARNING 79
DEPARTMENT OF CSE AY:2023-24
An Illustrative Example
To illustrate the operation of the Q learning algorithm, consider a single action taken
by an agent, and thecorresponding refinement to
MACHINE LEARNING 80
DEPARTMENT OF CSE AY:2023-24
The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for thistransition.
𝑄̂ value associated with the resulting state (100), discounted byγ (.9).
Convergence
Will the Q Learning Algorithm converge toward a Q equal to the true Q function?
Yes, under certain conditions.
1. Assume the system is a deterministic MDP.
2. Assume the immediate reward values are bounded; that is, there exists
some positive constant c such that for allstates s and actions a, | r(s, a)|
<c
3. Assume the agent selects actions in such a fashion that it visits every possible
state-action pair infinitely often
MACHINE LEARNING 81
DEPARTMENT OF CSE AY:2023-24
Here are four machine learning trends that could become a reality in the near future:
Algorithms can help companies unearth insights about their business, but this proposition can be
expensive with no guarantees of a bottom-line increase. Companies often deal with havingto collect
data, hire data scientists and train them to deal with changing databases. Now that more data metrics
are becoming available, the cost to store it is dropping thanks to the cloud. There will no longer be
the need to manage infrastructure as cloud systems can generate new models as the scale of an
operation increases, while also delivering more accurate results. More open-source ML frameworks
are coming to the fold, obtaining pre-trained platforms thatcan tag images, recommend products and
perform natural language processing tasks.
Some of the tasks that ML can help companies deal with is the manipulation and classification of
large quantities of vectors in high-dimensional spaces. Current algorithms take a large chunk of time
to solve these problems, costing companies more to complete their business processes. Quantum
computers are slated to become all the rage soon as they can manipulate high-dimensional vectors at
a fraction of the time. These will be able to increase the number of vectors and dimensions that are
processed when compared to traditional algorithms in a quicker period of time.
3) Improved Personalization
Retailers are already making waves in developing recommendation engines that reach their target
audience more accurately. Taking this a step further, ML will be able to improve the personalization
techniques of these engines in more precise ways. The technology will offer more specific data that
they can then use on ads to improve the shopping experience for consumers.
4) Data on Data
As the amount of data available increases, the cost of storing this data decreases at roughly thesame
rate. ML has great potential in generating data of the highest quality that will lead to better models,
an improved user experience and more data that helps repeat but improve uponthis cycle. Companies
such as Tesla add a million miles of driving data to enhance its self- driving capabilities every hour.
Its Autopilot feature learns from this data and improves the software that propels these self-driving
vehicles forward as the company gathers more data onthe possible pitfalls of autonomous driving
technology.
MACHINE LEARNING 82
DEPARTMENT OF CSE AY:2023-24
MACHINE LEARNING 83