0% found this document useful (0 votes)
60 views25 pages

21AI71 Module 5 Textbook

The Notes of Module 5 of VTU 7th sem Advance AIML

Uploaded by

licol89820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views25 pages

21AI71 Module 5 Textbook

The Notes of Module 5 of VTU 7th sem Advance AIML

Uploaded by

licol89820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

MODULE 5

Clustering

Clustering is an unsupervised machine learning technique that groups similar data points into
clusters based on their features. Unlike supervised learning, clustering does not rely on labeled data;
instead, it identifies inherent patterns and structures within the dataset. It is widely used for
exploratory data analysis, pattern recognition, and preprocessing in machine learning pipelines.

The goal of clustering is to maximize intra-cluster similarity (data points within the same
cluster are similar) and minimize inter-cluster similarity (data points in different clusters are
distinct). Common applications include:

• Customer segmentation in marketing.


• Document grouping in text analytics.
• Image compression in computer vision.

Clustering work

Clustering algorithms use different distance or similarity or dissimilarity measures to derive


different clusters. The type of distance/similarity measure used plays a crucial role in the final
cluster formation.Larger distance would imply that observations are far away from one another,
whereas higher similarity would indicate that the observations are similar.To understand the
existence of different clusters in a dataset, we will use a small dataset of customers containing
their age and income information (File Name: Income Data.csv). We can analyze and
understand the customer segments that might exist and identify the key attributes of each
segment.

we will plot these attributes using a scatter plot and visualize how the segments look like. This
is possible because we are dealing with only two features: age and income. It may not be
possible to visualize the clusters if we are dealing with many features.

import pandas as pd
customers_df = pd.read_csv(“Income Data.CSV”)
Use the following code to print the first few records from the dataset.

customers_df.head(5)
(Figure 7.1).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
sn.lmplot(“age”, “income”, data=customers_df, fit_reg = False,

size = 4)

it can be observed that there are three customer segments, which can be described as
below:
1. One on the top-left side of the graph, depicting low-age–high-income group.
2. One on the top-right side of the graph, depicting high-age–medium-income group.
3. One on the bottom of the graph, depicting a low-income group, which has an age spread
from low to high.

Finding Similarities Using Distances


Clustering techniques assume that there are subsets in the data that are similar or
homogeneous. One
approach for measuring similarity is through distances measured using different metrics.
Few distance measures used in clustering are discussed in the following sections.
7.2.1.1 Euclidean Distance
Euclidean distance is the radial distance between two observations or records. If there are
many attributes (features), then the distance across all attributes is calculated to find out
the distance. Euclidean distance between two observations X1 and X2 with n features can be
calculated as

where Xi1 is the value of the ith feature for first observation and Xi2 is the value of ith feature for
second observation. For example, the distance between two customers, customer1 and
customer2, is calculated as follows:

Other Distance Metrics


Some of the other widely used distances are as follows:
1. Minkowski Distance: It is the generalized distance measure between two observations.
2. Jaccard Similarity Coefficient: It is a measure used when the data is qualitative, especially
when attributes can be represented in binary form.
3. Cosine Similarity: In this, X1 and X2 are two n-dimensional vectors. It measures the angle
between two vectors (thus called as vector space model).
4. Gower’s Similarity Coefficient: The distance and similarity measures that we have
discussed so far are valid either for quantitative data or qualitative data. Gower’s similarity
coefficient can be used when both quantitative and qualitative features are present.

Types of Clustering

1. Partitioning Clustering:
Divides the data into non-overlapping subsets.
o Example: k-means, k-medoids.
o Best for datasets with a clear cluster structure.
2. Hierarchical Clustering:
Builds a tree-like structure of clusters.
o Types: Agglomerative (bottom-up) and Divisive (top-down).

Useful for analyzing data with nested clusters

K-MEANS CLUSTERING
K-means clustering is one of the frequently used clustering algorithms. It is a non-
hierarchical clustering
method in which the number of clusters (K) is decided a priori. The observations in the
sample are
assigned to one of the clusters (say C1, C2, …, CK) based on the distance between the
observation and the
centroid of the clusters.
The following steps are used in K-means clustering algorithm:
1. Decide the value of K (which can be fine-tuned later).
2. Choose K observations from the data that are likely to be in different clusters. There are
many
ways of choosing these initial K values; the easiest approach is to choose observations that
are farthest (in one of the parameters of the data).
3. The K observations selected in step 2 are the centroids of those clusters.
4. For remaining observations, find the cluster closest to the centroid. Add the new
observation (say observation j) to the cluster with the closest centroid. Adjust the centroid
after adding a new observation to the cluster. The closest centroid is chosen based upon an
appropriate distance measure.
5. Repeat step 4 until all observations are assigned to a cluster.
For the data described in Section 7.2 (Income Data.csv), we can create three clusters, as we
know there are three segments from Figure 7.1. sklearn library has KMeans algorithm. Initialize
KMeans with the number of clusters (k) as an argument and call fit() method with the
DataFrame, which contains entities and their features that need to be clustered.

from sklearn.cluster import KMeans


clusters = KMeans(3)
clusters.fit(customers_df)
KMeans (algorithm=‘auto’, copy_x=True, init=‘k-means++’, max_iter=300,
n_clusters=3, n_init=10, n_jobs=l, precompute_distances=‘auto’,
random_state=None, tol=0.0001, verbose=0)
The output variable clusters.labels_ contains labels that identify the cluster to which an
observation
belongs. We can concatenate it with the customers’ data and verify.
customers_df[“clusterid”] = clusters.labels_

Plotting Customers with Their Segments


We will plot and depict each customer segment with different markings (see Figure 7.2).
Plotting the age and income of customers from different segments with different markers.
The markers [+, ^, .] will
depict the clusters differently.
markers = [‘+’,‘^’,‘.’]
sn.lmplot(“ age”, “income”,
data = customers_df,
hue = “clusterid”,
fit_reg = False,
markers = markers,
size = 4)

The above clusters are mostly segmented based on income. This is because the salary is on a
larger scale compared to the age. The scale of age is 0 to 60, while salary ranges from 0 to
50000. For example, difference in age between two customers, with age 20 and 70, is
significant, but numerical difference is only 50. Similarly, difference in income between two
customers, with income of 10000 and 11000, is not significant; the numerical difference is
1000. So, the distance will always be determined by the difference in salary and not in age.
Hence before creating clusters, all features need to be normalized and brought to normalized
scale. StandardScaler in skleardn.preprocessing normalizes all values by subtracting all values
from its mean and dividing by standard deviation.

Hierarchical Clustering

Hierarchical clustering is an unsupervised machine learning method that creates a hierarchy


or tree-like structure (called a dendrogram) of clusters. It is especially useful for analyzing
data with nested or hierarchical relationships.
Hierarchical clustering is a clustering algorithm which uses the following steps to develop
clusters:
1. Start with each data point in a single cluster.
2. Find the data points with the shortest distance (using an appropriate distance measure)
and merge them to form a cluster.
3. Repeat step 2 until all data points are merged together to form a single cluster.
The above procedure is called an agglomerative hierarchical cluster. AgglomerativeClustering
in sklearn. cluster provides an algorithm for hierarchical clustering and also takes the number
of clusters to be created as an argument.
The agglomerative hierarchical clustering can be represented and understood by using
dendrogram, which we had created earlier.

from sklearn.cluster import AgglomerativeClustering

We will create clusters using AgglomerativeClustering and store the new cluster labels in
h_clusterid variable.
h_clusters = AgglomerativeClustering(3)
h_clusters.fit(scaled_beer_df)
beer_df[“h_clusterid”] = h_clusters.labels_

Compare the Clusters Created by K-Means and Hierarchical Clustering


We will print each cluster independently and interpret the characteristic of each cluster.
beer_df[beer_df.h_clusterid == 0]

beer_df[beer_df.h_clusterid == 1]

beer_df[beer_df.h_clusterid == 2]

Instance-Based Learning

Introduction

Instance-based learning is a type of lazy learning method where the algorithm stores the
training data and makes predictions directly based on it. Instead of creating a generalized
model during training, the algorithm defers computation until a query is made. It primarily
relies on comparing new instances with stored instances using a distance metric.

• Key Features:
o No explicit training phase (lazy learning).
o Predictions are made based on the similarity of the input to stored examples.
o Works well with smaller datasets and low-dimensional spaces.
k-Nearest Neighbor (k-NN) Learning

Overview

The k-NN algorithm is one of the simplest instance-based learning methods. It predicts the
output for a new data point based on the majority class or average of its nearest neighbors in
the feature space.

Key Steps in k-NN:

1. Store the Dataset:


o Retain all training data points.
2. Calculate Distance:
o Compute the distance between the query point and all training data points using a
distance metric (e.g., Euclidean, Manhattan, or Minkowski distance).
3. Identify k Nearest Neighbors:
o Select the k closest training points to the query point based on the computed
distances.
4. Predict the Output:
o For classification: Assign the majority class among the k neighbors.
o For regression: Compute the average of the outputs of the k neighbors.

Key Parameters in k-NN

1. Value of k:
o Small k: Sensitive to noise, may overfit.
o Large k: Smoothens predictions but may underfit.
2. Distance Metric:
o The choice of metric affects performance.
o Euclidean distance works well for continuous features, while Manhattan distance is
robust for discrete features.

Advantages:

• Simple and easy to implement.


• Flexible to different types of data (numeric or categorical).
• No explicit training phase, so it adapts to data changes instantly.

Disadvantages:

• Computationally expensive for large datasets (O(n) for each prediction).


• Sensitive to irrelevant or redundant features.
• Performance depends on the choice of k and the distance metric.

Locally Weighted Regression (LWR)


Introduction

Locally Weighted Regression (LWR) is a form of regression that gives higher importance (or
weight) to data points near the query point. It models the data locally around the input point,
making it well-suited for non-linear relationships.

Key Features:

• A type of lazy learning where no global model is built.


• Predictions are made using a weighted combination of neighboring points.
• The weights decrease as the distance from the query point increases (e.g., Gaussian kernel).

Working:

1. Assign a weight to each training point based on its distance to the query point.
o Common weighting function

1. where d(x,xi)d(x, x_i)d(x,xi) is the distance, and τ\tauτ is the bandwidth parameter.
2. Perform a weighted linear regression using these weights.
3. Make predictions for the query point using the local model.

Advantages:

• Handles non-linearity effectively.


• Flexible, as it adapts to local data patterns.

Disadvantages:

• Computationally expensive for large datasets.


• Sensitive to the choice of bandwidth (τ\tauτ).

Radial Basis Function (RBF)

Introduction

Radial Basis Function (RBF) is a kernel-based method used in function approximation,


interpolation, and classification tasks. It works by creating a weighted sum of radial basis
functions, which are typically Gaussian functions centered at specific data points.

Key Features:

• Non-linear technique that maps inputs to a high-dimensional space.


• Often used in neural networks and support vector machines.
RBF Equation:

The output of an RBF model is given by:

where:

• ϕ\phiϕ: Radial basis function (e.g., Gaussian e−γ∣∣x−c∣∣2e^{-\gamma ||x - c||^2}e−γ∣∣x−c∣∣2).


• cic_ici: Center of the basis function.
• wiw_iwi: Weights.

Applications:

• Image reconstruction.
• Time-series prediction.
• Pattern recognition.

Advantages:

• Flexible for non-linear problems.


• Supports local and global fitting.

Disadvantages:

• Choice of parameters (centers, bandwidth) impacts performance.


• Computational cost increases with large datasets.

Case-Based Reasoning (CBR)

Introduction

Case-Based Reasoning (CBR) is a problem-solving technique that uses previous experiences


(cases) to solve new problems. Instead of relying on generalized rules, it matches a new
problem with past cases and adapts their solutions.

Key Features:

• Relies on a database of past cases.


• Involves storing, retrieving, adapting, and revising solutions.

Steps in CBR:

1. Retrieve: Find similar cases in the case base.


2. Reuse: Adapt the solution of the most similar case to the current problem.
3. Revise: Test and refine the solution if necessary.
4. Retain: Store the new solution as a case for future use.

Advantages:

• Leverages existing knowledge and reduces problem-solving effort.


• Can handle complex problems by analogy.

Disadvantages:

• Performance depends on the quality of stored cases.


• Requires a large and well-maintained case base.

Applications:

• Medical diagnosis.
• Legal reasoning.
• Customer support systems.
MODULE 5

INSTANCE BASED LEARNING

INTRODUCTION

• Instance-based learning methods such as nearest neighbor and locally weighted


regression are conceptually straightforward approaches to approximating real-valued or
discrete-valued target functions.
• Learning in these algorithms consists of simply storing the presented training data.
When a new query instance is encountered, a set of similar related instances is retrieved
from memory and used to classify the new query instance
• Instance-based approaches can construct a different approximation to the target function
for each distinct query instance that must be classified

Advantages of Instance-based learning


1. Training is very fast
2. Learn complex target function
3. Don’t lose information

Disadvantages of Instance-based learning


• The cost of classifying new instances can be high. This is due to the fact that nearly all
computation takes place at classification time rather than when the training examples
are first encountered.
• In many instance-based approaches, especially nearest-neighbor approaches, is that they
typically consider all attributes of the instances when attempting to retrieve similar
training examples from memory. If the target concept depends on only a few of the many
available attributes, then the instances that are truly most "similar" may well be alarge
distance apart.
k- NEAREST NEIGHBOR LEARNING

• The most basic instance-based method is the K- Nearest Neighbor Learning. This
algorithm assumes all instances correspond to points in the n-dimensional space Rn.
• The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance.
• Let an arbitrary instance x be described by the feature vector
((a1(x), a2(x), ………, an(x))
Where, ar(x) denotes the value of the rth attribute of instance x.

• Then the distance between two instances xi and xj is defined to be d(xi , xj )


Where,

• In nearest-neighbor learning the target function may be either discrete-valued or real-


valued.

Let us first consider learning discrete-valued target functions of the form

Where, V is the finite set {v1, . . . vs }

The k- Nearest Neighbor algorithm for approximation a discrete-valued target function is


given below:
• The value 𝑓̂(xq) returned by this algorithm as its estimate of f(x q) is just the most
common value of f among the k training examples nearest to xq.
• If k = 1, then the 1- Nearest Neighbor algorithm assigns to 𝑓̂(xq) the value f(xi). Where
xi is the training instance nearest to xq.
• For larger values of k, the algorithm assigns the most common value among the k nearest
training examples.

• Below figure illustrates the operation of the k-Nearest Neighbor algorithm for the case where the
instances are points in a two-dimensional space and where the target function is Boolean
valued.

• The positive and negative training examples are shown by “+” and “-” respectively. A
query point xq is shown as well.
• The 1-Nearest Neighbor algorithm classifies xq as a positive example in this figure,
whereas the 5-Nearest Neighbor algorithm classifies it as a negative example.

• Below figure shows the shape of this decision surface induced by 1- Nearest Neighbor over
the entire instance space. The decision surface is a combination of convex polyhedra
surrounding each of the training examples.

• For every training example, the polyhedron indicates the set of query points whose
classification will be completely determined by that training example. Query points
outside the polyhedron are closer to some other training example. This kind of diagram
is often called the Voronoi diagram of the set of training example
The K- Nearest Neighbor algorithm for approximation a real-valued target function is given
below

Distance-Weighted Nearest Neighbor Algorithm

• The refinement to the k-NEAREST NEIGHBOR Algorithm is to weight the


contribution of each of the k neighbors according to their distance to the query point xq,
giving greater weight to closer neighbors.
• For example, in the k-Nearest Neighbor algorithm, which approximates discrete-valued
target functions, we might weight the vote of each neighbor according to the inverse
square of its distance from xq

Distance-Weighted Nearest Neighbor Algorithm for approximation a discrete-valued target


functions
Distance-Weighted Nearest Neighbor Algorithm for approximation a Real-valued target
functions

Terminology

• Regression means approximating a real-valued target function.


• Residual is the error 𝑓̂(x) - f (x) in approximating the target function.
• Kernel function is the function of distance that is used to determine the weight of each
training example. In other words, the kernel function is the function K such that
wi = K(d(xi, xq))

LOCALLY WEIGHTED REGRESSION

• The phrase "locally weighted regression" is called local because the function is
approximated based only on data near the query point, weighted because the
contribution of each training example is weighted by its distance from the query point,
and regression because this is the term used widely in the statistical learning community for
the problem of approximating real-valued functions.

• Given a new query instance xq, the general approach in locally weighted regression is to
construct an approximation 𝑓̂ that fits the training examples in the neighborhood
surrounding xq. This approximation is then used to calculate the value 𝑓̂(xq), which is
output as the estimated target value for the query instance.
Locally Weighted Linear Regression

• Consider locally weighted regression in which the target function f is approximated near
xq using a linear function of the form

Where, ai(x) denotes the value of the ith attribute of the instance x

• Derived methods are used to choose weights that minimize the squared error summed
over the set D of training examples using gradient descent

Which led us to the gradient descent training rule

Where, η is a constant learning rate

• Need to modify this procedure to derive a local approximation rather than a global one.
The simple way is to redefine the error criterion E to emphasize fitting the local training
examples. Three possible criteria are given below.

1. Minimize the squared error over just the k nearest neighbors:

2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq :

3. Combine 1 and 2:
If we choose criterion three and re-derive the gradient descent rule, we obtain the following
training rule

The differences between this new rule and the rule given by Equation (3) are that the
contribution of instance x to the weight update is now multiplied by the distance penalty
K(d(xq, x)), and that the error is summed over only the k nearest training examples.

RADIAL BASIS FUNCTIONS

• One approach to function approximation that is closely related to distance-weighted


regression and also to artificial neural networks is learning with radial basis functions
• In this approach, the learned hypothesis is a function of the form

• Where, each xu is an instance from X and where the kernel function Ku(d(xu, x)) is
defined so that it decreases as the distance d(xu, x) increases.
• Here k is a user provided constant that specifies the number of kernel functions to be
included.
• 𝑓̂ is a global approximation to f (x), the contribution from each of the Ku(d(xu, x)) terms
is localized to a region nearby the point xu.

Choose each function Ku(d(xu, x)) to be a Gaussian function centred at the point xu with some
variance 𝜎u2

• The functional form of equ(1) can approximate any function with arbitrarily small error,
provided a sufficiently large number k of such Gaussian kernels and provided the width
𝜎2 of each kernel can be separately specified
• The function given by equ(1) can be viewed as describing a two layer network where
the first layer of units computes the values of the various Ku(d(xu, x)) and where the
second layer computes a linear combination of these first-layer unit values
Example: Radial basis function (RBF) network

Given a set of training examples of the target function, RBF networks are typically trained in
a two-stage process.
1. First, the number k of hidden units is determined and each hidden unit u is defined by
choosing the values of xu and 𝜎u2 that define its kernel function Ku(d(xu, x))
2. Second, the weights w, are trained to maximize the fit of the network to the training
data, using the global error criterion given by

Because the kernel functions are held fixed during this second stage, the linear weight
values w, can be trained very efficiently

Several alternative methods have been proposed for choosing an appropriate number of hidden
units or, equivalently, kernel functions.
• One approach is to allocate a Gaussian kernel function for each training example
(xi,f (xi)), centring this Gaussian at the point xi.
Each of these kernels may be assigned the same width 𝜎2. Given this approach, the RBF
network learns a global approximation to the target function in which each training
example (xi, f (xi)) can influence the value of f only in the neighbourhood of xi.
• A second approach is to choose a set of kernel functions that is smaller than the number
of training examples. This approach can be much more efficient than the first approach,
especially when the number of training examples is large.

Summary
• Radial basis function networks provide a global approximation to the target function,
represented by a linear combination of many local kernel functions.
• The value for any given kernel function is non-negligible only when the input x falls
into the region defined by its particular centre and width. Thus, the network can be
viewed as a smooth linear combination of many local approximations to the target
function.
• One key advantage to RBF networks is that they can be trained much more efficiently
than feedforward networks trained with BACKPROPAGATION.
CASE-BASED REASONING

• Case-based reasoning (CBR) is a learning paradigm based on lazy learning methods and
they classify new query instances by analysing similar instances while ignoring
instances that are very different from the query.
• In CBR represent instances are not represented as real-valued points, but instead, they
use a rich symbolic representation.
• CBR has been applied to problems such as conceptual design of mechanical devices
based on a stored library of previous designs, reasoning about new legal cases based on
previous rulings, and solving planning and scheduling problems by reusing and
combining portions of previous solutions to similar problems

A prototypical example of a case-based reasoning

• The CADET system employs case-based reasoning to assist in the conceptual design of
simple mechanical devices such as water faucets.
• It uses a library containing approximately 75 previous designs and design fragments to
suggest conceptual designs to meet the specifications of new design problems.
• Each instance stored in memory (e.g., a water pipe) is represented by describing both its
structure and its qualitative function.
• New design problems are then presented by specifying the desired function and
requesting the corresponding structure.

The problem setting is illustrated in below figure


• The function is represented in terms of the qualitative relationships among the water-
flow levels and temperatures at its inputs and outputs.
• In the functional description, an arrow with a "+" label indicates that the variable at the
arrowhead increases with the variable at its tail. A "-" label indicates that the variable at
the head decreases with the variable at the tail.
• Here Qc refers to the flow of cold water into the faucet, Qh to the input flow of hot water,
and Qm to the single mixed flow out of the faucet.
• Tc, Th, and Tm refer to the temperatures of the cold water, hot water, and mixed water
respectively.
• The variable Ct denotes the control signal for temperature that is input to the faucet, and
Cf denotes the control signal for waterflow.
• The controls Ct and Cf are to influence the water flows Qc and Qh, thereby indirectly
influencing the faucet output flow Qm and temperature Tm.

• CADET searches its library for stored cases whose functional descriptions match the
design problem. If an exact match is found, indicating that some stored case implements
exactly the desired function, then this case can be returned as a suggested solution to the
design problem. If no exact match occurs, CADET may find cases that match various
subgraphs of the desired functional specification.
REINFORCEMENT LEARNING
Reinforcement learning addresses the question of how an autonomous agent that senses and
acts in its environment can learn to choose optimal actions to achieve its goals.

INTRODUCTION

• Consider building a learning robot. The robot, or agent, has a set of sensors to observe
the state of its environment, and a set of actions it can perform to alter this state.
• Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals.
• The goals of the agent can be defined by a reward function that assigns a numerical
value to each distinct action the agent may take from each distinct state.
• This reward function may be built into the robot, or known only to an external teacher
who provides the reward value for each action performed by the robot.
• The task of the robot is to perform sequences of actions, observe their consequences,
and learn a control policy.
• The control policy is one that, from any initial state, chooses actions that maximize the
reward accumulated over time by the agent.

Example:
• A mobile robot may have sensors such as a camera and sonars, and actions such as
"move forward" and "turn."
• The robot may have a goal of docking onto its battery charger whenever its battery level
is low.
• The goal of docking to the battery charger can be captured by assigning a positive reward
(Eg., +100) to state-action transitions that immediately result in a connection to the
charger and a reward of zero to every other state-action transition.

Reinforcement Learning Problem


• An agent interacting with its environment. The agent exists in an environment described
by some set of possible states S.
• Agent perform any of a set of possible actions A. Each time it performs an action a, in
some state st the agent receives a real-valued reward r, that indicates the immediate value
of this state-action transition. This produces a sequence of states si, actions ai, and
immediate rewards ri as shown in the figure.
• The agent's task is to learn a control policy, 𝝅: S → A, that maximizes the expected sum
of these rewards, with future rewards discounted exponentially by their delay.
Reinforcement learning problem characteristics

1. Delayed reward: The task of the agent is to learn a target function 𝜋 that maps from the
current state s to the optimal action a = 𝜋 (s). In reinforcement learning, training
information is not available in (s, 𝜋 (s)). Instead, the trainer provides only a sequence of
immediate reward values as the agent executes its sequence of actions. The agent,
therefore, faces the problem of temporal credit assignment: determining which of the
actions in its sequence are to be credited with producing the eventual rewards.

2. Exploration: In reinforcement learning, the agent influences the distribution of training


examples by the action sequence it chooses. This raises the question of which
experimentation strategy produces most effective learning. The learner faces a trade-off
in choosing whether to favor exploration of unknown states and actions, or exploitation
of states and actions that it has already learned will yield high reward.

3. Partially observable states: The agent's sensors can perceive the entire state of the
environment at each time step, in many practical situations sensors provide only partial
information. In such cases, the agent needs to consider its previous observations together
with its current sensor data when choosing actions, and the best policy may be one that
chooses actions specifically to improve the observability of the environment.
4. Life-long learning: Robot requires to learn several related tasks within the same
environment, using the same sensors. For example, a mobile robot may need to learn
how to dock on its battery charger, how to navigate through narrow corridors, and how
to pick up output from laser printers. This setting raises the possibility of using
previously obtained experience or knowledge to reduce sample complexity when
learning new tasks.

THE LEARNING TASK

• Consider Markov decision process (MDP) where the agent can perceive a set S of
distinct states of its environment and has a set A of actions that it can perform.
• At each discrete time step t, the agent senses the current state st, chooses a current action
at, and performs it.
• The environment responds by giving the agent a reward rt = r(st, at) and by producing
the succeeding state st+l = δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on
the current state and action, and not on earlier states or actions.

The task of the agent is to learn a policy, 𝝅: S → A, for selecting its next action a, based on the
current observed state st; that is, 𝝅(st) = at.

How shall we specify precisely which policy π we would like the agent to learn?

1. One approach is to require the policy that produces the greatest possible cumulative reward
for the robot over time.
• To state this requirement more precisely, define the cumulative value Vπ (st) achieved
by following an arbitrary policy π from an arbitrary initial state st as follows:

• Where, the sequence of rewards rt+i is generated by beginning at state st and by repeatedly
using the policy π to select actions.
• Here 0 ≤ γ ≤ 1 is a constant that determines the relative value of delayed versus
immediate rewards. if we set γ = 0, only the immediate reward is considered. As we set
γ closer to 1, future rewards are given greater emphasis relative to the immediate reward.
• The quantity Vπ (st) is called the discounted cumulative reward achieved by policy π
from initial state s. It is reasonable to discount future rewards relative to immediate
rewards because, in many cases, we prefer to obtain the reward sooner rather than later.
2. Other definitions of total reward is finite horizon reward,

Considers the undiscounted sum of rewards over a finite number h of steps

3. Another approach is average reward

Considers the average reward per time step over the entire lifetime of the agent.

We require that the agent learn a policy π that maximizes Vπ (st) for all states s. such a policy
is called an optimal policy and denote it by π*

Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum
discounted cumulative reward that the agent can obtain starting from state s.

Example:

A simple grid-world environment is depicted in the diagram

• The six grid squares in this diagram represent six possible states, or locations, for the
agent.
• Each arrow in the diagram represents a possible action the agent can take to move from
one state to another.
• The number associated with each arrow represents the immediate reward r(s, a) the
agent receives if it executes the corresponding state-action transition
• The immediate reward in this environment is defined to be zero for all state-action
transitions except for those leading into the state labelled G. The state G as the goal state,
and the agent can receive reward by entering this state.

Once the states, actions, and immediate rewards are defined, choose a value for the discount
factor γ, determine the optimal policy π * and its value function V*(s).
Let’s choose γ = 0.9. The diagram at the bottom of the figure shows one optimal policy for this
setting.

Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ = 0.9. An optimal
policy, corresponding to actions with maximal Q values, is also shown.

The discounted future reward from the bottom centre state is


0+ γ 100+ γ2 0+ γ3 0+... = 90

You might also like