Unit 4
Unit 4
UNIT-IV
1
CONTENTS
Dimensionality Reduction
Linear Discriminant Analysis
Principal Component Analysis
Factor Analysis
Independent Component Analysis
Locally Linear Embedding
Isomap Least Squares Optimization
Evolutionary Learning
Genetic algorithms
Genetic Offspring
Genetic Operators
Using Genetic Algorithms
Reinforcement Learning
Overview
Getting Lost Example
2
DIMENSIONALITY REDUCTION
Reduction is divided into two components, feature
Dimensionality reduction refers to techniques
for reducing the number of input variables in training data.
Disadvantages
Non-linear data is mapped and transformed onto a
higher-dimensional space. Then PCA is used
to reduce the dimensions. However, one downside of this
approach is that it is computationally very expensive.
5
LINEAR DISCRIMINANT ANALYSIS (LDA)
8
9
10
11
PRINCIPAL COMPONENTS ANALYSIS (PCA)
13
PCA APPROACH
14
15
16
Goals
The main goal of a PCA analysis is to identify
patterns in data.
PCA aims to detect the correlation between
variables.
It attempts to reduce the dimensionality.
Transformation
This transformation is defined in such a way that the
first principal component has the largest
possible variance and each succeeding component
in turn has the next highest possible variance.
17
LIMITATION OF PCA
Applications
Spike-triggered covariance analysis in
Neuroscience.
Quantitative Finance.
Image Compression.
Facial Recognition.
correlation. 18
PCA RELATION WITH THE MULTI-LAYER PERCEPTRON
PCA is linear (it just rotates and translates the axes). This is
clear, the hidden nodes that are computing PCA, and they
are effectively a bit like a Perceptron—they can only
perform linear tasks. 19
The predictor variables are multi collinear in
nature which is overcome by using Principal
Component Analysis (PCA) which resulted in a
new set of independent variables that are taken
for predicting the results using Multilayer Layer
Perceptron (MLP) model.
20
KERNEL PCA
PCA is a linear method. Kernel PCA uses a kernel function to project
dataset into a higher dimensional feature space, where it is linearly
separable. It is similar to the idea of Support Vector Machines. There
are various kernel methods like linear, polynomial, Sigmoid and
gaussian.
In the field of multivariate statistics, kernel principal component
analysis (KPCA) is an extension of principal component
analysis (PCA) using techniques of kernel methods. Using a kernel,
the originally linear operations of PCA are performed in a
reproducing kernel Hilbert space.
The largest difference of the projections of the points onto the
eigenvector (new coordinates), KPCA is a circle and PCA is a straight
line, so KPCA gets higher variance than PCA.
21
22
THE KERNEL PCA ALGORITHM
23
FACTOR ANALYSIS
Factor analysis is a technique that is used to reduce a large
number of variables into fewer numbers of factors. This
technique extracts maximum common variance from all
variables and puts them into a common score.
Factor analysis is a statistical data reduction and
analysis technique that strives to explain correlations among
multiple outcomes as the result of one or more underlying
explanations, or factors. The technique involves data
reduction, as it attempts to represent a set of variables
by a smaller number.
The difference between factor analysis and principal
component analysis. Factor analysis explicitly assumes the
existence of latent factors underlying the observed
data. PCA instead seeks to identify variables that are
composites of the observed variables. 24
INDEPENDENT COMPONENTS ANALYSIS (ICA)
Independent Component Analysis (ICA) is a machine
learning technique to separate independent sources from a mixed
signal/Data. Unlike principal component analysis which focuses on
maximizing the variance of the data points, the independent
component analysis focuses on independence.
26
THE LOCALLY LINEAR EMBEDDING ALGORITHM
27
THE LLE ALGORITHM PRODUCES A VERY INTERESTING
RESULT ON THE IRIS DATASET: IT SEPARATES THE THREE
GROUPS INTO THREE POINTS (FIGURE 6.12). THIS SHOWS
THAT THE ALGORITHM WORKS VERY WELL ON THIS TYPE OF
DATA, BUT DOESN’T GIVE US ANY HINTS AS TO WHAT ELSE IT
CAN DO.
28
FIGURE 6.13 SHOWS A COMMON DEMONSTRATION DATASET
FOR THESE ALGORITHMS. KNOWN AS THE SWISSROLL FOR
OBVIOUS REASONS, IT IS TRICKY TO FIND A 2D
REPRESENTATION OF THE 3D DATA BECAUSE IT IS ROLLED UP.
THE RIGHT OF FIGURE 6.13 SHOWS THAT LLE CAN
SUCCESSFULLY UNROLL IT.
29
MULTI-DIMENSIONAL SCALING (MDS)
Like PCA, MDS tries to find a linear approximation to the full data
space that embeds the data into a lower dimensionality.
In the case of MDS the embedding tries to preserve the distances
between all pairs of points. It turns out that if the space is
Euclidean, then the two methods are identical.
We use the same notational setup as previously, starting with data
points x1, x2, . . . , xN 2 RM. We choose a new dimensionality L <
M and compute the embedding so that the data points are z1, z2, . .
. zN 2 RL. As usual, we need a cost function to minimize. There are
lots of choices for MDS cost functions, but the more common ones
are:
30
31
32
THIS CLASSICAL MDS ALGORITHM WORKS FINE ON FLAT
MANIFOLDS(VARIOUS AND MANY) DATA SPACES.
33
34
EVOLUTIONARY LEARNING
35
EACH ADULT IN THE MATING PAIR PASSES ONE OF
THEIR TWO CHROMOSOMES TO THEIR OFFSPRING
36
THE GENETIC ALGORITHM (GA)
37
STRING REPRESENTATION
39
POPULATION
For the current generation we need to select those strings that will
be used to generate new offspring. The idea here is that average
fitness will improve if we select strings that are already relatively
fit compared to the other members of the population (following
natural selection), which is exploitation of our current population.
Crossover
Crossover is the operator that performs global exploration, since
the strings that are produced are radically different to both parents
in at least some places. The hope is that sometimes we will take
good parts of both solutions and put them together to make an
even better solution. The different forms of the crossover
operator.
Map Colouring
Graph colouring is a typical discrete optimisation problem. We want to
colour a graph using only k colours, and choose them in such a way that
adjacent regions have different colours. It has been mathematically proven
that any two-dimensional planar graph can be coloured with four colours,
which was the first ever proof that used a computer program to check the
cases.
Encode possible solutions as strings For this problem, we’ll
choose our alphabet to consist of the three possible shades
(black (b), dark (d), and light (l)).
Choose a suitable fitness function The thing that we want
to minimise (a cost function) is the number of times that
two adjacent regions have the same colour.
Choose suitable genetic operators We’ll use the standard
genetic operators for this, since this example makes the
operations of crossover and mutation clear. 48
49
50
PUNCTUATED EQUILIBRIUM
The argument runs that if humans evolved from apes, then there should be
some evidence of a whole set of intermediary species that existed during the
transition phase, and there aren’t. Interestingly, GAs demonstrate one of the
explanations why this is not correct, which is that the way that evolution
actually seems to work is known as punctuated equilibrium.
51
EXAMPLES
The Knapsack Problem
Knapsack problem states that: Given a set of items, each with a
mass and a value, determine the number of each item to include in
a collection so that the total weight is less than or equal to a given
limit and the total value is as large as possible.
The Genetic Algorithm provides a way to solve the knapsack
problem in linear time complexity . The attribute reduction
technique which incorporates Rough Set Theory finds the
important genes, hence reducing the search space and ensures that
the effective information will not be lost.
Genetic Algorithms definitely rule them all and prove to be the best
approach in obtaining solutions to problems traditionally thought
of as computationally infeasible such as the Knapsack problem.
52
53
EXAMPLE 2: THE FOUR PEAKS PROBLEM
The four peaks is a toy problem that is quite often used to test out
GAs and various developments of them. It is an invented fitness
function that rewards strings with lots of consecutive 0s at the
start of the string, and lots of consecutive 1s at the end. The fitness
consists of counting the number of 0s at the start, and the number
of 1s at the end and returning the maximum of them as the fitness.
54
LIMITATIONS OF GA
55
TRAINING NEURAL NETWORKS WITH GENETIC ALGORITHMS
We trained our neural networks, most notably the MLP, using
gradient descent. we could encode the problem of finding the
correct weights as a set of strings, with the fitness function
measuring the sum-of-squa res error. This has been done,
and with good reported results. However, there are
some problems with this approach.
Problem:
The first is that we turn all the local information from the targets
about the error at each output node of the network into just one
number, the fitness, which is throwing away useful information,
and the second is that we are ignoring the gradient information,
which is also throwing away useful information.
56
SOLUTION
57
REINFORCEMENT LEARNING
Reinforcement learning fills the gap between supervised
learning, where the algorithm is trained on the correct answers
given in the target data, and unsupervised learning, where the
algorithm can only exploit similarities in the data to cluster it.
Reinforcement learning is usually described in terms of the
interaction between some agent and its environment. The agent is
the thing that is learning, and the environment is where it is
learning, and what it is learning about. The environment has
another task, which is to provide information about how good a
strategy is, through some reward function.
The importance of reinforcement learning for psychological
learning theory comes from the concept of trial-and-error learning,
which has been around for a long time, and is also known as the
Law of Effect.
58
A ROBOT PERCEIVES THE CURRENT STATE OF ITS
ENVIRONMENT THROUGH ITS SENSORS, AND
PERFORMS ACTIONS BY MOVING ITS MOTORS. THE
REINFORCEMENT LEARNER (AGENT) WITHIN THE
ROBOT TRIES TO PREDICT THE NEXT STATE AND
REWARD.
59
Reinforcement learning maps states or situations to actions in order
to maximise some numerical reward. That is, the algorithm knows
about the current input (the state), and the possible things it can do
(the actions), and its aim is to maximise the reward. There is a clear
distinction drawn between the agent that is doing the learning and
the environment, which is where the agent acts, and which produces
the state and the rewards.
The possible ways that the robot can drive its motors are the actions,
which move the robot in the environment, and the reward could be
60
how well it does its task without crashing into things.
THE REINFORCEMENT LEARNING CYCLE: THE LEARNING
AGENT PERFORMS ACTION AT IN STATE ST AND RECEIVES
REWARD RT+1 FROM THE ENVIRONMENT, ENDING UP IN
STATE ST+1.
61
EXAMPLE: GETTING LOST
You arrive in a foreign city exhausted after many
hours of flying, catch the train into town and stagger
into a backpacker’s hostel without noticing much of
your surroundings. When you wake up it is dark and
you are starving, walked through the old part of the city, so
you don’t need to worry about any street that takes you out of the
old part. So at the next bus stop you come to, you have a proper
look at the map, and note down the map of the old town squares,
which turns out to look like Figure.
62
YOU DECIDE THAT THE BACKPACKER’S IS ALMOST DEFINITELY IN THE SQUARE
LABELLED F ON THE MAP, BECAUSE ITS NAME SEEMS VAGUELY FAMILIAR.
YOU DECIDE TO WORK OUT A REWARD STRUCTURE SO THAT YOU CAN
FOLLOW A REINFORCEMENT LEARNING ALGORITHM TO GET TO THE
BACKPACKER’S. THE FIRST THING YOU WORK OUT IS THAT STAYING STILL
MEANS THAT YOU ARE SLEEPING ON YOUR FEET, WHICH IS BAD. SO YOU
ASSIGN A REWARD OF −5 FOR THAT (WHILE NEGATIVE REINFORCEMENT CAN
BE VIEWED AS PUNISHMENT, IT DOESN’T NECESSARILY CORRESPOND
CLEARLY, BUT YOU MIGHT WANT TO IMAGINE IT AS PINCHING YOURSELF SO
THAT YOU STAY AWAKE).
63
THE STATE DIAGRAM IF YOU ARE CORRECT AND THE
BACKPACKER’S IS IN SQUARE (STATE) F. THE
CONNECTIONS FROM EACH STATE BACK INTO ITSELF
(MEANING THAT YOU DON’T MOVE) ARE NOT SHOWN, TO
AVOID THE FIGURE GETTING TOO COMPLICATED. THEY
ARE EACH WORTH −5 (EXCEPT FOR STAYING IN STATE F,
WHICH MEANS THAT YOU ARE IN THE BACKPACKER’S).
64
THE FOLLOWING THINGS DISCUSS IN REINFORCEMENT LEARNING
66
POLICY
67
MARKOV DECISION PROCESSES
The Markov Property
A simple example of a Markov decision process to
decide on the state of your mind tomorrow given
your state of mind today.
A reinforcement learning problem that follows that is, the Markov property is
known as a Markov Decision Process (MDP). It means that we can compute the
likely next reward, and what the next state will be, from only the current state and
action, based on previous experience.
68
PROBABILITIES IN MARKOV DECISION PROCESSES
69
PROBABILITIES IN MARKOV DECISION PROCESSES
There are three actions that can be taken in state E (shown by the
black circles), with associated probabilities and expected rewards.
Learning and using this transition diagram can be seen as the aim
of any reinforcement learner.
70
VALUES
The reinforcement learner is trying to decide on what action to take
in order to maximize the expected reward into the future. This
expected reward is known as the value. There are two ways that we
can compute a value.
We can consider the current state, and average across all of the
actions that can be taken, leaving the policy to sort this out for itself
(the state-value function, V (s)), or we can consider the current
state and each possible action that can be taken separately, the
action-value function, Q(s, a). In either case we are thinking about
what the expected reward would be if we started in state s (where
E(·) is the statistical expectation):
71
72
73
BACK ON HOLIDAY: USING REINFORCEMENT LEARNING
Figure, and can be written out as (where 1 means that there is a link, and 0 means that there is
not):
74
THE DIFFERENCE BETWEEN SARSA AND Q-LEARNING
The most important difference between the two is how Q is updated after each
action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn
from it. In contrast, Q-learning uses the maximum Q' over all possible actions for
the next step.
Both algorithms will start out with no information about the environment, and
will therefore explore randomly, using the -greedy policy. However, over time,
the strategies that the two algorithms produce are quite different.
The main reason for the difference is that Q-learning always attempts to follow
the optimal path, which is the shortest one. This takes it close to the cliff, and the
-greedy part means that inevitably it will sometimes fall over. By way of
contrast, the sarsa algorithm will converge to a much safer route that keeps it
well away from the cliff, even though it takes longer.
75
USES OF REINFORCEMENT LEARNING
Reinforcement learning has been used successfully for many problems, and the
results of computer modeling of reinforcement learning have been of great
interest to psychologists, as well as computer scientists, because of the close
links to biological learning.
Reinforcement learning has been used in other robotic applications, including
robots learning to follow each other, travel towards bright lights, and even
navigate.
In general, reinforcement learning is fairly slow, because it has to build up all of
the information through exploration and exploitation in order to find the better
solutions.
It is also very dependent upon a carefully chosen reward function: get that
wrong and the algorithm will do something completely unexpected.
A famous example of reinforcement learning was TD-Gammon, which was
produced by Gerald Tesauro. His idea was that reinforcement learning should be
very good at learning to play games, because games were clearly episodic—you
played until somebody won—and there was a clear reward structure, with a
positive reward for winning.
76
ASSIGNMENT QUESTIONS
77
10. What is The Curse of Dimensionality?
11. How the Dimensionality Reduction in Latent Variable.
12. Write algorithm for Principal Component Analysis.
13. Explain about Probabilistic PCA.
14. Differentiate Probabilistic PCA - Independent Components
Analysis.
15. What is Factor analysis?
16. (i)Describe in detail about Linear Discriminants. (ii)Discuss:
Generalizing the Linear Model and Geometry of the Linear
Discriminant.
17. Point out why dimensionality reduction is useful?
18. Define Factor Analysis or latent variables.
19. Distinguish between within-class scatter and between-
classes scatter. 78
20. Define PCA.
21. Describe what Isomap is.
22. Discover Locally Linear Embedding algorithm with k=12.
23. Explain the three different ways to do dimensionality reduction.
24. Explain what Least Squares Optimization is.
25. Difference action and state space.
26. Write what is Punctuated Equilibrium? Remember BTL1
27. How reinforcement learner experience and the corresponding action.
28. Express the basic tasks that need to be performed for GA.
29. Identify how reinforcement learning maps states to action
30. Examine Genetic Programming.
31. Differentiate Sarsa and Q-learning.
32. Explain Least Squares Optimization.
33. (i)Describe in detail about Generating Offspring Genetic Operators.
(ii)Discuss the Basic Genetic Algorithm.
79