May 14
May 14
19
3.1.1 Stochastic Process
Definition 1. Stochastic Process: A stochastic process is a collection of random
variables often used to represent the evolution of some random value over time.
There is indeterminacy in a stochastic process. Even if we know the initial
conditions, the system can evolve in possibly many different ways.
20
Figure 3.1: An example Hidden Markov Model with three urns
2
Source: Prof. Pushpak Bhattacharyya’s lecture slides on HMM from the course CS 344 - Artifi-
cial Intelligence at IIT Bombay, spring 2013
21
3.1.6 Formulating the Part-of-Speech tagging problem using HMM
The POS tagging problem can be described as follows. We are given a sentence
which is a sequence of words. Each word has a POS tag which is unknown. The
task is to find the POS tags of each word and return the POS tag sequence corre-
sponding to the sentence. Here the POS tags constitute the hidden states. As in the
urn problem, we again assume that words (balls) are emitted by POS tags (urns), a
property called the lexical assumption. That is, the probability of seeing a particu-
lar word depends only on the POS tag previously seen. Also, as was the case in the
urn problem, the probability of a word having a particular POS tag is dependent
only on the POS tag of the previous word (urn to urn probability). Having mod-
elled the problem as given above, we need to explain how the transition tables are
constructed. The transition probabilities come from data. This is a data-driven ap-
proach to POS tagging, and using data on sentences which are already POS tagged
we construct the transition tables. Given this formulation, we next present an al-
gorithm which given an input sentence and the transition tables outputs the most
probable POS tag sequence.
• The states are the POS tags. The state transition probabilities are pre-computed
using a POS-tagged corpus.
Next, we observe that due to the Markov assumption, once we have traversed
a part of the sentence, the transition probabilities do not depend on the entire sen-
22
tence seen so far. They depend only on the previous POS tag. This crucial obser-
vation gives rise to the Viterbi algorithm:
Suppose we are given a HMM with S possible POS tags (states), initial prob-
abilities πi of being in state i, the transition probabilities P (sj |si ) of going from
state i to j and the emission probabilities P (xt |si ) of emitting xt from the state si .
If the input sentence is x1 , x2 , . . . , xT then the most probable state sequence that
produces the sentence y1 , y2 , . . . , yT is given by the recurrence relations
where Vt,k is the probability of the most probable state sequence which emitted
the first t words that has k as the final state. The Viterbi path (most likely state
sequence) can be remembered by storing back pointers which contain the state
sx which was chosen in the second equation. The complexity of the algorithm is
O(|T ||S 2 |) where T is the set of words, the input sequence and S is the set of POS
tags.
3.1.8 Pseudocode
Pseudocode for the Viterbi algorithm is given below:
# Given
# Set of states: Array S
# Start state: s0
# End state: se
# Symbol sequence: Array w
# State transition probabilities: Matrix a
# Symbol emission probabilities: Matrix b
# alpha: Matrix alpha
# Returns
# Total probability: p
# Initialisation F1
foreach s in S do
alpha [1][s] := a[s0][s]*b[s][w[1]]
done
# Induction F2
for i := 1 to length(w)-1 do
foreach s in S do
foreach s’ in S do
23
alpha[i+1][s] += alpha[i][s’]*a[s’][s]
done
alpha[i+1][s] *= b[s][w[i+1]]
done
done
# Termination F3
foreach s in S do
p += alpha[length(w)][s]*a[s][se]
done
return p
Definition 3. Precision: The fraction of words which were correctly tagged is com-
puted as the precision. The precision for any particular tag is defined as the total
number of times it was correctly given as the output divided by the total number of
tims it was given as output.
Definition 4. Recall: Overall recall is defined as the fraction of correct tags among
the total number of words tagged. If every word in the corpus is assigned a tag,
then recall equals precision. Recall for a specific tag is defined as the total number
of times the tag was correctly given as output divided by the total number of times
it should have been given as output.
In the next section, we present Conditional Random fields simiar to HMMs are
used for sequence labeling tasks in NLP. However, they have a different approach
to the task.
24
3.2.1 Random Fields
A random field is a generalization of a stochastic process such that the underlying
parameter of the process, namely time, need no longer be a simple real value but
can take on values that are multi-dimensional vectors.
A topological space is one that allows for the definition of concepts such as
continuity, connectedness, neighbourhood. For our purposes, a rigorous mathe-
matical definition of topological space is not essential and hence will be skipped.
There are many types of random fields. The most common ones are listed here:
A more detailed description of the types of random fields is outside the scope of this
thesis. But we will need the description of a Markov random field to understand
the definition of a conditional random field and hence it will be discussed in brief
here.
A Markov random field is derived from the definition of a random field in
a way similar to the derivation of a Markov process from a stochastic process. A
Markov random field exhibits the Markov property, that is the probability that the
random variable assumes a certain value depends on other random variables only
through the ones that are its immediate neighbours. Mathematically,
25
positive functions defined on cliques that cover all the nodes and edges of G. That
is,
1 Y
P (X) = φC XC
Z
c∈CG
wherePCGQis the set of all maximal cliques in G and Z is the normalizing factor
Z = x c∈CG φC XC .
This theorem is used in deriving the functional form of the conditional proba-
bility distribution for conditional random fields. In the next section, we formally
define conditional random fields and then we go on to study their application to
sequence labeling tasks.
CRF is very much similar to a Markov random field. The difference is that
the Markov condition should now hold on the conditional probability distribution
P (Y |X). Although, G could be any graph, we will be concerned only with se-
quences. Hence, we restrict G to be a simple chain and X and Y to be sequences.
By the Hammersley and Clifford theorem of random fields (Hammersley and Clif-
ford, 1971), the joint distribution of the label sequence Y given X has the form
X X
Pθ (Y |X) = exp λk fk (e, y|e , x) + µk gk (v, y|v , x)
e∈E,k e∈V,k
where y|S is the set of components of y associated with the vertices in the subgraph
S. The features f and g are known and given. The next step would be to estimate
the parameters θ using the training data. This is done using an iterative scaling
algorithm similar to the Improved Iterative Scaling algorithm.
26
It can be seen easily that CRFs encompass the class of HMMs. THe class of
CRFs is much more expressive as they allow arbitrary dependencies on the ob-
servation sequence. Most sequential classifiers are trained to make the best local
decision and hence cannot trade off at different positions against each other. They
are myopic about the impact of their current decisions on later decisions.
where
27
model is the correct underlying distribution and hence is a function of the parame-
ters of the model. The likelihood of the training data is expressed as follows (N is
the number of training instances):
N
Y
M (Λ) = P (xi , yi )
i=1
N
Y
= PΛ (yi |xi )P (xi )
i=1
Now, we note that log(x) is a one-to-one map for x > 0. Therefore the value of
x which maximizes f (x) is the same as that which maximizes log(f (x)). Hence-
forth we work with the logarithm of the likelihood expression as it is mathemati-
cally easier to work with. The log-likelihood expression denoted by L(Λ) is given
below:
c(x, y)
p̃(x, y) = P
x,y c(x, y)
where c(x, y) is the number of times the instance (x, y) occurs in the training data.
The log-likelihood expression becomes the following:
X
Lp̃ (Λ) = log PΛ (y|x)c(x,y)
x,y
X
= p̃(x, y)log (PΛ (y|x))
x,y
P
We ignore x,y c(x, y) as it is constant for a given training set (= N ).
28
3.3.3 The objective to optimize
Hence we arrive the objective to be maximized. The maximum likelihood problem
is to discover Λ∗ ≡ argmaxΛ Lp̃ (Λ) where
X
Lp̃ (Λ) = p̃(x, y)log (PΛ (y|x))
x,y
n
!!
X X X X X
= p̃(x, y) λi fi (x, y) − p̃(x, y)log exp λi fi (x, y)
x,y i x,y y i=1
n
!!
X X X X X
= p̃(x, y) λi fi (x, y) − p̃(x)log exp λi fi (x, y)
x,y i x y i=1
From the expression for log-likelihood we note that Lp̃ (Λ) ≤ 0 always. Hence
Lp̃ (Λ) = 0 is optimal. TO maximize the log-likelihood we could take the partial
derivate of the expression with each parameter λi which gives
P
∂Lp̃ (Λ) X X exp ( i λi fi (x, y))
= p̃(x, y)fi (x, y) − p̃(x)fi (x, y) P P
∂λi x,y x,y y exp ( i λi fi (x, y))
X X
= p̃(x, y)fi (x, y) − p̃(x)fi (x, y)PΛ (y|x)
x,y x,y
The above expression contains all the parameters λi in it and is hence difficult to
solve for. Since a direct aproach via differentiation yielded too complex a problem
we take an iterative approach similar to the gradient descent algorithm which is
described next.
29
bound on the above change in likelihood expression.
X X X Z(Λ+∆) (x)
Lp̃ (Λ + ∆) − Lp̃ (Λ) ≥ p̃(x, y) δi fi (x, y) + 1 − p̃(x)
x,y x
Z(Λ) (x)
i
P P
y exp ( i (λi + δi )fi (x, y))
X X X
= p̃(x, y) δi fi (x, y) + 1 − p̃(x) P P
x,y i x y exp ( i λi fi (x, y))
!!
X X X X exp(P λi fi (x, y) X
i
= p̃(x, y) δi fi (x, y) + 1 − p̃(x) exp δi fi (x, y)
x,y x y
Z Λ (x)
i i
!
X X X X X
= p̃(x, y) δi fi (x, y) + 1 − p̃(x) PΛ (y|x)exp δi fi (x, y)
x,y i x y i
= A(∆|Λ)
Now we know that is we can find a ∆ such that A(∆|Λ) > 0 then we have a
improvement in the likelihood. Hence, we try to maximize A(∆|Λ) with respect
to each δi . Unfortunately the derivative of A(∆|Λ) with respect to δi yields an
equation containing all of {δ1 , δ2 . . . . , δn } and hence the constraint equations for
δi are coupled.
To get around this, we first observe that the coupling is due to the summation
of the δi s present inside the exponentiation function. We consider a counterpart
expression with the summation placed outside the exponentiation and compare the
two expressions. We find that we can indeed establish an inequality using an im-
portant property called the Jensen’s inequality. First, we define the quantity,
X
f # (x, y) = fi (x, y)
i
If fi are binary-valued then f # (x, y) just gives the total number of features which
are non-zero (applicable) at the point (x,y). We rewrite A(∆|Λ) in terms of f # (x, y)
as follows:
!
X X X X X δi fi (x, y)
A(∆|Λ) = p̃(x, y) δi fi (x, y)+1− p̃(x) PΛ (y|x)exp f # (x, y)
x,y x y
f # (x, y)
i i
fi (x,y)
Now, we note that f # (x,y)
is a p.d.f. Jensen’s inequality states that for a p.d.f,
p(x), !
X X
exp p(x)q(x) ≤ exp(p(x)q(x))
x x
Now, using Jensen’s inequality, we get,
X X X X X fi (x, y)
A(∆|Λ) ≥ p̃(x, y) δi fi (x, y) + 1 − p̃(x) PΛ (y|x) exp(δi f# (x, y))
x,y x y
f # (x, y)
i i
= B(∆|Λ)
30
where B(∆|Λ) is a new lower-bound on the change in likelihood. B(∆|Λ) can be
maximized easily because there is no coupling of variables in its derivative. The
derivative of B(∆|Λ) with respect to δi is,
∂B(∆) X X X
= p̃(x, y)fi (x, y) − p̃(x) PΛ (y|x)fi (x, y)exp(δi f # (x, y))
∂δi x,y x y
3.4.1 Motivation
Suppose some given data points each belong to one of two classes, and the goal
is to decide which class a new data point will be in. In the case of support vector
machines, a data point is viewed as a p-dimensional vector, and we want to know
whether we can separate such points with a (p − 1)-dimensional hyperplane. This
is called a linear classifier. There are many hyperplanes that might classify the data.
One reasonable choice as the best hyperplane is the one that represents the largest
separation, or margin, between the two classes. So we choose the hyperplane so
that the distance from it to the nearest data point on each side is maximized. If
such a hyperplane exists, it is known as the maximum-margin hyperplane and the
linear classifier it defines if known as a maximum margin classifier. The model of
support-vector machines achieve precisely the maximum margin classifier.
31
Figure 3.3: An example of maximum margin hyperplane classification over per-
fectly separable data.
w.x − b = 0
where w is the normal vector to the hyperplane. If the training data are linearly
separable, we can select two hyperplanes in a way that they separate the data and
there are no points between them, and then try to maximize their distance. The re-
gion bounded by them is called ”the margin”. These hyperplanes can be described
by the equations
w.x − b = 1
w.x − b = −1
2
The distance between these two hyperplanes is ||w|| . Hence our aim is to minimize
w. Also, we want to penalize those data points which fall within the margin. So
ideally,
32
This can be rewritten as
yi (w.x − b) ≥ 1 ∀ i.
If the data is not linearly separable, then finding such a hyperplane is impossible.
To still get a ’good’ classifier we introduce non-negative slack variables ξi , which
measure the degree of misclassification of the point xi .
The final objective function has two terms, one corresponding to the maxi-
mization of the margin size, the other corresponding to penalising the misclassified
examples (non-zero ξi ), and the optimization becomes a trade-off between a large
margin and a small error penalty.
N
!
1 2
X
arg min ||w|| + C ξi .
w,ξ,b 2
i=1
subject to
yi (w.x − b) ≥ 1 − ξi ∀ i
ξi ≥ 0 ∀ i
where αi , βi ≥ 0.
To solve the above problem, we use the technique of Primal-Dual algorithms.
That is, we first write its dual. The dual of the above problem is,
N
X 1 X
arg max αi − αi αj yi yj xi .xj
α 2
i=1 i,j
subject to
0 ≤ αi ≤ C ∀ i and
XN
α i yi = 0
i=1
The theory of how the dual is generated from the primal is outside the scope of this
thesis. The dual problem is convex and hence can be solved by classical convex
optimization techniques. From the strong duality theorem we have that the optimal
values of the primal and the dual objectives coincide. From a property of duality
33
not presented here, we get that the solution to the primal can be expressed as a
linear combination of the training vectors,
N
X
w= αi yi xi
i=1
Since, αi ≥ 0 we notice that in the optimization for the primal if yi (w.xi − b) > 1
the corresponding αi s are going to be set to zero. The xi whose corresponding αi
are greater than 0 are precisely the support vectors. They lie on the margin and
satisfy yi (w.xi − b) = 1.
34
Figure 3.4: An example transformation of input space using kernel functions and
the corresponding decision boundaries in the two spaces.
3.5.1 Model
A potential solution for a problem is an individual who can be represented via a
set of parameters. These parameters are regarded as the genes of a chromosome;
they can be structured by a string of values in binary form. A positive value called
the fitness value, is used to reflect the degree of goodness of the chromosome for
solving the problem, and this value is closely related to its objective value. A fitter
chromosome has the tendency to yield good-quality offspring. A population pool
of chromosomes has to be installed and they can be randomly set initially. The size
of this population varies from one problem to the other.
In each cycle of the genetic process, the subsequent generation is created from
the chromosomes of a set of parents in current generation called the mating pool
who are selected via a selection routine. There can be multiple selection routines.
A popular one called the Roulette wheel selection is presented next.
The cycle of evolution is repeated until a desired termination criterion is reached.
This criterion can also be set by the number of evolutional cycles, the amount of
variation of individuals between generations, or a predefined value of fitness.
• Sum the fitness of all the population members and call it total fitness (N ).
• Return the first population member whose fitness added to the fitness of the
preceding population members is greater than n.
35
Next we look at genetic operators. These operators act on chromosomes and give
rise to the next pool of chromosomes (or offspring). These operators result in a
change in the fitness value of the chromosomes in the pool and hence are ultimately
responsible for finding the optimal chromosome.
The Crossover Operation: Given two parents from the mating pool, the crossover
operator partitions the chromosome length into partitions of random sizes and prob-
abilistically exchanges the subchromosomes within a partition between the parents.
The new chromosome generated at the end of the crossover operation is called the
child chromosome. An example of a single-point crossover (the chromosomes are
split into two partitions only) is shown in Figure 3.5.
The Mutation Operation: The mutation operation changes each bit in the
chromosome with some probability. If each bit is allowed to take on values from a
set of size more than 2, then the mutation operator might end up exchanging two
bits in the chromosome, replacing a bit by another at a different position, creating
a new bit with no relation to existing bits et cetera. However, since we are only
concerned with chromosomes whose bits take binary values, for our purposes the
36
mutation operator does the following. It flips each bit of the chromosome with a
certain probability. An example of mutation is shown in Figure 3.6. Although, in
the figure only a single bit is shown to be flipped, in practice multiple bits can get
flipped via a single mutation operation.
Figure 3.6: A depiction of the mutation operator. Here only a single bit is shown
to be mutated.
Choosing the probability parameters for both the operators can be a complex,
non-linear optimization problem. This selection issue continues to remain largely
open but some guidelines exist as to how to choose the parameters given the popu-
lation size.
Both the operators do not necessarily always improve the population. The im-
provement is enforced by the parent selection and the survivor selection routines.
However, the apparently random nature of the operations is what offers genetic al-
gorithms to overcome a serious limitation of many other optimization frameworks.
They do not get trapped at local optima in contrast to some gradient descent al-
gorithms. Even when a local optima is achieved a genetic algorithm will produce
some offspring which explores other possibilities for the optima beyond the current
local optima.
• Recombine the genes of the parents in the mating pool using crossover to
generate a new child population.
• Perturb the genes of the child population via the mutation operator.
• Select the chromosomes with the best N fitness values among the parent and
child populations as survivors for the next generation.
37
The pseudocode for the algorithm is given below.
Pseudocode:
The genetic operators, crossover and mutation have the ability to generate,
promote, and juxtapose(side by side) building blocks to form the optimal strings.
Crossover tends to conserve the genetic information present in the strings. Thus,
when the strings for crossover are similar, its capacity to generate new building
blocks diminishes. Mutation, however, is not a conservative operator but capable
of generating new building blocks radically.
In addition, parent selection is an important procedure to be devised. It tends to
bias toward building blocks with higher fitness values, and at the end ensures their
representation from generation to generation.
38
3.5.6 Limitations
Genetic algorithms have their share of limitations. The first among them was al-
ready touched upon. The question of what parameters to choose for the operations
of crossover and mutation does not have a satisfactory answer and it appears that
getting to the answer is complex and might be inefficient too. The second major
concern is that of efficiency of the algorithm. Genetic algorithms might perform
well on the average but in the worst case they can take large amounts of time to
improve the solution.
In the next section, we look at another probabilistic heuristic for optimization
which was inspired from a metallurgical process.
1. A finite set S.
This is a complete and abstract description of the model which can be applied to
problems beyond thermodynamics too, in particular to optimization problems in
computer science. Next, we look at the heuristic of simulated annealing and how
it solves the energy minimization problem.
39
3.6.2 The Method
Given the above elements, the SA algorithm consists of a discrete-time inhomoge-
neous Markov chain x(t), which evolves as follows.
Hence, the SA algorithm can be viewed as a local search algorithm in which there
are occasional ’upward’ moves that lead to a cost increase. These moves help the
algorithm escape from local minima. This is shown pictorially in Figure 3.7. The
function T (t) called the cooling schedule is crucial in deciding the performance of
SA for a problem. From the probability expression above, we can see that lower
temperatures imply smaller chance of perturbation. Hence as we near our optima,
the temperature should fall down so that we converge at that state.
3.6.3 Performance
SA has gained popularity among researchers mainly due to its speed of conver-
gence. Despite the lack of a rigorous theoretical justification for the convergence
speed, SA has grown in popularity and widely used in image processing. In a
40
comprehensive study of SA, Johnson et al. (1990, 1991, 1992) present the perfor-
mance of SA in combinatorial optimization problems. In general, its performance
was mixed. In some problems such as the traveling salesman problem, SA outper-
formed the best known heuristics. In other cases, such as the graph partitioning
problem, specialized heuristics performed better. In the case of the graph coloring
problem there was no significant difference between the solutions given by SA and
specialized heuristics.
We now conclude this chapter. We have looked at various classical techniques
used for search and optimization, all the time having the main motive of developing
quantum techniques in the back of the mind. This broad exploration of classical
techniques was essential because it is these techniques that we would like to im-
prove using the power of quantum mechanics and quantum computing.
We next go on to look at the development of quantum computing ideas in liter-
ature by studying some popular and landmark quantum computing algorithms.
41