Computer Science Notes Mathematics For Computer Science
Computer Science Notes Mathematics For Computer Science
This minisite contains notes taken by Chris Northwood whilst studying Computer Science at the
University of York between 2005-09 and the University of Sheffield 2009-10.
They are published here in case others find them useful, but I provide no warranty for their accuracy,
completeness or whether or not they are up-to-date.
The contents of this page have dubious copyright status, as great portions of some of my revision notes
are verbatim from the lecture slides, what the lecturer wrote on the board, or what they said.
Additionally, lots of the images have been captured from the lecture slides.
Types of Numbers
Positive Integers
N = {0, 1, 2, 3, ...}
Integers
Rationals
Irrationals
Reals
• Closed interval:
• Open interval:
Informal definition
f : A -> B
Function f maps every value from domain A onto a single value in the range B.
Formal definition
The image is that portion of the range for which there exists a mapping for the domain.
Combining Functions
Classes of Function
Injective Function
Surjective Function
Every value in the range is mapped onto from the domain, the image is identical to the range.
Properties of Functions
A function is even if
A function is odd if
Piecewise Function
The domain of f(x) is the union of the intervals R1, R2, etc: .
To ensure that the piecewise function is indeed a function, they must be pairwise
disjoint: . If any intervals overlap, it is not pairwise disjoint, and is not a function.
Function Inverse
Function Composition
Consider 2 functions - f : B -> C and g : A -> B, where the domain at f must be indentical to the image
of g.
(f·g)(x) = f(g(x))
f·g has domain A and range C. g must be surjective otherwise f is not a function.
g(A) = B
Function Limits
• As x gets closer and closer to a, does the value of function f(x) get closer and closer to some
limit, L?
• Can we make the function value f(x) as close to L as desired by making x sufficiently close to a?
Limit Rules
• Constant:
• Linear:
• Power rule:
• Functions to powers:
• Linear roots: , if a > 0 and n is a positive integer, or a < 0 and n is an odd positive
integer
• Roots of functions: , if n is an odd positive integer, or n is an even
• Constant multiplier:
• Sum:
• Difference:
• Product:
• Quotient:
Tolerances
Infinite Limits
The statement means that for every ε > 0 there exists a number M such that if x > M then
|f(x) - L| < ε. On a graph, this is represented as the function f(x) lying between the horizontal
lines y = L ± ε. The line y = L is called the horizontal asymptote of the graph of x.
Continuity of a Function
• no breaks
• no holes
• no vertical asymptotes
Therefore, we can say that there are the following types of discontinuity:
• removable discontinuity
• f(c) B and
• f(x) = f(c)
• A rational function q(x) = is continuous at every real number x R, except where g(x) = 0.
Continuity on an Interval
Let function f : A -> B be defined on the open interval (a,b) A R. The function is said to be
continuous on (a, b) if:
• f(x) = f(c)
If we want the function to be defined on the closed interval [a,b], then we must tighten up the above
definition to include:
•
Continuity of Compound Functions
• the sum
• the difference
• the product
Consider: f : B -> C and g : A -> B (this is surjective). The domain of f must be identical to the image of g.
Consider the composition f ∧ g, if g(x) = b and f is continuous at b, then f(g(x)) = f(b) = f( g(x)).
If, in addition, g is continuous at c and f is continuous at g(c), then f(g(x)) = f(b) = f( g(x)) = f(g(c)),
then the composite function f ∧ g is continuous at c.
If f is continuous on [a, b] and w [min(f(a), f(b)), max(f(a), f(b))], then there exists at least one number
in [a, b] such that f(c) = w.
Zeroes of a Function
Calculus
• exists
Derivative of an Inverse Function
provided that ≠ 0.
However, this is not well formed, as y is expressed in terms of x. We must use our knowledge of the
original formula and derivative to reform the derivative to be all in terms of y.
f(g(x)) = f∧g(x)
We can use the chain rule to differentiate this. The chain rule is f(g(x)) = f′(x).g′(x).
The Exponential
The exponential function is of the form ax, where a ∈ (0,∞) and it is a positive real number constant. The
exponential function is defined in five stages.
2. If x = 0, then a0 = 1
5. If x ∈ R, then
It is important to remember that this is not the same as the polynomial function. In a polynomial, the
function varies on the base, but here it varies on the exponent. f(x) = ax is exponential, but f(x) = xn is
polynomial.
Monotonicity
Exponential Limits
Derivative of Exponential
If we try to derive from first principles, we get an indeterminate number in the form of 0/0, so the limit is
when x = 0. Hence, we can say that f′(x) = f′(0)ax
Natural Exponential
This is the exponential where f′(0) = 1, we call this number e and f′(x), where f(x) = ex, is ex - it is equal
everywhere to its own derivative.
Natural Logarithm
To figure out the derivative of the natural logarithm, we can use the rules we already know about
inverses to get Dy ln y = 1/ex. To get things in terms of y, if we sub y = ex in we get Dy ln y = 1/y
The natural logarithm takes the normal logarithmic rules.
ax = ef(x) and then you can take natural logarithms of both side to get x ln a = f(x), therefore by
substituting back in, ax = ex ln a.
By differentiating from first principles, we showed Dx ax = ax , but since we now know that
Using the rules we learnt earlier for inverses, we can derive the derivative for logax, which gives us
Dy logay = 1/(y ln a)
Iteration Algorithms
Consider xr + 1 = (xr2 + 3) / 4. This has fixed points at 1 and 3. A fixed point is a number which when put in
returns itself (xr + 1 = xr). If we start with numbers near this point (e.g., x0 = 1.5) we get a sequence that
converges towards a point (in this case 1). However, if we put in x0 = 2.9, this converges to 1 also. x0 = 3.1
diverges to ∞
Types of Algorithm
The iterative algorithm for computing x will be based on a function of preceding values from this
sequence. The function will determine convergence, and care should be taken to ensure convergence
exists.
MCS covers two main types of algorithm, one-stage (e.g., Newton-Rapheson) and two-stage (e.g.,
Aitken's method). Two stage algorithms tend to have faster convergence, but one-stage algorithms have
simpler initial conditions.
The type of algorithm we looked at above is a one-stage algorithm called fixed-point iteration. To find the
fixed points, a graph is used.
• How fast does it move along this path? - Can we choose f(x) to optimise convergance?
xr+1 = F(xr)
If it converges to s, limr→∞xr = s and s = F(s), then s is called the fixed point of the iteration scheme. If the
iterative process is initiallised with x0 = s then it will remain static at the fixed point.
2. Differentiate - F′(s + ε) = a1 + 2εa2 + 3ε2a3 + ..., then set ε = 0 to eliminate terms containing ε.
As s = F(s), we can subtract from both sides to give us: εr+1 = εrF′(s) + εr2/2 F′(s). For small values of r, we
can ignore εr2 and higher powers. This is the recurance relation for the distance from the fixed point.
So, | εr+1 / εr | ≈ | F′(s) | < 1 is the condition for convergance if for some range of x0 FPI converges to s =
F(s), otherwise if there's no such range, it diverges.
Convergance Paths
The sign of F′(s) determines the sign of the the error. sgn(εr + 1) = sgn(F′(s))r+1.sgn(ε0).
• staircase, if F′(s) ∈ (0, 1). This converges to a fixed point from one direction
• cobweb or spiral, if F′(s) ∈ (-1, 0). This jumps from side to side of the fixed point.
Rate of Convergance
Quadratic convergance is much faster than linear convergance. We have the freedom to choose F(x) so
that F′(s) = 0
There are limitations to solving non-linear equations using FPI. Non-uniqueness is a problem when
choosing the iteration function. For any one G(x) there may be many roots, and many different ways of
choosing G(x). There may be no universal F(x) for finding all roots; each root may require its own F(x),
and even if there is a universal F(x), the initial values may require careful choice. F(x) may converge at
different rates for different roots.
Due to the shortcomings with FPI, we can use the Newton-Raphson method.
Newton-Raphson
Wikipedia
In Newton-Raphson, the gradient in the proximity of the root to guide the search. It is a universal
scheme, where only one function is needed to find all the roots - provided that there are good initial
estimates. Newton-Raphson is also a faster iteration method - it is quadratic.
Probability
• Union - A ∪ B
• Intersection - A ∩ B
• Disjoint sets - A ∩ B = ∅
• Difference - A - B
• Complement - Ac = U - A
Random Events
Events can be one of two types: deterministic, where the same experimental conditions always give
identical answers, or random, where the same experimental conditions give different answers, but the
average of a large number of experiments are identical.
Sample space
The sample space is the set of all possible outcomes of a random experiment. S is the universal set for
the random experiment. Members of S are called sample points. e.g., for a coin toss, S = {H, T}
Sample spaces come in various forms. The first is a discrete sample space which can be either finite (e.g.,
a dice throw or coin toss) or infinite (defined on the set of positive integers, and are countable). The
other sample space is the continuous sample space, which is defined over the set of reals.
Events
• either A or B - A ∪ B
• both A and B - A ∩ B
Probability is the likelihood of a specific event occuring and is represented as a real number in the range
[0, 1]. If P(A) = 1, this is a sure event and P(A) = 0 then the event is impossible.
The odds against is the probability of an event not occuring. P(Ac) = 1 - P(A)
Estimating Probability
Epistemic means
relating to knowledge
There are two approaches to estimating probability, the first is the a priori (epistemic) approach.
If event A can occur in h different ways out of a total of n possible ways, P(A) = h/n. This assumes the
different occurances of A are equally likely and is often referred to as the uniform distribution of
ignorance.
Aletoric means
relating to experience
The second approach is the a posteriori (aleatoric) approach. If an experiment is repeated n times and
the event A is observed to occur h times, P(A) = h/n, however, n must be a very large value for this to be
accurate.
We can break down probability in terms of atomic events into a union of events (an event or another
event) or an intersection of events (an event and another event). These can be simplified into mutual
exclusivity and independance respectively.
Axioms of Probability
If sample space S is finite, the set of events is the power-set of S. If sample space S is continuous, the
events are measurable subsets.
For every event: P(A) ≥ 0, however for the sure event, P(A) = 1. However, for mutually exclusive events
A1, A2, A3, ... then P(A1 ∪ A2 ∪ A3 ...) = P(A1) + P(A2) + P(A3) ...
The theorems of probability follow the axioms. If A1 ⊂ A2 then P(A1) ≤ P(A2) ^ P(A1 - A2) = P(A1) - P(A2). For
every event P(A) ∈ [0,1]. The impossible event has 0 probability P(∅) = 0 and the complementary P(Ac) =
1 - P(A).
Conditional Probability
If you consider two events, A and B, the probability of B given A has occured has the notation P(B|A)
which can be expressed as P(B|A) = P(A ∩ B) / P(A), or equivalently P(A ∩ B) = P(B|A).P(A).
The probability of a set of independent events which are anded together (e.g., event A and event B, etc)
can be expressed as P(A1 ∩ A2 ... An) = ∏ni = 1 P(Ai).
Bayes Theorem
If you consider an event where A1, A2 ... An are mutually exclusive events which could have caused B. The
sample space S = Unk = 1, i.e., one of these events has to occur.
Random Variables
If you associate a variable with each point in the sample space, this is referred to as a random or
stochastic variable. A random variable is discrete if the sample space is countable, or continuous if the
sample space is uncountable.
If you consider a set of discrete random variables, X ∈ {xAk, Ak ∈ S} defined over the sample space S, the
probability distribution P(X = xAk) = f(xAk), where f : S → [0, 1], f(x) ≥ 0, ΣAk ∈ Sf(xAk) = 1.
If X is a continuous random variable, the probability that X takes on the value x is 0, as the sample space
is uncountably large.
Density Functions
In a continuous distribution, f(x) is defined to mean P(A ≤ x ≤ B) = ∫BAf(x) dx, i.e., the probability that x lies
between A and B is equal to the corresponding area beneath the function f(x).
Another method of considering density is cumulative distribution. Here, function F(x) = P(X ≤ x) = P(-∞
< x ≤ x). So, we could say that F(x) = ∫x-∞f(x) dx, hence f(x) = dF(x)/dx.
Bernoulli Trials
A random event (e.g., heads in a coin toss, 4 on a dice) occurs with probability p and fails to occur with
probability 1 - p. A random variable X is the number of times the event occurs in n Bernoulli trials.
The random variable X is the event that there are x successes and n - x failures, in n trials. This gives us
the probability of successes to be px (where p is the probability of success) and the probability of failures
is (1 - p)n - x
There are n!/(x!(n - x)!) ways of getting this Bernoulli Trial, so the probability distribution P(X = x) =
(n!/(x!(n - x)!))px(1 - p)n - x, which we call the binomial distribution.
Expectation
The expectation of random variable X is E(X) = Σj xjP(X = xj) if the event is discrete and E(X) = ∫ xP(X = x)
dx if the event is continuous. This gives us a value μ of X, which is the expectation of X.
Variance
Variance is the expectation of (X - μ)2 → Var(X) = E[(X - μ)2]. Standard devaiation σX = √Var(X).
• If X and Y are independent random values, Var(X + Y) = Var(X) + Var(Y) and Var(X - Y) = Var(X) -
Var(Y).
Binomial Distribution
The binomial distribution has mean μ = np and variance Var(X) = np(1 - p) which gives a standardised
random value as Z = (X - np)/√np(1 - p)
Poisson Distribution
The binomial distribution is limited for rare events. If there is a large sample (i.e., n → ∞), a small
probability (i.e., p → 0), the mean μ = np and variance Var(X) ≈ np, then the distribution can be
expressed as P(X = x) = (μxe-μ)/x!
Normal/Gaussian Distribution
The limiting case of the binomial distribution is when n → ∞. P(Z = z) = 1/√2π e-z2/2, which is the normal
distribution.
A non-standardised random value x is Gaussian if P(X = x) = 1/(σx√2π e-(x - μ2)/(2σ2). The sample space here
is the set of reals (i.e., x ∈ R), which means x is a continuous random value.
This approximates the binomial with a standardised random value (Z = (X - np)/(np(1 - p)) when n is large
and p is neither close to 0 or close to 1.
We have the problem of an Ada compiler not knowing whether or not an Ada program is legal Ada, and
not English, FORTRAN or some other illegal language. To start with, we can look at the problem of
recognising individual elements (variable names, integers, real numbers, etc) and then building these
together to make a full language recogniser (a lexical analyser). One definition may be that variable
names always start with a letter which is then followed by a sequence of digits and letters. We can draw
a transition graph to describe what a legal variable is composed of.
Transition Graphs
From the transition graph, a transition table can be drawn up, for the above example:
1 3 2 error
3 3 3 accept
Assume we have an on-off switch, we can define S1(n) as automata is off after n pushes if and only if n is
even and S2(n) as the automata is on after n pushes if and only if n is odd. We can prove ∀n, S1(n) ^ S2(n)
using induction.
As usual, we need to prove the base case (S1(0) ^ S2(0)) by considering implication in both directions, and
prove for n + 1.
Central Concepts
An alphabet is a finite non-empty set of symbols, e.g., Σ = {0, 1} or Σ = {a, b, c ... z}, etc
A string is a sequence of symbols of the alphabet, including the empty string (λ). Σk is the set of strings of
length k (however, Σ ≠ Σ1). Σ* is the unions of all strings of length k ≥ 0. Σ0 = λ. We can also define the set
that includes the empty string as Σ+ = Σ* - λ
In strings, (assume a ∈ L) an means a...a (a n times). If w = abc and v = bcd, then wv = abcbcd. Prefixes are
anything that forms the start of a string and a suffix is anything that forms the end of a string. The length
of a string is the number of symbols in it and this can be represented using |w|.
A language is a set of strings chosen from some Σ*. L is a language over Σ: L ⊆ Σ*. Languages are sets, so
any operation that can be performed on a set can be performed on a language, e.g., L is the complement
of L.
Models of Computation
Automata are abstract models of computing machines. They can read input, have a control mechanism
(a finite internal state), a possibly infinite storage capability and an output. In this course, we will mainly
be considering finite state automata (FSA) - machines without storage or memory. An important point to
consider is that some problems are insoluble using finite state automata, and more powerful concepts
are needed.
DFA's accept string S, if and only if δ(i, S) ∈ F, where the transition function δ is extended to deal with
strings, such that δ(q, aw) = δ(δ(q, a), w). λ is accepted by M, if and only if i ∈ F.
A language L is accepted by an FSA M if and only if M accepts every string S in L and nothing else.
The definition of a DFA is such that for every element of the input alphabet, there is exactly one next
state (i.e., δ is a function). An NDFA relaxes this restriction, so there can be more than one next state
from a given input alphabet, so the transition function becomes a relation ρ.
There are advantages of non-determinism. Although digital computers are deterministic, it is often
convenient to express problems using NDFAs, especially when using search. For example, the following
NDFA definining awa, where w ∈ {a, b}* is a natural way of expression:
Any NDFA can be converted into a DFA by considering all possible transition paths in a NDFA and then
drawing a DFA where a node is a set of the nodes in an NDFA.
• i is an initial state (i ∈ Q)
ρ can also be defined as a transition function which maps a state and a symbol (or λ) to a set of states.
Hence, we can say a string S is accepted by NDFA M = ⟨Q, Σ, ρ, i, F⟩ if and only if there exists a
state q such that (i, S, q) ∈ ρ and q ∈ F, where the transition relation ρ is extended to strings by (q, aw, r)
∈ ρ, if and only if there exists a state p such that (q, a, p) ∈ ρ ^ (p, w, r) ∈ ρ. The empty string λ is
accepted by M if and only if i ∈ F.
λ Transitions
λ transitions are a feature of an NDFA. A transition does not consume an input symbol and these
transitions are labelled with λ.
When converting an NDFA to a DFA with a λ transition, you always follow a λ transition when plotting the
possible paths.
We could say that a DFA was a special case of an NDFA, where there are no non-deterministic elements.
From the formal definitions of a DFA and NDFA, we could hence prove that for any NDFA, an equivalent
DFA could be constructed.
If M is an NDFA ⟨S, Σ, ρ, i, F⟩, then we can construct DFA M′ = ⟨S′, Σ, δ, i′, F′⟩, where:
• S′ = 2S
• i′ = {i}
• δ is a function from S′ × Σ into S′ such that for each x ∈ Σ, s′ ∈ S′, δ(s′, x) is the set of all s ∈ S such
that (u, x, s) ∈ ρ for some u ∈ s′. In other words, δ(s′, x) is the set of all states in S that can be
reached from a state in s′ over an arc named x.
Since δ is a function, then M′ is a DFA. All that needs to be done is prove that M and M′ accept the same
strings. We can do this using induction on the string length, and using path equivalence to show that any
string accepted by an NDFA is accepted by the same equivalent DFA.
Regular Languages
A language L is regular if a DFA or NDFA exists that accepts L. If M is a DFA, L(M) is the language accepted
by M. Every finite language is regular.
If we consider a regular language as a set, we can perform set operations on one or more regular
languages, and still get regular languages out.
However, not all languages are regular and can not be represented by a DFA. For example, if we consider
Scheme expressions, these have a balanced amount of opening and closing brackets, but these can not
be represented by a DFA.
We have a theorem called the pumping lemma that states, if a language contains strings of the
form xnyn for an arbitarily large n ∈ N then it must also contain strings of the form xnym where n ∋ m.
Regular Expressions
Instead of using machines to describe languages, we switch to an algebraic description - this is a more
concise way of representing regular languages.
In regular expressions, + represents union, juxtaposition represents concatenation (note, some books
use . for concatenation) and * is the star closure. For example:
{a} a
{a, b} a+b
{a, b} ∪ {b, c} (a + b) + (b + c)
Formal Definition
Let Σ be an alphabet. ∅, λ and a ∈ Σ are regular expressions called primitive regular expressions.
if r1 and r2 are regular expressions, then so are r1 + r2, r1r2, r1* and (r1). A string is a regular expression if
and only if it can be derived from primitive regular expressions by finite number of applications of the
previous statement. Every regular language defines a set of strings; a language.
We can extend transition diagrams to have transitions which consist of regular expressions, making them
more flexible and powerful.
Moore Machines
A Moore machine is a DFA where the states emit output signals. Moore machines always emit one extra
symbol (i.e., the output has one symbol more in the output string than the input string) and have no
accepting state. In a Moore machine, the state name and the output symbol are written inside the circle.
Mealy Machines
A Mealy machine is the same as a Moore machine, but it is the transitions that emit the output symbols.
This means that Mealy machines generate output strings of the same length as the input string. With
Mealy machines, the output symbols are written alongside the transition with an input symbol.
Previously, our definition of equivalence meant that both machines accept the same string, however
here, we need to define a new concept of equivalence, so we say that the machines are equivalent if
they both give the same output. Under this definition, equivalence will always be a contradiction as
Mealy and Moore machines will always generate strings of different lengths, given the same input. We
must also disregard the first character output from a Moore machine.
To change Moore machines into Mealy, each state needs to be transformed where the output on the
state is put on each transition leading to that state. This can be proved by induction, but it is easier to
prove by inspection.
Converting a Mealy machine to a Moore machine is harder, as multiple transitions with different outputs
can arrive at the same state. To work round this, you need to replicate a state, such that all transitions
into a particular state give the same output and then the above transformation process can be reversed.
Again, this is proved by inspection.
Let L be any infinite regular language. There then exists strings x, y, z (where y ≠ λ) such that: ∀n >
0, xynz ∈ L
• With this information, we can break the string into three parts:
o let y be the substring of w which follows x, such that if it travels round the circuit exactly
once it returns to the same state (this must not be λ)
• Thus, w = xyz
• Hence, M accepts any string in the form xynz (where n > 0) and could be represented as:
To apply the pumping lemma, you should assume some language L is regular and generate a
contradiction. However, this is not perfect, for example, in the case of palindromes. so additional
methods must be applied based on the length of symbols in the FSA to create a contradiction.
Decidability
A problem is effectively soluble if there is an algorithm that gives you the answer in a finite number of
steps - no matter on the inputs. The maximum number of steps must be predictable before commencing
execution of the procedure.
An effective solution to a problem that has a yes or no answer is called a decision procedure. A problem
that has a decision procedure is called decidable.
How do we conclude whether or not rwo regular expressions are equal? "Generating words and praying
for inspiration is not an efficient process".
A regular expression can be converted into a finite automaton by means of an effective procedure, and
vice versa. The length of the procedure is related to the number of symbols in the regular expression or
the number of states in the finite automaton. We can test the languages are equivalent by:
• producing an effective procedure for testing that an automaton does not accept any strings
• looking at the automaton that is constructed by looking at the differences between the two
machines
• if this difference automaton accepts no strings, then the languages are equivalent
• The DFA M = (Q, Σ, δ, i, F) such that L = L(M) and the DFA is a completion (no missing transitions -
that is δ(q0, w) is always defined).
• L = L(N), where N is the DFA N = (Q, Σ, δ, i, Q-F), that is, N is exactly like M, but the accepting
states have become non-accepting states, and vice versa.
Proving that the intersection of two regular languages is regular is easy as we can use boolean
operations, specifically de Morgan's law. That is, L1 ∩ L2 = ¬(L1 ∪ L2)
This can be constructed explicitely also. If L1 and L2 are regular languages, so is L1 ∩ L2. This can be
proved by:
• considering two languages, L1 and L2 and automata M1 and M2. Here, assume L1 = L(M1) and L2 =
L(M2) and that M1 = {Q1, Σ, δ1, i1, F1}, M2 = {Q2, Σ, δ2, i2, F2}
• notice the alphabets of M1 and M2 are the same, this can be done because in general, Σ = Σ1 ∪ Σ2
• construct M such that it simulates both M1 and M2. States of M are the pairs of states first from
M1 and the second from M2. When considering the transitions, suppose M is in the state (p, q)
where p ∈ Q1 and q ∈ Q2. If a is an input symbol, we have δ(p, q), a) = δ1(p, a), δ2(q, a),
presuming that we have completed DFAs.
If we have two languages, L1 and L2 defined either by regular expressions or FSAs, we can produce an FSA
which accepts (L1 - L2) + (L2 - L1) - this is the language that accepts strings that are in L1 but not L2, or are
in L2 but not L1. If L1 and L2 are the same language, then the automaton can not accept any strings.
To make this procedure an effective procedure, we need to show we can test emptiness. We can create
an effective procedure using induction. The base case is that from the start case, at least the start case is
reachable. The inductive step is that if state q is reachable from the start state, there is an arc
from q to p with any label (an input symbol or λ), then p is reachable. In this manner, we can compute
the set of reachable states. If the accepting state is among them, then the FSA is not empty, otherwise it
is.
You can also test by regular expression to see if the FSA is empty. Here, the base step is that ∅ denotes
the empty language and any other primitive expression does not. The inductive step is that if r is a
regular expression, there are four cases:
1. r = r1 + r2, then L(r) is empty if and only if L(r1) and L(r2) are empty.
2. r = r1r2 is empty if and only if either L(r1) or L(r2) are empty.
Now we can give a procedure for checking whether two regular languages are equivalent:
2. Construct M1 and M2
3. Construct M1 ∩ M2 and M2 ∩ M1
If F is an FSA with n states, and it accepts an input string w such that n < length(w), then F accepts an
infinite language. This is proved with the pumping lemma.
If F accepts infinitely many strings then F accepts some strings such that n < length(w) < 2n. This is
proved by saying:
4. Then one can construct a path with length at least n + 1 but less than 2n by going round the
circuit the required number of times.
We can then generate an effective procedure. If we have an FSA with n states, we can generate all strings
in the language with length at least n and less than 2n and check if any is accepted by the FSA. If it is, the
FSA accepts an infinite language; if not, it does not.
Minimising a DFA
For an example,
see lecture 7
Is there a way to minimise a DFA? States p and q are equivalent if and only if for all input
strings w, δ(p, w) is accepting if and only if δ(q, w) is accepting. That is, starting from p and q with a given
input, either both end up in an accepting state (not necessarily the same one) or neither does (not
necessarily the same). States p and q are distinguisable if there is at least one string w such that one
of δ(p, w) and δ(q, w) is accepting and the other is not.
One efficient way of accomplishing this is using the table filling system. For DFA M = (Q, Σ, δ, i, F), if p is
an accepting state and q is non-accepting, then the pair {p, q} is distinguishable. The inductive step is
that is p and q are states such that for some input symbol a, r = δ(p, a) and s = δ(q, a) are a pair of states
known to be distinguishable then {p, q} is a pair of distinguishable states.
The procedure can be performed by filling up a table starting with the states we can distinguish
immediately. At the end of the process, those states not marked will show the equivalent states.
The states eventually become an equivalence relation: reflexive, symmetric and transitive. The
automaton can be minimised by substituting states for equivalent states
Context-Free Grammars
Context-Free Grammars (CFGs) are very easy to write parsers for, and many computer langauges are
context free, such as DTDs for XML.
CFGs consist of a series of productions, rules which define what the language can do, as well as non-
terminal symbols, defined by productions and terminal symbols, which exist in the language we are
trying to define.
The context-free grammar for the language anbn is S → ab | λ, where our non-terminal symbol is S, the
terminals are a and b and "S → ab | λ" is our production.
To make a more formal definition of a context-free grammar, we could start with a grammar and then
restrict it to be context-free.
• P is a finite set of production (or rules) in the form of α → β, where α and β ∈ (V ∪ T)* and α
contains at least one non-terminal
By convention, we use capital letters for non-terminals and small letters for terminal symbols.
To tighten up the above definition to be a context-free grammar, α should consist of a single non-
terminal.
Parsing
There are two ways to show a particular string is in the language of a grammer. The first involves working
from the string towards S, e.g.,
• S → NP VP
• VP → VT NP
• NP → peter
• NP → mary
• VP → runs
• VT → likes
• VT → hates
So, to parse the string "peter hates mary" using the above CFG, the parsing process would look like:
• NP hates mary
• NP VT mary
• NP VT NP
• NP VP
• S
It is also possible to work from S towards the string, and this process is called deriving. For example, we
could have:
• start with S
• NP VP ⇒ peter VP
• peter VP ⇒ peter VT NP
Formally, we can express derivations as, given a string of symbols αβγ, where α, γ are strings from (V ∪
T)* and β is a string from (V ∪ T)+ and a rule β → δ, we can derive αδγ replacing β with δ:
• we write αβγ ⇒ αδγ to indicate that αδγ is directly derived from αβγ
• we are generally interested in the question "Is there a sequence of derivations such that α ⇒ ...
⇒δ
In our example, we could have written peter VP ⇒ ... ⇒ peter hates mary. We say that peter VP derives
peter hates mary, and this is written as peter VP ⇒* peter hates mary.
To reduce some (but not all) non-determinism from a derivation, we can consider two types of
derivations, left most and right most. In the left most derivation, the left most nonterminal is replaced,
and in a right most derivation, the right most nonterminal is replaced. A left most derivation is indicated
by: ⇒lm* and a right most one by ⇒rm*.
S ⇒* w ⇔ S ⇒lm* w ^ S ⇒* w ⇔ S ⇒rm*
Parse/Derivation Trees
These are graphical representations of derivations. For example, for the left most derivation given above,
the parse tree would look like:
Ambiguity
In natural language, ambiguity is part of the language, however in computer languages, ambiguity is to
be avoided.
Parsers
• Given a string as input, a parser returns a (set of) parse tree(s) that derive the input string.
• A grammar G generates a language L(G) if L(G) is the set of all terminal symbols that can be
generated from G, i.e., L(G) = {α ∈ T* | S ⇒* α}
Not all context free grammars are regular, however some are. All regular languages can be expressed by
a CFG. A CFG can be express all languages, however.
For a given CFG, the semiword is a string of terminals (maybe none) concatenated with exactly one
variable on the right, e.g.,: tttV.
To prove this, we have to theorise that for any given finite automaton there is a CFG that generates
exactly the language accepted by the FSA, i.e., all regular languages are context free. This is proved using
a construction algorithm - we start with an FSA and generate a CFG:
• The variables of the CFG will be all the names of the states in the FSA, although the start state is
renamed S.
• For every edge labelled a ∈ Σ ∨ λ, the transition a between two nodes (A and B) gives a
production A → aB (where B = A in the case of a loop)
Now we can prove that this CFG generates exactly the language accepted by the original FSA by showing
that every word accepted by the FSA can be generated by the CFG and every word generated by the CFG
is accepted by the FSA (we know in general that this is not possible, but our CFG has a particular form).
If we consider some word, say abbaa, accepted by the FSA, then we can construct the path through the
FSA to an accepting state through a sequence of semi-paths. These semi-paths are formed from the
string read from the input followed by the name of the state to which the string takes us.
• ...
This corresponds to a derivation of the string using the productions already generated from the FSA.
To prove the second statement we have to express every word of the CFG in the form of N → tN, thus
every working string found in a derivation from this CFG must contain only one nonterminal and that will
be at the very end, thus all derivations from this CFG give a sequence of semiwords, starting with S and
corresponding to a growing semipath through the FSA. The word itself is only generated by the final
accepting nonterminal being turned into a λ - this means the semipath ends in an accepting state, and
the word generated is accepted by the FSA.
We have a final theorem to complete our proof: If all the productions of a CFG are either of the form N
→ semiword or N → word (where the word may be λ).
• Draw a state for each nonterminal and an additional state for the accept state
• For every production of the form Nx → wyNz, draw a directed arc from Nx to Nz
• If Nx = Nz then loop
• Any path in the transition graph from start to accept corresponds to a word in the language of
the transition graph and also to the sequence of productions from the CFG which generate the
same word. Conversely, every derivation of a word corresponds to a graph.
Regular Grammars
A grammar G = <V, T, S, P> is a regular grammar if all rules in P are of the form:
• A → xB
• A→x
where A and B are non-terminals and x ∈ T* (i.e., x is either λ or a sequence of terminal symbols).
Thus, we see that all regular languages can be generated by regular grammars, and all regular grammars
generate regular languages. But, it's still quite possible for a CFG which is not in the format of a regular
grammar to generate a regular language.
Ambiguity in CFGs
Ambiguity is particularly a problem if we want to compare parsed structures, e.g., structured files. A CFG
is ambiguous is there exists at least one string in the language accepted by the CFG for which there is
more than one parse tree.
The question we need to ask is whether or not we can develop an algorithm that removes ambiguity
from CFGs. There is no general algorithm that can tell us whether or not a CFG is ambiguous (this is
discussed further in TOC).
If L is a CFG for which there exists an unambiguous grammar, then L is said to be unambiguous. However,
if every grammar that generates L is ambiguous then the language is inherently ambiguous.
Are there techniques that might help us avoid ambiguity? We can look at the two causes of ambiguity:
• a group of identical operators can group either from the left or from the right. It does not matter
which, but we must pick one
A technique for removing ambiguity is introducing several new variables each representing those
expressions that can share a level of binding strength.
However, if an identical node appears twice in a total language tree, this does not necessarily mean the
language is ambiguous, as the parse trees could just be the difference between left most and right most
derivations.
Pushdown Stack Automata are like NDFA, but they also have a potentially infinite stack (last in, first out)
that can be read, pushed and popped. The automaton observes the stack symbol at the top of the stack.
It uses this and the input symbol to make a transition to another state and replaces the top of the stack
by a string of stack symbols.
The automaton has accept states, but we will also consider the possibility of acceptance by empty stack.
A non-deterministic pushdown stack automaton (NDPA) is defined by P = (Q, Σ, Γ, δ, q0, z0, F).
• Q, Σ, q0 and F are defined as before - states, input symbols, initial states and accept states
• z0 ∈ Γ is the stack start symbol. The NDPA stack consists of one instance of this symbol in out
outset
• δ is revised for NDPAs. As this is a non-deterministic automaton, the result is a set of states, so
we can define δ as a function to a set of pairs of stacks and states. δ takes a state q ∈ Q, an input
symbol a ∈ Σ or λ the empty symbol and a stack symbol z ∈ Γ, which is removed from the top of
the stack and produces a set of pairs: a new state and a new string of stack symbols that are
pushed onto the stack. If the stack is to be popped then the new string will be empty (λ). Hence,
δ : Q × (Σ ∪ {λ}) × Γ → 2(Q × Γ*). (note, this is restricted to finite subsets)
Instantaneous Descriptions
To compute an NDPA, we need to describe sequences of configurations. In the case of an FSA, the state
was enough. In the case of an NPDA, we need stacks and states.
The configuration of an NPDA is called the instantaneous description (ID) and is represented by a triple
(q, w, γ), where q is the state, w is the remaining input and γ is the stack. As before, we assume the top
of the stack is the left hand end.
In DFAs and NDFAs, the δ notation was sufficient to describe sequences of transitions through which a
configuration could move, however for NPDAs, a different notation is required.
├P, or just ├ (when P is assumed) is used to connect two IDs. We also use the symbol ├* to signify
there are zero or more transitions of the NDPA. The basis of this is that I ├* I for any ID I, and
inductively, I ├* J, if there exists some ID K such that I ├ K and K ├* J, that is, I ├* J if there exists a
sequence of IDs K1,...,Kn, and ∀i = 1, 2, ..., n - 1, Ki ├ Ki+1
• If a computation is legal, then the computation formed by adding the same string of stack
symbols below the stacks for each ID is also legal
• If a computation is legal and some tail of inputs is not consumed then removing this tail in each
case will not affect the legality
• If P = (Q, Σ, Γ, δ, q0, z0, F) is an NPDA and (q, x, α) ├*P (p, y, β), then for any strings w ∈ Σ*
and γ ∈ Γ* it is also true that (q, xw, αγ) ├*P (p, yw, βγ). This captures the first two principles
above, and can be proved by induction.
• If P = (Q, Σ, Γ, δ, q0, z0, F) is an NPDA and (q, xw, α) ├*P (p, yw, β), then it is also true that
(q, x, α) ├*P (p, y, β). This captures the third principle above
The language accepted by final state by a NDPA P = (Q, Σ, Γ, δ, q0, z0, F) is: L(P) = {w ∈ Σ* | (q0, w, z0) ├
*P (q, λ, α), q ∈ F, α ∈ Γ*}. Note that in this definition, the final stack content is irrelevant to this definition
of acceptance.
For an NPDA P, the language accepted by empty stack is defined by N(P) = {w|(q0, w, z0) ├* (q, λ, λ)}, for
any state q ∈ Q. (note that the N here indicates null stack).
The class of languages that are L(P) for some NDPA P is the same for the class of languages that are N(P),
perhaps for a different P. This class is exactly the context free languages. The proof of this is based on
constructing the relevant automata.
When discussing acceptance by empty stack, the final states (final component) of the NPDA are
irrelevant, so are sometimes omitted.
If L = N(PN) for some NPDA where PN = (Q, Σ, Γ, δN, q0, z0), then there is a NPDA PF such that L = L(PF).
Diagrammatically, this looks like:
We let x0 be a new stack symbol, the initial stack symbol for PF and we construct a new start state,
p0 whose sole function is to push z0, the start symbol of PN onto the top of the stack and enter state q0,
the start state of PN. PF simulates PN until the stack of PN is empty. PF detects this because it detects x0 on
the top of the stack. The state pf is a new state in PF which is the accepting state. The new automaton
transfers to pf whenever it discovers that PN would have emptied its stack.
We can make a formal definition of this accepting state machine as: PF = (Q ∪ {p0, pf}, Σ, Γ ∪ {x0}, δF, p0, x0,
{pf}), where δF is defined by:
• δF(p0, λ, x0) = {(q0, z0x0)}. PF makes a spontaneous transition to the start state of PN pushing its
start symbol z0 onto the stack.
It can be shown that w ∈ L(PF) ⇔ w ∈ N(PN). This fact is later used as a proof about languages that an
NPDA will accept.
• δN(p0, λ, x0) = {q0, z0x0}. The start symbol of PF is pushed onto the stack at the outset and then
goes to the start value of PF.
• (p, λ) ∈ δN(q, λ, y), ∀q ∈ F, y ∈ Γ ∪ {x0} - whenever PF accepts, PN can start emptying its stack
without consuming any more input.
• ∀y ∈ Γ ∪ {x0}, δN(p, λ, y) = {(p, λ)}. Once in p when PF has accepted, PN pops every symbol on its
stack until the stack is empty. No further input is consumed.
We have shown from our final state to empty stack theorem that if we let L to be L(PF) for some NPDA
PF = (Q, Σ, Γ, δF, q0, z0, F) then there is a NPDA PN such that L = N(PN). We can then show that the
following classes of language are equivalent:
If we have a CFG G, the mechanism is to construct an NPDA that simulates the leftmost derivations of G.
At each stage of the derivation, we have a sentential form that describes the state of the derivation: xAα,
where x is a string of terminals. A is the first nonterminal and α is a string of terminals and nonterminals.
We can call Aα the tail of the sentential form, that is, the string that still needs processing. If we have a
sentential form of terminals only, then the tail is λ.
The proposed NDPA has one state and simulates the sequence of left sentential forms that the grammar
uses to derive w. The tail of xAα appears on the stack with A on the top.
To figure out how the NDPA works, we suppose that the NDPA is in an ID (q, y, Aα) representing the left
sentential form xAα. The string being parsed w has form xy, so we are trying to match y against Aα. The
NPDA will guess non-deterministically one of the rules, say A → β as appropriate for the next step in the
derivation. This guess causes A to be replaced by β on the stack, so, (q, y, Aα) ├* (q, y, βα). However,
this does not necessarily lead to the next sentential form as β may be prefixed by terminals. At this stage,
the terminals need to be removed. Each terminal on the top of the stack is checked against the next
input to ensure that we have guessed right. If we have not guessed right, then the NPDA branch dies. If
we succeed then we shall eventually reach the left sentential form w, i.e., the terminal string where all
the symbols of the stack have been expanded (if nonterminals) or matched against input (if terminals).
At this point, the stack is empty and we accept by empty stack.
We can make this more precise by supposing F = (V, T, S, R) is a CFG and the construction of NPDA P is as
follows: P = ({q}, T, V ∪ T, δ, q, S) (this NPDA is accepting by empty stack), where δ is defined as:
• A ∈ V, δ(q, λ, A) = {(q, β) | A → β ∈ R}
To convert from an NPDA to a CFG, a similar approach is used, but in the other direction. Our problem is
that given an NPDA P, we need to find a CFG G whose language is the same language that P accepts by
empty stack. We can observe that:
• a fundamental event as far as an NPDA is concerned is the popping of a symbol from the stack,
while consuming some input - this is a net process and may involve a sequence of steps including
the pushing of other symbols and removal
• this process of popping the element may involve a sequence of state transitions, however there
is a starting state at which the process begins and an ending state where it finishes.
In the construction of the grammar below, we use the following notation: nonterminals are associated
with the net popping of a symbol, say X and the start and finish states, say p and q for the popping of X.
Such a symbol is in the form [pXq].
If we let P = (Q, Σ, Γ, δ, q0, z0, F) be an NDPA, then there is a CFG G such that L(G) = N(P). We can
construct G = (V, Σ, S, R) as follows:
• Productions (R) for all states p we have S → [q0z0p]. This generates all the strings that cause P to
pop z0 while going from q0 to p, thus for any string w, (q0, w0, z0) ├* (p, λ, λ)
• Suppose δ(q, a, X) contains the pair (r, Y1...Yn) where a = Σ ∪ λ then for all lists of states r1...rn G
includes the production [qXrn] → a[rY1Y1]...[rn-1Ynrn]. This production says that one way to pop X
and go from q to state rn is to read a (which may be λ) and then use some input to pop Y1 off the
stack while going from r to r1, then read some input that pops Y2, etc...
In order to prove that P will accept strings that satisfy this grammar we have to prove that[qXp] ⇒* w ⇔
(q, w, X) ├* (p, λ, λ)
A DPDA provides no choice of move in any situation. Such choice might come from:
DPDAs accept all languages by final state that are regular, but there are also some non-regular languages
that are accepted. DPDAs do not accept some context-free languages. All languages accepted by DPDAs
are unambiguous, however not all inherently unambiguous languages are accepted by a DPDA.
Simplifying CFGs
Every context free language can be generated by a CFG in which all productions areof the form: A → BC,
or A → a. This form is called the Chomsky Normal Form. To get to this form, we need to perform various
steps.
Firstly we remove useless symbols. For a symbol to be useful, it must be both generating and reachable.
X is generating if X ⇒* w, for some terminal string w. Every terminal is therefore generating. X is
reachable if there is a derivation S ⇒ αXβ for some α and β.
To compute generating symbols, we have the basis that every terminal in T is generating. The inductive
step is that if there is a production A → α and every symbol of α is known to be generating then A is
generating.
To compute reachable symbols, we have the basis of S being reachable and the inductive step of
supposing variable A is reachable. Then, for all productions with A as the head (LHS), all the symbols of
the bodies of those productions are also reachable.
For the next step, we need to eliminate nullable symbols. A is nullable if A ⇒* λ. If A is nullable, then
whenever A appears in a production body, A may or may not derive λ.
To compute nullable symbols, we have the base step that if A → λ is a production of G, then A is nullable.
Inductively, if there is a production B → C1C2...Ck, where each Ci is nullable, then B is nullable. Ci must be
a variable to be nullable. The procedure for this is:
The final step is to remove unit productions: A → B, where A and B are variables. These productions
serve no use, so can be removed. However, we need to take care when identifying cycles. The procedure
for identifying unit pairs, A ⇒* B using only unit productions, has the base step of (A, A) is a unit pair, as
A ⇒* A. Then by induction, if (A, B) is a unit pair and B → C is a production where C is a variable, then (A,
C) is a unit pair.
To eliminate unit productions, if we're given a CFG G = (V, T, S, P), then construct G1 = (V, T, S, P1) by
finding all the unit pairs, and then for pair (A, B), add to P1 all the productions A → α where B → α is a
non-unit production of P.
Now we have the three processes for converting a CFG to an equivalent minimised CFG, we have to
consider a safe order for performing these operations to ensure a minimised CFG. This is:
1. Remove λ productions
To reduce a grammar to the Chomsky Normal Form, we have to perform the following procedure:
1. Remove λ productions, unit productions and useless symbols, as above. As a result, every
production has the form A → a, or has a body of length 2 or more.
2. For every terminal that appears in a production body of length 2 or more, create a new variable
A and a production A → a. Replace all occurances of a with A in production bodies of length 2 or
more. Now each body is either a single terminal or a string of at least 2 variables with no
terminals.
3. For bodies of length greater than 2, introduce new variables until we have all bodies down to
length 2.
The Chomsky Normal Form is helpful in decidability results. For instance, it can turn parse trees into
binary trees and is therefore able to calculate the longest path and is then bound on execution time.
Chomsky proposed the following classification for grammars in terms of the language they generate.
Grammar Langage
Restriction on α → β Accepting Automata
Name Generated
Unrestricted α contains at least one non- Recursively Turing machine (a general purpose
(type 0) terminal enumerable computing machine with infinte memory
and read/write operations on memory).
Transitions can take into account the
contents of any memory location
Context-free
α is a single non-terminal Context-free Non-deterministic pushdown automata
(type 2)
α is a single non-terminal
and β is a sequence of
Regular (type
terminals (possibly zero) Regular Finite state automata
3)
followed by at most one
non-terminal