0% found this document useful (0 votes)
84 views

Coding Theory and Linguistics

This document discusses error-correcting codes and their relationship to coding theory and linguistics. It introduces key concepts in coding theory like codewords, minimum distance, and code parameters. It describes how codes can be modeled as fractals and languages. The document also discusses important bounds in the space of code parameters, like the asymptotic bound, and questions around the computability of this bound given its connection to Kolmogorov complexity.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Coding Theory and Linguistics

This document discusses error-correcting codes and their relationship to coding theory and linguistics. It introduces key concepts in coding theory like codewords, minimum distance, and code parameters. It describes how codes can be modeled as fractals and languages. The document also discusses important bounds in the space of code parameters, like the asymptotic bound, and questions around the computability of this bound given its connection to Kolmogorov complexity.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Coding Theory and Linguistics

Matilde Marcolli
CS101: Mathematical and Computational Linguistics

Winter 2015

CS101 Win2015: Linguistics Coding Theory


Error-correcting codes
• Alphabet: finite set A with #A = q ≥ 2.
• Code: subset C ⊂ An , length n = n(C ) ≥ 1.
• Code words: elements x = (a1 , . . . , an ) ∈ C .
• Code language: WC = ∪m≥1 WC ,m , words w = x1 , . . . , xm ;
xi ∈ C .
• ω-language: ΛC , infinite words w = x1 , . . . , xm , . . .; xi ∈ C .
• Special case: A = Fq , linear codes: C ⊂ Fnq linear subspace
• in general: unstructured codes
• k = k(C ) := logq #C and [k] = [k(C )] integer part of k(C )

q [k] ≤ #C = q k < q [k]+1

CS101 Win2015: Linguistics Coding Theory


• Hamming distance: x = (ai ) and y = (bi ) in C

d((ai ), (bi )) := #{i ∈ (1, . . . , n) | ai 6= bi }

• Minimal distance d = d(C ) of the code

d(C ) := min {d(a, b) | a, b ∈ C , a 6= b}

Code parameters
• R = k/n = transmission rate of the code
• δ = d/n = relative minimum distance of the code
Small R: fewer code words, easier decoding, but longer encoding
signal; small δ: too many code words close to received one, more
difficult decoding

CS101 Win2015: Linguistics Coding Theory


Language of a Code
• strings of code words WC = ∪m≥1 WCm
• ω-language ΛC of code C , infinite sequences of code words
• ΛC fractal in [0, 1]n hypercube
• Hausdorff dimension dimH (ΛC ) = R(C ) rate of code
• min distance d(C ): threshold dim, lower dim slices (all directions
parallel to coord axes) of ΛC empty or singletons; higher dim some
sections of positive Hausdorff dim

CS101 Win2015: Linguistics Coding Theory


Example: unstructured [3, 2, 2]2 code
C = {(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)}

M. Marcolli, C. Perez, Codes as fractals and noncommutative spaces,


Mathematics in Computer Science, Vol.6 (2012) N.3, 199-215
(SURF 2010)
CS101 Win2015: Linguistics Coding Theory
ω-language and complexity

Yu.I. Manin, M. Marcolli, Error-correcting codes and phase


transitions, Mathematics in Computer Science, Vol.5 (2011)
133–170.

• Entropy of language WC , generating function:


X
sC (m) = #WC ,m , GC (t) = sC (m)t m
m

Entropy: SC = − logq ρ(GC (t)) with ρ = radius of convergence


• GC (q −s ) = ZC (s) partition function is generating function of
language structure functions
• Entropy of language is code rate R = R(C )

CS101 Win2015: Linguistics Coding Theory


• complexity KTU (w ) of words in a language
• for infinite words in ω-language ΛC complexity

K(wn )
κ(w ) = lim inf
wn →w `(wn )

• Levin (semi)measure

− logq µU (wn )
κ(w ) = lim inf
wn →w `(wn )

universal enumerable semi-measure µU


• bounds uniform Bernoulli measure µ on ΛC

− logq µ(w )
κ(x) ≤ lim = R(C )
`(w )

achieved on full measure subset

CS101 Win2015: Linguistics Coding Theory


The space of code parameters:
• Optimization problem: increase R and δ... how good are codes?
• Codesq = set of all codes C on an alphabet #A = q
• function cp : Codesq → [0, 1]2 ∩ Q2 to code parameters
cp : C 7→ (R(C ), δ(C ))
• the function C 7→ (R(C ), δ(C )) is a total recursive map
• Multiplicity of a code point (R, δ) is #cp −1 (R, δ)

M.A. Tsfasman, S.G. Vladut, Algebraic-geometric codes,


Mathematics and its Applications (Soviet Series), Vol. 58,
Kluwer Academic Publishers, 1991.

CS101 Win2015: Linguistics Coding Theory


Spoiling operations on codes: C an [n, k, d]q code
• C1 := C ∗i f ⊂ An+1

(a1 , . . . , an+1 ) ∈ C1 iff (a1 , . . . , ai−1 , ai+1 , . . . , an ) ∈ C ,

and ai = f (a1 , . . . , ai−1 , ai+1 . . . , an )


C1 an [n + 1, k, d]q code (f constant function)
• C2 := C ∗i ⊂ An−1

(a1 , . . . , , ai−1 , ai+1 , . . . , an ) ∈ C2

iff ∃b ∈ A, (a1 , . . . , ai−1 , b, ai+1 , . . . , an ) ∈ C


C2 an [n − 1, k, d]q code
• C3 := C (a, i) ⊂ C ⊂ An

(a1 , . . . , an ) ∈ C3 iff ai = a.

C3 an [n − 1, k − 1 ≤ k 0 < k, d 0 ≥ d]q code


CS101 Win2015: Linguistics Coding Theory
Asymptotic bound in the space of code parameters
• Vq ⊂ [0, 1]2 : all code points (R, δ) = cp(C ), C ∈ Codesq
• Uq : set of limit points of Vq
• Isolated code points: Vq r (Vq ∩ Uq )

• Fact: existence of Asymptotic Bound:


Uq consists of all points below graph of a function

Uq = {(R, δ) ∈ [0, 1]2 | R ≤ αq (δ)}

Yu.I.Manin, What is the maximum number of points on a


curve over F2 ? J. Fac. Sci. Tokyo, IA, Vol. 28 (1981),
715–720.

CS101 Win2015: Linguistics Coding Theory


Method for establishing asymptotic bound: controlling quadrangles

δ 1

R = αq (δ) continuous decreasing function with αq (0) = 1 and


αq (δ) = 0 for δ ∈ [ q−1
q , 1]; has inverse function on [0, (q − 1)/q];
Uq union of all lower cones of points in Γq = {R = αq (δ)}

CS101 Win2015: Linguistics Coding Theory


Code points and multiplicities
• Set of code points of infinite multiplicity
Uq ∩ Vq = {(R, δ) ∈ [0, 1]2 ∩ Q2 | R ≤ αq (δ)} below the
asymptotic bound
• Code points of finite multiplicity all above the asymptotic bound
Vq r (Uq ∩ Vq ) and isolated (open neighborhood containing (R, δ)
as unique code point)
• again based on using spoiling operations on codes
Yu.I. Manin, What is the maximum number of points on a
curve over F2 ? J. Fac. Sci. Tokyo, IA, Vol. 28 (1981),
715–720.
Yu.I. Manin, M. Marcolli, Error-correcting codes and phase
transitions, Mathematics in Computer Science, Vol.5 (2011)
133–170.

CS101 Win2015: Linguistics Coding Theory


Other bounds in the space of code parameters
• singleton bound: R + δ ≤ 1
• Gilbert–Varshamov line: R = 12 (1 − Hq (δ))

Hq (δ) = δ logq (q − 1) − δ logq δ − (1 − δ) logq (1 − δ)

q-ary entropy (for linear codes GV line R = 1 − Hq (δ))

CS101 Win2015: Linguistics Coding Theory


Statistics of codes and the Gilbert–Varshamov bound
Known statistical approach to the GV bound: random codes
Shannon Random Code Ensemble: ω-language with alphabet A;
uniform Bernoulli measure on ΛA ; choose code words of C as
independent random variables in this measure
Volume estimate:
d  
(Hq (δ)−o(1))n
X n
q ≤ Volq (n, d = nδ) = (q − 1)j ≤ q Hq (δ)n
j
j=0

Gives probability of parameter δ for SRCE meets the GV bound


with probability exponentially (in n) near 1: expectation
 k
q
E∼ Volq (n, d)q −n ∼ q n(Hq (δ)−1+2R)+o(n)
2

... a priori no good statistical description of the asymptotic bound


CS101 Win2015: Linguistics Coding Theory
Estimates on the asymptotic bound

• Plotkin bound:
q−1
αq (δ) = 0, δ≥
q

• singleton bound:
αq (δ) ≤ 1 − δ

• Hamming bound:
δ
αq (δ) ≤ 1 − Hq ( )
2

• Gilbert–Varshamov bound:

αq (δ) ≥ 1 − Hq (δ)

CS101 Win2015: Linguistics Coding Theory


Computability question
• Note: only the asymptotic bound marks a significant change of
behavior of codes across the curve (isolated and finite
multiplicity/accumulation points and infinite multiplicity)
• in this sense it is very different from all the other bounds in the
space of code parameters
• .... but no explicit expression for the curve R = αq (δ)
• ... is the function R = αq (δ) computable?

Yu.I. Manin, A computability challenge: asymptotic bounds


and isolated error-correcting codes, arXiv:1107.4246

CS101 Win2015: Linguistics Coding Theory


The asymptotic bound and Kolmogorov complexity
• the asymptotoc bound R = αq (δ) becomes computable given an
oracle that can list codes by increasing Kolmogorov complexity
• given such an oracle: iterative (algorithmic) procedure for
constructing the asymptotic bound
• ... it is at worst as “non-computable” as Kolmogorov complexity
• asymptotic bound can be realized as phase transition curve of a
statistical mechanical system based on Kolmogorov complexity

Yu.I. Manin, M. Marcolli, Kolmogorov complexity and the


asymptotic bound for error-correcting codes, Journal of
Differential Geometry, Vol.97 (2014) 91–108

CS101 Win2015: Linguistics Coding Theory


Structural numbering for codes
• structural numbering of X : computable bijection νX : N → X ,
principal homogeneous space over group of total recursive
permutations σ : N → N
• construct an enumeration νX : N → X for X = Codesq the space
of q-ary codes
• A = {0, . . . , q − 1} ordered, An lexicographically; computable
total order νX :
(i) if n1 < n2 all C ⊂ An1 before all C 0 ⊂ An2 ;
(ii) k1 < k2 all [n, k1 , d]q -codes before [n, k2 , d 0 ]q -codes;
(iii) fixed n and q k : lexicographic order of code words,
concatenated into single word w (C ) (determines code):
order all the w (C ) lexicographically
• also fixed enumeration νY : N → Y of rational points
Y = [0, 1]2 ∩ Q2

CS101 Win2015: Linguistics Coding Theory


• Kolmogorov ordering:
KTU (x) = order x by growing Kolmogorov complexity KTU (x)

c1 KTU (x) ≤ KTU (x) ≤ c2 KTU (x)

• Parameters map: f : X → Y with X = Codesq , Y = [0, 1]2 ∩ Q2


and f = cp : C 7→ (R(C ), δ(C )) code parameters
• total recursive map f = cp : Codesq → [0, 1]2 ∩ Q2
• total recursive function f : X → Y ⇒ ∀y ∈ f (X ), ∃x ∈ X ,
y = f (x) and ∃ computable c = c(f , νX , νY ) > 0

KTU (x) ≤ c · νY−1 (y )

• meaning: when increasing Kolmogorov ordering of x also


increasing structural ordering of f (x)

CS101 Win2015: Linguistics Coding Theory


Algorithmic construction of the asymptotic bound = by successive
approximations separate out subsets of f (X ) ⊂ Y that have finite
and infinite multiplicity
Multiplicities:
• take F (x) = (f (x), n(x)) with

n(x) = #{x 0 | νX−1 (x 0 ) ≤ νX−1 (x), f (x 0 ) = f (x)}

total recursive function ⇒ F (X ) ⊂ Y × Z+ enumerable


• Xm := {x ∈ X | n(x) = m} and Ym := f (Xm ) ⊂ Y enumerable
• multiplicities: mult(y ) := #f −1 (y )

Y∞ ⊂ · · · f (Xm+1 ) ⊂ f (Xm ) ⊂ · · · ⊂ f (X1 ) = f (X )

Y∞ = ∩m f (Xm ) and Yfin = f (X ) r Y∞

CS101 Win2015: Linguistics Coding Theory


Complexity counting:
• for x ∈ X1 and y = f (x): complexity

KTU (x) ≤ c · νY−1 (y )

• y ∈ Y∞ and m ≥ 1: ∃ unique xm ∈ X , y = f (xm ), n(xm ) = m


and c = c(f , u, v , νX , νY ) > 0

KTU (xm ) ≤ c · νY−1 (y ) m log(νY−1 (y )m)

• both of these complexity estimates follow from general formulae


for Kolmogorov complexity under composition of total recursive
functions
• again the meaning is that increasing Kolmogorov ordering of xm
also increases structural ordering of Y and multiplicities m

CS101 Win2015: Linguistics Coding Theory


Oracle mediated recursive construction of Y∞ and Yfin

• Choose sequence (Nm , m), m ≥ 1, Nm+1 > Nm


• Step 1: A1 = list y ∈ f (X ) with νY−1 (y ) ≤ N1 ; B1 = ∅
• Step m + 1: Given Am and Bm , list y ∈ f (X ) with
νY−1 (y ) ≤ Nm+1 ; Am+1 = elements in this list for which ∃ x ∈ X ,
y = f (x), n(x) = m + 1; Bm+1 = remaining elements in the list
• Note: at this step invoke oracle: produce list of x ∈ X with
explicitly bounded complexity

KTU (xm ) ≤ c · νY−1 (y ) m log(νY−1 (y )m)

to ensure that this x with n(x) = m + 1 appears in this list


(if it exists)

CS101 Win2015: Linguistics Coding Theory


• obtain Am ∪ Bm ⊂ Am+1 ∪ Bm+1 , union is all f (X )
• Bm ⊂ Bm+1 and Yfin = ∪m Bm
• Y∞ = ∪m≥1 (∩n≥0 Am+n )
• from Am to Am+1 first add all new y with Nm < νY−1 (y ) ≤ Nm+1
then subtract those that have no more elements in the fiber
f −1 (y ): these will be in Bm+1
• Bm is m-th step approximation of set of isolated code points
• Am successively approximates the region of code-points below
the asymptotic bound

CS101 Win2015: Linguistics Coding Theory


Partition function for code complexity
X
Z (X , β) = KT −U (x)−β
x∈X

weights elements by inverse complexity with β = inverse


temperature, thermodynamic parameter

KP(x)−β
P
• variant with prefix-free complexity ZP(X , β) =

• prefix-free complexity: intrinsic characterization by Levin in


terms of maximality for all probabilities enumerable from below
p : X → R+ ∪ {∞}

{(r , x) | r < p(x)} ⊂ Q × X enumerable

CS101 Win2015: Linguistics Coding Theory


Convergence properties
• Kolmogorov complexity and Kolmogorov ordering

c1 K(x) ≤ K(x) ≤ c2 K(x)

• convergence of Z (X , β) controlled by series


X X
Ku (x)−β = n−β = ζ(β)
x∈X n≥1

• Partition function Z (X , β) convergence for β > 1; phase


transition at pole β = 1

CS101 Win2015: Linguistics Coding Theory


Asymptotic bound as a phase transition
• δ = βq (R) inverse of αq (δ) on R ∈ [0, 1 − 1/q]
• Fix R ∈ Q ∩ (0, 1) and ∆ ∈ Q ∩ (0, 1)
X
Z (R, ∆; β) = Ku (C )−β+δ(C )−1
C :R(C )=R;1−∆≤δ(C )≤1

• Phase transition at the asymptotic bound:


• 1 − ∆ > βq (R): partition function Z (R, ∆; β) real analytic in β
• 1 − ∆ < βq (R): partition function Z (R, ∆; β) real analytic for
β > βq (R) and divergence for β → βq (R)+

CS101 Win2015: Linguistics Coding Theory


Application to Linguistics: Syntactic Parameters and Coding

M. Marcolli, Principles and Parameters: a coding theory


perspective, arXiv:1407.7169

• idea: assign a (binary or ternary) code to a family of languages


and use position of code parameters with respect to the
asymptotic bound to test relatedness
• N = number of syntactic parameters Π = (Π` )N `=1
each Π` with values in F2 = {0, 1}
(or F3 = {−1, 0, +1} if include parameters that are not set in
certain languages)
• F = {Lk }m
k=1 a set of natural languages (language “family”)

• Code C = C (F) in FN (FN N


2 or F3 ) with m code words
wk = Π(Lk ) string of syntactic parameters for the language Lk

CS101 Win2015: Linguistics Coding Theory


Interpretation of Code Parameters
• R = R(C ) measures ratio between logarithmic size of number of
languages in F and total number of parameters: how F
distributed in the ambient FN
• δ = δ(C ) is the minimum, over all pairs of languages Li , Lj in F
of the relative Hamming distance

δ(C (F)) = min δH (Li , Lj )


Li 6=Lj ∈F

N
1 X
δH (Li , Lj ) = |Π` (Li ) − Π` (Lj )|
N
`=1

• code parameter δ used in Parameter Comparison Method for


reconstruction of phylogenetic trees

CS101 Win2015: Linguistics Coding Theory


Interpretation of Spoiling Operations

• first spoiling operation: effect of including one syntactic


parameter in the list which is dependent on the other parameters
• second spoiling operation: forgetting one of the syntactic
parameters
• third spoiling operation: forming subfamilies by considering
languages that have a common value of one of the parameters

CS101 Win2015: Linguistics Coding Theory


Parameters from Modularized Global Parameterization Method
G. Longobardi, Methods in parametric linguistics and cognitive
history, Linguistic Variation Yearbook, Vol.3 (2003) 101–138
G. Longobardi, C. Guardiano, Evidence for syntax as a signal
of historical relatedness, Lingua 119 (2009) 1679–1706.

• Determiner Phrase Module:


- syntactic parameters dealing with person, number, gender (1–6)
- parameters of definiteness (7–16)
- parameters of countability (17–24)
- genitive structure (25–31)
- adjectival and relative modification (32–14)
- position and movement of the head noun (42–50)
- demonstratives and other determiners (51–50 and 6–63)
- possessive pronouns (56–59)

CS101 Win2015: Linguistics Coding Theory


Simple Example:

• group of three languages F = {`1 , `2 , `3 }: Italian, Spanish,


French using first group of 6 parameters
• code C = C (F)
`1 1 1 1 0 1 1
`2 1 1 1 1 1 1
`3 1 1 1 0 1 0

• code parameters: (R = log2 (3)/6 = 0.2642, δ = 1/6)


• code parameters satisfy R < 1 − H2 (δ): below the
Gilbert–Varshamov curve

CS101 Win2015: Linguistics Coding Theory


Spoiling operations in this example:
• first spoiling operation:
first two parameters same value 1, so
C = C 0 ?1 f1 = (C 00 ?2 f2 ) ?1 f1 with f1 and f2 constant equal to 1
and C 00 ⊂ F42 without first two letters
• second spoiling operation:
conversely, C 00 = C 0 ?2 and C 0 = C ?1
• third spoiling operation:
C (0, 4) = {`1 , `3 } and C (1, 6) = {`2 , `3 }

CS101 Win2015: Linguistics Coding Theory


What if languages are not in the same historical family?
Example: F = {L1 , L2 , L3 }: Arabic, Wolof, Basque
• excluding parameters that are not set, or are entailed by other
parameters, for these languages: left with 25 parameters from
original list (number 1–5, 7, 10, 20–21, 25, 27–29, 31–32, 34, 37,
42, 50–53, 55–57)
• code C = C (F)

• code parameters: δ = 0.52 and R > 0 violates Plotkin bound


⇒ isolated code above the asymptotic bound

CS101 Win2015: Linguistics Coding Theory


Asymptotic bound and language relatedness

• For binary syntactic parameters: a code C = C (F)


violates the Plotkin bound if any pair Li 6= Lj of languages in F
has δH (Li , Lj ) ≥ 1/2
• Li and Lj differ in at least half of the parameters: it would not
happen in a group of historically related languages
• but what about codes above the asymptotic bound that do not
violate the Plotkin bound?
• Expect: C = C (F) above the asymptotic bound
⇒ F not a historical language family
(quantitative test of historical relatedness)

CS101 Win2015: Linguistics Coding Theory


Why the asymptotic bound?
• Why look at position with respect to asymptotic bound as a test
of historical relatedness? because it is the only true “bound” in the
space of code parameters across which behavior truly changes
• codes below the asymptotic bound are easily deformable
(as long as number of syntactic parameters is large)
• if think of language evolution as a process of parameter change,
expect languages that have evolved in the same family to
determine codes in this zone of the space of code parameters
• codes C = C (F) above the asymptotic bound should be a clear
sign that list of languages in F do not belong to same historical
family
• though there can be codes C = C (F) below the asymptotic
bound that also don’t come from historically related languages:
converse implication does not hold

CS101 Win2015: Linguistics Coding Theory

You might also like