Algebraic Approaches For Coded Caching and Distributed Computing
Algebraic Approaches For Coded Caching and Distributed Computing
2020
Recommended Citation
Tang, Li, "Algebraic approaches for coded caching and distributed computing" (2020). Graduate Theses
and Dissertations. 17873.
https://lib.dr.iastate.edu/etd/17873
This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and
Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and
Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information,
please contact [email protected].
Algebraic approaches for coded caching and distributed computing
by
Li Tang
DOCTOR OF PHILOSOPHY
The student author, whose presentation of the scholarship herein was approved by the program of
study committee, is solely responsible for the content of this dissertation. The Graduate College
will ensure this dissertation is globally accessible and will not permit alterations after a degree is
conferred.
Ames, Iowa
2020
DEDICATION
I would like to dedicate this thesis to my mother Dan Liu and father Xiaobo Tang. Without
their support I would not have been able to complete this work.
iii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Coded Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Coded Caching for Networks with the Resolvability Property . . . . . . . . . 3
1.1.2 Coded Caching Schemes with Reduced Subpacketization from Linear Block
Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Distributed computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Erasure coding for distributed matrix multiplication for matrices with bounded
entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Numerically stable coded matrix computations via circulant and rotation
matrix embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
LIST OF TABLES
Page
Table 2.1 Comparison of three schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Table 3.1 A summary of the different constructions of CCP matrices in Section 3.3 . . 57
Table 3.2 List of k values for Example 16. The values of n0 , α and z are obtained by
following Algorithm 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table 4.1 Effect of bound (L) on the decoding error . . . . . . . . . . . . . . . . . . . 69
Table 5.1 Comparison for matrix-vector case with n = 31, A has size 28000 × 19720
and x has length 28000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Table 5.2 Comparison for AT B matrix-matrix multiplication case with n = 31, kA =
4, kB = 7. A has size 8000 × 14000, B has size 8400 × 14000. . . . . . . . . 93
Table 5.3 Comparison for matrix-matrix AT B multiplication case with n = 17, uA =
2, uB = 2, p = 2, A is of size 4000 × 16000, B is of 4000 × 16000. . . . . . . 95
Table 5.4 Performance of matrix inversion over a large prime order field in Python 3.7.
The table shows the computation time for inverting a ` × ` matrix G over
a finite field of order p. Let G d −1 denote the inverse obtained by applying
LIST OF FIGURES
Page
Figure 1.1 Model of coded caching in Maddah-Ali and Niesen (2014b). . . . . . . . . . 2
Figure 1.2 Caching strategy for N = 2 files and K = 2 users with cache size M = 1
with all four possible user requests. Each file is split into 2 subfiles. The
schemes achieve rate R = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Figure 2.1 The figure shows a 42 combination network. It also shows the cache
placement when M = 2, N = 6. Here, Z1 = ∪6n=1 {Wn,1 1 , W 2 }, Z =
n,1 2
6 1 2 6 1 2
∪n=1 {Wn,2 , Wn,2 } and Z3 = ∪n=1 {Wn,3 , Wn,3 }. It can be observed that
each relay node sees the same caching pattern in the users that it is con-
nected to, i.e., the users connected to each Γi together have Z1 , Z2 and Z3
represented in their caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.2 Performance comparison of the different schemes for a 62 combination net-
work with K = 15, K̃ = 5 and N = 50. . . . . . . . . . . . . . . . . . . . . . 19
Figure 3.1 Recovery set bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 3.2 A comparison of rate and subpacketization level vs. M/N for a system with
K = 64 users. The left y-axis shows the rate and the right y-axis shows
the logarithm of the subpacketization level. The green and the blue curves
correspond to two of our proposed constructions. Note that our schemes
allow for multiple orders of magnitude reduction in subpacketization level
and the expense of a small increase in coded caching rate. . . . . . . . . . . 50
Figure 3.3 The plot shows the gain in the scaling exponent obtained using our tech-
niques for different value of M/N = 1/q. Each curve corresponds to a choice
of η = k/n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 4.1 Comparison of total computation latency by simulating up to 8 stragglers . 67
Figure 5.1 Consider matrix-vector AT x multiplication system with n = 31, τ = 29. A has
size 28000 × 19720 and x has length 28000. . . . . . . . . . . . . . . . . . . . . . 92
Figure 5.2 Consider matrix-matrix AT B multiplication system with n = 31, kA = 4, kB = 7,
A is of size 8000 × 14000, B is of 8400 × 14000. . . . . . . . . . . . . . . . . . . 93
Figure 5.3 Consider matrix-matrix AT B multiplication system with n = 18, uA = 2, uB = 2,
p = 2, A is of size 4000 × 16000, B is of 4000 × 16000. . . . . . . . . . . . . . . . 96
viii
ACKNOWLEDGMENTS
I would like to take this opportunity to express my thanks to those who helped me with various
aspects of conducting research and the writing of this thesis. First and foremost, Dr. Aditya
Ramamoorthy for his guidance, patience and support throughout this research and the writing of
this thesis. He always encouraged me when I hit the rock bottom in my research and life.
I would also like to thank my committee members for their efforts and contributions to this
work: Dr. Baskar Ganapathysubramanian, Dr. Chinmay Hegde, Dr. Zhengdao Wang and Dr.
ABSTRACT
This dissertation examines the power of algebraic methods in two areas of modern interest:
caching for large scale content distribution and straggler mitigation within distributed computation.
Caching is a popular technique for facilitating large scale content delivery over the Internet.
Traditionally, caching operates by storing popular content closer to the end users. Recent work
within the domain of information theory demonstrates that allowing coding in the cache and coded
transmission from the server (referred to as coded caching) to the end users can allow for significant
reductions in the number of bits transmitted from the server to the end users. The first part of
The original formulation of the coded caching problem assumes that the server and the end
users are connected via a single shared link. In Chapter 2, we consider a more general topology
where there is a layer of relay nodes between the server and the users. We propose novel schemes
for a class of such networks that satisfy a so-called resolvability property and demonstrate that
the performance of our scheme is strictly better than previously proposed schemes. Moreover, the
original coded caching scheme requires that each file hosted in the server be partitioned into a large
number (i.e., the subpacketization level) of non-overlapping subfiles. From a practical perspective,
this is problematic as it means that prior schemes are only applicable when the size of the files is
extremely large. In Chapter 3, we propose a novel coded caching scheme that enjoys a significantly
lower subpacketization level than prior schemes, while only suffering a marginal increase in the
transmission rate. We demonstrate that several schemes with subpacketization levels that are
The second half of this dissertation deals with large scale distributed matrix computations.
learning of neural networks. It is well recognized that the computation times on distributed clusters
x
are often dominated by the slowest workers (called stragglers). Recently, techniques from coding
theory have found applications in straggler mitigation in the specific context of matrix-matrix and
In Chapter 4, we consider matrix multiplication under the assumption that the absolute values
of the matrix entries are sufficiently small. Under this condition, we present a method with a
significantly smaller recovery threshold than prior work. Besides, the prior work suffers from
serious numerical issues owing to the condition number of the corresponding real Vandermonde-
structured recovery matrices; this condition number grows exponentially in the number of workers.
In Chapter 5, we present a novel approach that leverages the properties of circulant permutation
matrices and rotation matrices for coded matrix computation. In addition to having an optimal
recovery threshold, we demonstrate an upper bound on the worst case condition number of our
CHAPTER 1. INTRODUCTION
Caching and distributed computing are two important technologies in the Big Data era that
are widely used in a variety of commercial settings. Broadly speaking both these can be considered
as applications of network coding Ahlswede et al. (2000), Li et al. (2003). Network coding that
was introduced in Ahlswede et al. (2000) is a generalization of routing where intermediate nodes in
a network combine information as it travels through a network. Several problems within network
coding have been investigated over the years and various types of network connections have been
considered within this domain. These include the original multicast Ho et al. (2006), multiple
unicast Dougherty et al. (2005) and Huang et al. (2011), Huang and Ramamoorthy (2013, 2014) and
function computation Ramamoorthy and Langberg (2013); Langberg and Ramamoorthy (2009),
Caching is a popular technique for facilitating content delivery over the Internet. It exploits
local cache memory that is often available at the end users for reducing transmission rates from
the central server. In particular, when the users request files from a central server, the system
first attempts to satisfy the user demands in part from the local content. Thus, the overall rate
of transmission from the server is reduced which in turn reduces overall network congestion. The
work of Maddah-Ali and Niesen (2014b) demonstrated that huge rate savings are possible when
coding in the caches and coded transmissions from the server to the users are considered. This
In Maddah-Ali and Niesen (2014b), the scenario considered was as follows. There is a server
that contains N files, a collection of K users that are connected to the central server by a single
2
R
Shared Link
User U1 U2 Uk-1 Uk
Cache M M M M
shared link. Each user also has a local cache of size M . The focus is on reducing the rate of
transmission on the shared link. There are two distinct phases in the coded caching setting.
• Placement phase: In the placement phase, the caches of the users are populated. This phase
should not depend on the actual user requests, which are assumed to be arbitrary. The
• Delivery phase: In delivery phase, server sends some coded signal to each user such that each
user’s demand is satisfied. The delivery phase is always executed in the peak traffic time.
Since the original work of Maddah-Ali and Niesen (2014b), there have been several aspects of coded
caching that have been investigated. Maddah-Ali and Niesen (2014a) considered the decentralized
coded caching where the placement phase is driven by the users who randomly populate their
caches. Information theoretic lower bounds on the transmission rate were considered in Ghasemi
and Ramamoorthy (2016), Ghasemi and Ramamoorthy (2017c), Yu et al. (2018). Synchronization
issues in this problem setting were investigated in Ghasemi and Ramamoorthy (2017a), Ghasemi
and Ramamoorthy (2017b). More recently, techniques inspired by coded caching have been em-
ployed for speeding up distributed computing Li et al. (2017), Konstantinidis and Ramamoorthy
We explain the idea of coded caching by the following example in Maddah-Ali and Niesen
(2014b).
Example 1. Consider the case that K = N = 2, M = 1, such that there are two files A and
B in the server and two users each with cache memory size M = 1 file. We describe the coded
caching scheme as follows. In the placement phase, file A is split into two non-overlapping subfiles
A = {A1 , A2 }. Similarly B is split into B = {B1 , B2 }. User 1 caches A1 and B1 , and thus its cache
memory is 1 file. User 2 caches A2 and B2 , and thus its cache memory is also 1 file. In the delivery
phase, suppose for example user 1 requires file A and user 2 requires file B. User 1 needs A2 and
user 2 needs B1 from server. Server can definitely transmit A2 and B1 over the shared link and
the transmission rate is 1 file. We call this transmission rate as uncoded transmission rate. Coded
caching introduces another transmission strategy. The server can simply transmit A2 ⊕ B1 , where
⊕ denotes bitwise XOR. User 1 already has B1 , thus it can recover A2 from A2 ⊕ B1 . Similarly, user
2 can recover B1 since it already has A2 . Therefore, the transmission A2 ⊕ B1 can guarantee these
two users recover their requests and the transmission rate is half file. We call this transmission rate
as coded transmission rate. In this example, the coded transmission rate is only half of uncoded
transmission rate. Figure 1.2 shows coded transmission rate is always the equivalent of half a file
Maddah-Ali and Niesen (2014b) demonstrate that by carefully designing the cache in the users,
the coded transmission from the central server can significantly save the transmission rate when
the server and the end users are connected via a single shared link. In the first part of dissertation,
we consider a more general topology where there is a layer of relay nodes between the server and
the users. We demonstrate that our proposed scheme outperforms two previous proposed schemes
in Ji et al. (2015a). This work has appeared in Tang and Ramamoorthy (2016a) and is discussed
in Chapter 2.
4
A1,A2 A1,A2
B1,B2 B1,B2
A2ꚚA1 A2ꚚB1
A A A B
A1,A2 A1,A2
B1,B2 B1,B2
B2ꚚA1 B2ꚚB1
B A B B
Figure 1.2 Caching strategy for N = 2 files and K = 2 users with cache size M = 1 with
all four possible user requests. Each file is split into 2 subfiles. The schemes
achieve rate R = 0.5.
1.1.2 Coded Caching Schemes with Reduced Subpacketization from Linear Block
Codes
Maddah-Ali and Niesen (2014b) proposed a coded caching scheme and showed that compared
with conventional caching, coded caching can achieve a much lower transmission rate. However,
in the placement phase of Maddah-Ali’s scheme, each file is split into a very large number of non-
overlapping subfiles of equal size. This is called the subpacketization level of the scheme. It means
Maddah-Ali’s scheme is applicable only in the regime when the underlying file sizes are very large.
In the second part of this dissertation, we propose coded caching schemes based on combinatorial
structures called resolvable designs. These structures can be obtained in a natural manner from
linear block codes whose generator matrices possess certain rank properties. We demonstrate that
several schemes with subpacketization levels that are exponentially smaller than the basic scheme
can be obtained. The subpacketization level of our scheme is exponentially lower than the memory-
sharing within the scheme of Maddah-Ali and Niesen (2014b). This work has appeared in Tang
The current Big Data era routinely requires the processing of large scale data on massive
distributed computing clusters. In these applications, data sets are often so large that they cannot
be housed in the memory and/or the disk of any one computer. Thus, the data and the processing is
typically distributed across multiple nodes. Distributed computation is thus a necessity rather than
a luxury. The widespread usage of such clusters presents several opportunities and advantages over
traditional computing paradigms. However, it also presents newer challenges where coding-theoretic
ideas have recently had a significant impact. Large scale clusters (which can be heterogeneous in
nature) suffer from the problem of stragglers which refer to slow or failed worker nodes in the
system. Thus, the overall speed of a computation is typically dominated by the slowest node in the
The conventional approach for tackling stragglers in distributed computation has been to run
multiple copies of tasks on various machines, with the hope that at least one copy finishes on time.
However, coded computation offers significant benefits for specific classes of problems. We illustrate
worker that each store the equivalent of half of the matrices A and B. We first split the matrices
evenly along the column dimensions to obtain A = [A0 , A1 ], B = [B0 , B1 ]. We want to compute
Now we design a computation strategy which is resilient to a single straggler. Let the master node
B̂i = B0 + i2 B1 .
6
Each worker compute ÂTi B̂i . It follows that as soon as any four out of the five worker nodes return
the results of their computation, the master node can decode and recover C. We discuss this
through a representative scenario, where the master receives the computation results from workers
Then we have
Ĉ 10 11 12 13 AT0 B0
1
Ĉ2 20 21 22 3
2 T
A1 B0
= .
0
Ĉ3 3 31 32 33 AT0 B1
Ĉ4 40 41 42 43 T
A1 B1
The coefficient matrix in the above equation is a Vandermonde matrix, which is non-singular over
reals Horn and Johnson (1991). Therefore the four components AT0 B0 , AT1 B0 , AT0 B1 , AT1 B1 can
be recovered when the master node have Ĉ1 , Ĉ2 , Ĉ3 , Ĉ4 .
The above coded computation strategy is resilient to 1 straggler with 5 workers. On the other
hand, the uncoded strategy which is resilient to 1 straggler, each component of C has to run 2
The tutorial paper Ramamoorthy et al. (2020) overviews recent developments in the field of
1.2.1 Erasure coding for distributed matrix multiplication for matrices with bounded
entries
A key metric of distributed matrix multiplication is the minimum number of workers that the
master needs to wait for in order to compute C; this is called the recovery threshold of the scheme.
In the third part of dissertation, we present a novel coding strategy for this problem when the
absolute values of the matrix entries are sufficiently small. We demonstrate a trade-off between the
assumed absolute value bounds on the matrix entries and the recovery threshold. At one extreme,
7
we are optimal with respect to the recovery threshold and on the other extreme, we match the
threshold of prior work. Experimental results on cloud-based clusters validate the benefits of our
1.2.2 Numerically stable coded matrix computations via circulant and rotation ma-
trix embeddings
Ideas from coding theory have recently been used in several works for mitigating the effect
over the reals. In particular, a polynomial code based approach spreads out the computation of
This allows for an “optimal” recovery threshold whereby the intended result can be decoded as
long as at least (n − s) worker nodes complete their tasks; s is the number of stragglers that the
scheme can handle. However, a major issue with these approaches is the high condition number
It can be shown that the condition number of n × n real Vandermonde matrices grows exponen-
tially in n Pan (2016). On the other hand, the condition numbers of Vandermonde matrices with
parameters on the unit circle are much better behaved. However, using complex evaluation points
at the worker nodes. In this work we leverage the properties of circulant permutation matrices and
rotation matrices to obtain coded computation schemes with significantly lower worst case condi-
tion numbers; these matrices have eigenvalues that lie on the unit circle. Our technique essentially
works by evaluating polynomials at matrices rather than scalars. Our analysis demonstrates that
the associated recovery matrices have a condition number corresponding to Vandermonde matrices
with parameters given by the eigenvalues of the corresponding circulant permutation and rotation
matrices. Finally, we demonstrate an upper bound on the worst case condition number of these ma-
trices which grows as ≈ O(ns+6 ). In essence, we leverage the well-behaved conditioning of complex
8
Vandermonde matrices with parameters on the unit circle, while still working with computation
over the reals. Experimental results demonstrate that our proposed method has condition numbers
Our work in this area is currently published as a preprint Ramamoorthy and Tang (2019) and
In this chapter, we consider the coded caching problem in a more general setting where there
is a layer of relay nodes between the server and the users (see Ji et al. (2015b) for related work).
Specifically, the server is connected to a set of relay nodes and the users are connected to certain
subsets of the relay nodes. A class of such networks have been studied in network coding and are
referred to as “combination networks” Ngai and Yeung (2004). In a combination network there are
h relay nodes and hr users each of which is connected to a r-subset of the relay nodes. Combination
networks were the first example where an unbounded gap between network coding and routing for
the case of multicast was shown. While this setting is still far from an arbitrary network connecting
the server and the users, it is rich enough to admit several complex strategies that may shed light
In this chapter we consider a class of networks that satisfy a so called resolvability property.
These networks include combination networks where r divides h, but there are many other examples.
We propose a coded caching scheme for these networks and demonstrate its advantages.
This chapter is organized as follows. Section 2.1 presents the problem formulation, background
and our main contribution. In Section 2.2, we describe our proposed coded caching scheme, Section
2.3 presents a performance analysis and comparison and Section 2.4 concludes the paper.
In this work we consider a class of networks that contain an intermediate layer of nodes be-
tween the main server and the end user nodes. Combination networks are a specific type of such
networks and have been studied in some depth in the literature on network coding Ngai and Yeung
10
(2004). However, as explained below, we actually consider a larger class of networks that encompass
combination networks.
The networks we consider consist of a server node denoted S and h relay nodes, Γ1 , Γ2 , . . . , Γh
such that the server is connected to each of the relay nodes by a single edge. The set of relay nodes
is denoted by H. Let [m] = {1, 2, . . . , m}. If A ⊂ [h] we let ΓA = ∪i∈A {Γi }. There are K users in
the system and each user is connected to a subset of H of size r. Let V ⊂ {1, . . . , h} with |V| = r.
For convenience, we will assume that the set V is written in ascending order even though the subset
structure does not impose any ordering on the elements. Under this condition, we let V[i] represent
the i-th element of V. For instance, if V = {1, 3}, then V[1] = 1 and V[2] = 3. Likewise, Inv − V
will denote the corresponding inverse map, i.e. Inv − V[i] = j if V[j] = i.
Each user is labeled by the subset of relay nodes it is connected to. Thus, UV denotes the user
that is connected to ΓV . The set of all users is denoted U and the set of all subsets that specify
the users is denoted V, i.e., V ∈ V if UV is a user. We consider networks where V satisfies the
Definition 1. Resolvability property. The set V defined above is said to be resolvable if there
Each relay node Γi is thus connected to a set of users that is denoted by N (Γi ). A simple
Suppose that r divides h and let V be the set of all subsets of size r of [h]. In this case, the
network defined above is the combination network Ji et al. (2015b) with K = hr users. The fact
11
that this network satisfies the resolvability property is not obvious and follows from a result of
Baranyai (1975).
Example 3. The combination network for the case of h = 4, r = 2 is shown in Fig. 2.1 and the
h
On the other hand, there are other networks where |V| is strictly smaller than r .
Example 4. Let h = 9, r = 3 and let V = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {1, 4, 7}, {2, 5, 8}, {3, 6, 9}}.
The server S contains a library of N files where each file is of size F bits (we will interchangeably
refer to F as the subpacketization level). The files are represented by random variables Wi , i =
1, . . . , N , where Wi is distributed uniformly over the set [2F ]. Each user has a cache of size M F
bits. There are two distinct phases in the coded caching problem. In the placement phase, the
content of the user’s caches is populated. This phase should not depend on the file requests of the
users. In the delivery phase, user UV requests a file denoted WdV from the library; the set of all
user requests is denoted D = {WdV : UV is a user}. It can be observed that there are a total of N K
distinct request sets. The server responds by transmitting a certain number of bits that satisfies
the demands of all the users. A (M, R1 , R2 ) caching system also requires the specification of the
Source S
Relays Γ1 Γ2 Γ3 Γ4
Cache Z1 Z2 Z3 Z3 Z2 Z1
• hN K server to relay encoding functions: The signal ψS→Γi ,D (W1 , . . . , WN ) is the encoding
function for the edge from S to relay node1 Γi . Here, ψS→Γi ,D : [2N F ] → [2R1 F ], so that the
rate of transmission on server to relay edges is at most R1 . The signal on the edge is denoted
XS→Γi .
• hK̃N K relay to user encoding functions: Let UV ∈ N (Γi ). The signal ϕΓi →UV ,D (ψS→Γi ,D (W1 , . . . , WN ))
is the encoding function for the edge Γi → UV . Here ϕΓi →UV ,D : [2R1 F ] → [2R2 F ], so that the
rate of transmission on the relay to user edges is at most R2 . The signal on the corresponding
• KN K decoding functions: Every user has a decoding function for a specific request set D,
denoted by µD,UV (XΓV[1] →UV , . . . , XΓV[r] →UV , ZV ). Here µD,UV : [2R2 F ] × . . . [2R2 F ] × [2M F ] →
For this coded caching system, the probability of error Pe is defined as Pe = maxD maxV P (ŴD,UV 6=
WD ).
13
The triplet (M, R1 , R2 ) is said to be achievable if for every > 0 and every large enough file
size F there exists a (M, R1 , R2 ) caching scheme with probability of error less than . The sub-
packetization level of a scheme, i.e., F is also an important metric, because it is directly connected
to the implementation complexity of the scheme. For example, the original scheme of Maddah-Ali
and Niesen (2014b) that operates when there is a shared link between the server and the users. It
K
operates with a subpacketization level F ≈ KM/N which grows exponentially with K. Thus, the
scheme of Maddah-Ali and Niesen (2014b) is applicable when the files are very large. In general,
Prior work has presented two coded caching schemes for combination networks. In the routing
scheme, coding is not considered. Each user simply caches a M/N fraction of each file. In the
delivery phase, the total number of bits that need to be transmitted is K(1 − M/N )F . As there are
K M
h outgoing edges from S, we have that R1 = h (1 − N ). Moreover, as there are r incoming edges
in Ji et al. (2015b) for combination networks. This scheme uses a decentralized cache placement
phase, where each user randomly caches a M/N fraction of each file. In the delivery phase, the
server executes the CM step where it encodes all requested files by a decentralized multicasting
caching scheme. Following this, in the CNC step, the server divides each coded signal into r equal-
size signals and encodes these r signals by a (h, r) binary MDS code. The h coded signals are
transmitted over the h links from the server to relay nodes. The relay nodes forward the signals
to the users. Thus, each user receives r coded signals and can recover its demand due to MDS
property.
• Our schemes work for any network that satisfies the resolvability property. For a class of
combination networks, we demonstrate that our achievable rates are strictly better than
• The subpacketization level of our scheme is also significantly lower than competing methods.
As discussed in Section 2.1, the subpacketization level of a given scheme directly correlates
Consider a network where V satisfies resolvability property and let the parallel classes be
P1 , · · · , PK̃ . It is evident that each user belongs to exactly one parallel class. Let ∆(V) indi-
cate the parallel class that a user belongs to, i.e., ∆(V) = j if user UV belong to Pj . Now, recall
that N (Γi ) is the set of users UV such that i ∈ V. By the resolvability property it has to be the
case that each user in N (Γi ) belongs to a different parallel class. In fact, it can be observed that
This implies that each relay node “sees” exactly the same set of parallel classes represented in the
users that it is connected to. This observation inspires our placement phase in the caching scheme.
We populate the user caches based on the parallel class that a given user belongs to. Loosely
speaking, it turns out that we can design a symmetric uncoded placement such that the overall
Our proposed placement and delivery phase schemes are formally specified in Algorithm 1 and
N 2N
are discussed below. Assume each user has a storage capacity of M ∈ {0, K̃ , K̃ , · · · , N } files, and
let t = K̃ M
N =
KrM
The users can be partitioned into K̃ groups Gi where Gi = {UV : V ∈ Pi }.
hN .
In placement phase, each file Wn is split into r K̃t non-overlapping subfiles of equal size that
are labeled as
l
Wn = (Wn,T : T ⊂ [K̃], |T | = t, l ∈ [r]).
Thus, the subpacketization mechanism is such that each subfile has a superscript in [r] in addition
to the subset-based subscript that was introduced in the work of Maddah-Ali and Niesen (2014b).
l
Subfile Wn,T l
is placed in the cache of the users in Gi if i ∈ T . Equivalently, Wn,T is stored in
F
. This requires
r(K̃
t)
K̃ − 1
F Nt
Nr =F = MF
t − 1 r K̃ K̃
t
bits, demonstrating that our scheme uses only M F bits of cache memory at each user.
placement is as follows.
1 2
G1 = {U12 , U34 } cache Wn,1 , Wn,1 ;
1 2
G2 = {U13 , U24 } cache Wn,2 , Wn,2 ; and
1 2
G3 = {U14 , U23 } cache Wn,3 , Wn,3 .
Note that by eq. (2.1), we have that each relay node is connected to a user from each parallel
class. Our placement scheme depends on the parallel class that user belongs to. Thus, it ensures
that the overall distribution of the cache content seen by each relay node is the same. This can
16
be seen in Fig. 2.1 for the example considered above. We note here that the routing scheme (cf.
Now, we briefly outline the main idea of our achievable scheme. Our file parts are of the form
j
Wn,T , where for a given T , j ∈ [r]. Note that each user is also connected to r different relay nodes in
H. Our proposed scheme is such that each user recovers a missing file part with a certain superscript
from one of the relay nodes it is connected to. In particular, we convey enough information from the
server to the relay nodes such that each relay node uses the scheme proposed in Maddah-Ali and
Niesen (2014b) for one set of superscripts. Crucially, the symmetrization afforded by the placement
Theorem 1. Consider a network satisfying resolvability property with h relay nodes and K users
such that each user is connected to r relay nodes. Suppose that the N files in the server and
achievable.
( )
M
K(1 − N) N M
R1 = min KrM
, (1 − ) , (2.2)
h(1 + hN )
r N
1− MN
R2 = .
r
Proof. Let the set of user requests be denoted D = {WdV : UV is a user}. For each relay node
Γi , we focus on the users connected to it N (Γi ) and a subset C ⊂ {∆(V) : UV ∈ N (Γi )} where
Inv−V[i]
⊕{V:∆(V)∈C} WdV ,C\{∆(V)} (2.3)
to the relay node Γi (⊕ denotes bitwise XOR) and Γi forwards it into users UV where ∆(V) ∈ C.
We now argue that each user can recover its requested file. Evidently, a user UV is missing
Inv−V[i]
{WdV ,T : T ⊂ [K̃] \ {∆(V)}, |T | = t}.
17
This is because the transmission in eq. (2.3) is such that UV caches all subfiles that are involved
in the XOR except the one that is interested in. This implies it can decode its missing subfile. In
addition, UV is also connected to r relay nodes so that ∪i∈V {Inv − V[i]} = [r], i.e., it can recover
Next, we determine R1 and R2 . Each of coded subfiles results in FK̃ bits being sent over the
r( t )
K̃
link from source S to Γi . Since the number of subsets C is t+1 , the total number of bits sent
K̃
F K̃(1− M ) K̃(1− M )
from S to Γi is t+1 K̃
= F N
K̃M
, and hence R1 = N
. Next, note that each coded
r( t ) r(1+ N ) r(1+ K̃M
N
)
K̃
|C|(t+1 )
subfile is forwarded to |C| = t + 1 users. Thus, each user receives K̃
coded subfiles so that the
K̃
|C|(t+1) 1− M
total number of bits sent from a relay node to user is K̃
× FK̃ = |C|F r
(K̃−t)
K̃(t+1)
= F r N . Hence
r( t )
1− M
R2 = r
N
.
K̃(1− M ) 1− M
Thus, the triplet (M, R1 , R2 ) = (M, N
, r
N
) is achievable for M ∈ {0, N h 2N h
Kr , Kr , · · · , N }
r(1+ K̃M
N
)
. Points for general values of M can be obtained by memory sharing between triplets of this form.
If K̃ > N , it is clear that some users connecting to a given relay node Γi request the same file.
N M
In this case, the routing scheme can attain R1 = r (1 − N ), which is better than the proposed
N
scheme if M ≤ 1 − K̃
. This explains the second term within the minimum in the RHS of eq.
(2.2).
Example 6. Assume that user UV , V ⊂ {1, . . . , 4}, |V| = 2, requires file WdV . The users connected
to Γ1 correspond to subsets {1, 2}, {1, 3} and {1, 4} so that Inv − V[1] = 1 for all of them. Thus,
the users recover missing subfiles with superscript of 1 from Γ1 . In particular, the transmissions
are as follows.
The users connected to Γ2 correspond to subsets {1, 2}, {2, 3} and {2, 4} in which case Inv −
{1, 2}[2] = 2 while Inv − {2, 3}[2] = 1, Inv − {2, 4}[2] = 1. Thus, user U12 recovers missing subfiles
with superscript 2 from Γ2 while users U23 and U24 recover missing subfiles with superscript 1. The
In a similar manner, the other transmissions can be determined and it can be verified that the
It is important to note that the resolvability property is key to our proposed scheme. For
example, if r does not divide h, the combination network does not have the resolvability property.
In this case, it can be shown that a symmetric uncoded placement is impossible. We demonstrate
Example 7. Consider the combination network with h = 3, r = 2, and V = {{1, 2}, {1, 3}, {2, 3}}.
Here 2 does not divide 3 and it is easy to check that it does not satisfy the resolvability property.
there exists a symmetric uncoded placement and suppose U12 caches Z1 , U13 caches Z2 . By the
hypothesis, Γ1 and Γ2 have to see the same cache content. Since N (Γ1 ) = {U12 , U13 } and N (Γ2 ) =
{U12 , U23 }, U23 has to cache Z2 . As a result, since N (Γ3 ) = {U13 , U23 }, Γ3 sees Z2 and Z2 , which
are different from the cache content seen by Γ1 and Γ2 . This is a contradiction.
We emphasize that a large class of networks satisfy the resolvability property. For instance, if r
divides h, Baranyai (1975) shows that the set of all hr r-subsets of an h-set can be partitioned into
Stinson (2003) which are set systems that satisfy the resolvability property. Such designs include
19
8
R1 of Proposed Scheme
7 R1 of CM-CNC Scheme
R1 of Routing Scheme
Performance Analysis
6 R2 of Proposed Scheme
R2 of CM-CNC Scheme
5 R2 of Routing Scheme
4
0
0 10 20 30 40 50
Cache size M
6
Figure 2.2 Performance comparison of the different schemes for a 2 combination network
with K = 15, K̃ = 5 and N = 50.
affine planes which correspond to networks where for prime q, we have h = q 2 and the r = q; the set
V is given by the specification of the affine plane. Furthermore, one can obtain resolvable designs
from affine geometry over Fq that will correspond to networks with h = q m and r = q d .
We now compare the performance of our proposed scheme with the CM-CNC scheme Ji et al.
(2015b) and the routing scheme. For a given value of M we compare the achievable R1 , R2 pairs
of the different schemes. Furthermore, we also compare the required subpacketization levels of the
different schemes, as it directly impacts the complexity of implementation of a given scheme. Table
2.1 summarizes the comparison. We note here that the rate of the CM-CNC scheme is derived in
Ji et al. (2015b) for a decentralized placement. The rate in Table 2.1 corresponds to a derivation
of the corresponding rate for a centralized placement; it is lower than the one for the decentralized
placement.
20
The following conclusions can be drawn. Let (R1∗ , R2∗ ) and F ∗ denote the rates and subpacke-
This implies that our scheme is better in both rate metrics. Next,
F∗
r M
≈ exp (K(1 − )He ( )
F CM −CN C h N
where He (·) represents the binary entropy function in nats. Thus, the subpacketization level of our
For a 62 combination network with parameters K = 15, K̃ = 5, N = 50, we plot the perfor-
mance of the different schemes in Fig. 2.2. Fig. 2.2 compares R1 and R2 of three schemes. It can
be observed that for R1 , the proposed scheme is best for all cache size M . At the same time, we
can see that R2 of routing scheme and the proposed scheme are identical but significantly better
2.4 Conclusions
In this work, we proposed a coding caching scheme for networks that satisfy the resolvability
property. This family of networks includes a class of combination networks as a special case. The
21
rate required by our scheme for transmission over the server-to-relay edges and over the relay-to-
user edges is strictly lesser than that proposed in prior work. In addition, the subpacketization
level of our scheme is also significantly lower than prior work. The generalization to networks that
do not satisfy the resolvability property and to networks with arbitrary topologies is an interesting
In this chapter, we examine an important aspect of the coded caching problem that is closely
tied to its adoption in practice. It is important to note that the huge gains of coded caching require
K
each file to be partitioned into Fs ≈ KM non-overlapping subfiles of equal size; Fs is referred to as
N
M
the subpacketization level. It can be observed that for a fixed cache size N, Fs grows exponentially
with K. This can be problematic in practical implementations. For instance, suppose that K = 64,
64
with M 14 with a rate R ≈ 2.82. In this case, it is evident
N = 0.25 so that F s = 16 ≈ 4.8 × 10
that at the bare minimum, the size of each file has to be at least 480 terabits for leveraging the
gains in Maddah-Ali and Niesen (2014b). It is even worse in practice. The atomic unit of storage
on present day hard drives is a sector of size 512 bytes and the trend in the disk drive industry is
to move this to 4096 bytes Fitzpatrick (2011). As a result, the minimum size of each file needs to
be much higher than 480 terabits. Therefore, the scheme in Maddah-Ali and Niesen (2014b) is not
practical even for moderate values of K. Furthermore, even for smaller values of K, schemes with
low subpacketization levels are desirable. This is because any practical scheme will require each
of the subfiles to have some header information that allows for decoding at the end users. When
there are a large number of subfiles, the header overhead may be non-negligible. For these same
parameters (K = 64, M/N = 0.25) our proposed approach in this work allows us obtain, e.g., the
following operating points: (i) Fs ≈ 1.07 × 109 and R = 3, (ii) Fs ≈ 1.6 × 104 and R = 6, (iii)
Fs = 64 and R = 12. For the first point, it is evident that the subpacketization level drops by
over five orders of magnitude with only a very small increase in the rate. Point (ii) and (iii) show
proposed scheme allows us to operate at various points on the trade-off between subpacketization
The issue of subpacketization was first considered in the work of Shanmugam et al. (2014, 2016)
in the decentralized coded caching setting. In the centralized case it was considered in the work
of Yan et al. (2017a). They proposed a low subpacketization scheme based on placement delivery
arrays. Reference Shangguan et al. (2018) viewed the problem from a hypergraph perspective
and presented several classes of coded caching schemes. The work of Shanmugam et al. (2017)
has recently shown that there exist coded caching schemes where the subpacketization level grows
linearly with the number of users K; however, this result only applies when the number of users is
In this work, we propose low subpacketization level schemes for coded caching. Our proposed
schemes leverage the properties of combinatorial structures known as resolvable designs and their
natural relationship with linear block codes. Our schemes are applicable for a wide variety of
parameter ranges and allow the system designer to tune the subpacketization level and the gain
of the system with respect to an uncoded system. We note here that designs have also been used
to obtain results in distributed data storage Olmez and Ramamoorthy (2016) and network coding
based function computation in recent work Tripathy and Ramamoorthy (2015, 2017).
This chapter is organized as follows. Section 3.1 discusses the background and related work
and summarizes the main contributions of our work. Section 3.2 outlines our proposed scheme. It
includes all the constructions and the essential proofs. A central object of study in our work are
matrices that satisfy a property that we call the consecutive column property (CCP). Section 3.3
overviews several constructions of matrices that satisfy this property. Several of the longer and
more involved proofs of statements in Sections 3.2 and 3.3 appear in the Appendix. In Section
3.4 we perform an in-depth comparison our work with existing constructions in the literature. We
conclude the paper with a discussion of opportunities for future work in Section 3.5.
We consider a scenario where the server has N files each of which consist of Fs subfiles. There
are K users each equipped with a cache of size M Fs subfiles. The coded caching scheme is specified
24
by means of the placement scheme and an appropriate delivery scheme for each possible demand
pattern. In this work, we use combinatorial designs Stinson (2003) to specify the placement scheme
2. A is a collection of nonempty subsets of X called blocks, where each block contains the same
number of points.
Definition 3. The incidence matrix N of a design (X, A) is a binary matrix of dimension |X|×|A|,
where the rows and columns correspond to the points and blocks respectively. Let i ∈ X and j ∈ A.
Then,
1
if i ∈ j,
N (i, j) =
0
otherwise.
It can be observed that the transpose of an incidence matrix also specifies a design. We will
refer to this as the transposed design. In this work, we will utilize resolvable designs which are a
Definition 4. A parallel class P in a design (X, A) is a subset of disjoint blocks from A whose
union is X. A partition of A into several parallel classes is called a resolution, and (X, A) is said
For resolvable designs, it follows that each point also appears in the same number of blocks.
A = {{1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}}.
25
It can be observed that this design is resolvable with the following parallel classes.
In the sequel we let [n] denote the set {1, . . . , n}. We emphasize here that the original scheme
of Maddah-Ali and Niesen (2014b) can be viewed as an instance of the trivial design. For example,
consider the setting when t = KM/N is an integer. Let X = [K] and A = {B : B ⊂ [K], |B| = t}.
In the scheme of Maddah-Ali and Niesen (2014b), the users are associated with X and the subfiles
with A. User i ∈ [K] caches subfile Wn,B , n ∈ [N ] for B ∈ A if i ∈ B. The main message of our
work is that carefully constructed resolvable designs can be used to obtain coded caching schemes
with low subpacketization levels, while retaining much of the rate gains of coded caching. The
basic idea is to associate the users with the blocks and the subfiles with the points of the design.
The roles of the users and subfiles can also be interchanged by simply working with the transposed
design.
Example 9. Consider the resolvable design from Example 8. The blocks in A correspond to six
users U12 , U34 , U13 , U24 , U14 , U23 . Each file is partitioned into Fs = 4 subfiles Wn,1 , Wn,2 , Wn,3 , Wn,4
which correspond to the four points in X. The cache in user UB , denoted ZB is specified as
We note here that the caching scheme is symmetric with respect to the files in the server.
Furthermore, each user caches half of each file so that M/N = 1/2. Suppose that in the delivery
phase user UB requests file WdB where dB ∈ [N ]. These demands can be satisfied as follows. We
26
pick three blocks, one each from parallel classes P1 , P2 , P3 and generate the signals transmitted in
The three terms in the in eq. (3.1) above correspond to blocks from different parallel classes
{1, 2} ∈ P1 , {1, 3} ∈ P2 , {2, 3} ∈ P3 . This equation has the all-but-one structure that was also
exploited in Maddah-Ali and Niesen (2014b), i.e., eq. (3.1) is such that each user caches all but
one of the subfiles participating in the equation. Specifically, user U12 contains Wn,1 and Wn,2 for
all n ∈ [N ]. Thus, it can decode subfile Wd12 ,3 that it needs. A similar argument applies to users
U13 and U23 . It can be verified that the other three equations also have this property. Thus, at the
end of the delivery phase, each user obtains its missing subfiles.
This scheme corresponds to a subpacketization level of 4 and a rate of 1. In contrast, the scheme
of Maddah-Ali and Niesen (2014b) would require a subpacketization level of 63 = 20 with a rate of
0.75. Thus, it is evident that we gain significantly in terms of the subpacketization while sacrificing
As shown in Example 9, we can obtain a scheme by associating the users with the blocks and
the subfiles with the points. In this work, we demonstrate that this basic idea can be significantly
generalized and several schemes with low subpacketization levels that continue to leverage much of
Coded caching has been the subject of much investigation in recent work as discussed briefly
earlier on. We now overview existing literature on the topic of low subpacketization schemes for
coded caching. In the original paper Maddah-Ali and Niesen (2014b), for given problem parameters
27
K (number of users) and M/N (cache fraction), the authors showed that when N ≥ K, the rate
equals
K(1 − M/N )
R=
1 + KM/N
when M is an integer multiple of N/K. Other points are obtained via memory sharing. Thus,
in the regime when KM/N is large, the coded caching rate is approximately N/M − 1, which is
K
independent of K. Crucially, though this requires the subpacketization level Fs ≈ KM/N . It can
be observed that for a fixed M/N , Fs grows exponentially with K. This is one of main drawbacks
The subpacketization issue was first discussed in the work of Shanmugam et al. (2014, 2016)
in the context of decentralized caching. Specifically, Shanmugam et al. (2016) showed that in the
decentralized setting for any subpacketization level Fs such that Fs ≤ exp(KM/N ) the rate would
scale linearly in K, i.e., R ≥ cK. Thus, much of the rate benefits of coded caching would be lost
if Fs did not scale exponentially in K. Following this work, the authors in Yan et al. (2017a)
introduced a technique for designing low subpacketization schemes in the centralized setting which
they called placement delivery arrays. In Yan et al. (2017a), they considered the setting when
M/N = 1/q or M/N = 1 − 1/q and demonstrated a scheme where the subpacketization level was
exponentially smaller than the original scheme, while the rate was marginally higher. This scheme
can be viewed as a special case of our work. We discuss these aspects in more detail in Section 3.4.
In Shangguan et al. (2018), the design of coded caching schemes was achieved through the design
of hypergraphs with appropriate properties. In particular, for specific problem parameters, they
√
were able to establish the existence of schemes where the subpacketization scaled as exp(c K).
ReferenceYan et al. (2017b) presented results in this setting by considering strong edge coloring of
bipartite graphs.
Very recently, Shanmugam et al. (2017) showed the existence of coded caching schemes where
the subpacketization grows linearly with the number of users, but the coded caching rate grows as
O(K δ ) where 0 < δ < 1. Thus, while the rate is not a constant, it does not grow linearly with
K either. Both Shangguan et al. (2018) and Shanmugam et al. (2017) are interesting results that
28
demonstrate the existence of regimes where the subpacketization scales in a manageable manner.
Nevertheless, it is to be noted that these results come with several caveats. For example, the result
of Shanmugam et al. (2017) is only valid in the regime when K is very large and is unlikely to be
of use for practical values of K. The result of Shangguan et al. (2018) has significant restrictions
on the number of users, e.g., in their paper, K needs to be of the form na and q t na .
In this work, the subpacketization levels we obtain are typically exponentially smaller than the
original scheme. However, they still continue to scale exponentially in K, albeit with much smaller
exponents. However, our construction has the advantage of being applicable for a large range of
• We uncover a simple and natural relationship between a (n, k) linear block code and a coded
caching scheme. We first show that any linear block code over GF (q) and in some cases
Z mod q (where q is not a prime or a prime power) generates a resolvable design. This
design in turn specifies a coded caching scheme with K = nq users where the cache fraction
M/N = 1/q. A complementary cache fraction point where M/N = 1 − α/nq where α is some
integer between 1 and k + 1 can also be obtained. Intermediate points can be obtained by
• We consider a class of (n, k) linear block codes whose generator matrices satisfy a specific rank
properties. For such codes, we are able to identify an efficient delivery phase and determine
the precise coded caching rate. We demonstrate that the subpacketization level is at most
q k (k + 1) whereas the coded caching gain scales as k + 1 with respect to an uncoded caching
scheme. Thus, different choices of k allow the system designer significant flexibility to choose
• We discuss several constructions of generator matrices that satisfy the required rank property.
We characterize the ranges of alphabet sizes (q) over which these matrices can be constructed.
29
If one has a given subpacketization budget in a specific setting, we are able to find a set of
schemes that fit the budget while leveraging the rate gains of coded caching.
All our constructions of low subpacketization schemes will stem from resolvable designs (cf.
Definition 4). Our overall approach is to first show that any (n, k) linear block code over GF (q)
can be used to obtain a resolvable block design. The placement scheme obtained from this resolvable
design is such that M/N = 1/q. Under certain (mild) conditions on the generator matrix we show
that a delivery phase scheme can be designed that allows for a significant rate gain over the uncoded
scheme while having a subpacketization level that is significantly lower than Maddah-Ali and Niesen
(2014b). Furthermore, our scheme can be transformed into another scheme that operates at the
also discuss situations under which we can operate over modular arithmetic Zq = Z mod q where
q is not necessarily a prime or a prime power; this allows us to obtain a larger range of parameters.
Consider a (n, k) linear block code over GF (q). To avoid trivialities we assume that its generator
matrix does not have an all-zeros column. We collect its q k codewords and construct a matrix T
of size n × q k as follows.
where the 1 × n vector c` represents the `-th codeword of the code. Let X = {0, 1, · · · , q k − 1} be
the point set and A be the collection of all subsets Bi,l for 0 ≤ i ≤ n − 1 and 0 ≤ l ≤ q − 1, where
Lemma 1. The construction procedure above results in a design (X, A) where X = {0, 1, · · · , q k −
1} and |Bi,l | = q k−1 for all 0 ≤ i ≤ n − 1 and 0 ≤ l ≤ q − 1. Furthermore, the design is resolvable
where u = [u0 , · · · , uk−1 ]. Let a∗ be such that ga∗ b 6= 0. Consider the equation
X
ua gab = ∆b − ua∗ ga∗ b ,
a6=a∗
where ∆b is fixed. For arbitrary values of ua , a 6= a∗ , this equation has a unique solution for ua∗ ,
which implies that for any ∆b , |Bb,∆b | = q k−1 and that Pb forms a parallel class.
Remark 1. A k × n generator matrix over GF (q) where q is a prime power can also be considered
as a matrix over an extension field GF (q m ) where m is an integer. Thus, one can obtain a resolvable
design in this case as well; the corresponding parameters can be calculated in an easy manner.
Remark 2. We can also consider linear block codes over Z mod q where q is not necessarily a
prime or a prime power. In this case the conditions under which a resolvable design can be obtained
by forming the matrix T are a little more involved. We discuss this in Lemma 5 in the Appendix.
Example 10. Consider a (4, 2) linear block code over GF (3) with generator matrix
1 0 1 1
G= .
0 1 1 2
Using T, we generate the resolvable block design (X, A) where the point set is X = {0, 1, 2, 3, 4, 5, 6, 7, 8}.
For instance, block B0,0 is obtained by identifying the column indexes of zeros in the first row of
A ={{0, 1, 2}, {3, 4, 5}, {6, 7, 8}, {0, 3, 6}, {1, 4, 7}, {2, 5, 8},
{0, 5, 7}, {1, 3, 8}, {2, 4, 6}, {0, 4, 8}, {2, 3, 7}, {1, 5, 6}}.
It can be observed that A has a resolution (cf. Definition 4) with the following parallel classes.
We now introduce a special class of linear block codes whose generator matrices satisfy specific
rank properties. It turns out that resolvable designs obtained from these codes are especially suited
Consider the generator matrix G of a (n, k) linear block code over GF (q). The i-th column
of G is denoted by gi . Let z be the least positive integer such that k + 1 divides nz (denoted by
(wraparounds over the boundaries are allowed). For this purpose, let Ta = {a(k+1), · · · , a(k+1)+k}
(k, k + 1)-consecutive column property that is central to the rest of the discussion.
{3, 0, 1}, S2 = {2, 3, 0} and S3 = {1, 2, 3}. The corresponding generator matrix G satisfies the (k, k+
1) CCP as any two columns of the each of submatrices GSi , i = 0, . . . , 3 are linearly independent
over GF (3).
We note here that one can also define different levels of the consecutive column property. Let
Definition 6. (k, α)-consecutive column property Consider the submatrices of G specified by GSaα
zn
for 0 ≤ a ≤ α − 1. We say that G satisfies the (k, α)-consecutive column property, where α ≤ k if
each GSaα has full rank. In other words, the α columns in each GSaα are linearly independent.
As pointed out in the sequel, codes that satisfy the (k, α)-CCP, where α ≤ k will result in
caching systems that have a multiplicative rate gain of α over an uncoded system. Likewise, codes
that satisfy the (k, k + 1)-CCP will have a gain of k + 1 over an uncoded system. In the remainder
of the paper, we will use the term CCP to refer to the (k, k + 1)-CCP if the value of k is clear from
the context.
A resolvable design generated from a linear block code that satisfies the CCP can be used in a
coded caching scheme as follows. We associate the users with the blocks. Each subfile is associated
with a point and an additional index. The placement scheme follows the natural incidence between
the blocks and the points; a formal description is given in Algorithm 2 and illustrated further in
Example 12.
Example 12. Consider the resolvable design from Example 10, where we recall that z = 3. The
blocks in A correspond to twelve users U012 , U345 , U678 , U036 , U147 , U258 , U057 , U138 , U246 , U048 ,
s ,
U237 , U156 . Each file is partitioned into Fs = 9 × z = 27 subfiles, each of which is denoted by Wn,t
s | t ∈
t = 0, · · · , 8, s = 0, 1, 2. The cache in user Uabc , denoted Zabc is specified as Zabc = {Wn,t
33
{a, b, c}, s ∈ {0, 1, 2} and n ∈ [N ]}. This corresponds to a coded caching system where each user
where B ∈ A if t ∈ B. Therefore, each user caches a total of N q k−1 z subfiles. As each file consists
It remains to show that we can design a delivery phase scheme that satisfies any possible demand
pattern. Suppose that in the delivery phase user UB requests file WdB where dB ∈ [N ]. The server
responds by transmitting several equations that satisfy each user. Each equation allows k + 1 users
from different parallel classes to simultaneously obtain a missing subfile. Our delivery scheme is
such that the set of transmitted equations can be classified into various recovery sets that correspond
to appropriate collections of parallel classes. For example, in Fig. 3.1, PS0 = {P0 , P1 , P2 }, PS1 =
{P0 , P1 , P3 } and so on. It turns out that these recovery sets correspond precisely to the sets
zn
Sa , 0 ≤ a ≤ k+1 − 1 defined earlier. We illustrate this by means of the example below.
Example 13. Consider the placement scheme specified in Example 12. Let each user UB request
file WdB . The recovery sets are specified by means of the recovery set bipartite graph shown in Fig.
3.1, e.g., PS1 corresponds to S1 = {0, 1, 3}. The outgoing edges from each parallel class are labeled
arbitrarily with numbers 0, 1 and 2. Our delivery scheme is such that each user recovers missing
subfiles with a specific superscript from each recovery set that its corresponding parallel class
participates in. For instance, a user in parallel class P1 recovers missing subfiles with superscript 0
34
Parallel Classes
P0 P1 P2 P3
2 0 2
0 1 2 0 1 1 0 1 2
Recovery Sets
from PS0 , superscript 1 from PS1 and superscript 2 from PS3 ; these superscripts are the labels of
It can be verified, e.g., that user U012 which lies in P0 recovers all missing subfiles with super-
Each of the equations above benefits three users. They are generated simply by choosing U012 from
P0 , any block from P1 and the last block from P3 so that the intersection of all these blocks is
empty. The fact that these equations are useful for the problem at hand is a consequence of the
CCP. The process of generating these equations can be applied to all possible recovery sets. It can
be shown that this allows all users to be satisfied at the end of the procedure.
In what follows, we first show that for the recovery set PSa it is possible to generate equations
Claim 1. Consider the resolvable design (X, A) constructed as described in Section III.A by a
zn
(n, k) linear block code that satisfies the CCP. Let PSa = {Pi | i ∈ Sa } for 0 ≤ a ≤ k+1 − 1, i.e.,
blocks Bi1 ,li1 , . . . , Bik ,lik (where lij ∈ {0, . . . , q − 1}) that are picked from any k distinct parallel
Before proving Claim 1, we discuss its application in the delivery phase. Note that the claim
asserts that k blocks chosen from k distinct parallel classes intersect in precisely one point. Now,
suppose that one picks k + 1 users from k + 1 distinct parallel classes, such that their intersection is
empty. These blocks (equivalently, users) can participate in an equation that benefits k +1 users. In
particular, each user will recover a missing subfile indexed by the intersection of the other k blocks.
We emphasize here that Claim 1 is at the core of our delivery phase. Of course, we need to justify
that enough equations can be found that allow all users to recover all their missing subfiles. This
follows from a natural counting argument that is made more formally in the subsequent discussion.
The superscripts s ∈ {0, . . . , z − 1} are needed for the counting argument to go through.
Proof. Following the construction in Section III.A, we note that a block Bi,l ∈ Pi is specified by
Now consider Bi1 ,li1 , . . . , Bik ,lik (where ij ∈ Sa , lij ∈ {0, . . . , q − 1}) that are picked from k
distinct parallel classes of PSa . W.l.o.g. we assume that i1 < i2 < · · · < ik . Let I = {i1 , . . . , ik }
and TI denote the submatrix of T obtained by retaining the rows in I. We will show that the
By the CCP, the vectors gi1 , gi2 , . . . , gik are linearly independent. Therefore this system of k
equations in k variables has a unique solution over GF (q). The result follows.
36
We now provide an intuitive argument for the delivery phase. Recall that we form a recovery set
bipartite graph (see Fig. 3.1 for an example) with parallel classes and recovery sets as the disjoint
vertex subsets. The edges incident on each parallel class are labeled arbitrarily from 0, . . . , z − 1.
For a parallel class P ∈ PSa we denote this label by label(P − PSa ). For a given recovery set PSa ,
the delivery phase proceeds by choosing blocks from distinct parallel classes in PSa such that their
intersection is empty; this provides an equation that benefits k + 1 users. It turns out that the
equation allows a user in parallel class P ∈ PSa to recover a missing subfile with the superscript
label(P − PSa ).
The formal argument is made in Algorithm 3. For ease of notation in Algorithm 3, we denote
Claim 2. Consider a user UB belonging to parallel class P ∈ PSa . The signals generated in
Algorithm 3 can recover all the missing subfiles needed by UB with superscript E(P).
Proof. Let Pα ∈ PSa . In the arguments below, we argue that user UBα,lα that demands file Wκα,lα
can recover all its missing subfiles with superscript E(Pα ). Note that |Bα,lα | = q k−1 . Thus, user
UBα,lα needs to obtain q k − q k−1 missing subfiles with superscript E(Pα ). Consider an iteration
37
of the while loop where block Bα,lα is picked in step 2. The equation in Algorithm 3 allows it to
E(Pα )
recover W where ˆlα = ∩j∈Sa \{α} Bj,lj . This is because ∩j∈Sa Bj,lj = ∅ and Claim 1.
κα,lα ,l̂α
Next we count the number of equations that UBα,lα participates in. We can pick k − 1 users
from some k − 1 distinct parallel classes in PSa . This can be done in q k−1 ways. Claim 1 ensures
that the blocks so chosen intersect in a single point. Next we pick a block from the only remaining
parallel class in PSa such that the intersection of all blocks is empty. This can be done in q − 1
ways. Thus, there are a total of q k−1 (q − 1) = q k − q k−1 equations in which user UBα,lα participates
in.
It remains to argue that each equation provides a distinct subfile. Towards this end, let
but ∩kj=1 Bij ,lij = ∩kj=1 Bij ,li0 = β. This is a contradiction since this in turn implies that β ∈
j
∩k+1 ∩k+1
T
j=2 Bij ,lij j=2 Bij ,li0 , which is impossible since two blocks from the same parallel class have an
j
empty intersection.
As the algorithm is symmetric with respect to all blocks in parallel classes belonging to PSa ,
The overall delivery scheme repeatedly applies Algorithm 3 to each of the recovery sets.
Lemma 2. The proposed delivery scheme terminates and allows each user’s demand to be satisfied.
(q−1)n
Furthermore the transmission rate of the server is k+1 and the subpacketization level is q k z.
The main requirement for Lemma 2 to hold is that the recovery set bipartite graph be biregular,
where multiple edges between the same pair of nodes is disallowed and the degree of each parallel
class is z. It is not too hard to see that this follows from the definition of the recovery sets (see the
In an analogous manner, if one starts with the generator matrix of a code that satisfies the
(k, α)-CCP for α ≤ k, then we can obtain the following result which is stated below. The details
are similar to the discussion for the (k, k + 1)-CCP and can be found in the Appendix (Section A).
Corollary 1. Consider a coded caching scheme obtained by forming the resolvable design obtained
from a (n, k) code that satisfies the (k, α)-CCP where α ≤ k. Let z be the least positive integer
such that α | nz. Then, a delivery scheme can be constructed such that the transmission rate is
(q−1)n
α and the subpacketization level is q k z.
k+1
3.2.4 Obtaining a scheme for M/N = 1 − nq .
The construction above works for a system where M/N = 1/q. It turns out that this can be
M k+1
converted into a scheme for N = 1− nq . Thus, any convex combination of these two points can
be obtained by memory-sharing.
Towards this end, we note that the class of coded caching schemes considered here can be
specified by an equation-subfile matrix. This is inspired by the hypergraph formulation and the
placement delivery array (PDA) based schemes for coded caching in Shangguan et al. (2018) and
Yan et al. (2017a). Each equation is assumed to be of the all-but-one type, i.e., it is of the form
Wdt1 ,Aj1 ⊕ Wdt2 ,Aj2 ⊕ · · · ⊕ Wdtm ,Ajm where for each ` ∈ [m], we have the property that user Ut`
does not cache subfile Wn,Aj` but caches all subfiles Wn,Ajs where {js : s ∈ [m], s 6= `}.
associate each row of S with an equation and each column with a subfile. We denote the i-th row
of S by Eqi and j-th column of S by Aj . The value S(i, j) = t if in the i-th equation, user Ut
recovers subfile Wdt ,Aj , otherwise, S(i, j) = 0. Suppose that these ∆ equations allow each user to
satisfy their demands, i.e., S corresponds to a valid coded caching scheme. It is not too hard to
see that the placement scheme can be obtained by examining S. Namely, user Ut caches the subfile
corresponding to the j-th column if integer t does not appear in the j-th column.
39
Example 14. Consider a coded caching system in Maddah-Ali and Niesen (2014b) with K = 4,
A1 A2 A3 A4 A5 A6
Eq1 3 2 0 1 0 0
Eq2 4 0 2 0 1 0
.
Eq3
0 4 3 0 0 1
Eq4 0 0 0 4 3 2
Upon examining S it is evident for instance that user U1 caches subfiles A1 , . . . , A3 as the number
1 does not appear in the corresponding columns. Similarly, the cache placement of the other users
can be obtained. Interpreting this placement scheme in terms of the user-subfile assignment, it can
be verified that the design so obtained corresponds to the transpose of the scheme considered in
Example 3.1 (and also to the scheme of Maddah-Ali and Niesen (2014b) for K = 4, M/N = 1/2).
Lemma 3. Consider a ∆×Fs equation-subfile matrix S whose entries belong to the set {0, 1, . . . , K}.
It corresponds to a valid coded caching system if the following three conditions are satisfied.
Proof. The placement scheme is obtained as discussed earlier, i.e., user Ut caches subfiles Wn,Aj if
integer t does not appear in column Aj . Therefore, matrix S corresponds to a placement scheme.
Next we discuss the delivery scheme. Note that Eqi corresponds to an equation as follows.
where S(i, j1 ) = t1 , · · · , S(i, jm ) = tm . The above equation can allow m users to recover subfiles
simultaneously if (a) Ut` does not cache Wn,Aj` and (b) Ut` caches all Wn,Ajs where {js : s ∈
[m], s 6= `}. It is evident that Ut` does not cache Wn,Aj` owing to the placement scheme. Next, to
40
guarantee the condition (b), we need to show that integer t` = S(i, j` ) will not appear in column
Ajs in S where {js : s ∈ [m], s 6= `}. Towards this end, t` 6= S(i, js ) because of Condition 2. Next,
consider the non-zero entries that lie in the column Ajs but not in the row Eqi . Assume there exists
an entry S(i0 , js ) such that S(i0 , js ) = S(i, j` ) = t` and i0 6= i, then S(i, js ) = ts 6= 0, which is a
contradiction to Condition 3. Finally, Condition 1 guarantees that each missing subfile is recovered
only once.
Mt Lt
User Ut caches a fraction N = Fs where Lt is the number of columns of S that do not have
∆
the entry t. Similarly, the transmission rate is given by R = Fs .
The crucial point is that the transpose of S, i.e., ST also corresponds to a coded caching scheme.
This follows directly from the fact that ST also satisfies the conditions in Lemma 3. In particular,
ST corresponds to a coded caching system with K users and ∆ subfiles. In the placement phase, the
Mt0 ∆−Fs +Lt
cache size of Ut is N = ∆ . In the delivery phase, by transmitting Fs equations corresponding
Fs
to the rows of ST , all missing subfiles can be recovered. Then, the transmission rate is R = ∆.
Applying the above discussion in our context, consider the equation-subfile matrix S corre-
Mt 1
sponding to the coded caching system with K = nq, N = q for 1 ≤ t ≤ nq, Fs = q k z and
nz M0
∆ = q k (q−1) k+1 . Then ST corresponds to a system with K 0 = nq, N = 1− k+1 0 k zn
nq , Fs = (q−1)q k+1 ,
Fs k+1
and transmission rate R0 = ∆ = (q−1)n . The following theorem is the main result of this paper.
Theorem 2. Consider a (n, k) linear block code over GF (q) that satisfies the (k, k + 1) CCP. This
corresponds to a coded caching scheme with K = nq users, N files in the server where each user has
a cache of size M ∈ 1q N, 1 − k+1nq N . Let z be the least positive integer such that k + 1 | nz.
M
When N = 1q , we have
(q − 1)n
R= , and
k+1
Fs = q k z.
41
M k+1
When N = (1 − nq ), we have
k+1
R= , and
(q − 1)n
zn
Fs = (q − 1)q k .
k+1
In a similar manner for the (n, k) linear block code that satisfies the (k, α)-CCP over GF (q),
M0 α
the caching system where M/N = 1/q can be converted into a system where K 0 = nq, N0 = 1 − nq ,
Fs0 = (q − 1)q k zn 0
α and R =
α
(q−1)n using the equation-subfile technique. The arguments presented
At this point we have established that linear block codes that satisfy the CCP are attractive
candidates for usage in coded caching. In this section, we demonstrate that there are a large class
of generator matrices that satisfy the CCP. For most of the section we work with matrices over
a finite field of order q. In the last subsection, we discuss some constructions for matrices over Z
mod q when q is not a prime or prime power. We summarize the constructions presented in this
(n, k)-MDS codes with minimum distance n − k + 1 are clearly a class of codes that satisfy the
CCP. In fact, for these codes any k columns of the generator matrix can be shown to be full rank.
Note however, that MDS codes typically need large field size, e.g., q + 1 ≥ n (assuming that the
MDS conjecture is true)Roth (2006). In our construction, the value of M/N = 1/q and the number
of users is K = nq. Thus, for large n, we will only obtain systems with small values of M/N , or
equivalently large values of M/N (by Theorem 2 above). This may be restrictive in practice.
42
A cyclic code is a linear block code, where the circular shift of each codeword is also a codeword
Lin and Costello (2004). A (n, k) cyclic code over GF (q) is specified by a monic polynomial
g(X) = n−k i
P
i=0 gi X with coefficients from GF (q) where gn−k = 1 and g0 6= 0; g(X) needs to divide
the polynomial X n − 1. The generator matrix of the cyclic code is obtained as below.
g g · · · gn−k 0 · · 0
0 1
0 g0 g1 · · · gn−k 0 · 0
G=. .
. ..
. .
0 0 · 0 g0 g1 · · · gn−k
The following claim shows that for verifying the CCP for a cyclic code it suffices to pick any
Claim 3. Consider a (n, k) cyclic code with generator matrix G. Let GS denote a set of k + 1
consecutive columns of G. If each k × k submatrix of GS is full rank, then G satisfies the (k, k + 1)-
CCP.
Proof. Let the generator polynomial of the cyclic code be g(X), where we note that g(X) has degree
n − k. Let GS = [g(a)n , g(a+1)n , · · · , g(a+k)n ] where we assume that GS satisfies the (k, k + 1)-CCP.
Let
GS 0 \j =[g(a+i)n , . . . , g(a+j−1+i)n ,
g(a+j+1+i)n , . . . , g(a+k+i)n ].
We need to show that if GS\j has full rank, then GS 0 \j has full rank, for any 0 ≤ j ≤ k.
As GS\j has full rank, there is no codeword c 6= 0 such that c((a)n ) = · · · = c((a + j − 1)n ) =
c((a + j + 1)n ) = · · · = c((a + k)n ) = 0. By the definition of a cyclic code, any circular shift of a
codeword results in another codeword that belongs to the code. Therefore, there is no codeword c0
Claim 3 implies a low complexity search algorithm to determine if a cyclic code satisfies the
zn
CCP. Instead of checking all GSa , 0 ≤ a ≤ k+1 − 1, in Definition 5, we only need to check an
arbitrary GS = [g(i)n , g(i+1)n , · · · , g(i+k)n ], for 0 ≤ i < n. To further simplify the search, we choose
i = n − b k2 c − 1.
For this choice of i, Claim 4 shows that GS is such that we only need to check the rank of a
list of small-dimension matrices to determine if each k × k submatrix of GS is full rank (the proof
Claim 4. A cyclic code with generator matrix G satisfies the CCP if the following conditions hold.
Example 15. Consider the polynomial g(X) = X 4 + X 3 + X + 2 over GF (3). Since it divides
X 8 − 1, it is the generator polynomial of a (8, 4) cyclic code over GF (3). The generator matrix of
44
rightmost columns of G is such that all 4 × 4 submatrices of it are full rank. Thus, by Claim 3 the
Remark 3. Cyclic codes form an important class of codes that satisfy the (k, k)-CCP (cf. Definition
6). This is because, it is well-known Lin and Costello (2004) that any k consecutive columns of the
It is well recognized that cyclic codes do not necessarily exist for any choice of parameters. This
is because of the divisibility requirement on the generator polynomial. We now discuss a more
general construction of generator matrices that satisfy the CCP. As we shall see, this construction
provides a more or less satisfactory solution for a large range of system parameters.
Our first simple observation is that the Kronecker product (denoted by ⊗ below) of a z × α
generator matrix that satisfies the (z, z)-CCP with a t × t identity matrix, It×t immediately yields
Claim 5. Consider a (n, k) linear block code over GF (q) whose generator matrix is specified as
G = A ⊗ It×t where A is a z × α matrix that satisfies the (z, z)-CCP. Then, G satisfies the
Proof. The recovery set for A is specified as Saz = {(az)α , · · · , (az + z − 1)α } and the recovery set
for G is specified as Sak = {(ak)n , · · · , (ak + k − 1)n }. Since A satisfies the (z, z)-CCP, ASaz has full
rank. Note that GSak = ASaz ⊗ It×t . Then det(GSak ) = det(ASaz ⊗ It×t ) = det(ASaz )t 6= 0. Therefore,
Remark 4. Let A be the generator matrix of a cyclic code over GF (q), then G = A ⊗ It×t satisfies
Our next construction addresses the (k, k + 1)-CCP. In what follows, we use the following
notation.
• 1a : [1, · · · , 1]T ;
| {z }
a
• C(c1 , c2 )a×b : a × b matrix where each row is the cyclic shift (one place to the right) of the
Claim 6 below constructs a (n, k) linear code that satisfies the CCP over GF (q) where q > α. Since
α = nt , the required field size in Claim 6 is lower than the MDS code considered in Section 3.3.1.
Claim 6. Consider a (n, k) linear block code over GF (q) whose generator matrix is specified as
follows,
b00 It×t ··· b0(α−1) It×t
b10 It×t ··· b1(α−1) It×t
.. ..
G=
. .
(3.3)
b(z−2)0 It×t ··· b(z−2)(α−1) It×t
C(b(z−1)0 ,b(z−1)0 )(t−1)×t ··· C(b(z−1)(α−1) ,b(z−1)(α−1) )(t−1)×t
where
b00 b01 ··· b0(α−1)
b10
b11 ··· b1(α−1)
. ..
.
. .
b(z−1)0 b(z−1)1 · · · b(z−1)(α−1)
Proof. The proof again leverages the idea that G can be expressed succinctly by using Kronecker
Consider the case when α = z + 1. We construct a (n, k) linear code satisfy the CCP over
GF (q) where q ≥ z. It can be noted that the constraint of field size is looser than the corresponding
constraint in Claim 6.
Given a (n, k) code that satisfies the CCP, we can use it obtain higher values of n in a simple
Claim 8. Consider a (n, k) linear block code over GF (q) with generator matrix G that satisfies
the CCP. Let the first k + 1 columns of G be denoted by the submatrix D. Then the matrix G0 of
G0 = [D| · · · |D |G]
| {z }
s
Claim 8 can provide more parameter choices and more possible code constructions. For example,
given n, k, q, where k + 1 + (n)k+1 ≤ q + 1 < n, there may not exist a (n, k)-MDS code over GF (q).
However, there exists a (k + 1 + (n)k+1 , k)-MDS code over GF (q). By Claim 8, we can obtain a
(n, k) linear block code over GF (q) that satisfies the CCP. Similarly, combining Claim 4, Claim 6,
Claim 7 with Claim 8, we can obtain more linear block codes that satisfy the CCP.
47
A result very similar to Claim 8 can be obtained for the (k, α)-CCP. Specifically, consider a
(n, k) linear block code with generator matrix G that satisfies the (k, α)-CCP and let D be the first
α columns of G. Then, G0 = [D| · · · |D |G] of dimension k × (n + sα) also satisfies the (k, α)-CCP.
| {z }
s
We now discuss constructions where q is not a prime or a prime power. We attempt to construct
matrices over the ring Z mod q in this case. The issue is somewhat complicated by the fact that
a square matrix over Z mod q has linear independent rows if and only if its determinant is a unit
in the ring Dummit and Foote (2003). In general, this fact makes it harder to obtain constructions
such as those in Claim 6 that exploit the Vandermonde structure of the matrices. Specifically, the
difference of units in a ring is not guaranteed to be a unit. However, we can still provide some
constructions. It can be observed that Claim 5 and Claim 8 hold for linear block codes over Z
Claim 9. Let G = [Ik×k |1k ], i.e., it is the generator matrix of a (k + 1, k) single parity check (SPC)
code, where the entries are from Z mod q. The G satisfies the (k, k + 1)-CCP and the (k, k)-CCP.
Proof. It is not too hard to see that when G = [Ik×k |1k ], any k×k submatrix of G has a determinant
which is ±1, i.e., it is a unit over Z mod q. Thus, the result holds in this case.
Claim 10. The following matrix with entries from Z mod q satisfies the (k, k + 1)-CCP. Here
k = 2t − 1 and n = 3t.
It×t 0 1t It×t
G= .
0 I(t−1)×(t−1) 1t C(1, −1)(t−1)×t
Proof. This can be proved by following the arguments in the proof of Claim 7 while treating
submatrices for which we need to check the property. These correspond to simpler instances of the
48
submatrices considered in Types I - III in the proof of Claim 7. In particular, the corresponding
Remark 5. We note that the general construction in Claim 7 can potentially fail in the case when
the matrices are over Z mod q. This is because in one of the cases under consideration (specifically,
Type III, Case 1), the determinant depends on the difference of the bi values. The difference of
units in Z mod q is not guaranteed to be a unit, thus there is no guarantee that the determinant
is a unit.
Remark 6. We can use Claim 8 to obtain higher values of n based on the above two classes of
While most constructions of cyclic codes are over GF (q), there has been some work on con-
structing cyclic codes over Z mod q. Specifically, Blake (1972) provides a construction where
Chinese remainder theorem any element γ ∈ Z mod q has a unique representation in terms of its
residues modulo qi , for i = 1, . . . , d. Let ψ : Z mod q → GF (q1 ) × · · · × GF (qd ) denote this map.
• Suppose that (n, ki ) cyclic codes over GF (qi ) exist for all i = 1, . . . , d. Each individual code
is denoted C i .
• Let C denote the code over Z mod q. Let c(i) ∈ C i for i = 1, . . . , d. The codeword c ∈ C is
(1) (d)
obtained as follows. The j-th component of c, cj = ψ −1 (cj , . . . , cj )
Therefore, there are q1k1 q2k2 · · · qdkd codewords in C. It is also evident that C is cyclic. As discussed
in Section 3.2.1, we form the matrix T for the codewords in C. It turns out that using T and the
technique discussed in Section 3.2.1, we can obtain a resolvable design. Furthermore, the gain of
the system in the delivery phase can be shown to be kmin = min{k1 , k2 , · · · , kd }. We discuss these
3.4.1 Discussion
M
When the number of users is K = nq and the cache fraction is N = 1q , we have shown in Theorem
2 that the gain g = k + 1 and Fs = q k z. Therefore, both the gain and the subpacketization level
increase with larger k. Thus, for our approach given a subpacketization budget Fs0 , the highest
coded gain that can be obtained is denoted by gmax = kmax + 1 where kmax is the largest integer
such that q kmax z ≤ Fs0 and there exists a (n, kmax ) linear block code that satisfies the CCP.
For determining kmax , we have to characterize the collection of values of k such that there exists
a (n, k) linear code satisfies the CCP over GF (q) or Z mod q. We use our proposed constructions
(MDS code, Claim 4, Claim 6, Claim 7, Claim 8, Claim 9, Claim 10) for this purpose. We call this
collection C(n, q) and generate it in Algorithm 4. We note here that it is entirely possible that there
are other linear block codes that fit the appropriate parameters and are outside the scope of our
constructions. Thus, the list may not be exhaustive. In addition, we note that we only check for
the (k, k + 1)-CCP. Working with the (k, α)-CCP where α ≤ k can provide more operating points.
Example 16. Consider a caching system with K = nq = 12 × 5 = 60 users and cache fraction
M 1
N = 5. Suppose that the subpacketization budget is 1.5 × 106 . By checking all k < n we can
construct C(n, q) (see Table 3.2). As a result, C(n, q) = {1, 2, 3, 4, 5, 6, 7, 8, 9, 11}. Then kmax = 8,
Fs ≈ 1.17 × 106 and the maximal coded gain we can achieve is gmax = 9. By contrast, the
KM
scheme in Maddah-Ali and Niesen (2014b) can achieve coded gain g = N + 1 = 13 but requires
K
subpacketization level Fs = KM ≈ 1.4 × 1012 .
N
We can achieve almost the same rate by performing memory-sharing by using the scheme of
Maddah-Ali and Niesen (2014b) in this example. In particular, we divide each file of size Ω into two
9 1
smaller subfiles Wn1 and Wn2 , where the size of Wn1 , |Wn1 | = 10 Ω and the size of Wn2 , |Wn2 | = 10 Ω.
The scheme of Maddah-Ali and Niesen (2014b) is then applied separately on Wn1 and Wn2 with
M1 2 M2 13
N1 = 15 (corresponding to Wn1 ) and N2 = 15 (corresponding to Wn2 ). Thus, the overall cache
2 13
fraction is 0.9 × 15 + 0.1 × 15 ≈ 15 . The overall coded gain of this scheme is g ≈ 9. However, the
50
Figure 3.2 A comparison of rate and subpacketization level vs. M/N for a system with
K = 64 users. The left y-axis shows the rate and the right y-axis shows
the logarithm of the subpacketization level. The green and the blue curves
correspond to two of our proposed constructions. Note that our schemes allow
for multiple orders of magnitude reduction in subpacketization level and the
expense of a small increase in coded caching rate.
K K
subpacketization level is FsM N = ≈ 5 × 109 , which is much greater than the
KM1 /N1 + KM2 /N2
subpacketization budget.
In Fig. 3.2, we present another comparison for system parameters K = 64 and different values
of M/N . The scheme of Maddah-Ali and Niesen (2014b) works for all M/N such that KM/N is an
integer. In Fig. 3.2, our plots have markers corresponding to M/N values that our scheme achieves.
For ease of presentation, both the rate (left y-axis) and the logarithm of the subpacketization
level (right y-axis) are shown on the same plot. We present results corresponding to two of our
construction techniques: (i) the SPC code and (ii) a smaller SPC code coupled with Claim 8. It
can be seen that our subpacketization levels are several orders of magnitude smaller with only a
An in-depth comparison for general parameters is discussed next. In the discussion below, we
shall use the superscript ∗ to refer to the rates and subpacketization levels of our proposed scheme.
3.4.2 Comparison with memory-sharing within the scheme of Maddah-Ali and Niesen
(2014b)
Suppose that for given K, M and N , a given rate R can be achieved by the memory sharing of the
scheme in Maddah-Ali and Niesen (2014b) between the corner points (M1 , R1 ), (M2 , R2 ), · · · , (Md , Rd )
M
K(1− Ni ) Pd
where Mi = tKiN
λi Ri , M/N = di=1 λi M
P
for some integer ti . Then Ri = KMi , R = i=1 N
i
1+ N
K
and di=1 λi = 1. The subpacketization level is FsM S = di=1 KM
P P
i . In addition, we note that the
N
function h(x) = K(1 − x)/(1 + Kx) is convex in the parameter 0 ≤ x ≤ 1. This can be verified by
To see this, consider the following argument. Suppose that the above statement is not true.
Then, there exists a scheme that operates via memory sharing between points (M1 , R1 ), · · · , (Md , Rd )
K K
such that Fs < KKM 0 . Note that KM if M M2 1 M1 M2 1
< KM N < N ≤ 2 or N > N ≥ 2 . By the
1
1 2
N N N
convexity of h(·), we can conclude that (M, R) is not in the convex hull of the corner points
Next, we compare this lower bound on FsM S to the subpacketization level of our proposed
n(q−1)
scheme. In principle, we can solve the system of equations (3.5) and (3.6) for R = k+1 and
M 1
N = q and obtain the appropriate λ and M ∗ values1 . Unfortunately, doing this analytically
becomes quite messy and does not yield much intuition. Instead, we illustrate the reduction in
Example 17. Consider a (9, 5) linear block code over GF (2) with generator matrix specified below.
1 0 0 0 0 1 1 0 0
0 1 0 0 0 1 0 1 0
G = 0 0 1 0 0 1 0 0 1 .
0 0 0 1 0 1 1 1 0
0 0 0 0 1 1 0 1 1
It can be checked that G satisfies the (5, 6)-CCP. Thus, it corresponds to a coded caching system
M1 M2
with K = 9 × 2 = 18 users. Our scheme achieves the point N = 12 , R1 = 32 , Fs,1
∗ = 64 and
N = 32 ,
R2 = 23 , Fs,2
∗ = 96.
M1 1 3
On the other hand for N = 2, R1 = 2,
by numerically solving (3.5) and (3.6) we obtain
M1∗ M10 5 M S ≥ 18 = 8568, which is much higher than
N ≈ 0.227 and therefore N = 18 . Then Fs,1 5
∗ M2∗ 1 M20 5 M S is also at
Fs,1 = 64. A similar calculation shows that N ≈ 4 and therefore N = 18 . Thus Fs,2
∗ = 96.
least as large as 8568, which is still much higher than Fs,2
The next set of comparisons are with other proposed schemes in the literature. We note here
that several of these are restrictive in the parameters that they allow.
3.4.3 Comparison with Maddah-Ali and Niesen (2014b), Yan et al. (2017a), Yan et al.
For comparison with Maddah-Ali and Niesen (2014b), denote RM N and FsM N be the rate and
the subpacketization level of the scheme of Maddah-Ali and Niesen (2014b), respectively. For the
R∗ 1+n M 1
M N
= , for =
R 1+k N q
R ∗ nq − k M 1+k
= , for =1− ,
RM N nq − n N nq
1
= 0.25
0.8
= 0.5
0.4
0.2
0
0 5 10 15 20
q
Figure 3.3 The plot shows the gain in the scaling exponent obtained using our techniques
for different value of M/N = 1/q. Each curve corresponds to a choice of
η = k/n.
M
• If N = 1q , we have
F MN
1 1 η
lim log2 s ∗ = H2 − log2 q. (3.7)
n→∞ K Fs q q
M k+1
• If N =1− nq , we have
FsM N
1 η η
lim log2 ∗
= H2 − log2 q. (3.8)
n→∞ K Fs q q
In the above expressions, 0 < η = k/n ≤ 1 and H2 (·) represents the binary entropy function.
K
≈ 2KH2 (p) Graham et al. (1994).
Proof. Both results are simple consequences of approximating Kp
It is not too hard to see that Fs∗ is exponentially lower than FsM N . Thus, our rate is higher, but
the subpacketization level is exponentially lower. Thus, the gain in the scaling exponent of with
respect to the scheme of Maddah-Ali and Niesen (2014b) depends on the choice of R and the value
of M/N . In Fig. 3.3 we plot this value of different values of R and q. The plot assumes that codes
satisfying the CCP can be found for these rates and corresponds to the gain in eq. (3.7).
In Yan et al. (2017a) a scheme for the case when M/N = 1/q or M/N = 1 − 1/q with subpacke-
tization level exponentially smaller with respect to Maddah-Ali and Niesen (2014b) was presented.
55
This result can be recovered a special case of our work (Theorem 2) when the linear block code is
chosen as a single parity check code over Z mod q. In this specific case, q does not need to be a
prime power. Thus, our results subsume the results of Yan et al. (2017a).
m
In a more recent preprint, reference Yan et al. (2017b), proposed a caching system with K = a ,
Fs = m a m−a
m
b and M/N = 1 − λ b−β / b . The corresponding rate is
m
m − (a + b − 2β) a + b − 2β
a+b−2β
R= m
min , ,
b
λ a−β
where m, a, b, β are positive integers and 0 < a < m, 0 < b < m, 0 ≤ β ≤ min{a, b}. While a
precise comparison is somewhat hard, we can compare the schemes for certain parameter choices,
Let a = 2, β = 1, m = 2b. This corresponds to a coded caching system with K = b(2b−1) ≈ 2b2 ,
M b−1
≈ 12 , Fs = 2b 2b
N = 2b−1 b ≈ 2 , R = b. For comparison with our scheme we keep the transmission
rates of both schemes roughly the same and let n = b2 , q = 2, k = b − 1. We assume that the
corresponding linear block code exists. Then Fs∗ ≈ 2b , which is better than Fs .
On the other hand if we let β = 0, a = 2, m = 2qb, we obtain a coded caching system with
m
K = m(m−1) , M 1 m Y AN = (2q − 1)2 . For keeping the rates the same, we
N ≈ q , Fs = m ≈ (2q) , R
2q
2 2q
m(m−1) m2
m(m−1) m(m−1)
let n = 2q , k= 4q(2q−1) − 1 so that Fs∗ ≈ q 4q(2q−1) ≈ q 8q2 . In this regime, the subpacketization
The work of Shangguan et al. (2018) proposed caching schemes with parameters (i) K = m
a ,
m
M (m−ab ) m
(a+b )
= 1 − m , Fs = and R = , where a, b, m are positive integers and a + b ≤ m and (ii)
N (b) b (mb)
K = mt q t , M 1 m t 1
N = 1 − q t , Fs = q (q − 1) and R = (q−1)t , where q, t, m are positive integers.
Their scheme (i), is the special case of scheme in Yan et al. (2017b) when β = 0. For the second
q
K √
scheme, if we let t = 2, Shangguan et al. (2018) shows that R ≈ R∗ , Fs ≈ q 2q ( q − 1)2 and
K
−1
Fs∗ ≈ (q − 1)q q , which means Fs is again better than Fs∗ . We emphasize here that these results
Finally, we consider the work of Shanmugam et al. (2017). In their work, they leveraged the
results of Alon et al. (2012) to arrive at coded caching schemes where the subpacketization is linear
in K. Specifically, they show that for any constant M/N , there exists a scheme with rate K δ , where
56
δ > 0 can be chosen arbitrarily small by choosing K large enough. From a theoretical perspective,
this is a positive result that indicates that regimes where linear subpacketization scaling is possible.
However, these results are only valid when the value of K is very large. Specifically, K = C n and
the result is asymptotic in the parameter n. For these parameter ranges, the result of Shanmugam
In this work we have demonstrated a link between specific classes of linear block codes and
the subpacketization problem in coded caching. Crucial to our approach is the consecutive col-
umn property which enforces that certain consecutive column sets of the corresponding generator
matrices are full-rank. We present several constructions of such matrices that cover a large range
of problem parameters. Leveraging this approach allows us to construct families of coded caching
schemes where the subpacketization level is exponentially smaller compared to the approach of
There are several opportunities for future work. Even though our subpacketization level is sig-
nificantly lower than Maddah-Ali and Niesen (2014b), it still scales exponentially with the number
of users. Of course, the rate of growth with the number of users is much smaller. There have been
some recent results on coded caching schemes that demonstrate the existence of schemes where
investigate whether some of these ideas can be leveraged to obtain schemes that work for practical
Table 3.1 A summary of the different constructions of CCP matrices in Section 3.3
Table 3.2 List of k values for Example 16. The values of n0 , α and z are obtained by
following Algorithm 4.
k n0 z α Construction Notes
11 12 1 1 (12, 11) SPC code k + 1 = n0
10 12 11 12 - -
9 12 5 6 Claim 7 α = z + 1 and q ≥ z
8 12 3 4 Claim 7 α = z + 1 and q ≥ z
7 12 2 3 Claim 7 α = z + 1 and q ≥ z
6 12 7 12 Claim 4 Generator polynomial is X 6 +X 5 +3X 4 +
3X 3 + X 2 + 4X + 3
5 6 1 1 (6,5) SPC code and Claim 8 Extend (6,5) SPC code to (12,5) code
4 7 5 7 Claim 4 Generator polynomial is X 8 +X 7 +4X 6 +
3X 5 + 2X 3 + X 2 + 4X + 4
3 4 1 1 (4,3) SPC code and Claim 8 Extend (4,3) SPC code to (12,3) code
2 3 1 1 (3,2) SPC code and Claim 8 Extend (3,2) SPC code to (12,2) code
1 2 1 1 (2,1) SPC code and Claim 8 Extend (2,1) SPC code to (12,1) code
59
The work of Yu et al. (2020) considers the distributed computation of the product of two large
matrices AT and B, which are respectively partitioned into p × m and p × n blocks of submatrices
of equal size by the master node. The key result of Yu et al. (2020) shows that the product AT B
investigated in a different context by Yagle (1995) in the mid 90’s. However, the motivation for
that work was fast matrix multiplication using pseudo-number theoretic transforms, rather than
fault tolerance. There have been other contributions in this area Dutta et al. (2016); Lee et al.
(2018, 2017); Mallick et al. (2019); Wang et al. (2018) as well, some of which predate Yu et al.
(2020).
Let A (size v × r) and B (size v × t) be two integer matrices1 . We are interested in computing
C , AT B in a distributed fashion. Specifically, each worker node can store a 1/mp fraction of
matrix A and a 1/np fraction of matrix B. The job given to the worker node is to compute the
product of the submatrices assigned to it. The master node waits for a sufficient number of the
submatrix products to be communicated to it. It then determines the final result after further
1
Floating point matrices with limited precision can be handled with appropriate scaling.
60
processing at its end. More precisely, matrices A and B are first block decomposed as follows:
v r v t
where the Aij ’s and the Bkl ’s are of dimension p × m and p × n respectively. The master node
X
Ã(s, z) = Aij sλij z ρij , and
i,j
X
B̃(s, z) = Bkl sγkl z δkl ,
k,l
where λij , ρij , γkl and δkl are suitably chosen integers. Following this, the master node evaluates
Ã(s, z) and B̃(s, z) at a fixed positive integer s and carefully chosen points z ∈ {z1 , . . . , zK } (which
can be real or complex) where K is the number of worker nodes. Note that this only requires
scalar multiplication and addition operations on the part of the master node. Subsequently, it
The i-th worker node computes the product ÃT (s, zi )B̃(s, zi ) and sends it back to the master
node. Let 1 ≤ τ ≤ K denote the minimum number of worker nodes such that the master node can
determine the required product (i.e., matrix C) once any τ of the worker nodes have completed
their assigned jobs. We call τ the recovery threshold of the scheme. In Yu et al. (2020), τ is shown
to be pmn + p − 1.
In this work, we demonstrate that as long as the entries in A and B are bounded by sufficiently
small numbers, the recovery threshold (τ ) can be significantly reduced as compared to the approach
of Yu et al. (2020). Specifically, the recovery threshold in our work can be of the form p0 mn + p0 − 1
depending on our assumptions on the matrix entries. We show that the required upper bound on
the matrix entries can be traded off with the corresponding threshold in a simple manner. Finally,
61
we present experimental results that demonstrate the superiority of our method via an Amazon
We let
s−1 (AT10 B00 + AT11 B00 z + AT10 B01 z 2 + AT11 B01 z 3 ) (4.1)
Evidently, the product above contains the useful terms in (4.2) as coefficients of z k for k =
0, . . . , 3. The other two lines contain terms (coefficients of s−1 z k and sz k , k = 0, . . . , 3) that we are
not interested in; we refer to these as interference terms. Rearranging the terms, we have
where we recover superposed useful and interference terms even in the presence of K − 4 erasures.
Now, suppose that the absolute value of each entry in C and of each of the interference terms
is < L. Furthermore, assume that s ≥ 2L. The Cij ’s can then be recovered by exploiting the fact
that s ≥ 2L, e.g., for non-negative matrices A and B, we can simply extract the integer part of
each Xij and compute its remainder upon division by s. The case of general A and B is treated in
Section 4.2.2.
To summarize, under our assumptions on the maximum absolute value of the matrix C and
the interference matrix products, we can obtain a scheme with a threshold of 4. In contrast, the
Remark 7. We emphasize that the choice of polynomials Ã(s, z) and B̃(s, z) are quite different in
our work as compared to Yu et al. (2020); this can be verified by setting s = 1 in the expressions. In
particular, our choice of polynomials deliberately creates the controlled superposition of useful and
interference terms (the choice of coefficients in Yu et al. (2020) explicitly avoids the superposition).
We unentangle the superposition by using our assumptions on the matrix entries later. To our best
knowledge, this unentangling idea first appeared in the work of Yagle (1995), though its motivations
were different.
We now present the most general form of our result. Let the block decomposed matrices A and
B be of size p × m and p × n respectively. We form the polynomials Ã(s, z) and B̃(s, z) as follows
m−1
X p−1
X
Ã(s, z) = z i
Aui s−u , and
i=0 u=0
n−1
X p−1
X
mj
B̃(s, z) = z Bvj sv .
j=0 v=0
63
To better understand the behavior of this sum, we divide it into the following cases.
• Case 1: Useful terms. These are the terms with coefficients of the form ATui Buj . They are
z mj+i .
• Case 2: Interference terms. Conversely, the terms in (4.4) with coefficient ATui Bvj , u 6= v are
the interference terms and they are the coefficients of z mj+i sv−u (for v 6= u).
where ∗ denotes an interference term. Note that (4.5) consists of consecutive powers z k for k =
0, . . . , mn − 1.
We choose distinct values zi for worker i (real or complex). Suppose that the absolute value of
each Cij and of each interference term (marked with ∗) is at most L − 1. We choose s ≥ 2L.
We now show that as long as at least mn of the worker nodes return their computations, the
Suppose the master node obtains the result Yi = ÃT (s, zi )B̃(s, zi ) from any mn workers
lowing equations,
Yi1 1 zi1 zi21 ··· zimn−1
1
X00
zimn−1
Yi 1 zi
2 2 zi22 ··· 2
X01
. = .
.
..
. ··· ··· .
Yimn 1 zimn zi2mn ··· zimn−1
mn
X(m−1)(n−1)
The Vandermonde form of the above matrix guarantees the uniqueness of the solution. This is
Q
because the determinant of Vandermonde matrix can be expressed as 1≤a,b≤mn (zia − zib ), which
Note that Xij = ∗s−(p−1) + · · · + ∗s−1 + Cij + ∗s + · · · + ∗sp−1 . The master node can recover
Cij from Xij as follows. We first round Xij to the closest integer. This allows us to recover
L−1
| ∗ s−(p−1) + · · · + ∗s−1 | ≤ < 1/2.
2L − 1
Next, we determine Ĉij = Cij + ∗s + · · · + ∗sp−1 mod s (we work under the convention that the
modulo output always lies between 0 and s − 1). It is easy to see that if Ĉij ≤ s/2 then Cij = Ĉij ,
otherwise Cij is negative and Cij = −(s − Ĉij ). If s is a power of 2, the modulo operation can be
The maximum and the minimum values (integer or floating point) that can be stored and
manipulated on a computer have certain limits. Assuming s = 2L, it is easy to see that |Xij | is at
most (2L)p /2. Therefore, large values of L and p can potentially cause numerical issues (overflow
and/or underflow). We note here that a simple but rather conservative way to estimate the value
The method presented in Section 4.2 achieves a threshold of mn while requiring that the LHS of
(4.5) remain with the range of numeric values that can be represented on the machine. In general,
65
the terms in (4.5) will depend on the choice of the zi ’s and the values of the |Xij |’s, e.g., choosing
the zi ’s to be complex roots of unity will imply that our method requires mn × (2L)p /2 to be within
We now present a scheme that allows us to trade off the precision requirements with the recovery
threshold of the scheme, i.e., we can loosen the requirement on L and p at the cost of an increased
threshold.
Assume that p0 is an integer that divides p. We form the polynomials Ã(s, z) and B̃(s, z) as
follows,
0 −1 p/p0 −1
X pX
m−1
j+p0 i
X
Ã(s, z) = z A(k+ p0 j),i sk , and
p
i=0 j=0 k=0
0 −1 p/p0 −1
X pX
n−1
mp0 u+(p0 −1−v)
X
B̃(s, z) = z B(w+ p0 v),u s−w .
p
u=0 v=0 w=0
Note that in the expressions above we use Ai,j to represent the (i, j)-th entry of A (rather than
To better understand the behavior of (4.6), we again divide it into useful terms and interference
terms.
• Case 1: Useful terms. These are the terms with coefficients of the form AT(k+ p j),i B(k+ p0 j),u .
p0 p
0 0 0
The term AT(k+ p j),i B(k+ p0 j),u is the coefficient of z mp u+p i+p −1 .
p0 p
• Case 2: Interference terms. The interference terms are associated with the terms with coef-
0 0 0
AT(k+ p0 j),i B(w+ p0 v),u z mp u+(p −1−v)+j+p i sk−w .
p p
We now verify that the interference terms and useful terms are distinct. This is evident when
Next, we discuss the degree of Ã(s, z)T B̃(s, z) in the variable z. In (4.6), the terms with maximal
z-degree are the terms with u = n − 1, v = 0, j = p0 − 1 and i = m − 1. Thus, the maximal degree
of z in the expression is mnp0 + p0 − 2. It can be verified that terms with z-degree from 0 to
mnp0 + p0 − 2 will appear in (4.6) and the z-degree of the useful terms Ciu are mp0 u + p0 i + p0 − 1,
i = 0, · · · , m − 1, u = 0, · · · , n − 1.
Likewise the s-degree of Ã(s, z)T B̃(s, z) varies from −(p − 1), . . . , 0, . . . , (p − 1) with the useful
Evidently, the recovery threshold is mnp0 + p0 − 1, which is higher than that of the construction in
0
Section 4.2.2. However, let s = 2L, the maximum value of |Xij | is at most (2L)p/p /2 which is less
17
16
15
14
13
12
11
10
9
0 1 2 3 4 5 6 7
We let
The product of the above polynomials can be verified to contain the useful terms with coefficients
z, z 3 , z 5 , z 7 ; the others are interference terms. For this scheme the corresponding |Xij | can at most
be 2L2 , though the recovery threshold is 9. Applying the method of Section 4.2.2 would result in
We ran our experiments on AWS EC2 r3.large instances. Our code is available onlinecom (2019).
The input matrices A and B were randomly generated integer matrices of size 8000 × 8000 with
68
elements in the set {0, 1, . . . , 50}. These matrices were pre-generated (for the different straggler
counts) and remained the same for all experiments. The master node was responsible for the 2 × 2
block decomposition of A and B, computing Ã(s, zi ) and B̃(s, zi ) for i = 1, . . . , 10 and sending
them to the worker nodes. The evaluation points (zi ’s) were chosen as 10 equally spaced reals
within the interval [−1, 1]. The stragglers were simulated by having S randomly chosen machines
We compared the performance of our method (cf. Section 4.2) with Yu et al. (2020). For
fairness, we chose the same evaluation points in both methods. In fact, the choice of points in their
code available online nip (2017) (which we adapted for the case when p > 1), provides worse results
Computation latency refers to the elapsed time from the point when all workers have received
their inputs until enough of them finish their computations accounting for the decoding time. The
decoding time for our method is slightly higher owing to the modulo s operation (cf. Section 4.2.3).
It can be observed in Fig. 4.1 that for our method there is no significant change in the latency
for the values of S ∈ {0, 2, 4, 6} and it remains around 9.83 seconds. When S = 7, as expected
the straggler effects start impacting our system and the latency jumps to approximately 16.14
seconds. In contrast, the performance of Yu et al. (2020) deteriorates in the presence of two or
Real Vandermonde matrices are well-known to have bad condition numbers. The condition
number is better when we consider complex Vandermonde matrices with entries from the unit circle
Gautschi (1990). In our method, the |Xij | and |Yij | values can be quite large. This introduces small
errors in the decoding process. Let Ĉ be the decoded matrix and C , AT B be the actual product.
||C−Ĉ||F
Our error metric is e = ||C||F (subscript F refers to the Frobenius norm). The results in Fig. 1,
had an error e of at most 10−7 . We studied the effect of increasing the average value of the entries
in A and B in Table 1. The error is consistently low up to a bound of L = 1000, following which
the calculation is useless owing to numerical overflow issues. We point out that in our experiments
69
Bound(L) s Error
100 228 6.31 · 10−7
200 230 8.87 · 10−7
500 232 6.40 · 10−6
1000 234 9.52 · 10−6
2000 236 1
the error e was identically zero if the zi ’s were chosen from the unit circle. However, this requires
Polynomial based methods have been used in several works for mitigating the effect of stragglers
in distributed matrix computations. However, they suffer from serious numerical issues owing to
this chapter, we present a novel approach that leverages the properties of circulant permutation
matrices and rotation matrices for coded matrix computation. In addition to having an optimal
recovery threshold, our scheme has condition numbers that are orders of magnitude lower than
prior work.
Consider a scenario where the master node has a large t × r matrix A ∈ Rt×r and either a
AT B in a distributed manner over n worker nodes in the matrix-vector and matrix-matrix setting
respectively. Towards this end, the master node partitions A (respectively B) into ∆A (respectively
∆B ) block columns. Each worker node is assigned δA ≤ ∆A and δB ≤ ∆B linearly encoded block
In the matrix-vector case, the i-th worker is assigned encoded block-columns of A and the vector
x and computes their inner product. In the matrix-matrix case it computes all pairwise products of
block-columns assigned to it. We say that a given scheme has computation threshold τ if the master
node can decode the intended result as long as any τ out of n worker nodes complete their jobs. In
71
this case we say that the scheme is resilient to s = n − τ stragglers. We say that this threshold is
optimal if the value of τ is the smallest possible for the given storage capacity constraints.
The overall goal is to (i) design schemes that are resilient to s stragglers (s is a design parameter),
while ensuring that the (ii) desired result can be decoded in a efficient manner, and (iii) the decoded
result is numerically robust even in the presence of round-off errors and other sources of noise.
A significant amount of prior work Yu et al. (2017, 2020); Dutta et al. (2016, 2019) has demon-
strated interesting and elegant approaches based on embedding the distributed matrix computation
into the structure of polynomials. Specifically, the encoding at the master node can be viewed as
evaluating certain polynomials at distinct real values. Each worker node gets a particular evalua-
tion. Once, at least τ workers finish their tasks, the master node can decode the intended result
by performing polynomial interpolation. The work of Yu et al. (2017) demonstrates that when
referred to as polynomial codes) achieve this threshold. Prior work has also considered other ways
in which matrices A and B can be partitioned. For instance, they can be partitioned both along
rows and columns. The work of Yu et al. (2020); Dutta et al. (2019) has obtained threshold results
in those cases as well. The so called Entangled Polynomial and Mat-Dot codes Yu et al. (2020);
Dutta et al. (2019), also use polynomial encodings. The key point is that in all these approaches,
A major part of our work in this paper revolves around understanding the numerical stability
of various distributed matrix computations. This is closely related to the condition number of
matrices. Let ||M|| denote the maximum singular value of a matrix M of dimension l × l.
κ(M) ≈ 10b , then the decoded result loses approximately b digits of precision Higham (2002). In
72
particular, matrices that are ill-conditioned lead to significant numerical problems when solving
linear equations.
the master node. In the work of Yu et al. (2017), this would require solving a ∆A ∆B × ∆A ∆B
Vandermonde system. Unfortunately, it can be shown that the condition number of these matrices
grows exponentially in ∆A ∆B Pan (2016). This is a significant drawback and even for systems
with around ∆A ∆B ≈ 30, the condition number is so high that the decoded results are essentially
useless.
In Section VII of Yu et al. (2020), it is remarked that when operating over infinite fields such as
the reals, one can embed the computation into finite fields to avoid numerical errors. For instance,
they advocate encoding and decoding over a finite field of order p. However, this method would
require ”quantizing” real matrices A and B so that the entries are integers. We demonstrate that
For this method to work, the maximum possible absolute value of each entry of the quantized
matrices, α should be such that α2 t < p, since each entry corresponds to the inner product of
columns of A and columns of B. This “dynamic range constraint (DRC)” means that the error in
the computation depends strongly on the actual matrix entries and the value of t is quite limited.
If the DRC is violated, the error in the underlying computation can be catastrophic. Even if the
DRC is not violated, the dependence of the error on the entries can make it very bad. We discuss
The main goal of our work is to consider alternate embeddings of distributed matrix computa-
tions that are based on rotation and circulant permutation matrices. We demonstrate that these
The issue of numerical stability in the coded computation context has been considered in a few
recent works Tang et al. (2019); Ramamoorthy et al. (2019); Das et al. (2018); Das and Ramamoor-
thy (2019); Das et al. (2019); Fahim and Cadambe (2019); Subramaniam et al. (2019). The work
of Ramamoorthy et al. (2019); Das and Ramamoorthy (2019) presented strategies for distributed
73
matrix-vector multiplication and demonstrated some schemes that empirically have better numer-
ical performance than polynomial based schemes for some values of n and s. However, both these
approaches work only for the matrix-vector problem. Preprint Das et al. (2019) presents a ran-
dom convolutional coding approach that applies for both the matrix-vector and the matrix-matrix
multiplications problems. Their work demonstrates a computable upper bound on the worst case
condition number of the decoding matrices by drawing on connections with the asymptotic analysis
of large Toeplitz matrices. The recent preprint Subramaniam et al. (2019) presents constructions
that are based on random linear coding ideas where the encoding coefficients are chosen at random
The work most closely related to the our work is Fahim and Cadambe (2019) considers an
alternative approach for polynomial based schemes by working within the basis of orthogonal poly-
nomials. They demonstrate an upper bound on the worst case condition number of the decoding
matrices which grows as O(n2s ) where s is the number of stragglers that the scheme is resilient
to. They also demonstrate experimentally that their performance is significantly better than the
polynomial code approach. In contrast we demonstrate an upper bound that is ≈ O(ns+6 ). Fur-
thermore, in Section 5.6 we show that in practice our worst case condition numbers are far better
The work of Pan (2016) shows that unless all (or almost all) the parameters of the Vandermonde
matrix lie on the unit circle, its condition number is badly behaved. However, most of these
parameters are complex-valued (except ±1), whereas our matrices A and B are real-valued. Using
complex evaluation points in the polynomial code scheme, will increase the cost of computations
approximately four times for matrix-matrix multiplication and around two times for matrix-vector
• Our main finding in this paper is that we can work with real-valued matrix embeddings that (i)
continue to have the optimal threshold of polynomial based approaches, and (ii) enjoy the low
74
condition number of complex Vandermonde matrices with all parameters on the unit circle.
appropriate sizes can be used within the framework of polynomial codes. At the top level,
instead of evaluating polynomials at real values, our approach evaluates the polynomials at
matrices.
n
• Using these embeddings we show that the worst case condition number over all n−s possible
indicate that the actual values are significantly smaller, i.e., the analysis yields pessimistic
upper bounds.
Our schemes in this work will be defined by the encoding matrices used by the master node,
which are such that the master node only needs to perform scalar multiplications and additions.
The computationally intensive tasks, i.e., matrix operations are performed by the worker nodes.
We begin by defining certain classes of matrices, discuss their relevant properties and present an
√
example that outlines the basic idea of our work. We let i = −1.
[0 1 0 . . . 0]. Let P be a m × m matrix with e as its first row. The remaining rows are obtained
by cyclicly shifting the first row with the shift index equal to the row index. Then Pi , i ∈ [m] are
said to be circulant permutation matrices. Let W denote the m-point Discrete Fourier Transform
ij 2π
(DFT) matrix, i.e., W(i, j) = √1 ωm
m
for i ∈ [m], j ∈ [m] where ωm = e−i m denotes the m-th root
75
(m−1)
2 ,...,ω
of unity. Then, it can be shown Gray (2006) that P = Wdiag(1, ωm , ωm m )W∗ , where
Remark 8. Rotation matrices and circulant permutation matrices (see Appendix B for an example)
have the useful property that they are “real” matrices with complex eigenvalues lie on the unit circle.
polynomial approach, the master node forms A(z) = A0 + A1 z + A2 z 2 and evaluates it at distinct
real values z1 , . . . , zn . The i-th evaluation is sent to the i-th worker node which computes AT (zi )x.
From polynomial interpolation, it follows that as long as the master node receives results from any
three workers, it can decode AT x. However, when ∆A is large, the interpolation is numerically
unstable.
The basic idea of our approach is as follows. We further split each Ai into two equal sized block
columns. Thus we now have six block-columns, indexed as A0 , . . . A5 . Consider the 6 × 2 matrix
P5 P5
The master node forms “two” encoded matrices for the i-th worker: j=0 Aj g0 (j) and j=0 Aj g1 (j)
(where gi (l) denotes the l-th component of the vector gi ). Thus, the storage capacity constraint
fraction γA is still 1/3. Worker node i computes the inner product of these two encoded matrices
with x and sends the result to the master node. It turns out that in this case when any three
workers i0 , i1 , and i2 complete their tasks, the decodability and numerical stability of recovering
Using the eigendecomposition of Rθ (cf. 5.6) the above block matrix can expressed as
Q 0 0 I I I Q∗ 0 0
i i i ∗
,
0 Q 0
Λ 0 Λ 1 Λ 2 0 Q 0
0 0 Q Λ 2i 0 Λ 2i 1 Λ 2i 2 0 0 Q ∗
| {z }
Σ
As the pre- and post-multiplying matrices are unitary, the condition number of the above matrix
only depends on the properties of the middle matrix, denoted by Σ. In what follows we show that
upon appropriate column and row permutations, Σ can be shown equivalent to a block diagonal
matrix where each of the blocks is a Vandermonde matrix with parameters on the unit circle.
Thus, the matrix is invertible. Furthermore, even though we use real computation, the numerical
stability of our scheme depends on Vandermonde matrices with parameters on the unit circle. The
upcoming Theorem 3 shows that the condition number of such matrices is much better behaved.
In the sequel we show that this argument can be significantly generalized and adapted for the
case of circulant permutation embeddings. The matrix-matrix case requires the development of
Notation: Let [m] denote the set {0, . . . , m − 1}. For a matrix M, M(i, j) denotes its (i, j)-
th entry, whereas Mi,j denotes the (i, j)-th block sub-matrix of M. We use MATLAB inspired
notation at certain places. For instance, diag(a1 , a2 , . . . , am ) denotes a m × m diagonal matrix with
ai ’s on the diagonal and M(:, j) denotes the j-th column of matrix M. The notation M1 ⊗ M2
The encoding matrix for A will be specified by a kA ` × n` “generator” matrix G such that
X
Âhi,ji = G(α` + β, i` + j)Ahα,βi (5.4)
α∈[kA ],β∈[`]
for i ∈ [n], j ∈ [`]. A similar rule will apply for B and result in encoded matrices B̂hi,ji . Thus,
in the matrix-vector case worker node i stores Âhi,ji for j ∈ [`] and x, whereas in the matrix-
matrix case it stores Âhi,ji and B̂hi,ji , for j ∈ [`]. Thus worker i stores γA = `/∆A = 1/kA and
γB = `/∆B = 1/kB fractions of matrices A and B respectively. In the matrix-vector case, worker
node i computes ÂThi,ji x for j ∈ [`] and transmits them to the master node. In the matrix-matrix
case, it computes all `2 pairwise products ÂThi,l1 i B̂hi,l2 i for l1 ∈ [`], l2 ∈ [`].
Decoding Scheme: With the above encoding, the decoding process corresponds to solving
linear equations. We discuss the matrix-vector case here; the matrix-matrix case is quite similar.
In the matrix-vector case, the master node receives ÂThi,ji x of length r/∆A for j ∈ [`] from a certain
number of worker nodes and wants to decode AT x of length r. Based on our encoding scheme, this
can be done by solving a ∆A × ∆A linear system of equations r/∆A times. The structure of this
C is such that V(i, j) = zji , i ∈ [m], j ∈ [m]. If the zi ’s are distinct, then V is nonsingular Horn
and Johnson (1991). In this work, we will also assume that the zi ’s are non-zero.
with parameters s0 , s1 , . . . sm−1 . The following facts about κ(V) follow from prior work Pan (2016).
• Complex Vandermonde matrices with parameters “not” on the unit circle. Suppose that
exponential in β.
78
Based on the above facts, the only scenario where the condition number is somewhat well-behaved
is if most or all of the parameters of V are complex and lie on the unit-circle. In the Appendix B,
we show the following result which is one of our key technical contributions.
Theorem 3. Consider a m×m Vandermonde matrix V where m < q (where q is odd) with distinct
Let q be an odd number such that q ≥ n, θ = 2π/q and ` = 2 (cf. block column decomposition
in (5.3)). We choose the generator matrix such that its (i, j)-th block submatrix for i ∈ [kA ], j ∈ [n]
is given by
ji
Grot
i,j = Rθ (5.5)
Theorem 4. The threshold for the rotation matrix based scheme specified above is kA . Further-
more, the worst case condition number of the recovery matrices is upper bounded by O(q q−kA +6 ).
Proof. Suppose that workers indexed by i0 , . . . , ikA −1 complete their tasks. We extract the corre-
We note here that the decoder attempts to recover each entry of AThi,ji x from the results sent by
the worker nodes. Thus, we can equivalently analyze the decoding by considering the system of
equations as
mG̃rot = c,
m = [m0 , · · · , mkA −1 ]
and
c = [ci0 , · · · , cikA −1 ]
In the expression above, terms of the form mhi,ji and chi,ji are scalars. We need to analyze κ(G̃rot ).
and Λ is specified in (5.2). Note that the pre- and post-multiplying matrices in (5.6) above are
both unitary. Therefore κ(G̃rot ) is the same as κ(Λ̃) Horn and Johnson (1991).
Using Claim 14 in the Appendix B, we can permute Λ̃ to put it in block-diagonal form so that
Λ̃d [0] 0
Λ̃d = ,
0 Λ̃d [1]
80
where Λ̃d [0] and Λ̃d [1] are Vandermonde matrices with parameter sets {eiθi0 , . . . , eiθikA −1 } and
{e−iθi0 , . . . , e−iθikA −1 } respectively. Note that these parameters are distinct points on the unit
circle. Thus, Λ̃d [0] and Λ̃d [1] are both invertible which implies that Λ̃ is invertible. This allows
us to conclude that the threshold of the scheme is kA . The upper bound on the condition number
We note here that the decoding process involves inverting a ∆A × ∆A matrix once and using the
inverse to solve r/∆ systems of equations. Thus, the overall decoding complexity is O(∆3A + r∆A )
sub-divided into kA (q̃ − 1) block-columns as in (5.3). In this embedding we have an additional step.
In the subsequent discussion, we work with the set of block-columns Ahi,ji for i ∈ [kA ], j ∈ [q̃]. The
coded submatrices Âhi,ji for i ∈ [n], j ∈ [q̃] are generated by means of a kA q̃ × nq̃ matrix Gcirc as
follows.
X
Âhi,ji = Gcirc (αq̃ + β, iq̃ + j)Ahα,βi , (5.8)
α∈[kA ],β∈[q̃]
Gcirc ji
i,j = P , for i ∈ [kA ], j ∈ [n]. (5.9)
The matrix P denotes the q̃ × q̃ circulant permutation matrix introduced in Definition 9. For this
scheme the storage fraction γA = q̃/(kA (q̃ − 1)), i.e., it is slightly higher than 1/kA .
Remark 10. The Âhi,ji ’s can simply be generated by additions since Gcirc is a binary matrix.
Theorem 5. The threshold for the circulant permutation based scheme specified above is kA . Fur-
thermore, the worst case condition number of the recovery matrices is upper bounded by O(q̃ q̃−kA +6 )
The proof appears in the Appendix B. It is conceptually similar to the proof of Theorem 4
and relies critically on the fact that all eigenvalues of P lie on the unit circle and that P can be
diagonalized by the DFT matrix W. It suggests an efficient decoding algorithm where the fast
Fourier Transform (FFT) plays a key role (see Algorithm 5 and Claim 12).
Remark 11. Both circulant permutation matrices and rotation matrices allow us to achieve a
specified threshold for distributed matrix vector multiplication. The required storage fraction γA
is slightly higher for the circulant permutation case and it requires q to be prime. However, it
allows for an efficient FFT based decoding algorithm. On the other hand, the rotation matrix case
requires a smaller ∆A , but the decoding requires solving the corresponding system of equations
the complexity of which can be cubic in ∆A . We note that when the matrix sizes are large, the
decoding time will be negligible as compared to the worker node computation time; we discuss this
82
in Section 5.6. In Section 5.6, we show results that demonstrate that the normalized mean-square
error when circulant permutation matrices are used is lower than the rotation matrix case.
The circulant matrix embedding idea can be also applied to the fast encoder of Quasi-cyclic
(QC) LDPC code in Huang et al. (2014); Tang et al. (2013); Zhang et al. (2014).
The matrix-matrix case requires the introduction of newer ideas within this overall framework.
In this case, a given worker obtains encoded block-columns of both A and B and representing the
underlying computations is somewhat more involved. Once again we let θ = 2π/q, where q ≥ n (n
is the number of worker nodes) is an odd integer and set ` = 2. Furthermore, let kA kB < n. The
ji
GA
i,j = Rθ , for i ∈ [kA ], j ∈ [n], and
(jkA )i
GB
i,j = Rθ , for i ∈ [kB ], j ∈ [n].
The master node operates according to the encoding rule discussed previously (cf. (5.3)) for both A
and B. Thus, each worker node stores γA = 1/kA and γB = 1/kB fraction of A and B respectively.
The i-th worker node computes the pairwise product of the matrices ÂThi,l1 i B̂hi,l2 i for l1 , l2 = 0, 1
and returns the result to the master node. Thus, the master node needs to recover all pair-wise
products of the form AThi,αi Bhj,βi for i ∈ [kA ], j ∈ [kB ] and α, β = 0, 1. Let Z denote a 1 × 4kA kB
Theorem 6. The threshold for the rotation matrix based matrix-matrix multiplication scheme is
GA A B
l denote the l-th block column of G (with similar notation for G ). Note that for k1 , k2 ∈ {0, 1}
the l-th worker node computes ÂThl,k1 i B̂hl,k2 i which can written as
83
X
GA (2α + β, 2l + k1 )AThα,βi
α∈[kA ],β∈{0,1}
X
GB (2α + β, 2l + k2 )Bhα,βi
α∈[kB ],β∈{0,1}
using the properties of the Kronecker product. Based on this, it can be observed that the decod-
ability of Z at the master node is equivalent to checking whether the following matrix is full-rank.
G̃ = [GA B A B A B
i0 ⊗ Gi0 |Gi1 ⊗ Gi1 | . . . |Giτ −1 ⊗ Giτ −1 ].
I I
Λl ΛlkA
(IkA ⊗ Q) Q∗ ⊗ (IkB ⊗ Q) Q∗ ,
.. ..
. .
Λl(kA −1) ΛlkA (kB −1)
where the first equality uses the eigenvalue decomposition of Rθ . Applying the properties of
| {z }
Xl
84
[GA B A B
i0 ⊗ Gi0 |Gi1 ⊗ Gi1 | . . . |Giτ −1
A
⊗ GB
iτ −1 ]
Q̃ 0 ... 0
2
0 Q̃2 . . . 0
= Q̃1 [Xi0 |Xi1 | . . . |Xiτ −1 ] . .
. .. .. ..
. . . .
0 0 ... Q̃2
Thus, we can conclude that the invertibility and the condition number of G̃ only depends on
[Xi0 |Xi1 | . . . |Xiτ −1 ] as the matrices pre- and post- multiplying it are both unitary. The invert-
ibility of [Xi0 |Xi1 | . . . |Xiτ −1 ] follows from an application of Claim 15 in the Appendix B. The
proof of Claim 15 also shows that upon appropriate permutation, the matrix [Xi0 |Xi1 | . . . |Xiτ −1 ]
can be expressed as a block-diagonal matrix with four blocks each of size τ × τ . Each of these
blocks is a Vandermonde matrix with parameters from the set {1, ωq , ωq2 , . . . , ωqq−1 }. Therefore,
[Xi0 |Xi1 | . . . |Xiτ −1 ] is non-singular and it follows that the threshold of our scheme is kA kB . An
application of Theorem 3 implies that the worst case condition number is at most O(q q−τ +6 ).
In the previous sections, we consider the case that A and B are partitioned into block columns.
In this section, we consider a more general sceanrio where A and B are partitioned into block
columns and and block rows. This construction resembles the entangled polynomial codes of Yu
et al. (2020).
A = [A(hi,li,j) ], i ∈ [p], l ∈ {0, 1}, j ∈ [kA ], where A(hi,li,j) denotes the submatrix indexed by the
hi, li-th block row and the j-th block-column of A. Similarly, we partition B into 2p block-rows
and ∆B = kB block-columns. We let θ = 2π/q, where q ≥ n > 2kA kB p − 1 (recall that n is the
The encoding in this scenario is somewhat more complicated to express. We simplify this by
leveraging the following simple lemma whose proof follows from basic Kronecker product properties
Lemma 4. Suppose that matrices M1 and M2 both have ζ rows and the same column dimension.
2
The k-th worker node stores Âhk,ti , B̂hk,ti , t = 0, 1. Thus, each worker node stores γA = 2pkA =
1 2 1
pkA and γB = 2pkB = pkB fraction of A and B respectively. Worker node k computes
T
Âhk,0i B̂hk,0i
. (5.10)
Âhk,1i B̂hk,1i
Before presenting our decoding algorithm and the main result of this section, we discuss the
partitioned as follows.
A(h0,0i,0) B(h0,0i,0)
A(h0,1i,0) B(h0,1i,0)
A= , B = .
A(h1,0i,0) B(h1,0i,0)
A(h1,1i,0) B(h1,1i,0)
86
In this example, since kA = kB = 1, there is only one block column in A and B. Therefore,
the index j in A(hi,li,j) and B(hi,li,j) is always 0. Accordingly, to simplify our presentation, we only
use indices i and l to refer the respective constituent block rows of A and B. That is, we simplify
A(hi,li,j) and B(hi,li,j) to Ahi,li and Bhi,li , respectively. Our scheme aims to let the master node
Suppose that Ahi,li and Bhi,li have ζ rows. The encoding process can be defined as
1
Âhk,0i X k(i−1) Ahi,0i
= (R−θ ⊗ Iζ ) , and
Âhk,1i i=0 Ahi,1i
1
B̂hk,0i X k(1−i) Bhi,0i
= (Rθ ⊗ Iζ ) .
B̂hk,1i i=0 Bhi,1i
The computation
in worker
node k (cf. (5.10)) can be analyzed
as follows.
F F
Ahi,0i Ahi,0i Bhi,0i Bhi,0i
Let = (Q ⊗ Iζ ) and = (Q ⊗ Iζ ) . Then
AF
hi,1i Ahi,1i B F
hi,1i Bhi,1i
T
∗
Âhk,0i B̂hk,0i (a) Âhk,0i B̂hk,0i
= (Q ⊗ Iζ ) (Q ⊗ Iζ )
Âhk,1i B̂hk,1i Âhk,1i B̂hk,1i
∗
A A
−k h0,0i h1,0i
= (Q ⊗ Iζ )(R−θ ⊗ Iζ ) + (Q ⊗ Iζ )(I2 ⊗ Iζ )
Ah0,1i Ah1,1i
Bh0,0i Bh1,0i
k
(Q ⊗ Iζ )(Rθ ⊗ Iζ ) + (Q ⊗ Iζ )(I2 ⊗ Iζ )
Bh0,1i Bh1,1i
∗
Ah0,0i Ah1,0i
(b) −k ∗ ∗
= (QR−θ Q ⊗ Iζ )(Q ⊗ Iζ ) + (QI2 Q ⊗ Iζ )(Q ⊗ Iζ )
Ah0,1i Ah1,1i
Bh0,0i Bh1,0i
k ∗ ∗
(QRθ Q ⊗ Iζ )(Q ⊗ Iζ ) + (QI2 Q ⊗ Iζ )(Q ⊗ Iζ )
Bh0,1i Bh1,1i
87
∗ −k ∗
ωq Ah0,0i 1 Ah1,0i
(c)
= ( ⊗ Iζ )(Q ⊗ Iζ ) + ( ⊗ Iζ )(Q ⊗ Iζ )
ωq∗ k Ah0,1i 1 Ah1,1i
k
ωq Bh0,0i 1 Bh1,0i
⊗ Iζ )(Q ⊗ Iζ ) + ( ⊗ Iζ )(Q ⊗ Iζ )
ωq −k Bh0,1i 1 Bh1,1i
∗ −k F F ∗ k F F
(d) ωq Ah0,0i Ah1,0i ωq Bh0,0i Bh1,0i
= + +
ωq∗ k AF
h0,1i A F
h1,1i ω q
−k BF
h0,1i B F
h1,1i
=(AF ∗ F F∗ F −k
h0,0i Bh1,0i + Ah1,1i Bh0,1i )ωq +
(AF ∗ F F∗ F F∗ F F∗ F
h0,0i Bh0,0i + Ah1,0i Bh1,0i + Ah0,1i Bh0,1i + Ah1,1i Bh1,1i )+
(AF ∗ F F∗ F k
h1,0i Bh0,0i + Ah0,1i Bh1,1i )ωq
where
(Q ⊗ Iζ )(R−k −k
−θ ⊗ Iζ ) = (QR−θ ) ⊗ Iζ
= (QR−k ∗
−θ Q Q) ⊗ Iζ
= (QR−k ∗
−θ Q ⊗ Iζ )(Q ⊗ Iζ ).
ωq
• (c) holds because QRθ Q∗ = .
ωq−1
Thus, it is clear that whenever master node collects the results of any three distinct worker nodes,
it can recover AF ∗ F F∗ F F∗ F F∗ F
h0,0i Bh0,0i + Ah1,0i Bh1,0i + Ah0,1i Bh0,1i + Ah1,1i Bh1,1i . However, we observe that
∗ T
F F
Ahi,0i Bhi,0i Ahi,0i Bhi,0i
=
AF
hi,1i BF
hi,1i Ahi,1i Bhi,1i
Theorem 7. The threshold for scheme in this section is 2pkA kB − 1. The worst case condition
Proof. We proceed in a similar manner as in Example 19. Following the encoding rules (cf. Defi-
=ÂF ∗ F
k B̂k
X A −1
p−1 kX B −1
p−1 kX
X
k((j−1)p+i+1) F ∗ k(p−1−i+jpkA ) F (5.11)
= ωq A(hi,0i,j) ωq B(hi,0i,j) +
i=0 j=0 i=0 j=0
X A −1
p−1 kX B −1
p−1 kX
X
ωq−k((j−1)p+i+1) AF ∗
(hi,1i,j) ωq−k(p−1−i+jpkA ) BF
(hi,1i,j) .
i=0 j=0 i=0 j=0
To better understanding the behavior of this sum (5.11), we divide it into two cases,
89
• Case 1: Useful terms. Master node wants to recover C = AT B = [Ci,j ], i ∈ [kA ], j ∈ [kB ],
where each Ci,j is a block matrix of size r/kA ×w/kB . Note that Ci,j = p−1 T
P
u=0 (A(hu,0i,i) B(hu,0i,j) +
AT(hu,1i,i) B(hu,1i,j) ). Thus, the “useful” terms in (5.11) are the terms with coefficients AT(hu,0i,i) B(hu,0i,j) ,
in (5.11) since
AF ∗ F F∗ F
(hu,0i,i) B(hu,0i,j) + A(hu,1i,i) B(hu,1i,j)
∗
F F
A(hu,0i,i) B(hu,0i,j)
=
AF (hu,1i,i) B F
(hu,1i,j)
∗
A(hu,0i,i) B(hu,0i,j)
= Q ⊗ Iζ Q ⊗ Iζ
A(hu,1i,i) B(hu,1i,j)
∗
A(hu,0i,i) B(hu,0i,j)
=
A(hu,1i,i) B(hu,1i,j)
k(ip+jpkA )
It is easy to check AF ∗ F
(hu,0i,i) B(hu,0i,j) is the coefficient of ωq and AF ∗ F
(hu,1i,i) B(hu,1i,j) is
−k(ip+jpkA )
the coefficient of ωq .
that the useful terms have no intersection with interference terms since |u − v| < p.
Next we determine the threshold of the proposed scheme. Towards this end, we find the maxi-
of k. The threshold can then be obtained as the difference between maximum and minimum degree
divided by k.
ωqk(pkA kB −1) AF ∗ F
(hp−1,0i,kA −1) B(h0,0i,kB −1) ,
90
ωq−k(pkA kB −1) AF ∗ F
(hp−1,1i,kA −1) B̂(h0,1i,kB −1) .
power d ≤ pkA kB − 1. We can always find a solution such that j2 = b pkdA c, j1 = b d modp pkA c,
i1 − i2 = (d mod pkA ) mod p. The same result can be generalized when d is negative.
(5.11) shows that proposed encoding matrix in transformed domain is a Vandermonde matrix
−(pkA kB −1)
with coefficients {ωq , · · · , ωqpkA kB −1 }. An application of Theorem 3 implies that the worst
The proof of Theorem 7 also illustrates the decoding algorithm. Let i-th worker node compute
Ĉ = ÂTk B̂k . Suppose that the master node receives the computation result from any 2pkA kB −
[p], i ∈ [kA ], j ∈ [kB ], can be decoded by vector [Ĉi0 , · · · , Ĉi2pkA kB −2 ] multiplying the inverse of the
Vandermonde matrix
−i (pk k −1) −i (pk k −1) −i2pkA kB −2 (pkA kB −1)
ω 0 A B ωq 1 A B ··· ωq
q
−i (pk k −2)
ωq 0 A B −i (pk k −2) −i2pkA kB −2 (pkA kB −2)
ωq 1 A B ··· ωq
.. .. .. ..
. . . .
1 1 ··· 1
.. .. .. ..
. . . .
i (pk k −1) i (pk k −1) i2pkA kB −2 (pkA kB −1)
ωq0 A B ωq1 A B ··· ωq
Pp−1 T
Finally, the result C = [Ci,j ], i ∈ [kA ], j ∈ [kB ] can be recovered since Ci,j = u=0 (A(hu,0i,i) B(hu,0i,j) +
We compare proposed threshold to Fahim and Cadambe (2019). The threshold of Fahim and
Cadambe (2019) is
Since when p = 1, our threshold in Section 5.4 is better than τM −V , we only interest in the case
p > 1. Let τdif = τM −V − τproposed = 2kA kB p − 2(kA kB + pkA + pkB ) + kA + kB + 2p. We show our
Suppose that the number of workers n is odd, so that we can pick q = n for the rotation matrix
embedding. From a theoretical perspective our schemes have a worst case condition number (over
the different recovery submatrices) of O(q q−τ +6 ) where τ is the recovery threshold. In contrast, the
scheme of Yu et al. (2017) has condition numbers that are exponential in n. The work most closely
related to ours is by Fahim and Cadambe (2019), which demonstrates an upper bound of O(q 2(q−τ ) )
on the worst case condition number. It can be noted that this grows much faster than our upper
bound in the parameter q − τ . In numerical experiments, our worst case condition numbers are
much smaller than the work of Fahim and Cadambe (2019); we discuss this in detail below.
Certain approaches Mallick et al. (2019); Das et al. (2018); Das and Ramamoorthy (2019);
Ramamoorthy et al. (2019) only apply for matrix-vector multiplication and furthermore do not
provide any explicit guarantees on the worst case condition number. Other approaches include the
work of Subramaniam et al. (2019) which uses random linear encoding of the A and B matrices
and the work of Das et al. (2019) that uses a convolutional coding approach to this problem. We
note that both these approaches work via random sampling and do not have a theoretical upper
bound on the worst case condition number. We present numerical experiments comparing our work
with Yu et al. (2017); Fahim and Cadambe (2019); Subramaniam et al. (2019) below.
The central point of our paper is that we can leverage the well-conditioned behavior of Vander-
monde matrices with parameters on the unit circle while continuing to work with computation over
the reals. We compare our results with the work of Yu et al. (2017) (called “Real Vandermonde”),
92
Table 5.1 Comparison for matrix-vector case with n = 31, A has size 28000 × 19720 and
x has length 28000.
Scheme γA τ Avg. Cond. Max. Cond. Avg. Worker Dec. Time
Num. Num. Comp. Time (s)
(s)
Real Vand. 1/29 29 1.1 × 1013 2.9 × 1013 1.2 × 10−3 9 × 10−5
Complex Vand. 1/29 29 12 55 2.9 × 10−3 2.8 × 10−4
Circ. Perm. Embed. 1/28 29 12 55 1.2 × 10−3 3.7 × 10−4
Rot. Mat. Embed. 1/29 29 12 55 1.3 × 10−3 10−4
10-10
Complex Vand.
Circ. Perm. Embed.
Normalized MSE (worst case)
10-12
10-13
10-14
10-15
80 85 90 95 100 105 110 115 120
SNR (dB)
Figure 5.1 Consider matrix-vector AT x multiplication system with n = 31, τ = 29. A has size
28000 × 19720 and x has length 28000.
a “Complex Vandermonde” scheme where the evaluation points are chosen from the complex unit
circle, the work of Fahim and Cadambe (2019) and Subramaniam et al. (2019).
All experiments were run on the AWS EC2 system with a t2.2xlarge instance (for master node)
In Table 5.1, we compare the average and worst case condition number of the different schemes
for matrix-vector multiplication. The system under consideration has n = 31 worker nodes and a
threshold specified by the third column (labeled as τ ). The evaluation points for Yu et al. (2017)
were uniformly sampled from the interval [−1, 1] Berrut and Trefethen (2004). The Complex
93
10-2
Complex Vand.
Rot. Mat. Embed.
Normalized MSE (worst case)
10-6
10-8
10-10
10-12
80 85 90 95 100 105 110 115 120
SNR (dB)
Figure 5.2 Consider matrix-matrix AT B multiplication system with n = 31, kA = 4, kB = 7, A
is of size 8000 × 14000, B is of 8400 × 14000.
Vandermonde scheme has evaluation points which are the 31-st root of unity. The Fahim and
Cadambe (2019) and Subramaniam et al. (2019) schemes are not applicable for the matrix-vector
case. It can be observed from Table 5.1 that the both the worst case and the average condition
numbers of our scheme are over eleven orders of magnitude better than the Real Vandermonde
scheme. Furthermore, there is an exact match of the condition number values for all the other
schemes. This can be understood by following the discussion in Section 5.3. Specifically, our schemes
have the property that the condition number only depends on the eigenvalues of corresponding
circulation permutation matrix and rotation matrix respectively. These eigenvalues lie precisely
It can be observed that the decoding flop count for both matrix-vector and matrix-matrix
multiplication is independent of t, i.e., in the regime where t is very large the decoding time may
be neglected with respect to the worker node computation time. Nevertheless, from a practical
When the matrix A is of dimension 28000 × 19720 and x is of length 28000, the last two
columns in Table 5.1 indicate the average worker node computation time and the master node
decoding time for the different schemes. These numbers were obtained by averaging over several
runs of the algorithm. It can be observed that the Complex Vandermonde scheme requires about
twice the worker computation time as our schemes. Thus, it is wasteful of worker node computation
resources. On the other hand, our schemes leverage the same condition number with computation
over the reals. The decoding times of almost all the schemes are quite small. However, the Circulant
Permutation Matrix scheme requires decoding time which is somewhat higher than the rotation
matrix embedding even though we can use FFT based approaches for it. We expect that for much
Our next set of results compare the mean-squared error (MSE) in the decoded result for the
different schemes. Let AT x denote the precise value of the computation and A
[ T x denote the result
numerical precision problems, we added i.i.d. Gaussian noise (of different SNRs) to the result of
the worker node computation. The master node then performs decoding on the noisy vectors. The
plots in Figure 5.1 correspond to the worst case choice of worker nodes for each of the schemes.
It can be observed that the Circulant Permutation Matrix Embedding has the best performance.
This is because the many of the matrices on the block-diagonal in (B.2) (see Appendix B) have
well-behaved condition numbers and only a few correspond to the worst case. We have not shown
the results for the Real Vandermonde case here because the normalized MSE was close 1.0.
95
In the matrix-matrix scenario we again consider a system with n = 31 worker nodes and kA = 4
and kB = 7 so that the threshold τ = kA kB = 28. Once again we observe that the worst case
condition number of the Rotation Matrix Embedding is about eleven orders of magnitude lower
than the Real Vandermonde case. Furthermore, the schemes of Fahim and Cadambe (2019) and
Subramaniam et al. (2019) have a worst case condition numbers that are three orders of magnitude
and two orders of magnitude higher than our scheme. For the Subramaniam et al. (2019) scheme we
performed 200 random trials and picked the scheme with the lowest worst case condition number.
When the matrix A is of dimension 8000×14000 and B is of dimension 8000×14000, the worker
node computation times and decoding times are listed in Table 5.2. As expected the Complex
Vandermonde scheme takes much longer for the worker node computations, whereas the Rotation
Matrix Embedding, Fahim and Cadambe (2019) and Subramaniam et al. (2019) take about the
same time. The decoding times are also very similar. For the matrix-matrix case the normalized
||AT B−A\T B||
MSE is defined as ||AT B||F
F
where AT B is the true product and A
[T B is the decoded product
using one of the methods. As shown in Figure 5.2, the normalized MSE of our Rotation Matrix
Embedding scheme is much about five orders of magnitude lower than the scheme of Fahim and
Cadambe (2019). The normalized MSE of the Real Vandermonde case is almost 1.0 so we do not
plot it.
96
100
Real Vandermonde
Complex Vandermonde
Rotation Matrix Embedding
10-2 [11] scheme
10-6
10-8
10-10
10-12
10-14
80 85 90 95 100 105 110 115 120
SNR(dB)
worker nodes and uA = 2, uB = 2, p = 2. We emphasis four schemes in Table 5.3 have different
thresholds. The Vandermonde-based schemes have lower threshold than Rotation Matrix Embed-
ding scheme and Fahim and Cadambe (2019) scheme. As we have discussed in Claim 16, in most
cases, the Rotation Matrix Embedding scheme has a lower threshold than Fahim and Cadambe
(2019) scheme. However, for a fair comparison, we consider a case that they have same thresh-
olds. We observe that Rotation Matrix Embedding scheme has much lower condition number than
other three schemes. Notice that it is even lower than Complex Vandermonde scheme, which is
monde scheme with p ≥ 2, its encoding matrix is a Vandermonde matrix of size 9 × 17, whose
generators are roots of unity. Then by Claim 3, the condition number is bounded by O(1714 ). The
encoding matrix of Rotation Matrix Embedding scheme is a Vandermonde matrix of size 15 × 17,
whose condition number is upper bounded by O(178 ). As shown in Figure 5.3, the normalized
MSE of our Rotation Matrix Embedding scheme is much lower than the other schemes. As for the
computation and decoding time in Table 5.3, the Complex Vandermonde scheme requires higher
computation and decoding time than the Real Vandermonde scheme (4x) since they are operated
97
over the complex field. Fahim and Cadambe (2019) requires higher decoding time than the Real
Vandermonde scheme (2x) since its threshold is higher. But they have same computation time.
The Rotation Matrix Embedding scheme has highest decoding time since it have higher threshold
than Real/Complex Vandermonde scheme decoding algorithm are operated over the complex field.
We now consider the finite field embedding proposed in Yu et al. (2020). As discussed before
this is mentioned as a potential solution to the numerical issues encountered when operating over
the reals in Section VII of Yu et al. (2020). For this purpose the real entries will need to multiplied
by large enough integers and then quantized so that each entry lies with 0 and p − 1 for a large
enough prime p. All computations will be performed within the finite field of order p, i.e., by
reducing the computations modulo-p. This technique requires that each ATi Bj needs to have all
Let α be an upper bound on the absolute value of matrix entries in A and B. Then, this means
α2 t ≤ p − 1,
needs to be satisfied. Otherwise, the modulo-p operation will cause arbitrarily large errors.
We note here that the publicly available code for Yu et al. (2017) uses p = 65537. Now consider
a system with kA = 3, kB = 2. Even for small matrices with A of size 400 × 200, B of size
400 × 300 and entries chosen as random integers between 0 to 30, the DRC is violated for p = 65537
since 302 × 400 > 65537. In this scenario, the normalized MSE of the Yu et al. (2017) approach
is 0.7746. In contrast, our method has a normalized MSE ≈ 2 × 10−28 for the same system with
kA = 3, kB = 2.
When working over 64-bit integers, the largest integer is ≈ 1019 . Thus, even if t < 105 , the
method can only support α ≤ 107 . Thus, the range is rather limited. Furthermore, considering
matrices of limited dynamic range is not a valid assumption. In machine learning scenarios such
as deep neural networks, matrix multiplications are applied repeatedly, and the output of one
98
stage serves as the input for the other. Thus, over several iterations the dynamic range of the
matrix entries will grow. Thus, applying the finite field embedding technique will necessarily incur
quantization error.
The most serious limitation of the method comes from the fact the error in the computation
(owing to quantization) is very strongly dependent on the actual entries of the A and B matrices.
We demonstrate this next. In fact, we can generate structured integer matrices A and B such that
the normalized MSE of their approach is exactly 1.0. Towards this end we first pick the prime
p = 2147483647 (which is much larger than their publicly available code) so that their method can
support higher dynamic range. Next let r = w = t = 400. This implies that α ≤ 1000 by the
Each Aij and Bij is a matrix of size 200 × 200, with entries chosen from the following distributions.
A11 , A12 distributed Unif(0, . . . , 9999) and A21 , A22 distributed Unif(0, . . . , 9). Next, B11 , B12
distributed Unif(0, . . . , 9) and B21 , B22 distributed Unif(0, . . . , 9999). In this scenario, the DRC
requires us to multiply each matrix by 0.1 and quantize each entry between 0 and 999. Note that
this implies that A21 , A22 , B11 , B12 are all quantized into zero submatrices since the entry in these
four submatrices is less than 10. We emphasize that the finite field embedding technique only
T
Ã
11 Ã 12 0 0
ÃT B̃ = = 0.
0 0 B̃21 Ã22
implies that the normalized MSE of their scheme is exactly 1.0. Thus, the finite field embedding
99
Table 5.4 Performance of matrix inversion over a large prime order field in Python 3.7.
The table shows the computation time for inverting a `×` matrix G over a finite
field of order p. Let G
d−1 denote the inverse obtained by applying the sympy
technique has a very strong dependence on the matrix entries. We note here that even if we consider
other quantization schemes or larger 64-bit primes, one can arrive at adversarial examples such as
the ones shown above. Once again for these examples, our methods have a normalized MSE of at
most 10−27 .
In our experience, the finite field embedding technique also suffers from significant computa-
tional issues in implementation. Note that the technique requires the computation of the inverse
matrix at the master node that is required for decoding the final result. We implemented this
within the Python 3.7, sympy library (see Tang (2020) Git hub repository). We performed ex-
periments with p = 65537 and p = 2147483647. As shown in Table 5.4, for the smaller prime
time of the inverse is rather high and can dominate the overall execution time. On the other hand
for the larger prime p = 2147483647, the error in in the computed inverse is very high for 12 × 12
and 15 × 15 matrices; the corresponding time taken is even higher. It is possible that very careful
implementations can perhaps avoid these issues. However, we are unaware of any such publicly
available code.
To summarize, the finite field embedding technique suffers from major dynamic range limitations
and associated computational issues and cannot be used to support real computations.
100
BIBLIOGRAPHY
Ahlswede, R., Cai, N., Li, S.-Y., and Yeung, R. W. (2000). Network information flow. IEEE Trans.
on Inf. Theory, 46(4):1204–1216.
Alon, N., Moitra, A., and Sudakov, B. (2012). Nearly complete graphs decomposable into large
induced matchings and their applications. In Proc. of the 44-th Annual ACM symposium on
Theory of computing, pages 1079–1090.
Baranyai, Z. (1975). On the factorization of the complete uniform hypergraph. In Infinite and
finite sets (Colloq., Keszthely, 1973; dedicated to P. Erdos on his 60th birthday), pages 91–108.
Berrut, J.-P. and Trefethen, L. N. (2004). Barycentric lagrange interpolation. SIAM review,
46(3):501–517.
Blake, I. F. (1972). Codes over certain rings. Information and Control, 20(4):396–404.
Das, A. B., Ramamoorthy, A., and Vaswani, N. (2019). Random convolutional coding
for robust and straggler resilient distributed matrix computation. [Online] Available at:
https://arxiv.org/abs/1907.08064.
Das, A. B., Tang, L., and Ramamoorthy, A. (2018). C3 LES : Codes for coded computation that
leverage stragglers. In IEEE Inf. Th. Workshop, pages 1–5.
Dougherty, R., Freiling, C., and Zeger, K. (2005). Insufficiency of linear coding in network infor-
mation flow. IEEE Trans. on Inf. Theory, 51(8):2745–2759.
Dutta, S., Cadambe, V., and Grover, P. (2016). Short-dot: Computing large linear transforms
distributedly using coded short dot products. In Proc. of Adv. in Neural Inf. Proc. Sys., pages
2100–2108.
101
Dutta, S., Fahim, M., Haddadpour, F., Jeong, H., Cadambe, V., and Grover, P. (2019). On
the optimal recovery threshold of coded matrix multiplication. IEEE Trans. on Inf. Theory,
66(1):278–301.
Fahim, M. and Cadambe, V. R. (2019). Numerically stable polynomially coded computing. [Online]
Available at: https://arxiv.org/abs/1903.08326.
Gautschi, W. (1990). How (un) stable are Vandermonde systems? Asymptotic and Computational
Analysis (Lecture Notes in Pure and Applied Mathematics), 124:193–210.
Ghasemi, H. and Ramamoorthy, A. (2016). Further results on lower bounds for coded caching. In
IEEE Int. Symp. on Inf. Theory, pages 2319–2323.
Ghasemi, H. and Ramamoorthy, A. (2017b). Asynchronous coded caching. In IEEE Int. Symp. on
Inf. Theory, pages 2438–2442.
Ghasemi, H. and Ramamoorthy, A. (2017c). Improved lower bounds for coded caching. IEEE
Trans. on Inf. Theory, 63(7):4388–4413.
Graham, R. L., Knuth, D. E., and Patashnik, O. (1994). Concrete mathematics: a foundation for
computer science (2nd ed.). Addison-Wesley Professional.
Gray, R. M. (2006). Toeplitz and circulant matrices: A review. Foundations and Trends R in
Communications and Information Theory, 2(3):155–239.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms. SIAM:Society for Industrial
and Applied Mathematics.
Ho, T., Médard, M., Koetter, R., Karger, D. R., Effros, M., Shi, J., and Leong, B. (2006). A random
linear network coding approach to multicast. IEEE Trans. on Inf. Theory, 52(10):4413–4430.
Horn, R. A. and Johnson, C. R. (1991). Topics in matrix analysis. Cambridge University Press.
Huang, Q., Tang, L., He, S., Xiong, Z., and Wang, Z. (2014). Low-complexity encoding of quasi-
cyclic codes based on galois fourier transform. IEEE Trans. on Comm., 62(6):1757–1767.
Huang, S. and Ramamoorthy, A. (2013). An achievable region for the double unicast problem based
on a minimum cut analysis. IEEE Trans. on Comm., 61(7):2890–2899.
102
Huang, S. and Ramamoorthy, A. (2014). On the multiple unicast capacity of 3-source, 3-terminal
directed acyclic networks. IEEE/ACM Trans. Netw., 22(1):285–299.
Huang, S., Ramamoorthy, A., and Medard, M. (2011). Minimum cost mirror sites using network
coding: Replication versus coding at the source nodes. IEEE Trans. on Inf. Theory, 57(2):1080–
1091.
Ji, M., Tulino, A., Llorca, J., and Caire, G. (2015a). Caching-aided coded multicasting with multiple
random requests. In IEEE Inf. Th. Workshop, pages 1–5.
Ji, M., Wong, M. F., Tulino, A. M., Llorca, J., Caire, G., Effros, M., and Langberg, M. (2015b). On
the fundamental limits of caching in combination networks. In IEEE 16th International Workshop
on Signal Processing Advances in Wireless Communications (SPAWC), pages 695–699.
Konstantinidis, K. and Ramamoorthy, A. (2020 (to appear)). Resolvable designs for speeding up
distributed computing. IEEE/ACM Trans. Netw.
Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., and Ramchandran, K. (2018). Speeding up
distributed machine learning using codes. IEEE Trans. on Inf. Theory, 64(3):1514–1529.
Lee, K., Suh, C., and Ramchandran, K. (2017). High-dimensional coded matrix multiplication. In
IEEE Int. Symp. on Inf. Theory, pages 2418–2422.
Li, S., Maddah-Ali, M. A., Yu, Q., and Avestimehr, A. S. (2017). A fundamental tradeoff be-
tween computation and communication in distributed computing. IEEE Trans. on Inf. Theory,
64(1):109–128.
Li, S.-Y., Yeung, R. W., and Cai, N. (2003). Linear network coding. IEEE Trans. on Inf. Theory,
49(2):371–381.
Lin, S. and Costello, D. J. (2004). Error Control Coding, 2nd Ed. Prentice Hall.
Maddah-Ali, M. A. and Niesen, U. (2014b). Fundamental limits of caching. IEEE Trans. on Info.
Theory, 60(5):2856–2867.
Mallick, A., Chaudhari, M., Sheth, U., Palanikumar, G., and Joshi, G. (2019). Rateless codes for
near-perfect load balancing in distributed matrix-vector multiplication. Proceedings of the ACM
on Measurement and Analysis of Computing Systems, 3(3):1–40.
Ngai, C. K. and Yeung, R. W. (2004). Network coding gain of combination networks. In IEEE Inf.
Th. Workshop, pages 283–287.
Olmez, O. and Ramamoorthy, A. (2016). Fractional repetition codes with flexible repair from
combinatorial designs. IEEE Trans. on Inf. Theory, 62(4):1565 –1591.
Pan, V. (2016). How bad are vandermonde matrices? SIAM Journal on Matrix Analysis and
Applications, 37(2):676–694.
Pan, V. Y. (2013). Polynomial Evaluation and Interpolation: Fast and Stable Approximate Solution.
Citeseer.
Rai, B. K. and Dey, B. K. (2012). On network coding for sum-networks. IEEE Trans. on Inf.
Theory, 58(1):50–63.
Ramamoorthy, A., Das, A. B., and Tang, L. (2020). Straggler-resistant distributed matrix com-
putation via coding theory: Removing a bottleneck in large-scale data processing. IEEE Signal
Processing Magazine, 37(3):136–145.
Ramamoorthy, A. and Langberg, M. (2013). Communicating the sum of sources over a network.
IEEE Journal on Selected Areas in Communications, 31(4):655–665.
Ramamoorthy, A. and Tang, L. (2019). Numerically stable coded matrix computations via circulant
and rotation matrix embeddings. [Online] Available at: https://arxiv.org/abs/1910.06515.
Ramamoorthy, A., Tang, L., and Vontobel, P. O. (2019). Universally decodable matrices for
distributed matrix-vector multiplication. In IEEE Int. Symp. on Inf. Theory, pages 1777–1781.
Shangguan, C., Zhang, Y., and Ge, G. (2018). Centralized coded caching schemes: A hypergraph
theoretical approach. IEEE Trans. on Inf. Theory, 64(8):5755–5766.
Shanmugam, K., Ji, M., Tulino, A. M., Llorca, J., and Dimakis, A. G. (2014). Finite length analysis
of caching-aided coded multicasting. In 52nd Annual Allerton Conference on Communication,
Control, and Computing, pages 914–920.
104
Shanmugam, K., Ji, M., Tulino, A. M., Llorca, J., and Dimakis., A. G. (2016). Finite-length
analysis of caching-aided coded multicasting. IEEE Trans. on Inf. Theory, 62(10):5524–5537.
Shanmugam, K., Tulino, A. M., and Dimakis, A. G. (2017). Coded caching with linear subpack-
etization is possible using Ruzsa-Szeméredi graphs. In IEEE Int. Symp. on Inf. Theory, pages
1237–1241.
Tang, L., Huang, Q., Wang, Z., and Xiong, Z. (2013). Low-complexity encoding of binary quasi-
cyclic codes based on galois fourier transform. In IEEE Int. Symp. on Inf. Theory, pages 131–135.
Tang, L., Konstantinidis, K., and Ramamoorthy, A. (2019). Erasure coding for distributed matrix
multiplication for matrices with bounded entries. IEEE Comm. Lett., 23(1):8–11.
Tang, L. and Ramamoorthy, A. (2016a). Coded caching for networks with the resolvability property.
In IEEE Int. Symp. on Inf. Theory, pages 420–424.
Tang, L. and Ramamoorthy, A. (2016b). Coded caching with low subpacketization levels. In
Workshop on Network Coding (NetCod), pages 1–6.
Tang, L. and Ramamoorthy, A. (2017). Low subpacketization schemes for coded caching. In IEEE
Int. Symp. on Inf. Theory, pages 2790–2794.
Tang, L. and Ramamoorthy, A. (2018). Coded caching schemes with reduced subpacketization
from linear block codes. IEEE Trans. on Inf. Theory, 64(4):3099–3120.
Wang, S., Liu, J., and Shroff, N. B. (2018). Coded sparse matrix multiplication. In Proc. 35th Int.
Conf. on Mach. Learning, pages 5139–5147.
Yan, Q., Cheng, M., Tang, X., and Chen, Q. (2017a). On the placement delivery array design for
centralized coded caching scheme. IEEE Trans. on Inf. Theory, 63(9):5821–5833.
Yan, Q., Tang, X., Chen, Q., and Cheng, M. (2017b). Placement delivery array design through
strong edge coloring of bipartite graphs. IEEE Comm. Lett., 22(2):236–239.
Yu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2017). Polynomial codes: an optimal design for
high-dimensional coded matrix multiplication. In Proc. of Adv. in Neural Inf. Proc. Sys., pages
4403–4413.
Yu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2018). Characterizing the rate-memory tradeoff
in cache networks within a factor of 2. IEEE Trans. on Inf. Theory, 65(1):647–663.
Yu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2020). Straggler mitigation in distributed
matrix multiplication: Fundamental limits and optimal coding. IEEE Trans. on Inf. Theory,
66(3):1920–1933.
Zhang, M., Tang, L., Huang, Q., and Wang, Z. (2014). Low complexity encoding algorithm of
rs-based qc-ldpc codes. In IEEE Inf. Th. Workshop, pages 1–4.
106
Lemma 5. A (n, k) linear block code over Z mod q with generator matrix G = [gab ] can construct
a resolvable block design by the procedure in Section 3.2.1 if gcd(q, g0b , g1b , · · · , g(k−1)b ) = 1 for
0 ≤ b < n.
gcd(q, g0b , g1b , · · · , g(k−1)b ) = 1, then it is evident that gcd(qi , g0b , g1b , · · · , g(k−1)b ) = 1 for 1 ≤ i ≤ d.
As qi is either a prime or a prime power, it follows that there exists a ga∗ b which is relatively prime
where u = [u0 , · · · , uk−1 ]. We consider eq. (A.1) over the ring Z mod qi and rewrite eq. (A.1) as
X
∆b − ua∗ ga∗ b = ua gab ,
a6=a∗
For arbitrary ua , a 6= a∗ , this equation has a unique solution for ua∗ since ga∗ b is a unit in Z
mod qi . This implies that there are qik−1 distinct solutions for (A.1) over Z mod qi . Using the
Chinese remainder theorem, eq. (A.1) has q1k−1 × q2k−1 × · · · qdk−1 = q k−1 solutions over Z mod q
Remark 12. From Lemma 5, it can be easily verified that a linear block code over Z mod q can
construct a resolvable block design if one of the following conditions for each column gi of the
• all non-zero entries in gi are zero divisors but their greatest common divisor is 1.
For the SPC code over Z mod q, all the non-zero entries in the generator matrix are 1, which is an
Proof of Lemma 2
First, we show that the proposed delivery scheme allows each user’s demand to be satisfied.
Note that Claim 2 shows that each user in a parallel class that belongs to the recovery set Sa
recovers all missing subfiles with a specified superscript from it. Thus, we only need to show that if
signals are generated (according to Claim 2) for each recovery set, we are done. This is equivalent
to showing that the bipartite recovery set graph is such that each parallel class has degree z and
Towards this end, consider the parallel class and we claim that there exist exactly z solutions
aα (k + 1) + bα = j + n(α − 1) (A.2)
such that aα1 6= aα2 for α1 6= α2 and j < n. The existence of the solution for each equation above
follows from the division algorithm. Note that aα < nz/(k + 1) as the RHS < nz. Furthermore,
note that for 1 ≤ α1 ≤ z and 1 ≤ α2 ≤ z, we cannot have solutions to eq. (A.2) such that aα1 = aα2
as this would imply that |bα1 − bα2 | ≥ n which is a contradiction. This shows that each parallel
The following facts follow easily from the construction of the recovery sets. The degree of each
nz
recovery set in the bipartite graph is k + 1 and there are k+1 of them; multiple edges between a
recovery set and a parallel class are disallowed. Therefore, the total number of edges in the bipartite
graph is nz. As each parallel class participates in at least z recovery sets, by this argument, it
zn
Finally, we calculate the rate of the delivery phase. In total, the server transmits q k (q − 1) k+1
equations, where the symbol transmitted has the size of a subfile. Thus, the rate is
zn 1
R = q k (q − 1)
k + 1 qk z
(q − 1)n
= .
k+1
Proof of Claim 4
In the above expression and the subsequent discussion if i is such that i < 0, we set gi = 0.
By Claim 3, a cyclic code with generator matrix G satisfies the CCP if all submatrices
g(a+j+1)n , · · · , g(a+k)n ],
where a = n − b k2 c − 1, 0 ≤ j ≤ k, have full rank. In what follows, we argue that this is true. Note
that in the generator matrix of cyclic code, any k consecutive columns are linearly independent
Lin and Costello (2004). Therefore for j = 0 and k, GS\(a+j)n has full rank, without needing the
109
where
g0 g1 · · gd k e−1
2
0 g0 g1 · g k
d 2 e−2
Aj = . ,
. ..
. .
0 · 0 g0
g
n−k−1 gn−k 0 · · 0
n−k−2 gn−k−1 gn−k 0 ·
g 0
.. ..
Cj = ,
. .
g
n−k−j+1 · · · · gn−k
gn−k−j · · · · gn−k−1
and
gn−k 0 · · 0
gn−k−1 gn−k 0 · 0
Ej = .
.. ..
. .
gn−k−b k c+j+1 · · · gn−k
2
110
Matrices Aj and Ej have full rank as they are respectively upper triangular and lower triangular,
with non-zero entries on the diagonal (as g0 and gn−k are non-zero in a cyclic code). Therefore,
GS\(a+j)n has full rank if Cj has full rank. For b k2 c < j < k, GS\(a+j)n can be partitioned into a
Proof of Claim 6
We need to argue that all k × k submatrices of GSa where 0 ≤ a < α are full rank. In what
follows we argue that all k × k submatrices of GS0 are full rank. The proof for any GSa is similar.
Note that GS0 can be written compactly as follows by using Kronecker products.
A ⊗ It
GS0 = ,
B ⊗ C(t−1)×t (1, 1)
where
b00 ··· b0(z−1)
. ..
.
A=
. .
b(z−2)0 · · · b(z−2)(z−1)
Next, we check the determinant of submatrices GS0 \i obtained by deleting i-th column of GS0 .
W.l.o.g, we let i = (z − 1)t + j where 0 ≤ j < t. The block form of the resultant matrix GS0 \i can
be expressed as
A0
⊗ It A00
⊗ ∆1
GS0 \i = ,
0 00
B ⊗ C(t−1)×t (1, 1) B ⊗ ∆2
where A0 and A00 are the first z −1 columns and last column of A respectively. Likewise, B0 and B00
are the first z − 1 components and last component of B. The matrices ∆1 and ∆2 are obtained by
deleting the j-th column of It and C(t−1)×t (1, 1) respectively. Then, using the Schur determinant
111
where (1) holds by the properties of the Kronecker product Horn and Johnson (1991) and (2) holds
since C(t−1)×t (1, 1)∆1 = ∆2 . Next note that det(∆2 ) 6= 0. This is because ∆2 can be denoted as
A 0
∆2 = ,
0 B
det(F)
det(B00 − B0 A0−1 A00 ) =
det(A0 )
6= 0,
since det(F) and det(A0 ) are both non-zero as their columns have the Vandermonde form. In F,
the columns correspond to distinct and non-zero elements from GF (q); therefore, q > z. Note
however, that the above discussion focused only GS0 . As the argument needs to apply for all GSa
Proof of Claim 7
Note that the matrix in eq. (3.4) is the generator matrix of (n, k) linear block code over GF (q)
where nz = (z + 1)(k + 1). Since z and z + 1 are coprime, z is the least positive integer such that
k + 1 | nz. To show G satisfies the CCP, we need to argue that all k × k submatrices of GSa where
0 ≤ a ≤ z are full rank. It is easy to check that Sa = {0, · · · , n − 1} \ {t(z − a), t(z − a) + 1, · · · , t(z −
a) + t − 1}. We verify three types of matrix GSa as follows: I. a = 0 II. a = 1 III. a > 1.
• Type I
When a = 0, it is easy to verify that any k × k submatrix of GS0 has full rank since GS0 has
the form [Ik×k |1k ], which is the generator matrix of the SPC code.
• Type II
113
It×t 0t×t ··· 0t×t b1 It×t
0t×t It×t ··· 0t×t b2 It×t
.. ..
GS1
= . .
(A.3)
0t×t 0t×t ··· It×t bz−1 It×t
0(t−1)×t 0(t−1)×t · · · 0(t−1)×t C(c1 , c2 )(t−1)×t
| {z } | {z }
Case 1 Case 2
Case 1: Suppose that we delete any of first (z − 1)t columns in GS1 (this set of columns is
depicted by the underbrace in eq. (A.3)), say i-th column of GS1 , where z1 t ≤ i < (z1 + 1)t
be expressed as follows.
A C
GS1 \i = ,
B D
where
A = Ii×i ,
B = 0(k−i)×i ,
0t×(k−t−i) b1 It×t
0 b2 It×t
t×(k−t−i)
.. ..
C= ,
. .
0 bz1 It×t
t×(k−t−i)
0i1 ×(k−t−i) bz1 +1 Ii1 ×i1 0i1 ×(i2 +1)
Note that if z1 = 0, C = [0i1 ×(k−t−i) b1 Ii1 ×i1 0i1 ×(i2 +1) ] and if z1 = z − 2,
01×i2 01×i1 bz−1 01×i2
D = Ii2 ×i2
0i2 ×(i1 +1) bz−1 Ii2 ×i2 .
0(t−1)×i2 C(c1 , c2 )(t−1)×t
114
01×i2 01×t ··· 01×t 01×i1 bz1 +1 01×i2
Ii2 ×i2 0i2 ×t ··· 0i2 ×t 0i2 ×(i1 +1) bz1 +1 Ii2 ×i2
D= .. ..
(A.4)
. .
0t×i2 0t×t ··· It×t bz−1 It×t
0(t−1)×i2 0(t−1)×t ··· 0(t−1)×t C(c1 , c2 )(t−1)×t
To verify GS1 \i has full rank, we just need to check D has full rank (as A is full rank).
Checking that D has full rank can be further simplified as follows. As bz1 +1 6= 0, we can move
the corresponding column that has bz1 +1 as its first entry so that it is the first column of D.
Following this, consider C(c1 , c2 )(t−1)×t \ ci1 which is obtained by deleting the i1 -th column
of C(c1 , c2 )(t−1)×t .
D1 D2
C(c1 , c2 )(t−1)×t \ ci1 = ,
D3 D4
D4 is a i2 × i2 matrix as follows
c2 0 0 ··· 0 0 0
c c
1 2 0 ··· 0 0 0
. ..
D4 = .. . ,
0 0 ··· 0 c1 c2 0
0 0 ··· 0 0 c1 c2
and D2 and D3 are i1 ×i2 and i2 ×i1 all zero matrices respectively. Then det(C(c1 , c2 )(t−1)×t \
··· ···
It×t 0t×t 0t×t 0t×t 0t×(t−1) 1t b1 It×t
.. .. .. .. .. .. ..
. . . . . . .
0t×t
··· It×t 0t×t ··· 0t×t 0t×(t−1) 1t bz−a It×t
0
t×t ··· 0t×t 0t×t ··· 0t×t 0t×(t−1) 1t bz−a+1 It×t
GSa = 0t×t ··· 0t×t It×t ··· 0t×t 0t×(t−1) 1t bz−a+2 It×t
.. .. .. .. .. .. ..
. . . . . . .
0t×t ··· 0t×t 0t×t ··· It×t 0t×(t−1) 1t bz−1 It×t
0(t−1)×t ··· 0(t−1)×t 0(t−1)×t · · · 0(t−1)×(t−1) I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t
| {z } | {z } | {z } | {z } | {z }
Case 1 Case 2 Case 3 Case 4 Case 5
(A.5)
Case 2: By deleting any of last t columns in GS1 , say i-th column of GS1 , where (z − 1)t ≤
i < zt, the block form of resultant matrix GS1 \i can be expressed as follows.
A C
GS1 \i = ,
B D
column of matrix [b1 It×t b2 It×t · · · bz−1 It×t ]T and D is C(c1 , c2 )(t−1)×t \ ci−(z−1)t . Since
i−(z−1)t zt−i−1 i−(z−1)t zt−i−1
det(D) = c1 c2 6= 0, det(GS1 \i ) = ±c1 c2 6= 0 and therefore GS1 \i has
full rank.
• Type III when a > 1, GSa has the form in eq. (A.5). As before we perform a case analysis.
Case 1: By deleting the i-th column of GSa , where z1 t ≤ i < (z1 + 1)t, z1 ≤ z − a − 1,
i1 = i − z1 t, and i2 = (z1 + 1)t − i − 1, the block form of the resultant matrix GSa \i can be
expressed as follows,
A C
GSa \i = ,
B D
116
···
0t×i2 0t×t 0t×(t−1) 1t b1 It×t
.. ..
C=
. .
(A.6)
0t×i2 0t×t · · · 0t×(t−1) 1t bz1 It×t
0i1 ×i2 0i1 ×t · · · 0i1 ×(t−1) 1i1 bz1 +1 Ii1 ×i1 0i1 ×(i2 +1)
01×i2 01×t ··· ··· ··· 01×(t−1) 1 01×i1 bz1 +1 01×i2
Ii2 ×i2 0i2 ×t ··· ··· ··· 0i2 ×(t−1) 1i2 0i2 ×(i1 +1) bz1 +1 Ii2 ×i2
0t×i2 It×t ··· ··· ··· 0t×(t−1) 1t bz1 +2 It×t
.. ..
. .
D=
0t×i2 0t×t ··· 0t×t ··· 0t×(t−1) 1t bz−a+1 It×t
0t×i2 0t×t ··· It×t ··· 0t×(t−1) 1t bz−a+2 It×t
.. ..
. .
0(t−1)×i2 0(t−1)×t · · · 0(t−1)×t · · · I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t
(A.7)
where A = Ii×i , B = 0(k−i)×i , C and D has the form in eq. (A.6) and eq. (A.7), respectively.
Note that if z1 = 0, C = [0i1 ×i2 0i1 ×t · · · 0i1 ×(t−1) 1i1 b1 Ii1 ×i1 0i1 ×(i2 +1) ] and if
01×i2 01×t ··· ··· ··· 01×(t−1) 1 01×i1 bz−a 01×i2
Ii2 ×i2 0i2 ×t ··· ··· ··· 0i2 ×(t−1) 1i2 0i2 ×(i1 +1) bz−a Ii2 ×i2
0t×i2 0t×t ··· 0t×t ··· 0t×(t−1) 1t bz−a+1 It×t
D=
0t×i2 0t×t ··· It×t ··· 0t×(t−1) 1t bz−a+2 It×t
.
.. ..
. .
0(t−1)×i2 0(t−1)×t · · · 0(t−1)×t · · · I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t
(A.8)
117
To verify that GSa \i has full rank, we just need to check D has full rank. Owing to the
Case 2: By deleting i-th column of GSa , where z1 t ≤ i < (z1 + 1)t, z − a ≤ z1 ≤ z − 3, the
proof that the resultant matrix has full rank is similar to the case that z1 ≤ z − a − 1 and we
omit it here.
A = I(z−a)t×(z−a)t
B = 0(k−(z−a)t)×(z−a)t
0t×t · · · 0t×(t−2) 1t b1 It×t
. ..
.
C= . .
0t×t · · · 0t×(t−2) 1t bz−a It×t
0t×t ··· 0t×(t−2) 1t bz−a+1 It×t
It×t ··· 0t×(t−2) 1t bz−a+2 It×t
.. ..
. .
D=
Ii1 ×i1 0i1 ×i2
···
0(t−1)×t 01×i1 01×i2 1t−1 C(c1 , c2 )(t−1)×t
0i2 ×i1 Ii2 ×i2
118
It×t ··· 0t×t 0t×t ··· 0t×(t−1) b1 It×t
.. ..
. .
0
t×t ··· It×t 0t×t ··· 0t×(t−1) bz−a It×t
GSa \i = 0t×t ··· 0t×t 0t×t ··· 0t×(t−1) bz−a+1 It×t (A.9)
0t×t ··· 0t×t It×t ··· 0t×(t−1) bz−a+2 It×t
.. ..
. .
0(t−1)×t ··· 0(t−1)×t 0(t−1)×t · · · I(t−1)×(t−1) C(c1 , c2 )(t−1)×t
To verify GSa \i has full rank, we need to check the determinant of D. Owing to the construc-
where C(c1 , c2 )(t−1)×t (i1 ) denotes the i1 -th row of C(c1 , c2 )(t−1)×t , 0 ≤ i ≤ t − 2.
=btz−a+1 (1 − b−1
z−a+1 (c1 + c2 ))
Since bz−a+1 6= 0 and c1 + c2 = 0, det D0 6= 0 and D0 has full rank. Then det(D) =
btz−a+1 (1 − b−1
z−a+1 (c1 + c2 )) 6= 0 and thus GSa \i is full rank.
Case 4: By deleting i-th column of GSa , where i = (z − 1)t − 1, the block form of the resultant
matrix GSa \i can be expressed as eq. (A.9). Evidently, det(GSa \i ) = ±btz−a+1 , so that
Case 5: By deleting i-th column of GSa , where (z − 1)t ≤ i < zt and i1 = i − (z − 1)t, the
block form of the resultant matrix GSa \i can be expressed as eq. (A.10).
where bs It×t \ ci1 denotes the submatrix obtained by deleting i1 -th column of bs It×t and
C(c1 , c2 )(t−1)×t \ci1 denotes the submatrix obtained by deleting i1 -th column of C(c1 , c2 )(t−1)×t \
119
It×t ··· 0t×t 0t×t ··· 0t×(t−1) 1t b1 It×t \ ci1
.. ..
. .
0
t×t ··· It×t 0t×t ··· 0t×(t−1) 1t bz−a It×t \ ci1
GSa \i = 0t×t ··· 0t×t 0t×t ··· 0t×(t−1) 1t bz−a+1 It×t \ ci1
0t×t ··· 0t×t It×t ··· 0t×(t−1) 1t bz−a+2 It×t \ ci1
.. ..
. .
0(t−1)×t ··· 0(t−1)×t 0(t−1)×t · · · I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t \ ci1
(A.10)
ci1 . To verify GSa \i has full rank, we just need to check [1t |bz−a+1 It×t \ ci1 ] has full rank.
by deleting any column of above matrix, it is obvious that det([1t |bz−a+1 It×t \ci1 ]) = ±bt−1
z−a+1
and det(GSa \i ) 6= 0.
Proof of Claim 8
Let z be the least integer such that k + 1 | nz. First, we argue that z is the least integer such
that k + 1 | (n + s(k + 1))z. Assume that this is not true, then there exists z 0 < z such that
contradiction.
Next we argue that G0 satisfies the CCP, i.e., all k × k submatrices of each G0Sa0 , where Ta0 =
nz
{a(k + 1), · · · , a(k + 1) + k} and Sa0 = {(t)n+s(k+1) |t ∈ Ta0 } and 0 ≤ a ≤ k+1 + sz, are full rank. Let
• Case 1. The first column of G0Sa0 lies in the first s(k + 1) columns of G0 .
Suppose ln0 ≤ a(k + 1) < ln0 + s(k + 1) where 0 ≤ l < z − 1. By the construction of G0 ,
G0Sa = D. Since D = GS0 , all k × k submatrices of D have full rank and so does G0Sa .
• Case 2. The first and last column of G0Sa0 lie in the last n columns of G0 .
Suppose ln0 + s(k + 1) ≤ a(k + 1) and a(k + 1) + k < (l + 1)n0 where 0 ≤ l < z − 1. As
n0 > s(k + 1), a(k + 1) − (l + 1)s(k + 1) > 0 and k + 1|a(k + 1) − (l + 1)s(k + 1). Let
a0 = a − (l + 1)s, then G0Sa = GSa0 and hence all k × k submatrices of G0Sa have full rank.
• Case 3. The first column of G0Sa0 lies in the last n columns of G0 but the last column lies in
Suppose ln0 + s(k + 1) ≤ a(k + 1) and a(k + 1) + k > (l + 1)n0 where 0 ≤ l < z − 2.
{(a(k + 1))n0 , · · · , (ln0 + n0 − 1)n0 } and S 1 = {(a0 (k + 1))n , · · · , (ln + n − 1)n }. As (ln0 + n0 −
G0Sa = [G0S 01 G0S 02 ] = [GS 1 GS 2 ] = GSa0 and hence all k × k submatrices of G0Sa have full rank.
Proof of Claim 11
M
• N = 1q . We have
FsM N
1 1 K 1 k
log2 ∗
= log2 − log2 z − log2 q.
K Fs K K/q K K
F MN
1 1 η
lim log2 s ∗ = H2 − log2 q.
n→∞ K Fs q q
M k+1
• N =1− nq . We have
FsM N
1 1 K k+1
log2 ∗
= log2 − log2 q
K Fs K k+1 K
1 zn
− log2
K k+1
121
FsM N
1 η η
lim log2 ∗
= H2 − log2 q
n→∞ K Fs q q
Consider the (k, α)-CCP (cf. Definition 6) where α ≤ k. Let z be the least integer such that
α | nz, and let Taα = {aα, · · · , aα + α − 1)} and Saα = {(t)n | t ∈ Taα }. Let GSaα = [gi0 , · · · , giα−1 ]
be the submatrix of G specified by the columns in Saα , i.e, gij ∈ GSaα if ij ∈ Saα . We demonstrate
that the resolvable design generated from a linear block code that satisfies the (k, α)-CCP can also
be used in a coded caching scheme. First, we construct a (X, A) resolvable design as described
in Section 3.2.A., which can be partitioned into n parallel classes Pi = {Bi,j : 0 ≤ j < q},
0 ≤ i < n. By the constructed resolvable design, we partition each subfile Wn into q k z subfiles
s | 0 ≤ t < q k , 0 ≤ s < z} and operate the placement scheme in Algorithm 2. In
Wn = {Wn,t
the delivery phase, for each recovery set, several equations are generated, each of which benefit α
users simultaneously. Furthermore, the equations generated by all the recovery sets can recover all
the missing subfiles. In this section, we only show that for the recovery set PSaα , it is possible to
generate equations which benefit α users and allow the recovery of all of missing subfiles with given
superscript. The subsequent discussion exactly mirrors the discussion in the (k, k + 1)-CCP case
and is skipped.
Towards this end, we first show that picking α users from α distinct parallel classes can al-
ways form q k−α+1 − q k−α signals. More specifically, consider blocks Bi1 ,li1 , . . . , Biα ,liα (where lij ∈
{0, . . . , q − 1}) that are picked from α distinct parallel classes of PSaα . Then, | ∩α−1
j=1 Bij ,lij | = q
k−α+1
Claim 13. Consider the resolvable design (X, A) constructed by a (n, k) linear block code that
zn
satisfies the (k, α) CCP. Let PSaα = {Pi | i ∈ Saα } for 0 ≤ a < α, i.e., it is the set of parallel classes
corresponding to Saα . We emphasize that |PSaα | = α ≤ k. Consider blocks Bi1 ,li1 , . . . , Biα0 ,li (where
α0
122
lij ∈ {0, . . . , q − 1}) that are picked from any α0 distinct parallel classes of PSaα where α0 ≤ α. Then,
0 0
| ∩αj=1 Bij ,lij | = q k−α .
The above argument implies that any α − 1 blocks from any α − 1 distinct parallel classes of PSaα
have q k−α+1 points in common and any α blocks Bi1 ,li1 , Biα ,liα from any α distinct parallel classes
of PSaα have q k−α points in common. These blocks (or users) can participate in q k−α+1 − q k−α
equations, each of which benefits α users. In particular, each user will recover a missing subfile
indexed by an element belonging to the intersection of the other α − 1 blocks in each equation.
A very similar argument to Lemma 2 can be made to justify enough equations can be found that
Proof. Recall that by the construction in Section III.A, block Bi,l ∈ Pi is specified as follows,
Now consider Bi1 ,li1 , . . . , Biα0 ,li (where ij ∈ Saα , lij ∈ {0, . . . , q − 1}) that are picked from α0
α0
distinct parallel classes of PSaα . W.l.o.g. we assume that i1 < i2 < · · · < iα0 . Let I = {i1 , . . . , iα0 }
and TI denote the submatrix of T obtained by retaining the rows in I. We will show that the
0
vector [li1 li2 . . . liα0 ]T is a column in TI and appears q k−α times in it.
We note here that by the (k, α)-CCP, the vectors gi1 , gi2 , . . . , giα are linearly independent and
thus the subset of these vectors, gi1 , · · · , giα0 are linearly independent. W. l. o. g., we assume that
the top α0 × α0 submatrix of the matrix [gi1 gi2 . . . giα0 ] is full-rank. Next, consider the system of
By the assumed condition, it is evident that this system of α0 equations in α0 variables has a unique
0
solution for a given vector v = [uα0 , · · · , uk−1 ] over GF (q). Since there are q k−α possible v vectors,
As in the case of the (k, k + 1)-CCP, we form a recovery set bipartite graph with parallel
classes and recovery sets as the disjoint vertex subsets, and the edges incident on each parallel
class are labeled arbitrarily from 0 to z − 1. For a parallel class P ∈ PSaα we denote this label by
label(P − PSaα ). For a given recovery set PSaα , the delivery phase proceeds by choosing blocks from
α distinct parallel classes in PSaα and it provides q k−α+1 − q k−α equations that benefit α users.
Note that in the (k, α)-CCP case, randomly picking α blocks from α parallel classes in PSaα will
always result in q k−α intersections, which is different from (k, k + 1)-CCP. It turns out that each
equation allows a user in P ∈ PSaα to recover a missing subfile with superscript label(P − PSaα ).
Let the demand of user UBi,j for i ∈ n − 1, 0 ≤ j ≤ q − 1 by Wκi,j . We formalize the argument
in Algorithm 6 and prove that equations generated in each recovery set PSaα can recover all missing
For the sake of convenience we argue that user UBβ,lβ that demands Wκβ,lβ can recover all its
missing subfiles with superscript E(Pβ ). Note that Bβ,lβ = q k−1 . Thus user UBβ,lβ needs to obtain
q k − q k−1 missing subfiles with superscript E(Pβ ). The delivery phase scheme repeatedly picks α
users from different parallel classes of PSaα . The equations in Algorithm 6 allow UBβ,lβ to recover all
E(Pβ )
W where L̂β = ∩j∈Saα \{β} Bj,lj \ ∩j∈Saα Bj,lj and t = 1, · · · , q k−α+1 − q k−α . This is because
κβ,lβ ,L̂β [t]
of Claim 13.
Next, we count the number of equations that UBβ,lβ participates in. We can pick α − 1 users
from α − 1 parallel classes in PSaα . There are totally q α−1 ways to pick them, each of which
generate q k−α+1 − q k−α equations. Thus there are a total of q k − q k−1 equations in which user
It remains to argue that each equation provides a distinct file part of user UBβ,lβ . Towards this
when we pick the same set of blocks {Bi1 ,li1 , · · · , Biα−1 ,liα−1 }, it is impossible that the recovered
E(Pβ ) E(Pβ )
subfiles W and W are the same since the points in L̂β are distinct. Next, sup-
κβ,lβ ,L̂β [t1 ] κβ,lβ ,L̂β [t2 ]
pose that there exist sets of blocks {Bi1 ,li1 , · · · , Biα−1 ,liα−1 } and {Bi1 ,li0 , · · · , Biα−1 ,li0 } such that
1 α−1
Bβ,lβ0 . This is a contradiction since this in turn implies that γ ∈ ∩αj=2 Bij ,lij ∩αj=2 Bij ,li0 , which is
T
j
impossible since two blocks from the same parallel class have an empty intersection.
Finally we calculate the transmission rate. In Algorithm 6, for each recovery set, we transmit
zn
q k+1 − q k equations and there are totally α recovery sets. Since each equation has size equal to a
of this system is a little higher compared to the (k, k + 1)-CCP system with almost the same
subpacketization level.
125
However, by comparing Definitions 5 and 6 it is evident that the rank constraints of the (k, α)-
CCP are weaker as compared to the (k, k+1)-CCP. Therefore, in general we can find more instances
of generator matrices that satisfy the (k, α)-CCP. For example, a large class of codes that satisfy
the (k, k)-CCP are (n, k) cyclic codes since any k consecutive columns in their generator matrices
are linearly independent Lin and Costello (2004). Thus, (n, k) cyclic codes always satisfy the
(k, k)-CCP but satisfy (k, k + 1)-CCP if they satisfy the additional constraints discussed in Claim
3.
First, we show that matrix T constructed by constructed by the approach outlined in Section
3.3.4 still results in a resolvable design. Let ∆ = [∆0 ∆1 · · · ∆n−1 ] be a codeword of the cyclic
the Chinese remaindering map ψ (discussed in Section 3.3.4), ∆ can be uniquely mapped into d
codewords c(i) , i = 1, . . . , d where each c(i) is a codeword of C i (the cyclic code over GF (qi )). Thus,
(1) (2) (d)
the b-th component ∆b can be mapped to (cb , cb , . . . , cb )
(i)
Let Gi = [gab ] represent the generator matrix of the code C i . Based on prior arguments, it is
P i −1 (i) (i)
evident that there are qiki −1 distinct solutions over GF (qi ) to the equation ka=0 ua gab = cb . In
turn, this implies that ∆b appears q1k1 −1 q2k2 −1 · · · qdkd −1 times in the b-th row of T and the result
follows.
Next we show any α blocks from distinct parallel classes of PS kmin have q1k1 −α q2k2 −α · · · qdkd −α
a
intersections, where α ≤ kmin and Sakmin = {(akmin )n , (akmin + 1)n , · · · , (akmin + kmin − 1)n }
Towards this end consider Bi1 ,li1 , . . . , Biα ,liα (where ij ∈ Sakmin , lij ∈ {0, . . . , q − 1}) that are
picked from α distinct parallel classes of PS kmin . W.l.o.g. we assume that i1 < i2 < · · · < iα . Let
a
I = {i1 , . . . , iα } and TI denote the submatrix of T obtained by retaining the rows in I. We will
show that the vector [li1 li2 . . . liα ]T is a column in TI and appears q1k1 −α q2k2 −α · · · qdkd −α times.
126
Let ψm (lij ) for m = 1, . . . , d represent the m-th component of the map ψ. Consider the (n, k1 )
cyclic code over GF (q1 ) and the system of equations in variables u0 , . . . , uα−1 that lie in GF (q1 ).
α−1 1 −1
kX
(1) (1)
X
ub gbi1 = ψ1 (li1 ) − ub gbi1 ,
b=0 b=α
α−1 kX1 −1
(1) (1)
X
ub gbi2 = ψ1 (li2 ) − ub gbi2 ,
b=0 b=α
..
.
α−1 1 −1
kX
(1) (1)
X
ub gbiα = ψ1 (liα ) − ub gbiα .
b=0 b=α
By arguments identical to those made in Claim 13 it can be seen that this system of equations has
q1k1 −α solutions. Applying the same argument to the other cyclic codes we conclude that the vector
[li1 , li2 , · · · , liα ] appears q1k1 −α q2k2 −α · · · qdkd −α times in TI and the result follows.
127
Example 20. For m = 4, the four possible circulation permutation matrices are
0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0
0
P= , P = I4 = ,
0 0 0 1 0 0 1 0
1 0 0 0 0 0 0 1
0 0 1 0 0 0 0 1
0 0 0 1
3 1 0 0 0
P2 = ,P = .
1 0 0 0 0 1 0 0
0 1 0 0 0 0 1 0
Proof of Claim 12
Proof. Note that Algorithm 5 is applied for recovering the corresponding entries of ATi,j x for i ∈
[kA ], j ∈ [q̃] separately. There are r/(kA (q − 1)) such entries. The complexity of computing a N -
point FFT is O(N log N ) in terms of the required floating point operations (flops). Computing the
permutation does not cost any flops and its complexity is negligible as compared to the other steps.
Step 1 of Algorithm 5 therefore has complexity O(kA q̃ log q̃). In Step 2, we solve the degree kA − 1
polynomial interpolation, (q̃ − 1) times. This takes O((q̃ − 1)kA log2 kA ) time Pan (2013). Finally,
Step 3, requires applying the inverse permutation and the inverse FFT; this requires O(kA q̃ log q̃)
128
r 2
O(kA q̃ log q̃) + O((q̃ − 1)kA log kA )
kA (q̃ − 1)
Proof of Theorem 5
Proof. The arguments are conceptually similar to the proof of Theorem 4. Suppose that the workers
indexed by i0 , . . . , ikA −1 complete their tasks. The corresponding block columns of Gcirc can be
extracted to form
I I ··· I
ikA −1
P i0 P i1 ··· P
G̃ = .
.
.. .
.. .. ..
. .
Pi0 (kA −1) Pi1 (kA −1) · · · PikA −1 (kA −1)
As in the proof of Theorem 4 we can equivalently analyze the decoding by considering the
system of equations
mG̃ = c,
m = [m0 , · · · , mkA −1 ]
c = [ci0 , · · · , cikA −1 ]
Note that not all variables in m are independent owing to (5.7). Let mF and cF denote the q̃-point
where W is the q̃-point DFT matrix. Let G̃k,l = Pil k denote the (k, l)-th block of G̃. Using the
(q̃−1)il k
G̃k,l = Wdiag(1, ωq̃il k , ωq̃2il k , . . . , ωq̃ )W∗ .
il k 2il k (q̃−1)il k
Let G̃F
k,l = diag(1, ωq̃ , ωq̃ , . . . , ωq̃ ), and G̃F represent the kA × kA block matrix with G̃F
k,l
mG̃ = c,
=⇒ [mF F F F F
0 , · · · , mkA −1 ]G̃ = [ci0 , · · · , cik ]
A −1
W
upon right multiplication by the matrix ..
. Next, we note that as each block within
.
W
G̃F has a diagonal structure, we can rewrite the system of equations as a block diagonal matrix
upon applying an appropriate permutation (cf. Claim 14 in Appendix B). Thus, we can rewrite it
130
as
[mF ,π F ,π F F ,π F ,π
0 , · · · , mq̃−1 ]G̃d = [c0 , · · · , cq̃−1 ], (B.1)
implies that mF
0
,π
is a 1 × kA zero row-vector and thus cF
0
,π
is too.
that [mF F
0 , · · · , mkA −1 ] and consequently m can be determined by solving the system of equations
in (B.1). Towards this end, we note that the k-th diagonal block (1 ≤ k ≤ q̃ − 1) of G̃F
d , denoted
by G̃F
d [k] can be expressed as follows.
1 1 ··· 1
ik −1 k
ω i0 k ωq̃i1 k
F q̃ ··· ωq̃ A
G̃d [k] = . (B.2)
.
.. .
.. .. ..
. .
(k −1)i0 k (k −1)i1 k (kA −1)ikA −1 k
ωq̃ A ωq̃ A ··· ωq̃
ikA −1 k
The above matrix is a complex Vandermonde matrix with parameters ωq̃i0 k , . . . , ωq̃ . Thus, as
for iα , iβ ∈ {0, . . . , n − 1} and 1 ≤ k ≤ q̃ − 1. A necessary and sufficient condition for this to hold
Pan (2016). Finding an upper bound on ||V−1 || is more complicated and we discuss this in detail
In what follows, we establish an upper bound on the condition number of Vandermonde matrices
i 2π 2π j
Proof. Recall that ωq = e q and ωm = ei m and define tj = f ωm , j = 0, . . . , m − 1 where f
is a complex number with |f | = 1. We let Cs,f denote the Cauchy matrix with parameters
{s0 , . . . , sm−1 } and {t0 , . . . , tm−1 }. Let W be the m-point DFT matrix. The work of Pan (2016)
shows that
!m−1
−j m−1 −1 1
V−1 = diag(f m−1−j )m−1 ∗
j=0 W diag(ωm )j=0 Cs,f diag m .
sj − f m
j=0
−j m−1
It can be seen that the matrix diag(f m−1−j )m−1 ∗
j=0 W diag(ωm )j=0 is unitary. Therefore,
!m−1
1
||V−1 || = ||C−1
s,f diag m ||
sj − f m
j=0
1
≤ ||C−1
×s,f ||
minm−1 m
i=0 |si − f |
m
−1 0 0 1
≤ m × (max |Cs,f (i , j )|) × , (B.3)
i0 ,j 0 minm−1 m m
i=0 |si − f |
132
where the first inequality holds as the norm of a product of matrices is upper bounded by the
products of the individual norms and second inequality holds since for any M, we have ||M|| ≤
||M||F .
In what follows, we upper bound the RHS of (B.3). Let s(x) denote a function of x so that
−1
s(x) = Πm−1 0 0
i=0 (x − si ). The (i , j )-the entry of Cs,f can be expressed as Pan (2016)
C−1 0 0 m m m
s,f (i , j ) = (−1) s(tj 0 )(si0 − f )/(si0 − tj 0 ), so that
|C−1 0 0 m m
s,f (i , j )| = |s(tj 0 )||si0 − f |/|si0 − tj 0 |
≤ |s(tj 0 )|(|sm m
i0 | + |f |)/|si0 − tj 0 |
Let M = {1, ωq , ωq2 , . . . , ωqq−1 } \ {s0 , s1 , . . . , sm−1 } denote the q-th roots of unity that are not
s(tj 0 ) = Πm−1
i=0 (tj 0 − si )
xq − 1
= , so that
Παj ∈M (x − αj ) x=tj 0
|tqj0 − 1|
|s(tj 0 )| =
Παj ∈M |tj 0 − αj |
2
≤ (since |tj 0 | = 1 and by the triangle inequality).
Παj ∈M |tj 0 − αj |
1 1
|C−1 0 0
s,f (i , j )| ≤ 4 max (B.4)
i ,j Παj ∈M |(tj 0 − αj )| |si0 − tj 0 |
0 0
1 1
=4 . (B.5)
mini0 ,j 0 Παj ∈M |(tj 0 − αj )| |si0 − tj 0 |
Note that in the expression above, si0 is a parameter of V while the αj ’s are the points within
π 0
Ωq = {1, ωq , ωq2 , . . . , ωqq−1 } that are “not” parameters of V. We choose f = ei m so that tj 0 = f ωm
j
=
j0
eiπ/m ωm . Next, we determine an upper bound on the RHS of (B.5). Towards this end, we note
that the distance between two points on the unit circle can be expressed as 2 sin(θ/2) if θ is the
It can be seen that the closest point to tj 0 that lies within Ωq has an induced angle
2π` 2π(j 0 + 21 ) 2π 1 π
− ≥ ≥ 2.
q m qm 2 q
Therefore, the corresponding distance is lower bounded by 2/q 2 . Similarly, the next closest distance
Παj ∈M |(tj 0 − αj )| min
0 0
|si0 − tj 0 |
i ,j
Therefore,
q d+3
|C−1 0 0
s,f (i , j )| ≤ Cd
where Cd = 2d−1 (d − 1)! is a constant. Let the i-th parameter si = ei2π`/q . Then,
|sm m
i − f | = |e
i2π`m/q
+ 1|
= 2| cos(π`m/q)|.
The term `m can be expressed as `m = βq + η for integers β and η such that 0 ≤ η ≤ q − 1. Now
note that η 6= q/2 since by assumption q is odd. Thus, | cos(π`m/q)| takes its smallest value when
q d+3
||V−1 || ≤ m q
Cd
q d+5
≤ (since m < q).
Cd
134
q d+6
κ(V) ≤ .
Cd
Auxiliary Claims
Claim 14. Let M be a l1 q × l2 q matrix consisting of blocks of size q × q denoted by Mi,j for
i ∈ [l1 ], j ∈ [l2 ]. Each Mi,j is a diagonal matrix. Then, the rows and columns of M can be
permuted to obtain Mπ which is a block diagonal matrix where each block matrix is of size l1 × l2
Proof. For an integer a, let (a)q denote a mod q. In what follows, we establish two permutations
and show that applying row-permutation πl1 and column-permutation πl2 to M will result in a
We observe that (i, j)-th entry in M is the ((i)q , (j)q )-th entry in Mbi/qc,bj/qc . Under the applied
permutations the (i, j)-th entry in M is mapped to (l1 (i)q +bi/qc, l2 (j)q +bj/qc)-entry in Mπ . Recall
that Mbi/qc,bj/qc is a diagonal matrix which implies that for (i)q 6= (j)q , the (l1 (i)q + bi/qc, l2 (j)q +
We use row permutation πrow = (0, 2, 1, 3), which means 0, 1, 2, 3-th row of M permutes to
0, 2, 1, 3-th row. Similarly, the column permutation is πcol = (0, 3, 1, 4, 2, 5). Thus, Mπ becomes
1 1 1
1 ωq ω 2
q
Mπ = .
1 1 1
1 ωq −1 ωq−2
P`a −1 j
P`a −1 −j and b (z) =
P`b −1 j`a ,
Claim 15. (i) Let a0 (z) = j=0 aj0 z , a1 (z) = j=0 aj1 z 0 j=0 bj0 z
P`b −1 −j`a . Then, a (z)b (z) for k , k = 0, 1 are polynomials that can be
b1 (z) = j=0 bj1 z k1 k2 1 2
is nonsingular.
136
(ii) The matrix [Xi0 |Xi1 | . . . |Xiτ −1 ] (defined in the proof of Theorem 6) is permutation equivalent
to a block-diagonal matrix with four blocks each of size τ × τ . Each of these blocks is a
Vandermonde matrix with parameters from the set {1, ωq , ωq2 , . . . , ωqq−1 }.
Proof. First we show that ak1 (z)bk2 (z) for k1 , k2 = 0, 1 are polynomials that can be recovered from
`a `b distinct evaluation points in C. Towards this end, these four polynomials can be written as
b −1
a −1 `X
`X
a0 (z)b0 (z) = ai0 bj0 z i+j`a ,
i=0 j=0
b −1
a −1 `X
`X
a0 (z)b1 (z) = ai0 bj1 z i−j`a ,
i=0 j=0
b −1
a −1 `X
`X
a1 (z)b0 (z) = ai1 bj0 z −i+j`a , and
i=0 j=0
b −1
a −1 `X
`X
a1 (z)b1 (z) = ai1 bj1 z −i−j`a .
i=0 j=0
Upon inspection, it can be seen that each of the polynomials above has `a `b consecutive powers of
z. Therefore, each of these can be interpolated from `a `b non-zero distinct evaluation points in C.
The second part of the claim follows from the above discussion. To see this we note that
I2
D(z)
[a0 (z) a1 (z)] = [a00 a01 a10 a11 . . . a(`a −1)0 a(`a −1)1 ] and
.
..
D(z ` a −1 )
I2
D(z `a )
[b0 (z) b1 (z)] = [b00 b01 b10 b11 . . . b(`b −1)0 b(`b −1)1 ] .
..
.
D(z ` a (` b −1) )
We have previously shown that all polynomials in [a0 (z) a1 (z)] ⊗ [b0 (z) b1 (z)] can be interpolated
by obtaining their values on `a `b non-zero distinct evaluation points. This implies that we can
equivalently obtain
[a00 a01 a10 a11 . . . a(`a −1)0 a(`a −1)1 ] ⊗ [b00 b01 b10 b11 . . . b(`b −1)0 b(`b −1)1 ]
which means that [X(z1 )|X(z2 )| . . . |X(z`a `b )] is non-singular. This proves the statement in part (i).
The proof of the statement in (ii) is essentially an exercise in showing the permutation equiva-
lence of several matrices by using Claim 14 and the permutation equivalence properties of Kronecker
Recall that we are analyzing the matrix X = [Xi0 |Xi1 | . . . |Xiτ −1 ]. An application of Claim 14
shows that
Vl,A,1 Vl,B,1
Xl,A XPl,A = P
, and Xl,B Xl,B = ,
Vl,A,2 Vl,B,2
Vl,A,i ⊗ XPl,B
Vl,B,1
=Vl,A,i ⊗
Vl,B,2
Vl,B,1
⊗ Vl,A,i
Vl,B,2
Vl,B,1 ⊗ Vl,A,i
=
Vl,B,2 ⊗ Vl,A,i
Vl,A,i ⊗ Vl,B,1
.
Vl,A,i ⊗ Vl,B,2
Vl,A,2 ⊗ Vl,B,1 = [ωq−l(kA −1) , ωq−l(kA −2) , · · · , ωq−l , 1, ωql , · · · , ωql(kA (kB −1)−1) , ωqlkA (kB −1) ]T ,
Vl,A,1 ⊗ Vl,B,2 = [ωq−lkA (kB −1) , ωq−l(kA (kB −1)−1) , · · · , ωq−l , 1, ωql , · · · , ωql(kA −2) , ωql(kA −1) ]T , and
Claim 16. Let τdif = 2kA kB p − 2(kA kB + pkA + pkB ) + kA + kB + 2p and p > 1. τdif < 0 only if
kA = 1 or kB = 1.
0
τdif (kA ) = 2kB p − 2kB − 2p + 1 > 0,
which means τdif (kA ) is a strictly increasing function. We consider the following three cases,
• kA = 2. In this case τdif (2) = 2kB p − 3kB − 2p + 2. It is not clear to evaluate the value
of τdif (2) directly, thus we consider τdif (2) be a function of kB . τdif (2)0 (kB ) = 2p − 3 > 0,
which means τdif (2)0 (kB ) is a strictly increasing function since in this proof we assume p > 2.
0 (2)(2) = 2p − 4 ≥ 0. Then we conclude τ
When kB = 2, τdif dif (2) ≥ 0.
• kA > 2. It is clear that τdif (kA ) > 0 since τdif (kA ) is strictly increasing and τdif (2) ≥ 0.