0% found this document useful (0 votes)
22 views150 pages

Algebraic Approaches For Coded Caching and Distributed Computing

The dissertation by Li Tang, titled 'Algebraic approaches for coded caching and distributed computing,' explores innovative algebraic methods in the fields of coded caching and distributed computing. It includes a comprehensive analysis of caching schemes, erasure coding, and matrix computations, contributing to advancements in electrical engineering. The work is part of the requirements for a Doctor of Philosophy degree at Iowa State University, completed in 2020.

Uploaded by

Lakhan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views150 pages

Algebraic Approaches For Coded Caching and Distributed Computing

The dissertation by Li Tang, titled 'Algebraic approaches for coded caching and distributed computing,' explores innovative algebraic methods in the fields of coded caching and distributed computing. It includes a comprehensive analysis of caching schemes, erasure coding, and matrix computations, contributing to advancements in electrical engineering. The work is part of the requirements for a Doctor of Philosophy degree at Iowa State University, completed in 2020.

Uploaded by

Lakhan Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

Iowa State University Capstones, Theses and

Graduate Theses and Dissertations Dissertations

2020

Algebraic approaches for coded caching and distributed


computing
Li Tang
Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Recommended Citation
Tang, Li, "Algebraic approaches for coded caching and distributed computing" (2020). Graduate Theses
and Dissertations. 17873.
https://lib.dr.iastate.edu/etd/17873

This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and
Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and
Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information,
please contact [email protected].
Algebraic approaches for coded caching and distributed computing

by

Li Tang

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Electrical Engineering

Program of Study Committee:


Aditya Ramamoorthy, Major Professor
Baskar Ganapathysubramanian
Chinmay Hegde
Zhengdao Wang
Sung Yell Song

The student author, whose presentation of the scholarship herein was approved by the program of
study committee, is solely responsible for the content of this dissertation. The Graduate College
will ensure this dissertation is globally accessible and will not permit alterations after a degree is
conferred.

Iowa State University

Ames, Iowa

2020

Copyright c Li Tang, 2020. All rights reserved.


ii

DEDICATION

I would like to dedicate this thesis to my mother Dan Liu and father Xiaobo Tang. Without

their support I would not have been able to complete this work.
iii

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Coded Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Coded Caching for Networks with the Resolvability Property . . . . . . . . . 3
1.1.2 Coded Caching Schemes with Reduced Subpacketization from Linear Block
Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Distributed computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Erasure coding for distributed matrix multiplication for matrices with bounded
entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Numerically stable coded matrix computations via circulant and rotation
matrix embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

CHAPTER 2. CODED CACHING FOR NETWORKS WITH THE RESOLVABILITY PROP-


ERTY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Problem Formulation, Background and Main Contribution . . . . . . . . . . . . . . . 9
2.1.1 Problem formulation and background . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Proposed Caching Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

CHAPTER 3. CODED CACHING SCHEMES WITH REDUCED SUBPACKETIZATION


FROM LINEAR BLOCK CODES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Background, Related Work and Summary of Contributions . . . . . . . . . . . . . . 23
3.1.1 Discussion of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Proposed low subpacketization level scheme . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Resolvable Design Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 A special class of linear block codes . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Usage in a coded caching scenario . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Obtaining a scheme for M/N = 1 − k+1 nq . . . . . . . . . . . . . . . . . . . . . . 38
iv

3.3 Some classes of linear codes that satisfy the CCP . . . . . . . . . . . . . . . . . . . . 41


3.3.1 Maximum-distance-separable (MDS) codes . . . . . . . . . . . . . . . . . . . 41
3.3.2 Cyclic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 Constructions leveraging properties of smaller base matrices . . . . . . . . . . 44
3.3.4 Constructions where q is not a prime or a prime power . . . . . . . . . . . . . 47
3.4 Discussion and Comparison with Existing Schemes . . . . . . . . . . . . . . . . . . . 49
3.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Comparison with memory-sharing within the scheme of Maddah-Ali and
Niesen (2014b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Comparison with Maddah-Ali and Niesen (2014b), Yan et al. (2017a), Yan
et al. (2017b), Shangguan et al. (2018) and Shanmugam et al. (2017) . . . . . 53
3.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

CHAPTER 4. ERASURE CODING FOR DISTRIBUTED MATRIX MULTIPLICATION


FOR MATRICES WITH BOUNDED ENTRIES . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Problem Formulation and Main Contribution . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.2 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Reduced Recovery Threshold codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 General code construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Decoding algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.4 Discussion of precision issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Trading off precision and threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

CHAPTER 5. NUMERICALLY STABLE CODED MATRIX COMPUTATIONS VIA CIR-


CULANT AND ROTATION MATRIX EMBEDDINGS . . . . . . . . . . . . . . . . . . . 70
5.1 Problem Setting, Related Work and Main contributions . . . . . . . . . . . . . . . . 70
5.1.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Distributed Matrix Computation Schemes . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Distributed Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Rotation Matrix Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Circulant Permutation Embedding . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Distributed Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Generalized Distributed Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 84
5.6 Comparisons and Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6.1 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

APPENDIX A. SUPPLEMENT FOR CODED CACHING SCHEMES WITH REDUCED


SUBPACKETIZATION FROM LINEAR BLOCK CODES . . . . . . . . . . . . . . . . . 106
v

APPENDIX B. SUPPLEMENT FOR NUMERICALLY STABLE CODED MATRIX COM-


PUTATIONS VIA CIRCULANT AND ROTATION MATRIX EMBEDDINGS . . . . . . 127
B.0.1 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
vi

LIST OF TABLES

Page
Table 2.1 Comparison of three schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Table 3.1 A summary of the different constructions of CCP matrices in Section 3.3 . . 57
Table 3.2 List of k values for Example 16. The values of n0 , α and z are obtained by
following Algorithm 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table 4.1 Effect of bound (L) on the decoding error . . . . . . . . . . . . . . . . . . . 69
Table 5.1 Comparison for matrix-vector case with n = 31, A has size 28000 × 19720
and x has length 28000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Table 5.2 Comparison for AT B matrix-matrix multiplication case with n = 31, kA =
4, kB = 7. A has size 8000 × 14000, B has size 8400 × 14000. . . . . . . . . 93
Table 5.3 Comparison for matrix-matrix AT B multiplication case with n = 17, uA =
2, uB = 2, p = 2, A is of size 4000 × 16000, B is of 4000 × 16000. . . . . . . 95
Table 5.4 Performance of matrix inversion over a large prime order field in Python 3.7.
The table shows the computation time for inverting a ` × ` matrix G over
a finite field of order p. Let G d −1 denote the inverse obtained by applying

the sympy function M atrix(G).inverse mod(p). The MSE is defined as


1 −1 − I|| .
` ||GG F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
d
vii

LIST OF FIGURES

Page
Figure 1.1 Model of coded caching in Maddah-Ali and Niesen (2014b). . . . . . . . . . 2
Figure 1.2 Caching strategy for N = 2 files and K = 2 users with cache size M = 1
with all four possible user requests. Each file is split into 2 subfiles. The
schemes achieve rate R = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Figure 2.1 The figure shows a 42 combination network. It also shows the cache
placement when M = 2, N = 6. Here, Z1 = ∪6n=1 {Wn,1 1 , W 2 }, Z =
n,1 2
6 1 2 6 1 2
∪n=1 {Wn,2 , Wn,2 } and Z3 = ∪n=1 {Wn,3 , Wn,3 }. It can be observed that
each relay node sees the same caching pattern in the users that it is con-
nected to, i.e., the users connected to each Γi together have Z1 , Z2 and Z3
represented in their caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.2 Performance comparison of the different schemes for a 62 combination net-
work with K = 15, K̃ = 5 and N = 50. . . . . . . . . . . . . . . . . . . . . . 19
Figure 3.1 Recovery set bipartite graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 3.2 A comparison of rate and subpacketization level vs. M/N for a system with
K = 64 users. The left y-axis shows the rate and the right y-axis shows
the logarithm of the subpacketization level. The green and the blue curves
correspond to two of our proposed constructions. Note that our schemes
allow for multiple orders of magnitude reduction in subpacketization level
and the expense of a small increase in coded caching rate. . . . . . . . . . . 50
Figure 3.3 The plot shows the gain in the scaling exponent obtained using our tech-
niques for different value of M/N = 1/q. Each curve corresponds to a choice
of η = k/n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 4.1 Comparison of total computation latency by simulating up to 8 stragglers . 67
Figure 5.1 Consider matrix-vector AT x multiplication system with n = 31, τ = 29. A has
size 28000 × 19720 and x has length 28000. . . . . . . . . . . . . . . . . . . . . . 92
Figure 5.2 Consider matrix-matrix AT B multiplication system with n = 31, kA = 4, kB = 7,
A is of size 8000 × 14000, B is of 8400 × 14000. . . . . . . . . . . . . . . . . . . 93
Figure 5.3 Consider matrix-matrix AT B multiplication system with n = 18, uA = 2, uB = 2,
p = 2, A is of size 4000 × 16000, B is of 4000 × 16000. . . . . . . . . . . . . . . . 96
viii

ACKNOWLEDGMENTS

I would like to take this opportunity to express my thanks to those who helped me with various

aspects of conducting research and the writing of this thesis. First and foremost, Dr. Aditya

Ramamoorthy for his guidance, patience and support throughout this research and the writing of

this thesis. He always encouraged me when I hit the rock bottom in my research and life.

I would also like to thank my committee members for their efforts and contributions to this

work: Dr. Baskar Ganapathysubramanian, Dr. Chinmay Hegde, Dr. Zhengdao Wang and Dr.

Sung Yell Song.

Finally, I would like to thank all my wonderful friends at Ames, Iowa.


ix

ABSTRACT

This dissertation examines the power of algebraic methods in two areas of modern interest:

caching for large scale content distribution and straggler mitigation within distributed computation.

Caching is a popular technique for facilitating large scale content delivery over the Internet.

Traditionally, caching operates by storing popular content closer to the end users. Recent work

within the domain of information theory demonstrates that allowing coding in the cache and coded

transmission from the server (referred to as coded caching) to the end users can allow for significant

reductions in the number of bits transmitted from the server to the end users. The first part of

this dissertation examines problems within coded caching.

The original formulation of the coded caching problem assumes that the server and the end

users are connected via a single shared link. In Chapter 2, we consider a more general topology

where there is a layer of relay nodes between the server and the users. We propose novel schemes

for a class of such networks that satisfy a so-called resolvability property and demonstrate that

the performance of our scheme is strictly better than previously proposed schemes. Moreover, the

original coded caching scheme requires that each file hosted in the server be partitioned into a large

number (i.e., the subpacketization level) of non-overlapping subfiles. From a practical perspective,

this is problematic as it means that prior schemes are only applicable when the size of the files is

extremely large. In Chapter 3, we propose a novel coded caching scheme that enjoys a significantly

lower subpacketization level than prior schemes, while only suffering a marginal increase in the

transmission rate. We demonstrate that several schemes with subpacketization levels that are

exponentially smaller than the basic scheme can be obtained.

The second half of this dissertation deals with large scale distributed matrix computations.

Distributed matrix multiplication is an important problem, especially in domains such as deep

learning of neural networks. It is well recognized that the computation times on distributed clusters
x

are often dominated by the slowest workers (called stragglers). Recently, techniques from coding

theory have found applications in straggler mitigation in the specific context of matrix-matrix and

matrix-vector multiplication. The computation can be completed as long as a certain number of

workers (called the recovery threshold) complete their assigned tasks.

In Chapter 4, we consider matrix multiplication under the assumption that the absolute values

of the matrix entries are sufficiently small. Under this condition, we present a method with a

significantly smaller recovery threshold than prior work. Besides, the prior work suffers from

serious numerical issues owing to the condition number of the corresponding real Vandermonde-

structured recovery matrices; this condition number grows exponentially in the number of workers.

In Chapter 5, we present a novel approach that leverages the properties of circulant permutation

matrices and rotation matrices for coded matrix computation. In addition to having an optimal

recovery threshold, we demonstrate an upper bound on the worst case condition number of our

recovery matrices grows polynomially in the number of workers.


1

CHAPTER 1. INTRODUCTION

Caching and distributed computing are two important technologies in the Big Data era that

are widely used in a variety of commercial settings. Broadly speaking both these can be considered

as applications of network coding Ahlswede et al. (2000), Li et al. (2003). Network coding that

was introduced in Ahlswede et al. (2000) is a generalization of routing where intermediate nodes in

a network combine information as it travels through a network. Several problems within network

coding have been investigated over the years and various types of network connections have been

considered within this domain. These include the original multicast Ho et al. (2006), multiple

unicast Dougherty et al. (2005) and Huang et al. (2011), Huang and Ramamoorthy (2013, 2014) and

function computation Ramamoorthy and Langberg (2013); Langberg and Ramamoorthy (2009),

Rai and Dey (2012) among other topics.

1.1 Coded Caching

Caching is a popular technique for facilitating content delivery over the Internet. It exploits

local cache memory that is often available at the end users for reducing transmission rates from

the central server. In particular, when the users request files from a central server, the system

first attempts to satisfy the user demands in part from the local content. Thus, the overall rate

of transmission from the server is reduced which in turn reduces overall network congestion. The

work of Maddah-Ali and Niesen (2014b) demonstrated that huge rate savings are possible when

coding in the caches and coded transmissions from the server to the users are considered. This

problem is referred to as coded caching.

In Maddah-Ali and Niesen (2014b), the scenario considered was as follows. There is a server

that contains N files, a collection of K users that are connected to the central server by a single
2

Server {W1, , WN}

R
Shared Link

User U1 U2 Uk-1 Uk

Cache M M M M

Figure 1.1 Model of coded caching in Maddah-Ali and Niesen (2014b).

shared link. Each user also has a local cache of size M . The focus is on reducing the rate of

transmission on the shared link. There are two distinct phases in the coded caching setting.

• Placement phase: In the placement phase, the caches of the users are populated. This phase

should not depend on the actual user requests, which are assumed to be arbitrary. The

placement phase is executed in the off-peak traffic time.

• Delivery phase: In delivery phase, server sends some coded signal to each user such that each

user’s demand is satisfied. The delivery phase is always executed in the peak traffic time.

Since the original work of Maddah-Ali and Niesen (2014b), there have been several aspects of coded

caching that have been investigated. Maddah-Ali and Niesen (2014a) considered the decentralized

coded caching where the placement phase is driven by the users who randomly populate their

caches. Information theoretic lower bounds on the transmission rate were considered in Ghasemi

and Ramamoorthy (2016), Ghasemi and Ramamoorthy (2017c), Yu et al. (2018). Synchronization

issues in this problem setting were investigated in Ghasemi and Ramamoorthy (2017a), Ghasemi

and Ramamoorthy (2017b). More recently, techniques inspired by coded caching have been em-

ployed for speeding up distributed computing Li et al. (2017), Konstantinidis and Ramamoorthy

(2018), Konstantinidis and Ramamoorthy (2019), Konstantinidis and Ramamoorthy (pear).


3

We explain the idea of coded caching by the following example in Maddah-Ali and Niesen

(2014b).

Example 1. Consider the case that K = N = 2, M = 1, such that there are two files A and

B in the server and two users each with cache memory size M = 1 file. We describe the coded

caching scheme as follows. In the placement phase, file A is split into two non-overlapping subfiles

A = {A1 , A2 }. Similarly B is split into B = {B1 , B2 }. User 1 caches A1 and B1 , and thus its cache

memory is 1 file. User 2 caches A2 and B2 , and thus its cache memory is also 1 file. In the delivery

phase, suppose for example user 1 requires file A and user 2 requires file B. User 1 needs A2 and

user 2 needs B1 from server. Server can definitely transmit A2 and B1 over the shared link and

the transmission rate is 1 file. We call this transmission rate as uncoded transmission rate. Coded

caching introduces another transmission strategy. The server can simply transmit A2 ⊕ B1 , where

⊕ denotes bitwise XOR. User 1 already has B1 , thus it can recover A2 from A2 ⊕ B1 . Similarly, user

2 can recover B1 since it already has A2 . Therefore, the transmission A2 ⊕ B1 can guarantee these

two users recover their requests and the transmission rate is half file. We call this transmission rate

as coded transmission rate. In this example, the coded transmission rate is only half of uncoded

transmission rate. Figure 1.2 shows coded transmission rate is always the equivalent of half a file

for all the four user requests.

1.1.1 Coded Caching for Networks with the Resolvability Property

Maddah-Ali and Niesen (2014b) demonstrate that by carefully designing the cache in the users,

the coded transmission from the central server can significantly save the transmission rate when

the server and the end users are connected via a single shared link. In the first part of dissertation,

we consider a more general topology where there is a layer of relay nodes between the server and

the users. We demonstrate that our proposed scheme outperforms two previous proposed schemes

in Ji et al. (2015a). This work has appeared in Tang and Ramamoorthy (2016a) and is discussed

in Chapter 2.
4

A1,A2 A1,A2
B1,B2 B1,B2

A2ꚚA1 A2ꚚB1

A A A B

A1,B1 A2,B2 A1,B1 A2,B2

A1,A2 A1,A2
B1,B2 B1,B2
B2ꚚA1 B2ꚚB1

B A B B

A1,B1 A2,B2 A1,B1 A2,B2

Figure 1.2 Caching strategy for N = 2 files and K = 2 users with cache size M = 1 with
all four possible user requests. Each file is split into 2 subfiles. The schemes
achieve rate R = 0.5.

1.1.2 Coded Caching Schemes with Reduced Subpacketization from Linear Block

Codes

Maddah-Ali and Niesen (2014b) proposed a coded caching scheme and showed that compared

with conventional caching, coded caching can achieve a much lower transmission rate. However,

in the placement phase of Maddah-Ali’s scheme, each file is split into a very large number of non-

overlapping subfiles of equal size. This is called the subpacketization level of the scheme. It means

Maddah-Ali’s scheme is applicable only in the regime when the underlying file sizes are very large.

In the second part of this dissertation, we propose coded caching schemes based on combinatorial

structures called resolvable designs. These structures can be obtained in a natural manner from

linear block codes whose generator matrices possess certain rank properties. We demonstrate that

several schemes with subpacketization levels that are exponentially smaller than the basic scheme

can be obtained. The subpacketization level of our scheme is exponentially lower than the memory-

sharing within the scheme of Maddah-Ali and Niesen (2014b). This work has appeared in Tang

and Ramamoorthy (2018, 2016b, 2017) and is discussed in Chapter 3.


5

1.2 Distributed computing

The current Big Data era routinely requires the processing of large scale data on massive

distributed computing clusters. In these applications, data sets are often so large that they cannot

be housed in the memory and/or the disk of any one computer. Thus, the data and the processing is

typically distributed across multiple nodes. Distributed computation is thus a necessity rather than

a luxury. The widespread usage of such clusters presents several opportunities and advantages over

traditional computing paradigms. However, it also presents newer challenges where coding-theoretic

ideas have recently had a significant impact. Large scale clusters (which can be heterogeneous in

nature) suffer from the problem of stragglers which refer to slow or failed worker nodes in the

system. Thus, the overall speed of a computation is typically dominated by the slowest node in the

absence of a sophisticated assignment of tasks to the worker nodes.

The conventional approach for tackling stragglers in distributed computation has been to run

multiple copies of tasks on various machines, with the hope that at least one copy finishes on time.

However, coded computation offers significant benefits for specific classes of problems. We illustrate

this by means of a matrix-matrix multiplication example in Yu et al. (2017).

Example 2. Consider a distributed matrix multiplication task of computing C = AT B using five

worker that each store the equivalent of half of the matrices A and B. We first split the matrices

evenly along the column dimensions to obtain A = [A0 , A1 ], B = [B0 , B1 ]. We want to compute

the following four matrix products that have shown below.


 
T T
A0 B0 A0 B1 
C = AT B =  
AT1 B0 AT1 B1

Now we design a computation strategy which is resilient to a single straggler. Let the master node

give worker i ∈ {1, 2, 3, 4, 5} the following two submatrices.

Âi = A0 + iA1 , and

B̂i = B0 + i2 B1 .
6

Each worker compute ÂTi B̂i . It follows that as soon as any four out of the five worker nodes return

the results of their computation, the master node can decode and recover C. We discuss this

through a representative scenario, where the master receives the computation results from workers

1, 2, 3, and 4. The computation result from worker i is

Ĉi = ÂTi B̂i = AT0 B0 + iAT1 B0 + i2 AT0 B1 + i3 AT1 B1

Then we have
    
Ĉ 10 11 12 13 AT0 B0
 1   
    
Ĉ2  20 21 22 3
2  T
 A1 B0 
 
 = .
  

   0
Ĉ3  3 31 32 33  AT0 B1 
 
    
Ĉ4 40 41 42 43 T
A1 B1

The coefficient matrix in the above equation is a Vandermonde matrix, which is non-singular over

reals Horn and Johnson (1991). Therefore the four components AT0 B0 , AT1 B0 , AT0 B1 , AT1 B1 can

be recovered when the master node have Ĉ1 , Ĉ2 , Ĉ3 , Ĉ4 .

The above coded computation strategy is resilient to 1 straggler with 5 workers. On the other

hand, the uncoded strategy which is resilient to 1 straggler, each component of C has to run 2

copies. Therefore it needs 8 workers.

The tutorial paper Ramamoorthy et al. (2020) overviews recent developments in the field of

coding for straggler-resilient distributed matrix computations.

1.2.1 Erasure coding for distributed matrix multiplication for matrices with bounded

entries

A key metric of distributed matrix multiplication is the minimum number of workers that the

master needs to wait for in order to compute C; this is called the recovery threshold of the scheme.

In the third part of dissertation, we present a novel coding strategy for this problem when the

absolute values of the matrix entries are sufficiently small. We demonstrate a trade-off between the

assumed absolute value bounds on the matrix entries and the recovery threshold. At one extreme,
7

we are optimal with respect to the recovery threshold and on the other extreme, we match the

threshold of prior work. Experimental results on cloud-based clusters validate the benefits of our

method. This work has appeared in Tang et al. (2019).

1.2.2 Numerically stable coded matrix computations via circulant and rotation ma-

trix embeddings

Ideas from coding theory have recently been used in several works for mitigating the effect

of stragglers in distributed matrix computations (matrix-vector and matrix-matrix multiplication)

over the reals. In particular, a polynomial code based approach spreads out the computation of

AT B among n distributed worker nodes by means of polynomial evaluations Yu et al. (2020).

This allows for an “optimal” recovery threshold whereby the intended result can be decoded as

long as at least (n − s) worker nodes complete their tasks; s is the number of stragglers that the

scheme can handle. However, a major issue with these approaches is the high condition number

of the corresponding Vandermonde-structured recovery matrices. This presents serious numerical

precision issues when decoding the desired result.

It can be shown that the condition number of n × n real Vandermonde matrices grows exponen-

tially in n Pan (2016). On the other hand, the condition numbers of Vandermonde matrices with

parameters on the unit circle are much better behaved. However, using complex evaluation points

in a straightforward manner causes an unacceptable multiplicative increase in the computation time

at the worker nodes. In this work we leverage the properties of circulant permutation matrices and

rotation matrices to obtain coded computation schemes with significantly lower worst case condi-

tion numbers; these matrices have eigenvalues that lie on the unit circle. Our technique essentially

works by evaluating polynomials at matrices rather than scalars. Our analysis demonstrates that

the associated recovery matrices have a condition number corresponding to Vandermonde matrices

with parameters given by the eigenvalues of the corresponding circulant permutation and rotation

matrices. Finally, we demonstrate an upper bound on the worst case condition number of these ma-

trices which grows as ≈ O(ns+6 ). In essence, we leverage the well-behaved conditioning of complex
8

Vandermonde matrices with parameters on the unit circle, while still working with computation

over the reals. Experimental results demonstrate that our proposed method has condition numbers

that are several orders of magnitude better than prior work.

Our work in this area is currently published as a preprint Ramamoorthy and Tang (2019) and

will be submitted to a journal in due course of time.


9

CHAPTER 2. CODED CACHING FOR NETWORKS WITH THE


RESOLVABILITY PROPERTY

In this chapter, we consider the coded caching problem in a more general setting where there

is a layer of relay nodes between the server and the users (see Ji et al. (2015b) for related work).

Specifically, the server is connected to a set of relay nodes and the users are connected to certain

subsets of the relay nodes. A class of such networks have been studied in network coding and are

referred to as “combination networks” Ngai and Yeung (2004). In a combination network there are

h relay nodes and hr users each of which is connected to a r-subset of the relay nodes. Combination


networks were the first example where an unbounded gap between network coding and routing for

the case of multicast was shown. While this setting is still far from an arbitrary network connecting

the server and the users, it is rich enough to admit several complex strategies that may shed light

on coded caching for general networks.

In this chapter we consider a class of networks that satisfy a so called resolvability property.

These networks include combination networks where r divides h, but there are many other examples.

We propose a coded caching scheme for these networks and demonstrate its advantages.

This chapter is organized as follows. Section 2.1 presents the problem formulation, background

and our main contribution. In Section 2.2, we describe our proposed coded caching scheme, Section

2.3 presents a performance analysis and comparison and Section 2.4 concludes the paper.

2.1 Problem Formulation, Background and Main Contribution

In this work we consider a class of networks that contain an intermediate layer of nodes be-

tween the main server and the end user nodes. Combination networks are a specific type of such

networks and have been studied in some depth in the literature on network coding Ngai and Yeung
10

(2004). However, as explained below, we actually consider a larger class of networks that encompass

combination networks.

2.1.1 Problem formulation and background

The networks we consider consist of a server node denoted S and h relay nodes, Γ1 , Γ2 , . . . , Γh

such that the server is connected to each of the relay nodes by a single edge. The set of relay nodes

is denoted by H. Let [m] = {1, 2, . . . , m}. If A ⊂ [h] we let ΓA = ∪i∈A {Γi }. There are K users in

the system and each user is connected to a subset of H of size r. Let V ⊂ {1, . . . , h} with |V| = r.

For convenience, we will assume that the set V is written in ascending order even though the subset

structure does not impose any ordering on the elements. Under this condition, we let V[i] represent

the i-th element of V. For instance, if V = {1, 3}, then V[1] = 1 and V[2] = 3. Likewise, Inv − V

will denote the corresponding inverse map, i.e. Inv − V[i] = j if V[j] = i.

Each user is labeled by the subset of relay nodes it is connected to. Thus, UV denotes the user

that is connected to ΓV . The set of all users is denoted U and the set of all subsets that specify

the users is denoted V, i.e., V ∈ V if UV is a user. We consider networks where V satisfies the

resolvability property that is defined below.

Definition 1. Resolvability property. The set V defined above is said to be resolvable if there

exists a partition of V into subsets P1 , P2 , . . . , PK̃ such that

• for any i ∈ [K̃] if V ∈ Pi and V 0 ∈ Pi , then V ∩ V 0 = ∅, and

• for any i ∈ [K̃], we have ∪V:V∈Pi V = [h].

The subsets Pi are referred to as the parallel classes of V.

Each relay node Γi is thus connected to a set of users that is denoted by N (Γi ). A simple

counting argument shows that |N (Γi )| = Kr/h = K̃.

Suppose that r divides h and let V be the set of all subsets of size r of [h]. In this case, the

network defined above is the combination network Ji et al. (2015b) with K = hr users. The fact

11

that this network satisfies the resolvability property is not obvious and follows from a result of

Baranyai (1975).

Example 3. The combination network for the case of h = 4, r = 2 is shown in Fig. 2.1 and the

corresponding parallel classes are

P1 = {{1, 2}, {3, 4}},

P2 = {{1, 3}, {2, 4}}, and

P3 = {{1, 4}, {2, 3}}.

h

On the other hand, there are other networks where |V| is strictly smaller than r .

Example 4. Let h = 9, r = 3 and let V = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {1, 4, 7}, {2, 5, 8}, {3, 6, 9}}.

In this case, the parallel classes are

P1 = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}}, and

P2 = {{1, 4, 7}, {2, 5, 8}, {3, 6, 9}}.

We discuss generalizations of these examples in Section III.

The server S contains a library of N files where each file is of size F bits (we will interchangeably

refer to F as the subpacketization level). The files are represented by random variables Wi , i =

1, . . . , N , where Wi is distributed uniformly over the set [2F ]. Each user has a cache of size M F

bits. There are two distinct phases in the coded caching problem. In the placement phase, the

content of the user’s caches is populated. This phase should not depend on the file requests of the

users. In the delivery phase, user UV requests a file denoted WdV from the library; the set of all

user requests is denoted D = {WdV : UV is a user}. It can be observed that there are a total of N K

distinct request sets. The server responds by transmitting a certain number of bits that satisfies

the demands of all the users. A (M, R1 , R2 ) caching system also requires the specification of the

following encoding and decoding functions.

• K caching functions: ZV = φV (W1 , . . . , WN ) which represents the cache content of user UV .

Here, φV : [2N F ] → [2M F ].


12

Source S

Relays Γ1 Γ2 Γ3 Γ4

Users U12 U13 U14 U23 U24 U34

Cache Z1 Z2 Z3 Z3 Z2 Z1

Figure 1: Problem instance


 for Example ??. For clarity of presentation, only the
4 shown on the edges.
Figure 2.1 The figure shows
Wnew (u) a been
label has
2 combination network. It also shows the cache placement
when M = 2, N = 6. Here, Z1 = ∪6n=1 {Wn,1 1 , W 2 }, Z = ∪6
n,1 2
1 2
n=1 {Wn,2 , Wn,2 }
and Z3 = ∪6n=1 {Wn,3 1 , W 2 }. It can be observed that each relay node sees
n,3
the same caching pattern in the users that it is connected to, i.e., the users
connected to each Γi together have Z1 , Z2 and Z3 represented in their caches.

• hN K server to relay encoding functions: The signal ψS→Γi ,D (W1 , . . . , WN ) is the encoding

function for the edge from S to relay node1 Γi . Here, ψS→Γi ,D : [2N F ] → [2R1 F ], so that the

rate of transmission on server to relay edges is at most R1 . The signal on the edge is denoted

XS→Γi .

• hK̃N K relay to user encoding functions: Let UV ∈ N (Γi ). The signal ϕΓi →UV ,D (ψS→Γi ,D (W1 , . . . , WN ))

is the encoding function for the edge Γi → UV . Here ϕΓi →UV ,D : [2R1 F ] → [2R2 F ], so that the

rate of transmission on the relay to user edges is at most R2 . The signal on the corresponding

edge is denoted XΓi →UV , which is assumed to be defined only if UV ∈ N (Γi ).

• KN K decoding functions: Every user has a decoding function for a specific request set D,

denoted by µD,UV (XΓV[1] →UV , . . . , XΓV[r] →UV , ZV ). Here µD,UV : [2R2 F ] × . . . [2R2 F ] × [2M F ] →

[2F ]. The decoded file is denoted by ŴD,UV .

For this coded caching system, the probability of error Pe is defined as Pe = maxD maxV P (ŴD,UV 6=

WD ).
13

The triplet (M, R1 , R2 ) is said to be achievable if for every  > 0 and every large enough file

size F there exists a (M, R1 , R2 ) caching scheme with probability of error less than . The sub-

packetization level of a scheme, i.e., F is also an important metric, because it is directly connected

to the implementation complexity of the scheme. For example, the original scheme of Maddah-Ali

and Niesen (2014b) that operates when there is a shared link between the server and the users. It
K

operates with a subpacketization level F ≈ KM/N which grows exponentially with K. Thus, the

scheme of Maddah-Ali and Niesen (2014b) is applicable when the files are very large. In general,

lower subpacketization levels for a given rate are preferable.

Prior work has presented two coded caching schemes for combination networks. In the routing

scheme, coding is not considered. Each user simply caches a M/N fraction of each file. In the

delivery phase, the total number of bits that need to be transmitted is K(1 − M/N )F . As there are
K M
h outgoing edges from S, we have that R1 = h (1 − N ). Moreover, as there are r incoming edges

into each user, R2 = 1r (1 − M


N ). An alternate scheme, called the CM-CNC scheme was presented

in Ji et al. (2015b) for combination networks. This scheme uses a decentralized cache placement

phase, where each user randomly caches a M/N fraction of each file. In the delivery phase, the

server executes the CM step where it encodes all requested files by a decentralized multicasting

caching scheme. Following this, in the CNC step, the server divides each coded signal into r equal-

size signals and encodes these r signals by a (h, r) binary MDS code. The h coded signals are

transmitted over the h links from the server to relay nodes. The relay nodes forward the signals

to the users. Thus, each user receives r coded signals and can recover its demand due to MDS

property.

2.1.2 Main contributions

• Our schemes work for any network that satisfies the resolvability property. For a class of

combination networks, we demonstrate that our achievable rates are strictly better than

those proposed in prior work.


14

• The subpacketization level of our scheme is also significantly lower than competing methods.

As discussed in Section 2.1, the subpacketization level of a given scheme directly correlates

with the complexity of implementation.

2.2 Proposed Caching Scheme

Consider a network where V satisfies resolvability property and let the parallel classes be

P1 , · · · , PK̃ . It is evident that each user belongs to exactly one parallel class. Let ∆(V) indi-

cate the parallel class that a user belongs to, i.e., ∆(V) = j if user UV belong to Pj . Now, recall

that N (Γi ) is the set of users UV such that i ∈ V. By the resolvability property it has to be the

case that each user in N (Γi ) belongs to a different parallel class. In fact, it can be observed that

{∆(V) : UV ∈ N (Γi )} = [K̃], for all i. (2.1)

This implies that each relay node “sees” exactly the same set of parallel classes represented in the

users that it is connected to. This observation inspires our placement phase in the caching scheme.

We populate the user caches based on the parallel class that a given user belongs to. Loosely

speaking, it turns out that we can design a symmetric uncoded placement such that the overall

cache content seen by every relay node is the same.

Our proposed placement and delivery phase schemes are formally specified in Algorithm 1 and
N 2N
are discussed below. Assume each user has a storage capacity of M ∈ {0, K̃ , K̃ , · · · , N } files, and

let t = K̃ M
N =
KrM
The users can be partitioned into K̃ groups Gi where Gi = {UV : V ∈ Pi }.
hN .

In placement phase, each file Wn is split into r K̃t non-overlapping subfiles of equal size that


are labeled as
l
Wn = (Wn,T : T ⊂ [K̃], |T | = t, l ∈ [r]).

Thus, the subpacketization mechanism is such that each subfile has a superscript in [r] in addition

to the subset-based subscript that was introduced in the work of Maddah-Ali and Niesen (2014b).
l
Subfile Wn,T l
is placed in the cache of the users in Gi if i ∈ T . Equivalently, Wn,T is stored in

user UV if ∆(V) ∈ T . Thus, each user caches a total of N r K̃−1



t−1 subfiles, and each subfile has size
15

Algorithm 1: Coded Caching in Networks Satisfying Resolvability Property


1 1. procedure: PLACEMENT PHASE ;
2 for i = 1 to N do
3 l
Partition Wi into (Wn,T : T ⊂ [K̃], |T | = t, l ∈ [r])
4 end
5 for V ∈ V do
6 l
UV caches Wn,T if ∆(V) ∈ T for l ∈ [r] and n ∈ [N ].
7 end
8 2.procedure: DELIVERY PHASE ;
9 for i = 1 to h do
10 Θ ← {C : C ⊂ [K̃], |C| = t + 1};
Inv−V[i]
11 Source sends {⊕{V:∆(V)∈C} WdV ,C\{∆(V)} : i ∈ V, C ∈ Θ} to Γi ;
12 for j = 1 to K̃ do
Inv−V[i]
13 Γi forwards {⊕{V:∆(V)∈C} WdV ,C\{∆(V)} : i ∈ V, C ∈ Θ, ∆(V) ∈ C} to UV
14 end
15 end

F
. This requires
r(K̃
t)
K̃ − 1
 
F Nt
Nr =F = MF
t − 1 r K̃  K̃
t
bits, demonstrating that our scheme uses only M F bits of cache memory at each user.

Example 5. Consider the combination network in Example 3 with M = 2, N = 6 and K = 6. In


l , i = 1, 2, 3, l = 1, 2. The cache
the placement phase, each file is partitioned into six subfiles Wn,i

placement is as follows.

1 2
G1 = {U12 , U34 } cache Wn,1 , Wn,1 ;
1 2
G2 = {U13 , U24 } cache Wn,2 , Wn,2 ; and
1 2
G3 = {U14 , U23 } cache Wn,3 , Wn,3 .

Note that by eq. (2.1), we have that each relay node is connected to a user from each parallel

class. Our placement scheme depends on the parallel class that user belongs to. Thus, it ensures

that the overall distribution of the cache content seen by each relay node is the same. This can
16

be seen in Fig. 2.1 for the example considered above. We note here that the routing scheme (cf.

Section 2.1) is also applicable for this placement.

Now, we briefly outline the main idea of our achievable scheme. Our file parts are of the form
j
Wn,T , where for a given T , j ∈ [r]. Note that each user is also connected to r different relay nodes in

H. Our proposed scheme is such that each user recovers a missing file part with a certain superscript

from one of the relay nodes it is connected to. In particular, we convey enough information from the

server to the relay nodes such that each relay node uses the scheme proposed in Maddah-Ali and

Niesen (2014b) for one set of superscripts. Crucially, the symmetrization afforded by the placement

scheme, allows each relay node to operate in this manner.

Theorem 1. Consider a network satisfying resolvability property with h relay nodes and K users

such that each user is connected to r relay nodes. Suppose that the N files in the server and

each user has cache of size M ∈ {0, N h 2N h


Kr , Kr , · · · , N }. Then, the following rate pair (R1 , R2 ) is

achievable.
( )
M
K(1 − N) N M
R1 = min KrM
, (1 − ) , (2.2)
h(1 + hN )
r N
1− MN
R2 = .
r

For general 0 ≤ M ≤ N , the lower convex envelope of these points is achievable.

Proof. Let the set of user requests be denoted D = {WdV : UV is a user}. For each relay node

Γi , we focus on the users connected to it N (Γi ) and a subset C ⊂ {∆(V) : UV ∈ N (Γi )} where

|C| = t + 1. For each subset C, the server transmits

Inv−V[i]
⊕{V:∆(V)∈C} WdV ,C\{∆(V)} (2.3)

to the relay node Γi (⊕ denotes bitwise XOR) and Γi forwards it into users UV where ∆(V) ∈ C.

We now argue that each user can recover its requested file. Evidently, a user UV is missing

subfiles of the form WdjV ,T where ∆(V) ∈


/ T . If UV is connected to Γi , it can recover the following

set of subfiles using the transmissions from Γi .

Inv−V[i]
{WdV ,T : T ⊂ [K̃] \ {∆(V)}, |T | = t}.
17

This is because the transmission in eq. (2.3) is such that UV caches all subfiles that are involved

in the XOR except the one that is interested in. This implies it can decode its missing subfile. In

addition, UV is also connected to r relay nodes so that ∪i∈V {Inv − V[i]} = [r], i.e., it can recover

all its missing subfiles.

Next, we determine R1 and R2 . Each of coded subfiles results in FK̃ bits being sent over the
r( t )


link from source S to Γi . Since the number of subsets C is t+1 , the total number of bits sent

 F K̃(1− M ) K̃(1− M )
from S to Γi is t+1 K̃
= F N
K̃M
, and hence R1 = N
. Next, note that each coded
r( t ) r(1+ N ) r(1+ K̃M
N
)

|C|(t+1 )
subfile is forwarded to |C| = t + 1 users. Thus, each user receives K̃
coded subfiles so that the

|C|(t+1) 1− M
total number of bits sent from a relay node to user is K̃
× FK̃ = |C|F r
(K̃−t)
K̃(t+1)
= F r N . Hence
r( t )
1− M
R2 = r
N
.
K̃(1− M ) 1− M
Thus, the triplet (M, R1 , R2 ) = (M, N
, r
N
) is achievable for M ∈ {0, N h 2N h
Kr , Kr , · · · , N }
r(1+ K̃M
N
)
. Points for general values of M can be obtained by memory sharing between triplets of this form.

If K̃ > N , it is clear that some users connecting to a given relay node Γi request the same file.
N M
In this case, the routing scheme can attain R1 = r (1 − N ), which is better than the proposed
N
scheme if M ≤ 1 − K̃
. This explains the second term within the minimum in the RHS of eq.

(2.2).

We illustrate our achievable scheme by considering the setup in Example 5.

Example 6. Assume that user UV , V ⊂ {1, . . . , 4}, |V| = 2, requires file WdV . The users connected

to Γ1 correspond to subsets {1, 2}, {1, 3} and {1, 4} so that Inv − V[1] = 1 for all of them. Thus,

the users recover missing subfiles with superscript of 1 from Γ1 . In particular, the transmissions

are as follows.

S → Γ1 : Wd112 ,2 ⊕ Wd113 ,1 , Wd112 ,3 ⊕ Wd114 ,1 , Wd113 ,3 ⊕ Wd114 ,2 ,

Γ1 → U12 : Wd112 ,2 ⊕ Wd113 ,1 , Wd112 ,3 ⊕ Wd114 ,1 ,

Γ1 → U13 : Wd112 ,2 ⊕ Wd113 ,1 , Wd113 ,3 ⊕ Wd114 ,2 , and

Γ1 → U14 : Wd112 ,3 ⊕ Wd114 ,1 , Wd113 ,3 ⊕ Wd114 ,2 .


18

The users connected to Γ2 correspond to subsets {1, 2}, {2, 3} and {2, 4} in which case Inv −

{1, 2}[2] = 2 while Inv − {2, 3}[2] = 1, Inv − {2, 4}[2] = 1. Thus, user U12 recovers missing subfiles

with superscript 2 from Γ2 while users U23 and U24 recover missing subfiles with superscript 1. The

specific transmissions are given below.

S → Γ2 : Wd123 ,1 ⊕ Wd212 ,3 , Wd212 ,2 ⊕ Wd124 ,1 , Wd123 ,2 ⊕ Wd124 ,3 ,

Γ2 → U12 : Wd123 ,1 ⊕ Wd212 ,3 , Wd212 ,2 ⊕ Wd124 ,1 ,

Γ2 → U23 : Wd123 ,1 ⊕ Wd212 ,3 , Wd123 ,2 ⊕ Wd124 ,3 , and

Γ2 → U24 : Wd212 ,2 ⊕ Wd124 ,1 , Wd123 ,2 ⊕ Wd124 ,3 .

In a similar manner, the other transmissions can be determined and it can be verified that the

demands of the users are satisfied and R1 = 12 , R2 = 13 .

It is important to note that the resolvability property is key to our proposed scheme. For

example, if r does not divide h, the combination network does not have the resolvability property.

In this case, it can be shown that a symmetric uncoded placement is impossible. We demonstrate

this by means of the example below.

Example 7. Consider the combination network with h = 3, r = 2, and V = {{1, 2}, {1, 3}, {2, 3}}.

Here 2 does not divide 3 and it is easy to check that it does not satisfy the resolvability property.

Next, we argue that a symmetric uncoded placement is impossible by contradiction. Assume

there exists a symmetric uncoded placement and suppose U12 caches Z1 , U13 caches Z2 . By the

hypothesis, Γ1 and Γ2 have to see the same cache content. Since N (Γ1 ) = {U12 , U13 } and N (Γ2 ) =

{U12 , U23 }, U23 has to cache Z2 . As a result, since N (Γ3 ) = {U13 , U23 }, Γ3 sees Z2 and Z2 , which

are different from the cache content seen by Γ1 and Γ2 . This is a contradiction.

We emphasize that a large class of networks satisfy the resolvability property. For instance, if r

divides h, Baranyai (1975) shows that the set of all hr r-subsets of an h-set can be partitioned into


disjoint parallel classes Pi , i = 1, 2, · · · , h−1



r−1 . More generally, one can consider resolvable designs

Stinson (2003) which are set systems that satisfy the resolvability property. Such designs include
19

8
R1 of Proposed Scheme
7 R1 of CM-CNC Scheme
R1 of Routing Scheme

Performance Analysis
6 R2 of Proposed Scheme
R2 of CM-CNC Scheme
5 R2 of Routing Scheme
4

0
0 10 20 30 40 50
Cache size M

6

Figure 2.2 Performance comparison of the different schemes for a 2 combination network
with K = 15, K̃ = 5 and N = 50.

affine planes which correspond to networks where for prime q, we have h = q 2 and the r = q; the set

V is given by the specification of the affine plane. Furthermore, one can obtain resolvable designs

from affine geometry over Fq that will correspond to networks with h = q m and r = q d .

2.3 Performance Analysis

We now compare the performance of our proposed scheme with the CM-CNC scheme Ji et al.

(2015b) and the routing scheme. For a given value of M we compare the achievable R1 , R2 pairs

of the different schemes. Furthermore, we also compare the required subpacketization levels of the

different schemes, as it directly impacts the complexity of implementation of a given scheme. Table

2.1 summarizes the comparison. We note here that the rate of the CM-CNC scheme is derived in

Ji et al. (2015b) for a decentralized placement. The rate in Table 2.1 corresponds to a derivation

of the corresponding rate for a centralized placement; it is lower than the one for the decentralized

placement.
20

Table 2.1 Comparison of three schemes

Routing CM-CNC New Scheme


K̃  K  K̃ 
F r K̃M r KM r K̃M
N N N
K M K(1− M ) K̃(1− M )
R1 h (1 − N)
N
r(1+ KM )
N
N r(1+ K̃M
N
)
1 M K(1− M ) 1 M
R2 r (1 − N)
N
r(1+ KM ) r (1 − N)
N

The following conclusions can be drawn. Let (R1∗ , R2∗ ) and F ∗ denote the rates and subpacke-

tization level of our proposed scheme.


1 M
R1∗ K + N
= < 1, and
R1CM −CN C 1

+ M
N
R2∗ 1 M
= +
R2CM −CN C K N
N
N− K̃ 1
≤ + < 1.
N K

This implies that our scheme is better in both rate metrics. Next,

F∗
 
r M
≈ exp (K(1 − )He ( )
F CM −CN C h N

where He (·) represents the binary entropy function in nats. Thus, the subpacketization level of our

scheme is exponentially smaller than the scheme of Ji et al. (2015b).

For a 62 combination network with parameters K = 15, K̃ = 5, N = 50, we plot the perfor-


mance of the different schemes in Fig. 2.2. Fig. 2.2 compares R1 and R2 of three schemes. It can

be observed that for R1 , the proposed scheme is best for all cache size M . At the same time, we

can see that R2 of routing scheme and the proposed scheme are identical but significantly better

than that of CM-CNC scheme.

2.4 Conclusions

In this work, we proposed a coding caching scheme for networks that satisfy the resolvability

property. This family of networks includes a class of combination networks as a special case. The
21

rate required by our scheme for transmission over the server-to-relay edges and over the relay-to-

user edges is strictly lesser than that proposed in prior work. In addition, the subpacketization

level of our scheme is also significantly lower than prior work. The generalization to networks that

do not satisfy the resolvability property and to networks with arbitrary topologies is an interesting

direction for future work.


22

CHAPTER 3. CODED CACHING SCHEMES WITH REDUCED


SUBPACKETIZATION FROM LINEAR BLOCK CODES

In this chapter, we examine an important aspect of the coded caching problem that is closely

tied to its adoption in practice. It is important to note that the huge gains of coded caching require
K 
each file to be partitioned into Fs ≈ KM non-overlapping subfiles of equal size; Fs is referred to as
N
M
the subpacketization level. It can be observed that for a fixed cache size N, Fs grows exponentially

with K. This can be problematic in practical implementations. For instance, suppose that K = 64,
64
with M 14 with a rate R ≈ 2.82. In this case, it is evident

N = 0.25 so that F s = 16 ≈ 4.8 × 10

that at the bare minimum, the size of each file has to be at least 480 terabits for leveraging the

gains in Maddah-Ali and Niesen (2014b). It is even worse in practice. The atomic unit of storage

on present day hard drives is a sector of size 512 bytes and the trend in the disk drive industry is

to move this to 4096 bytes Fitzpatrick (2011). As a result, the minimum size of each file needs to

be much higher than 480 terabits. Therefore, the scheme in Maddah-Ali and Niesen (2014b) is not

practical even for moderate values of K. Furthermore, even for smaller values of K, schemes with

low subpacketization levels are desirable. This is because any practical scheme will require each

of the subfiles to have some header information that allows for decoding at the end users. When

there are a large number of subfiles, the header overhead may be non-negligible. For these same

parameters (K = 64, M/N = 0.25) our proposed approach in this work allows us obtain, e.g., the

following operating points: (i) Fs ≈ 1.07 × 109 and R = 3, (ii) Fs ≈ 1.6 × 104 and R = 6, (iii)

Fs = 64 and R = 12. For the first point, it is evident that the subpacketization level drops by

over five orders of magnitude with only a very small increase in the rate. Point (ii) and (iii) show

proposed scheme allows us to operate at various points on the trade-off between subpacketization

level and rate.


23

The issue of subpacketization was first considered in the work of Shanmugam et al. (2014, 2016)

in the decentralized coded caching setting. In the centralized case it was considered in the work

of Yan et al. (2017a). They proposed a low subpacketization scheme based on placement delivery

arrays. Reference Shangguan et al. (2018) viewed the problem from a hypergraph perspective

and presented several classes of coded caching schemes. The work of Shanmugam et al. (2017)

has recently shown that there exist coded caching schemes where the subpacketization level grows

linearly with the number of users K; however, this result only applies when the number of users is

very large. We elaborate on related work in Section 3.1.1.

In this work, we propose low subpacketization level schemes for coded caching. Our proposed

schemes leverage the properties of combinatorial structures known as resolvable designs and their

natural relationship with linear block codes. Our schemes are applicable for a wide variety of

parameter ranges and allow the system designer to tune the subpacketization level and the gain

of the system with respect to an uncoded system. We note here that designs have also been used

to obtain results in distributed data storage Olmez and Ramamoorthy (2016) and network coding

based function computation in recent work Tripathy and Ramamoorthy (2015, 2017).

This chapter is organized as follows. Section 3.1 discusses the background and related work

and summarizes the main contributions of our work. Section 3.2 outlines our proposed scheme. It

includes all the constructions and the essential proofs. A central object of study in our work are

matrices that satisfy a property that we call the consecutive column property (CCP). Section 3.3

overviews several constructions of matrices that satisfy this property. Several of the longer and

more involved proofs of statements in Sections 3.2 and 3.3 appear in the Appendix. In Section

3.4 we perform an in-depth comparison our work with existing constructions in the literature. We

conclude the paper with a discussion of opportunities for future work in Section 3.5.

3.1 Background, Related Work and Summary of Contributions

We consider a scenario where the server has N files each of which consist of Fs subfiles. There

are K users each equipped with a cache of size M Fs subfiles. The coded caching scheme is specified
24

by means of the placement scheme and an appropriate delivery scheme for each possible demand

pattern. In this work, we use combinatorial designs Stinson (2003) to specify the placement scheme

in the coded caching system.

Definition 2. A design is a pair (X, A) such that

1. X is a set of elements called points, and

2. A is a collection of nonempty subsets of X called blocks, where each block contains the same

number of points.

A design is in one-to-one correspondence with an incidence matrix N which is defined as follows.

Definition 3. The incidence matrix N of a design (X, A) is a binary matrix of dimension |X|×|A|,

where the rows and columns correspond to the points and blocks respectively. Let i ∈ X and j ∈ A.

Then,


1
 if i ∈ j,
N (i, j) =

0

otherwise.

It can be observed that the transpose of an incidence matrix also specifies a design. We will

refer to this as the transposed design. In this work, we will utilize resolvable designs which are a

special class of designs.

Definition 4. A parallel class P in a design (X, A) is a subset of disjoint blocks from A whose

union is X. A partition of A into several parallel classes is called a resolution, and (X, A) is said

to be a resolvable design if A has at least one resolution.

For resolvable designs, it follows that each point also appears in the same number of blocks.

Example 8. Consider a block design specified as follows.

X = {1, 2, 3, 4}, and

A = {{1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}}.
25

Its incidence matrix is given below.


 
1 1 1 0 0 0
 
 
1 0 0 1 1 0
N = .
 
0 1 0 1 0 1
 
 
0 0 1 0 1 1

It can be observed that this design is resolvable with the following parallel classes.

P1 = {{1, 2}, {3, 4}},

P2 = {{1, 3}, {2, 4}}, and

P3 = {{1, 4}, {2, 3}}.

In the sequel we let [n] denote the set {1, . . . , n}. We emphasize here that the original scheme

of Maddah-Ali and Niesen (2014b) can be viewed as an instance of the trivial design. For example,

consider the setting when t = KM/N is an integer. Let X = [K] and A = {B : B ⊂ [K], |B| = t}.

In the scheme of Maddah-Ali and Niesen (2014b), the users are associated with X and the subfiles

with A. User i ∈ [K] caches subfile Wn,B , n ∈ [N ] for B ∈ A if i ∈ B. The main message of our

work is that carefully constructed resolvable designs can be used to obtain coded caching schemes

with low subpacketization levels, while retaining much of the rate gains of coded caching. The

basic idea is to associate the users with the blocks and the subfiles with the points of the design.

The roles of the users and subfiles can also be interchanged by simply working with the transposed

design.

Example 9. Consider the resolvable design from Example 8. The blocks in A correspond to six

users U12 , U34 , U13 , U24 , U14 , U23 . Each file is partitioned into Fs = 4 subfiles Wn,1 , Wn,2 , Wn,3 , Wn,4

which correspond to the four points in X. The cache in user UB , denoted ZB is specified as

Zij = (Wn,i , Wn,j )N N


n=1 . For example, Z12 = (Wn,1 , Wn,2 )n=1 .

We note here that the caching scheme is symmetric with respect to the files in the server.

Furthermore, each user caches half of each file so that M/N = 1/2. Suppose that in the delivery

phase user UB requests file WdB where dB ∈ [N ]. These demands can be satisfied as follows. We
26

pick three blocks, one each from parallel classes P1 , P2 , P3 and generate the signals transmitted in

the delivery phase as follows.

Wd12 ,3 ⊕ Wd13 ,2 ⊕ Wd23 ,1 , (3.1)

Wd12 ,4 ⊕ Wd24 ,1 ⊕ Wd14 ,2 ,

Wd34 ,1 ⊕ Wd13 ,4 ⊕ Wd14 ,3 , and

Wd34 ,2 ⊕ Wd24 ,3 ⊕ Wd23 ,4 .

The three terms in the in eq. (3.1) above correspond to blocks from different parallel classes

{1, 2} ∈ P1 , {1, 3} ∈ P2 , {2, 3} ∈ P3 . This equation has the all-but-one structure that was also

exploited in Maddah-Ali and Niesen (2014b), i.e., eq. (3.1) is such that each user caches all but

one of the subfiles participating in the equation. Specifically, user U12 contains Wn,1 and Wn,2 for

all n ∈ [N ]. Thus, it can decode subfile Wd12 ,3 that it needs. A similar argument applies to users

U13 and U23 . It can be verified that the other three equations also have this property. Thus, at the

end of the delivery phase, each user obtains its missing subfiles.

This scheme corresponds to a subpacketization level of 4 and a rate of 1. In contrast, the scheme

of Maddah-Ali and Niesen (2014b) would require a subpacketization level of 63 = 20 with a rate of


0.75. Thus, it is evident that we gain significantly in terms of the subpacketization while sacrificing

some rate gains.

As shown in Example 9, we can obtain a scheme by associating the users with the blocks and

the subfiles with the points. In this work, we demonstrate that this basic idea can be significantly

generalized and several schemes with low subpacketization levels that continue to leverage much of

the rate benefits of coded caching can be obtained.

3.1.1 Discussion of Related Work

Coded caching has been the subject of much investigation in recent work as discussed briefly

earlier on. We now overview existing literature on the topic of low subpacketization schemes for

coded caching. In the original paper Maddah-Ali and Niesen (2014b), for given problem parameters
27

K (number of users) and M/N (cache fraction), the authors showed that when N ≥ K, the rate

equals

K(1 − M/N )
R=
1 + KM/N

when M is an integer multiple of N/K. Other points are obtained via memory sharing. Thus,

in the regime when KM/N is large, the coded caching rate is approximately N/M − 1, which is
K

independent of K. Crucially, though this requires the subpacketization level Fs ≈ KM/N . It can

be observed that for a fixed M/N , Fs grows exponentially with K. This is one of main drawbacks

of the original scheme, deploying this solution in practice may be difficult.

The subpacketization issue was first discussed in the work of Shanmugam et al. (2014, 2016)

in the context of decentralized caching. Specifically, Shanmugam et al. (2016) showed that in the

decentralized setting for any subpacketization level Fs such that Fs ≤ exp(KM/N ) the rate would

scale linearly in K, i.e., R ≥ cK. Thus, much of the rate benefits of coded caching would be lost

if Fs did not scale exponentially in K. Following this work, the authors in Yan et al. (2017a)

introduced a technique for designing low subpacketization schemes in the centralized setting which

they called placement delivery arrays. In Yan et al. (2017a), they considered the setting when

M/N = 1/q or M/N = 1 − 1/q and demonstrated a scheme where the subpacketization level was

exponentially smaller than the original scheme, while the rate was marginally higher. This scheme

can be viewed as a special case of our work. We discuss these aspects in more detail in Section 3.4.

In Shangguan et al. (2018), the design of coded caching schemes was achieved through the design

of hypergraphs with appropriate properties. In particular, for specific problem parameters, they

were able to establish the existence of schemes where the subpacketization scaled as exp(c K).

ReferenceYan et al. (2017b) presented results in this setting by considering strong edge coloring of

bipartite graphs.

Very recently, Shanmugam et al. (2017) showed the existence of coded caching schemes where

the subpacketization grows linearly with the number of users, but the coded caching rate grows as

O(K δ ) where 0 < δ < 1. Thus, while the rate is not a constant, it does not grow linearly with

K either. Both Shangguan et al. (2018) and Shanmugam et al. (2017) are interesting results that
28

demonstrate the existence of regimes where the subpacketization scales in a manageable manner.

Nevertheless, it is to be noted that these results come with several caveats. For example, the result

of Shanmugam et al. (2017) is only valid in the regime when K is very large and is unlikely to be

of use for practical values of K. The result of Shangguan et al. (2018) has significant restrictions

on the number of users, e.g., in their paper, K needs to be of the form na and q t na .
 

3.1.2 Summary of Contributions

In this work, the subpacketization levels we obtain are typically exponentially smaller than the

original scheme. However, they still continue to scale exponentially in K, albeit with much smaller

exponents. However, our construction has the advantage of being applicable for a large range of

problem parameters. Our specific contributions include the following.

• We uncover a simple and natural relationship between a (n, k) linear block code and a coded

caching scheme. We first show that any linear block code over GF (q) and in some cases

Z mod q (where q is not a prime or a prime power) generates a resolvable design. This

design in turn specifies a coded caching scheme with K = nq users where the cache fraction

M/N = 1/q. A complementary cache fraction point where M/N = 1 − α/nq where α is some

integer between 1 and k + 1 can also be obtained. Intermediate points can be obtained by

memory sharing between these points.

• We consider a class of (n, k) linear block codes whose generator matrices satisfy a specific rank

property. In particular, we require collections of consecutive columns to have certain rank

properties. For such codes, we are able to identify an efficient delivery phase and determine

the precise coded caching rate. We demonstrate that the subpacketization level is at most

q k (k + 1) whereas the coded caching gain scales as k + 1 with respect to an uncoded caching

scheme. Thus, different choices of k allow the system designer significant flexibility to choose

the appropriate operating point.

• We discuss several constructions of generator matrices that satisfy the required rank property.

We characterize the ranges of alphabet sizes (q) over which these matrices can be constructed.
29

If one has a given subpacketization budget in a specific setting, we are able to find a set of

schemes that fit the budget while leveraging the rate gains of coded caching.

3.2 Proposed low subpacketization level scheme

All our constructions of low subpacketization schemes will stem from resolvable designs (cf.

Definition 4). Our overall approach is to first show that any (n, k) linear block code over GF (q)

can be used to obtain a resolvable block design. The placement scheme obtained from this resolvable

design is such that M/N = 1/q. Under certain (mild) conditions on the generator matrix we show

that a delivery phase scheme can be designed that allows for a significant rate gain over the uncoded

scheme while having a subpacketization level that is significantly lower than Maddah-Ali and Niesen

(2014b). Furthermore, our scheme can be transformed into another scheme that operates at the

point M/N = 1− k+1


nq . Thus, intermediate values of M/N can be obtained via memory sharing. We

also discuss situations under which we can operate over modular arithmetic Zq = Z mod q where

q is not necessarily a prime or a prime power; this allows us to obtain a larger range of parameters.

3.2.1 Resolvable Design Construction

Consider a (n, k) linear block code over GF (q). To avoid trivialities we assume that its generator

matrix does not have an all-zeros column. We collect its q k codewords and construct a matrix T

of size n × q k as follows.

T = [cT0 , cT1 , · · · , cTqk −1 ], (3.2)

where the 1 × n vector c` represents the `-th codeword of the code. Let X = {0, 1, · · · , q k − 1} be

the point set and A be the collection of all subsets Bi,l for 0 ≤ i ≤ n − 1 and 0 ≤ l ≤ q − 1, where

Bi,l = {j : Ti,j = l}.

Using this construction, we can obtain the following result.


30

Lemma 1. The construction procedure above results in a design (X, A) where X = {0, 1, · · · , q k −

1} and |Bi,l | = q k−1 for all 0 ≤ i ≤ n − 1 and 0 ≤ l ≤ q − 1. Furthermore, the design is resolvable

with parallel classes given by Pi = {Bi,l : 0 ≤ l ≤ q − 1}, for 0 ≤ i ≤ n − 1.

Proof. Let G = [gab ], for 0 ≤ a ≤ k − 1, 0 ≤ b ≤ n − 1, gab ∈ GF (q). Note that for ∆ =

[∆0 ∆1 . . . ∆n−1 ] = uG, we have


k−1
X
∆b = ua gab ,
a=0

where u = [u0 , · · · , uk−1 ]. Let a∗ be such that ga∗ b 6= 0. Consider the equation
X
ua gab = ∆b − ua∗ ga∗ b ,
a6=a∗

where ∆b is fixed. For arbitrary values of ua , a 6= a∗ , this equation has a unique solution for ua∗ ,

which implies that for any ∆b , |Bb,∆b | = q k−1 and that Pb forms a parallel class.

Remark 1. A k × n generator matrix over GF (q) where q is a prime power can also be considered

as a matrix over an extension field GF (q m ) where m is an integer. Thus, one can obtain a resolvable

design in this case as well; the corresponding parameters can be calculated in an easy manner.

Remark 2. We can also consider linear block codes over Z mod q where q is not necessarily a

prime or a prime power. In this case the conditions under which a resolvable design can be obtained

by forming the matrix T are a little more involved. We discuss this in Lemma 5 in the Appendix.

Example 10. Consider a (4, 2) linear block code over GF (3) with generator matrix
 
1 0 1 1
G= .
0 1 1 2

Collecting the nine codewords, T is constructed as follows.


 
0 0 0 1 1 1 2 2 2
 
 
0 1 2 0 1 2 0 1 2
T= .
 
0 1 2 1 2 0 2 0 1
 
 
0 2 1 1 0 2 2 1 0
31

Using T, we generate the resolvable block design (X, A) where the point set is X = {0, 1, 2, 3, 4, 5, 6, 7, 8}.

For instance, block B0,0 is obtained by identifying the column indexes of zeros in the first row of

T, i.e., B0,0 = {0, 1, 2}. Following this, we obtain

A ={{0, 1, 2}, {3, 4, 5}, {6, 7, 8}, {0, 3, 6}, {1, 4, 7}, {2, 5, 8},

{0, 5, 7}, {1, 3, 8}, {2, 4, 6}, {0, 4, 8}, {2, 3, 7}, {1, 5, 6}}.

It can be observed that A has a resolution (cf. Definition 4) with the following parallel classes.

P0 = {{0, 1, 2}, {3, 4, 5}, {6, 7, 8}},

P1 = {{0, 3, 6}, {1, 4, 7}, {2, 5, 8}},

P2 = {{0, 5, 7}, {1, 3, 8}, {2, 4, 6}}, and

P3 = {{0, 4, 8}, {2, 3, 7}, {1, 5, 6}}.

3.2.2 A special class of linear block codes

We now introduce a special class of linear block codes whose generator matrices satisfy specific

rank properties. It turns out that resolvable designs obtained from these codes are especially suited

for usage in coded caching.

Consider the generator matrix G of a (n, k) linear block code over GF (q). The i-th column

of G is denoted by gi . Let z be the least positive integer such that k + 1 divides nz (denoted by

k + 1 | nz). We let (t)n denote t mod n.

In our construction we will need to consider various collections of k +1 consecutive columns of G

(wraparounds over the boundaries are allowed). For this purpose, let Ta = {a(k+1), · · · , a(k+1)+k}

(a is a non-negative integer) and Sa = {(t)n | t ∈ Ta }. Let GSa be the k × (k + 1) submatrix of

G specified by the columns in Sa , i.e., g` is a column in GSa if ` ∈ Sa . Next, we define the

(k, k + 1)-consecutive column property that is central to the rest of the discussion.

Definition 5. (k, k + 1)-consecutive column property. Consider the submatrices of G specified by


zn
GSa for 0 ≤ a ≤ k+1 − 1. We say that G satisfies the (k, k + 1)-consecutive column property if all

k × k submatrices of each GSa are full rank.


32

Henceforth, we abbreviate the (k, k + 1)-consecutive column property as (k, k + 1)-CCP.

Example 11. In Example 10 we have k = 2, n = 4 and hence z = 3. Thus, S0 = {0, 1, 2}, S1 =

{3, 0, 1}, S2 = {2, 3, 0} and S3 = {1, 2, 3}. The corresponding generator matrix G satisfies the (k, k+

1) CCP as any two columns of the each of submatrices GSi , i = 0, . . . , 3 are linearly independent

over GF (3).

We note here that one can also define different levels of the consecutive column property. Let

Taα = {aα, · · · , aα + α − 1} and Saα = {(t)n | t ∈ Taα }.

Definition 6. (k, α)-consecutive column property Consider the submatrices of G specified by GSaα
zn
for 0 ≤ a ≤ α − 1. We say that G satisfies the (k, α)-consecutive column property, where α ≤ k if

each GSaα has full rank. In other words, the α columns in each GSaα are linearly independent.

As pointed out in the sequel, codes that satisfy the (k, α)-CCP, where α ≤ k will result in

caching systems that have a multiplicative rate gain of α over an uncoded system. Likewise, codes

that satisfy the (k, k + 1)-CCP will have a gain of k + 1 over an uncoded system. In the remainder

of the paper, we will use the term CCP to refer to the (k, k + 1)-CCP if the value of k is clear from

the context.

3.2.3 Usage in a coded caching scenario

A resolvable design generated from a linear block code that satisfies the CCP can be used in a

coded caching scheme as follows. We associate the users with the blocks. Each subfile is associated

with a point and an additional index. The placement scheme follows the natural incidence between

the blocks and the points; a formal description is given in Algorithm 2 and illustrated further in

Example 12.

Example 12. Consider the resolvable design from Example 10, where we recall that z = 3. The

blocks in A correspond to twelve users U012 , U345 , U678 , U036 , U147 , U258 , U057 , U138 , U246 , U048 ,
s ,
U237 , U156 . Each file is partitioned into Fs = 9 × z = 27 subfiles, each of which is denoted by Wn,t
s | t ∈
t = 0, · · · , 8, s = 0, 1, 2. The cache in user Uabc , denoted Zabc is specified as Zabc = {Wn,t
33

Algorithm 2: Placement Scheme


Input : Resolvable design (X, A) constructed from a (n, k) linear block code. Let z be
the least positive integer such that k + 1 | nz.
1 Divide each file Wn , for n ∈ [N ] into q k z subfiles. Thus,
Wn = {Wn,ts : t ∈ {0, . . . , q k − 1} and s ∈ {0, . . . , z − 1}} ;

2 User UB for B ∈ A caches ZB = {Wn,t s : n ∈ [N ], t ∈ B and s ∈ {0, . . . , z − 1}} ;

Output: Cache content of user UB denoted ZB for B ∈ A.

{a, b, c}, s ∈ {0, 1, 2} and n ∈ [N ]}. This corresponds to a coded caching system where each user

caches 1/3-rd of each file so that M/N = 1/3.

In general, (see Algorithm 2) we have K = |A| = nq users. Each file Wn , n ∈ [N ] is divided

into q k z subfiles Wn = {Wn,t


s | 0 ≤ t ≤ q k − 1, 0 ≤ s ≤ z − 1}. A subfile W s is cached in user U
n,t B

where B ∈ A if t ∈ B. Therefore, each user caches a total of N q k−1 z subfiles. As each file consists

of q k z subfiles, we have that M/N = 1/q.

It remains to show that we can design a delivery phase scheme that satisfies any possible demand

pattern. Suppose that in the delivery phase user UB requests file WdB where dB ∈ [N ]. The server

responds by transmitting several equations that satisfy each user. Each equation allows k + 1 users

from different parallel classes to simultaneously obtain a missing subfile. Our delivery scheme is

such that the set of transmitted equations can be classified into various recovery sets that correspond

to appropriate collections of parallel classes. For example, in Fig. 3.1, PS0 = {P0 , P1 , P2 }, PS1 =

{P0 , P1 , P3 } and so on. It turns out that these recovery sets correspond precisely to the sets
zn
Sa , 0 ≤ a ≤ k+1 − 1 defined earlier. We illustrate this by means of the example below.

Example 13. Consider the placement scheme specified in Example 12. Let each user UB request

file WdB . The recovery sets are specified by means of the recovery set bipartite graph shown in Fig.

3.1, e.g., PS1 corresponds to S1 = {0, 1, 3}. The outgoing edges from each parallel class are labeled

arbitrarily with numbers 0, 1 and 2. Our delivery scheme is such that each user recovers missing

subfiles with a specific superscript from each recovery set that its corresponding parallel class

participates in. For instance, a user in parallel class P1 recovers missing subfiles with superscript 0
34

Parallel Classes

P0 P1 P2 P3

2 0 2
0 1 2 0 1 1 0 1 2

PS0 PS1 PS2 PS3

Recovery Sets

Figure 3.1 Recovery set bipartite graph

from PS0 , superscript 1 from PS1 and superscript 2 from PS3 ; these superscripts are the labels of

outgoing edges from P1 in the bipartite graph.

It can be verified, e.g., that user U012 which lies in P0 recovers all missing subfiles with super-

script 1 from the equations below.

Wd1012 ,3 ⊕ Wd1036 ,2 ⊕ Wd0237 ,0 , Wd1012 ,6 ⊕ Wd1036 ,1 ⊕ Wd0156 ,0 ,

Wd1012 ,4 ⊕ Wd1147 ,0 ⊕ Wd0048 ,1 , 1 Wd1012 ,7 ⊕ Wd1147 ,2 ⊕ Wd0237 ,1 ,

Wd1012 ,8 ⊕ Wd1258 ,0 ⊕ Wd0048 ,2 , Wd1012 ,5 ⊕ Wd1258 ,1 ⊕ Wd0156 ,2 .

Each of the equations above benefits three users. They are generated simply by choosing U012 from

P0 , any block from P1 and the last block from P3 so that the intersection of all these blocks is

empty. The fact that these equations are useful for the problem at hand is a consequence of the

CCP. The process of generating these equations can be applied to all possible recovery sets. It can

be shown that this allows all users to be satisfied at the end of the procedure.

In what follows, we first show that for the recovery set PSa it is possible to generate equations

that benefit k + 1 users simultaneously.

Claim 1. Consider the resolvable design (X, A) constructed as described in Section III.A by a
zn
(n, k) linear block code that satisfies the CCP. Let PSa = {Pi | i ∈ Sa } for 0 ≤ a ≤ k+1 − 1, i.e.,

it is the subset of parallel classes corresponding to Sa . We emphasize that |PSa | = k + 1. Consider


35

blocks Bi1 ,li1 , . . . , Bik ,lik (where lij ∈ {0, . . . , q − 1}) that are picked from any k distinct parallel

classes of PSa . Then, | ∩kj=1 Bij ,lij | = 1.

Before proving Claim 1, we discuss its application in the delivery phase. Note that the claim

asserts that k blocks chosen from k distinct parallel classes intersect in precisely one point. Now,

suppose that one picks k + 1 users from k + 1 distinct parallel classes, such that their intersection is

empty. These blocks (equivalently, users) can participate in an equation that benefits k +1 users. In

particular, each user will recover a missing subfile indexed by the intersection of the other k blocks.

We emphasize here that Claim 1 is at the core of our delivery phase. Of course, we need to justify

that enough equations can be found that allow all users to recover all their missing subfiles. This

follows from a natural counting argument that is made more formally in the subsequent discussion.

The superscripts s ∈ {0, . . . , z − 1} are needed for the counting argument to go through.

Proof. Following the construction in Section III.A, we note that a block Bi,l ∈ Pi is specified by

Bi,l = {j : Ti,j = l}.

Now consider Bi1 ,li1 , . . . , Bik ,lik (where ij ∈ Sa , lij ∈ {0, . . . , q − 1}) that are picked from k

distinct parallel classes of PSa . W.l.o.g. we assume that i1 < i2 < · · · < ik . Let I = {i1 , . . . , ik }

and TI denote the submatrix of T obtained by retaining the rows in I. We will show that the

vector [li1 li2 . . . lik ]T is a column in TI and only appears once.

To see this consider the system of equations in variables u0 , . . . , uk−1 .


k−1
X
ub gbi1 = li1 ,
b=0
..
.
k−1
X
ub gbik = lik .
b=0

By the CCP, the vectors gi1 , gi2 , . . . , gik are linearly independent. Therefore this system of k

equations in k variables has a unique solution over GF (q). The result follows.
36

Algorithm 3: Signal Generation Algorithm for PSa


Input : For P ∈ PSa , E(P) = label(P − PSa ). Signal set Sig = ∅.
1 while any user UB ∈ Pj , j ∈ Sa does not recover all its missing subfiles with superscript
E(Pj ) do
2 Pick blocks Bj,lj ∈ Pj for all j ∈ Sa and lj ∈ {0, . . . , q − 1} such that ∩j∈Sa Bj,lj = ∅;
/* Pick blocks from distinct parallel classes in PSa such that their intersection is
empty */
3 Let ˆls = ∩j∈Sa \{s} Bj,lj for s ∈ Sa ;
/* Determine the missing subfile index that the user from Ps will recover */
E(Ps )
4 Add signal ⊕s∈Sa W to Sig /* User UBs,ls demands file Wκs,ls . This equation
κs,ls ,l̂s
allows it to recover the corresponding missing subfile index ˆ
ls . The superscript is
determined by the recovery set bipartite graph */
5 end
Output: Signal set Sig.

We now provide an intuitive argument for the delivery phase. Recall that we form a recovery set

bipartite graph (see Fig. 3.1 for an example) with parallel classes and recovery sets as the disjoint

vertex subsets. The edges incident on each parallel class are labeled arbitrarily from 0, . . . , z − 1.

For a parallel class P ∈ PSa we denote this label by label(P − PSa ). For a given recovery set PSa ,

the delivery phase proceeds by choosing blocks from distinct parallel classes in PSa such that their

intersection is empty; this provides an equation that benefits k + 1 users. It turns out that the

equation allows a user in parallel class P ∈ PSa to recover a missing subfile with the superscript

label(P − PSa ).

The formal argument is made in Algorithm 3. For ease of notation in Algorithm 3, we denote

the demand of user UBi,j for 0 ≤ i ≤ n − 1, 0 ≤ j ≤ q − 1 by Wκi,j .

Claim 2. Consider a user UB belonging to parallel class P ∈ PSa . The signals generated in

Algorithm 3 can recover all the missing subfiles needed by UB with superscript E(P).

Proof. Let Pα ∈ PSa . In the arguments below, we argue that user UBα,lα that demands file Wκα,lα

can recover all its missing subfiles with superscript E(Pα ). Note that |Bα,lα | = q k−1 . Thus, user

UBα,lα needs to obtain q k − q k−1 missing subfiles with superscript E(Pα ). Consider an iteration
37

of the while loop where block Bα,lα is picked in step 2. The equation in Algorithm 3 allows it to
E(Pα )
recover W where ˆlα = ∩j∈Sa \{α} Bj,lj . This is because ∩j∈Sa Bj,lj = ∅ and Claim 1.
κα,lα ,l̂α

Next we count the number of equations that UBα,lα participates in. We can pick k − 1 users

from some k − 1 distinct parallel classes in PSa . This can be done in q k−1 ways. Claim 1 ensures

that the blocks so chosen intersect in a single point. Next we pick a block from the only remaining

parallel class in PSa such that the intersection of all blocks is empty. This can be done in q − 1

ways. Thus, there are a total of q k−1 (q − 1) = q k − q k−1 equations in which user UBα,lα participates

in.

It remains to argue that each equation provides a distinct subfile. Towards this end, let

{i1 , . . . , ik } ⊂ Sa be an index set such that α ∈


/ {i1 , . . . , ik }. Suppose that there exist sets of blocks

6 {Bi1 ,li0 , . . . , Bik ,li0 },


{Bi1 ,li1 , . . . , Bik ,lik } and {Bi1 ,li0 , . . . , Bik ,li0 } such that {Bi1 ,li1 , . . . , Bik ,lik } =
1 k 1 k

but ∩kj=1 Bij ,lij = ∩kj=1 Bij ,li0 = β. This is a contradiction since this in turn implies that β ∈
j

∩k+1 ∩k+1
T
j=2 Bij ,lij j=2 Bij ,li0 , which is impossible since two blocks from the same parallel class have an
j

empty intersection.

As the algorithm is symmetric with respect to all blocks in parallel classes belonging to PSa ,

we have the required result.

The overall delivery scheme repeatedly applies Algorithm 3 to each of the recovery sets.

Lemma 2. The proposed delivery scheme terminates and allows each user’s demand to be satisfied.
(q−1)n
Furthermore the transmission rate of the server is k+1 and the subpacketization level is q k z.

Proof. See Appendix.

The main requirement for Lemma 2 to hold is that the recovery set bipartite graph be biregular,

where multiple edges between the same pair of nodes is disallowed and the degree of each parallel

class is z. It is not too hard to see that this follows from the definition of the recovery sets (see the

proof in the Appendix for details).


38

In an analogous manner, if one starts with the generator matrix of a code that satisfies the

(k, α)-CCP for α ≤ k, then we can obtain the following result which is stated below. The details

are similar to the discussion for the (k, k + 1)-CCP and can be found in the Appendix (Section A).

Corollary 1. Consider a coded caching scheme obtained by forming the resolvable design obtained

from a (n, k) code that satisfies the (k, α)-CCP where α ≤ k. Let z be the least positive integer

such that α | nz. Then, a delivery scheme can be constructed such that the transmission rate is
(q−1)n
α and the subpacketization level is q k z.

k+1
3.2.4 Obtaining a scheme for M/N = 1 − nq .

The construction above works for a system where M/N = 1/q. It turns out that this can be
M k+1
converted into a scheme for N = 1− nq . Thus, any convex combination of these two points can

be obtained by memory-sharing.

Towards this end, we note that the class of coded caching schemes considered here can be

specified by an equation-subfile matrix. This is inspired by the hypergraph formulation and the

placement delivery array (PDA) based schemes for coded caching in Shangguan et al. (2018) and

Yan et al. (2017a). Each equation is assumed to be of the all-but-one type, i.e., it is of the form

Wdt1 ,Aj1 ⊕ Wdt2 ,Aj2 ⊕ · · · ⊕ Wdtm ,Ajm where for each ` ∈ [m], we have the property that user Ut`

does not cache subfile Wn,Aj` but caches all subfiles Wn,Ajs where {js : s ∈ [m], s 6= `}.

The coded caching system corresponds to a ∆ × Fs equation-subfile matrix S as follows. We

associate each row of S with an equation and each column with a subfile. We denote the i-th row

of S by Eqi and j-th column of S by Aj . The value S(i, j) = t if in the i-th equation, user Ut

recovers subfile Wdt ,Aj , otherwise, S(i, j) = 0. Suppose that these ∆ equations allow each user to

satisfy their demands, i.e., S corresponds to a valid coded caching scheme. It is not too hard to

see that the placement scheme can be obtained by examining S. Namely, user Ut caches the subfile

corresponding to the j-th column if integer t does not appear in the j-th column.
39

Example 14. Consider a coded caching system in Maddah-Ali and Niesen (2014b) with K = 4,

∆ = 4 and Fs = 6. We denote the four users as U1 , U2 , U3 , U4 . Suppose that S is

A1 A2 A3 A4 A5 A6
 
Eq1  3 2 0 1 0 0 
 
 
Eq2  4 0 2 0 1 0 
.
 

 
Eq3 
 0 4 3 0 0 1 

 
Eq4 0 0 0 4 3 2

Upon examining S it is evident for instance that user U1 caches subfiles A1 , . . . , A3 as the number

1 does not appear in the corresponding columns. Similarly, the cache placement of the other users

can be obtained. Interpreting this placement scheme in terms of the user-subfile assignment, it can

be verified that the design so obtained corresponds to the transpose of the scheme considered in

Example 3.1 (and also to the scheme of Maddah-Ali and Niesen (2014b) for K = 4, M/N = 1/2).

Lemma 3. Consider a ∆×Fs equation-subfile matrix S whose entries belong to the set {0, 1, . . . , K}.

It corresponds to a valid coded caching system if the following three conditions are satisfied.

• There is no non-zero integer appearing more than once in each column.

• There is no non-zero integer appearing more than once in each row.

• If S(i1 , j1 ) = S(i2 , j2 ) 6= 0, then S(i1 , j2 ) = S(i2 , j1 ) = 0.

Proof. The placement scheme is obtained as discussed earlier, i.e., user Ut caches subfiles Wn,Aj if

integer t does not appear in column Aj . Therefore, matrix S corresponds to a placement scheme.

Next we discuss the delivery scheme. Note that Eqi corresponds to an equation as follows.

Wdt1 ,Aj1 ⊕ Wdt2 ,Aj2 ⊕ · · · ⊕ Wdtm ,Ajm ,

where S(i, j1 ) = t1 , · · · , S(i, jm ) = tm . The above equation can allow m users to recover subfiles

simultaneously if (a) Ut` does not cache Wn,Aj` and (b) Ut` caches all Wn,Ajs where {js : s ∈

[m], s 6= `}. It is evident that Ut` does not cache Wn,Aj` owing to the placement scheme. Next, to
40

guarantee the condition (b), we need to show that integer t` = S(i, j` ) will not appear in column

Ajs in S where {js : s ∈ [m], s 6= `}. Towards this end, t` 6= S(i, js ) because of Condition 2. Next,

consider the non-zero entries that lie in the column Ajs but not in the row Eqi . Assume there exists

an entry S(i0 , js ) such that S(i0 , js ) = S(i, j` ) = t` and i0 6= i, then S(i, js ) = ts 6= 0, which is a

contradiction to Condition 3. Finally, Condition 1 guarantees that each missing subfile is recovered

only once.

Mt Lt
User Ut caches a fraction N = Fs where Lt is the number of columns of S that do not have

the entry t. Similarly, the transmission rate is given by R = Fs .

The crucial point is that the transpose of S, i.e., ST also corresponds to a coded caching scheme.

This follows directly from the fact that ST also satisfies the conditions in Lemma 3. In particular,

ST corresponds to a coded caching system with K users and ∆ subfiles. In the placement phase, the
Mt0 ∆−Fs +Lt
cache size of Ut is N = ∆ . In the delivery phase, by transmitting Fs equations corresponding
Fs
to the rows of ST , all missing subfiles can be recovered. Then, the transmission rate is R = ∆.

Applying the above discussion in our context, consider the equation-subfile matrix S corre-
Mt 1
sponding to the coded caching system with K = nq, N = q for 1 ≤ t ≤ nq, Fs = q k z and
nz M0
∆ = q k (q−1) k+1 . Then ST corresponds to a system with K 0 = nq, N = 1− k+1 0 k zn
nq , Fs = (q−1)q k+1 ,
Fs k+1
and transmission rate R0 = ∆ = (q−1)n . The following theorem is the main result of this paper.

Theorem 2. Consider a (n, k) linear block code over GF (q) that satisfies the (k, k + 1) CCP. This

corresponds to a coded caching scheme with K = nq users, N files in the server where each user has
   
a cache of size M ∈ 1q N, 1 − k+1nq N . Let z be the least positive integer such that k + 1 | nz.
M
When N = 1q , we have

(q − 1)n
R= , and
k+1
Fs = q k z.
41

M k+1
When N = (1 − nq ), we have

k+1
R= , and
(q − 1)n
zn
Fs = (q − 1)q k .
k+1

By memory sharing any convex combination of these points is achievable.

In a similar manner for the (n, k) linear block code that satisfies the (k, α)-CCP over GF (q),
M0 α
the caching system where M/N = 1/q can be converted into a system where K 0 = nq, N0 = 1 − nq ,

Fs0 = (q − 1)q k zn 0
α and R =
α
(q−1)n using the equation-subfile technique. The arguments presented

above apply with essentially no change.

3.3 Some classes of linear codes that satisfy the CCP

At this point we have established that linear block codes that satisfy the CCP are attractive

candidates for usage in coded caching. In this section, we demonstrate that there are a large class

of generator matrices that satisfy the CCP. For most of the section we work with matrices over

a finite field of order q. In the last subsection, we discuss some constructions for matrices over Z

mod q when q is not a prime or prime power. We summarize the constructions presented in this

section in Table 3.1.

3.3.1 Maximum-distance-separable (MDS) codes

(n, k)-MDS codes with minimum distance n − k + 1 are clearly a class of codes that satisfy the

CCP. In fact, for these codes any k columns of the generator matrix can be shown to be full rank.

Note however, that MDS codes typically need large field size, e.g., q + 1 ≥ n (assuming that the

MDS conjecture is true)Roth (2006). In our construction, the value of M/N = 1/q and the number

of users is K = nq. Thus, for large n, we will only obtain systems with small values of M/N , or

equivalently large values of M/N (by Theorem 2 above). This may be restrictive in practice.
42

3.3.2 Cyclic Codes

A cyclic code is a linear block code, where the circular shift of each codeword is also a codeword

Lin and Costello (2004). A (n, k) cyclic code over GF (q) is specified by a monic polynomial

g(X) = n−k i
P
i=0 gi X with coefficients from GF (q) where gn−k = 1 and g0 6= 0; g(X) needs to divide

the polynomial X n − 1. The generator matrix of the cyclic code is obtained as below.
 
g g · · · gn−k 0 · · 0
 0 1 
 
 0 g0 g1 · · · gn−k 0 · 0 
G=. .
 
. .. 
. . 

 
0 0 · 0 g0 g1 · · · gn−k

The following claim shows that for verifying the CCP for a cyclic code it suffices to pick any

set of k + 1 consecutive columns.

Claim 3. Consider a (n, k) cyclic code with generator matrix G. Let GS denote a set of k + 1

consecutive columns of G. If each k × k submatrix of GS is full rank, then G satisfies the (k, k + 1)-

CCP.

Proof. Let the generator polynomial of the cyclic code be g(X), where we note that g(X) has degree

n − k. Let GS = [g(a)n , g(a+1)n , · · · , g(a+k)n ] where we assume that GS satisfies the (k, k + 1)-CCP.

Let

GS\j =[g(a)n , . . . , g(a+j−1)n , g(a+j+1)n , . . . , g(a+k)n ], and

GS 0 \j =[g(a+i)n , . . . , g(a+j−1+i)n ,

g(a+j+1+i)n , . . . , g(a+k+i)n ].

We need to show that if GS\j has full rank, then GS 0 \j has full rank, for any 0 ≤ j ≤ k.

As GS\j has full rank, there is no codeword c 6= 0 such that c((a)n ) = · · · = c((a + j − 1)n ) =

c((a + j + 1)n ) = · · · = c((a + k)n ) = 0. By the definition of a cyclic code, any circular shift of a

codeword results in another codeword that belongs to the code. Therefore, there is no codeword c0

such that c0 ((a + i)n ) = · · · = c0 (a + j − 1 + i)n ) = c0 ((a + j + 1 + i)n ) = · · · = c0 ((a + k + i)n ) = 0.

Thus, GS 0 \j has full rank.


43

Claim 3 implies a low complexity search algorithm to determine if a cyclic code satisfies the
zn
CCP. Instead of checking all GSa , 0 ≤ a ≤ k+1 − 1, in Definition 5, we only need to check an

arbitrary GS = [g(i)n , g(i+1)n , · · · , g(i+k)n ], for 0 ≤ i < n. To further simplify the search, we choose

i = n − b k2 c − 1.

For this choice of i, Claim 4 shows that GS is such that we only need to check the rank of a

list of small-dimension matrices to determine if each k × k submatrix of GS is full rank (the proof

appears in the Appendix).

Claim 4. A cyclic code with generator matrix G satisfies the CCP if the following conditions hold.

• For 0 < j ≤ b k2 c, the submatrices


 
 gn−k−1 gn−k 0 · · 0 
 
 g
 n−k−2 g n−k−1 g n−k 0 · 0 

.. .. 
 
Cj = 

 . . 
 
g
 n−k−j+1 · · · · gn−k  
 
gn−k−j · · · · gn−k−1

have full rank. In the above expression, gi = 0 if i < 0.

• For b k2 c < j < k, the submatrices


 
g
 1 2g · · · · g k−j 
 
g g · · · · gk−j−1 
 0 1 
. .. 
 
Cj =  .. . 
 
 
0 · · 0 g g g2 
 0 1 
 
0 · ·· 0 g0 g1

have full rank.

Example 15. Consider the polynomial g(X) = X 4 + X 3 + X + 2 over GF (3). Since it divides

X 8 − 1, it is the generator polynomial of a (8, 4) cyclic code over GF (3). The generator matrix of
44

this code is given below.


 
2 1 0 1 1 0 0 0
 
 
0 2 1 0 1 1 0 0
G= .
 
0 0 2 1 0 1 1 0
 
 
0 0 0 2 1 0 1 1
It can be verified that the 4 × 5 submatrix which consists of the two leftmost columns and three

rightmost columns of G is such that all 4 × 4 submatrices of it are full rank. Thus, by Claim 3 the

(4,5)-CCP is satisfied for G.

Remark 3. Cyclic codes form an important class of codes that satisfy the (k, k)-CCP (cf. Definition

6). This is because, it is well-known Lin and Costello (2004) that any k consecutive columns of the

generator matrix of a cyclic code are linearly independent.

3.3.3 Constructions leveraging properties of smaller base matrices

It is well recognized that cyclic codes do not necessarily exist for any choice of parameters. This

is because of the divisibility requirement on the generator polynomial. We now discuss a more

general construction of generator matrices that satisfy the CCP. As we shall see, this construction

provides a more or less satisfactory solution for a large range of system parameters.

Our first simple observation is that the Kronecker product (denoted by ⊗ below) of a z × α

generator matrix that satisfies the (z, z)-CCP with a t × t identity matrix, It×t immediately yields

a generator matrix that satisfies the (tz, tz)-CCP.

Claim 5. Consider a (n, k) linear block code over GF (q) whose generator matrix is specified as

G = A ⊗ It×t where A is a z × α matrix that satisfies the (z, z)-CCP. Then, G satisfies the

(k, k)-CCP where k = tz and n = tα.

Proof. The recovery set for A is specified as Saz = {(az)α , · · · , (az + z − 1)α } and the recovery set

for G is specified as Sak = {(ak)n , · · · , (ak + k − 1)n }. Since A satisfies the (z, z)-CCP, ASaz has full

rank. Note that GSak = ASaz ⊗ It×t . Then det(GSak ) = det(ASaz ⊗ It×t ) = det(ASaz )t 6= 0. Therefore,

G satisfies the (k, k)-CCP.


45

Remark 4. Let A be the generator matrix of a cyclic code over GF (q), then G = A ⊗ It×t satisfies

the (k, k)-CCP by Claim 5.

Our next construction addresses the (k, k + 1)-CCP. In what follows, we use the following

notation.

• 1a : [1, · · · , 1]T ;
| {z }
a

• C(c1 , c2 )a×b : a × b matrix where each row is the cyclic shift (one place to the right) of the

row above it and the first row is [c1 c2 0 · · · 0]; and

• 0a×b : a × b matrix with zero entries.

Consider parameters n, k. Let the greatest common divisor of n and k + 1, gcd(n, k + 1) = t. It


k+1
is easy to verify that z = t is the smallest integer such that k + 1 | nz. Let n = tα and k + 1 = tz.

Claim 6 below constructs a (n, k) linear code that satisfies the CCP over GF (q) where q > α. Since

α = nt , the required field size in Claim 6 is lower than the MDS code considered in Section 3.3.1.

Claim 6. Consider a (n, k) linear block code over GF (q) whose generator matrix is specified as

follows,
 b00 It×t ··· b0(α−1) It×t 
b10 It×t ··· b1(α−1) It×t
.. ..
 
G=
 . .

 (3.3)
b(z−2)0 It×t ··· b(z−2)(α−1) It×t
C(b(z−1)0 ,b(z−1)0 )(t−1)×t ··· C(b(z−1)(α−1) ,b(z−1)(α−1) )(t−1)×t

where
 
b00 b01 ··· b0(α−1)
 
 
 b10
 b11 ··· b1(α−1) 

 . .. 
 .
 . .


 
b(z−1)0 b(z−1)1 · · · b(z−1)(α−1)

is a Vandermonde matrix and q > α. Then, G satisfies the (k, k + 1)-CCP.

Proof. The proof again leverages the idea that G can be expressed succinctly by using Kronecker

products. The arguments can be found in the Appendix.


46

Consider the case when α = z + 1. We construct a (n, k) linear code satisfy the CCP over

GF (q) where q ≥ z. It can be noted that the constraint of field size is looser than the corresponding

constraint in Claim 6.

Claim 7. Consider nz = (z + 1) · (k + 1) . Consider a (n, k) linear block code whose generator

matrix (over GF (q)) is specified as follows.

 It×t 0t×t ··· 0t×t 0t×(t−1) 1t b1 It×t 


0t×t It×t ··· 0t×t 0t×(t−1) 1t b2 It×t
.. ..
 
G=
 . .

 (3.4)
0t×t 0t×t ··· It×t 0t×(t−1) 1t bz−1 It×t
0(t−1)×t 0(t−1)×t ··· 0(t−1)×t I(t−1)×(t−1) 1t−1 C(c1 ,c2 )(t−1)×t ,

where t = n − k − 1. If q ≥ z, b1 , b2 , · · · , bz−1 are non-zero and distinct, and c1 + c2 = 0, then G

satisfies the CCP.

Proof. See Appendix.

Given a (n, k) code that satisfies the CCP, we can use it obtain higher values of n in a simple

manner as discussed in the claim below.

Claim 8. Consider a (n, k) linear block code over GF (q) with generator matrix G that satisfies

the CCP. Let the first k + 1 columns of G be denoted by the submatrix D. Then the matrix G0 of

dimension k × (n + s(k + 1)) where s ≥ 0

G0 = [D| · · · |D |G]
| {z }
s

also satisfies the CCP.

Proof. See Appendix.

Claim 8 can provide more parameter choices and more possible code constructions. For example,

given n, k, q, where k + 1 + (n)k+1 ≤ q + 1 < n, there may not exist a (n, k)-MDS code over GF (q).

However, there exists a (k + 1 + (n)k+1 , k)-MDS code over GF (q). By Claim 8, we can obtain a

(n, k) linear block code over GF (q) that satisfies the CCP. Similarly, combining Claim 4, Claim 6,

Claim 7 with Claim 8, we can obtain more linear block codes that satisfy the CCP.
47

A result very similar to Claim 8 can be obtained for the (k, α)-CCP. Specifically, consider a

(n, k) linear block code with generator matrix G that satisfies the (k, α)-CCP and let D be the first

α columns of G. Then, G0 = [D| · · · |D |G] of dimension k × (n + sα) also satisfies the (k, α)-CCP.
| {z }
s

3.3.4 Constructions where q is not a prime or a prime power

We now discuss constructions where q is not a prime or a prime power. We attempt to construct

matrices over the ring Z mod q in this case. The issue is somewhat complicated by the fact that

a square matrix over Z mod q has linear independent rows if and only if its determinant is a unit

in the ring Dummit and Foote (2003). In general, this fact makes it harder to obtain constructions

such as those in Claim 6 that exploit the Vandermonde structure of the matrices. Specifically, the

difference of units in a ring is not guaranteed to be a unit. However, we can still provide some

constructions. It can be observed that Claim 5 and Claim 8 hold for linear block codes over Z

mod q. We will use them without proof in this subsection.

Claim 9. Let G = [Ik×k |1k ], i.e., it is the generator matrix of a (k + 1, k) single parity check (SPC)

code, where the entries are from Z mod q. The G satisfies the (k, k + 1)-CCP and the (k, k)-CCP.

It can be used as base matrix for Claim 5.

Proof. It is not too hard to see that when G = [Ik×k |1k ], any k×k submatrix of G has a determinant

which is ±1, i.e., it is a unit over Z mod q. Thus, the result holds in this case.

Claim 10. The following matrix with entries from Z mod q satisfies the (k, k + 1)-CCP. Here

k = 2t − 1 and n = 3t.
 
It×t 0 1t It×t
G= .

0 I(t−1)×(t−1) 1t C(1, −1)(t−1)×t

Proof. This can be proved by following the arguments in the proof of Claim 7 while treating

elements to be from Z mod q and setting z = 2. We need to consider three different k × (k + 1)

submatrices for which we need to check the property. These correspond to simpler instances of the
48

submatrices considered in Types I - III in the proof of Claim 7. In particular, the corresponding

determinants will always be ±1 which are units over Z mod q.

Remark 5. We note that the general construction in Claim 7 can potentially fail in the case when

the matrices are over Z mod q. This is because in one of the cases under consideration (specifically,

Type III, Case 1), the determinant depends on the difference of the bi values. The difference of

units in Z mod q is not guaranteed to be a unit, thus there is no guarantee that the determinant

is a unit.

Remark 6. We can use Claim 8 to obtain higher values of n based on the above two classes of

linear block codes over Z mod q.

While most constructions of cyclic codes are over GF (q), there has been some work on con-

structing cyclic codes over Z mod q. Specifically, Blake (1972) provides a construction where

q = q1 × q2 · · · × qd and qi , i = 1, . . . , d are prime. We begin by outlining this construction. By the

Chinese remainder theorem any element γ ∈ Z mod q has a unique representation in terms of its

residues modulo qi , for i = 1, . . . , d. Let ψ : Z mod q → GF (q1 ) × · · · × GF (qd ) denote this map.

• Suppose that (n, ki ) cyclic codes over GF (qi ) exist for all i = 1, . . . , d. Each individual code

is denoted C i .

• Let C denote the code over Z mod q. Let c(i) ∈ C i for i = 1, . . . , d. The codeword c ∈ C is
(1) (d)
obtained as follows. The j-th component of c, cj = ψ −1 (cj , . . . , cj )

Therefore, there are q1k1 q2k2 · · · qdkd codewords in C. It is also evident that C is cyclic. As discussed

in Section 3.2.1, we form the matrix T for the codewords in C. It turns out that using T and the

technique discussed in Section 3.2.1, we can obtain a resolvable design. Furthermore, the gain of

the system in the delivery phase can be shown to be kmin = min{k1 , k2 , · · · , kd }. We discuss these

points in detail in the Appendix (Section 5).


49

3.4 Discussion and Comparison with Existing Schemes

3.4.1 Discussion

M
When the number of users is K = nq and the cache fraction is N = 1q , we have shown in Theorem

2 that the gain g = k + 1 and Fs = q k z. Therefore, both the gain and the subpacketization level

increase with larger k. Thus, for our approach given a subpacketization budget Fs0 , the highest

coded gain that can be obtained is denoted by gmax = kmax + 1 where kmax is the largest integer

such that q kmax z ≤ Fs0 and there exists a (n, kmax ) linear block code that satisfies the CCP.

For determining kmax , we have to characterize the collection of values of k such that there exists

a (n, k) linear code satisfies the CCP over GF (q) or Z mod q. We use our proposed constructions

(MDS code, Claim 4, Claim 6, Claim 7, Claim 8, Claim 9, Claim 10) for this purpose. We call this

collection C(n, q) and generate it in Algorithm 4. We note here that it is entirely possible that there

are other linear block codes that fit the appropriate parameters and are outside the scope of our

constructions. Thus, the list may not be exhaustive. In addition, we note that we only check for

the (k, k + 1)-CCP. Working with the (k, α)-CCP where α ≤ k can provide more operating points.

Example 16. Consider a caching system with K = nq = 12 × 5 = 60 users and cache fraction
M 1
N = 5. Suppose that the subpacketization budget is 1.5 × 106 . By checking all k < n we can

construct C(n, q) (see Table 3.2). As a result, C(n, q) = {1, 2, 3, 4, 5, 6, 7, 8, 9, 11}. Then kmax = 8,

Fs ≈ 1.17 × 106 and the maximal coded gain we can achieve is gmax = 9. By contrast, the
KM
scheme in Maddah-Ali and Niesen (2014b) can achieve coded gain g = N + 1 = 13 but requires
K 
subpacketization level Fs = KM ≈ 1.4 × 1012 .
N

We can achieve almost the same rate by performing memory-sharing by using the scheme of

Maddah-Ali and Niesen (2014b) in this example. In particular, we divide each file of size Ω into two
9 1
smaller subfiles Wn1 and Wn2 , where the size of Wn1 , |Wn1 | = 10 Ω and the size of Wn2 , |Wn2 | = 10 Ω.

The scheme of Maddah-Ali and Niesen (2014b) is then applied separately on Wn1 and Wn2 with
M1 2 M2 13
N1 = 15 (corresponding to Wn1 ) and N2 = 15 (corresponding to Wn2 ). Thus, the overall cache
2 13
fraction is 0.9 × 15 + 0.1 × 15 ≈ 15 . The overall coded gain of this scheme is g ≈ 9. However, the
50

Figure 3.2 A comparison of rate and subpacketization level vs. M/N for a system with
K = 64 users. The left y-axis shows the rate and the right y-axis shows
the logarithm of the subpacketization level. The green and the blue curves
correspond to two of our proposed constructions. Note that our schemes allow
for multiple orders of magnitude reduction in subpacketization level and the
expense of a small increase in coded caching rate.

K K
subpacketization level is FsM N = ≈ 5 × 109 , which is much greater than the
 
KM1 /N1 + KM2 /N2

subpacketization budget.

In Fig. 3.2, we present another comparison for system parameters K = 64 and different values

of M/N . The scheme of Maddah-Ali and Niesen (2014b) works for all M/N such that KM/N is an

integer. In Fig. 3.2, our plots have markers corresponding to M/N values that our scheme achieves.

For ease of presentation, both the rate (left y-axis) and the logarithm of the subpacketization

level (right y-axis) are shown on the same plot. We present results corresponding to two of our

construction techniques: (i) the SPC code and (ii) a smaller SPC code coupled with Claim 8. It

can be seen that our subpacketization levels are several orders of magnitude smaller with only a

small increase in the rate.


51

An in-depth comparison for general parameters is discussed next. In the discussion below, we

shall use the superscript ∗ to refer to the rates and subpacketization levels of our proposed scheme.

3.4.2 Comparison with memory-sharing within the scheme of Maddah-Ali and Niesen

(2014b)

Suppose that for given K, M and N , a given rate R can be achieved by the memory sharing of the

scheme in Maddah-Ali and Niesen (2014b) between the corner points (M1 , R1 ), (M2 , R2 ), · · · , (Md , Rd )
M
K(1− Ni ) Pd
where Mi = tKiN
λi Ri , M/N = di=1 λi M
P
for some integer ti . Then Ri = KMi , R = i=1 N
i
1+ N
K 
and di=1 λi = 1. The subpacketization level is FsM S = di=1 KM
P P
i . In addition, we note that the
N

function h(x) = K(1 − x)/(1 + Kx) is convex in the parameter 0 ≤ x ≤ 1. This can be verified by

a simple second derivative calculation.


K
We first argue that FsM S is lower bounded by , where M 0 is obtained as follows. For a

KM 0
N
M∗
given M/N , we first determine λ and N that satisfy the following equations.
M∗ ∗
K(1 − N ) K MN
R=λ + (1 − λ)
KM ∗ M∗
, and (3.5)
1+ N 1 + K(1 − N )
∗ M∗
 
M M
=λ + (1 − λ) 1 − . (3.6)
N N N
M∗ t0 N
Here, N ≤ 12 , and M 0 = K , where t0 is the least integer such that M 0 ≥ M ∗ .

To see this, consider the following argument. Suppose that the above statement is not true.

Then, there exists a scheme that operates via memory sharing between points (M1 , R1 ), · · · , (Md , Rd )
K  K 
such that Fs < KKM 0 . Note that KM if M M2 1 M1 M2 1

< KM N < N ≤ 2 or N > N ≥ 2 . By the
1
1 2
N N N

convexity of h(·), we can conclude that (M, R) is not in the convex hull of the corner points

(M1 , R1 ), · · · , (Md , Rd ). This is a contradiction.

Next, we compare this lower bound on FsM S to the subpacketization level of our proposed
n(q−1)
scheme. In principle, we can solve the system of equations (3.5) and (3.6) for R = k+1 and
M 1
N = q and obtain the appropriate λ and M ∗ values1 . Unfortunately, doing this analytically

becomes quite messy and does not yield much intuition. Instead, we illustrate the reduction in

subpacketization level by numerical comparisons.


1 M k+1
Similar results can be obtained for N
=1− nq
52

Algorithm 4: C(n, q) Construction Algorithm


Input : n, q, C(n, q) = ∅
1 if q is a prime power then
2 for k = 1 : (n − 1) do
3 n0 ← (n)k+1 + k + 1;
k+1
4 z ← gcd(n 0 ,k+1) ;
n0
5 α← gcd(n0 ,k+1) ;
6 if there exists a (n0 + i(k + 1), k) cyclic code which satisfies the condition in Claim 4
for some i such that n0 + i(k + 1) ≤ n then
7 C(n, q) ← k. Corresponding codes are constructed by using Claim 8.
8 else
9 if z ≤ 2 then
10 C(n, q) ← k. Corresponding codes are constructed by SPC code and Claim 8
when z = 1 or Claim 7 and Claim 8 when z = 2.
11 else
12 if q + 1 ≥ n0 then
13 C(n, q) ← k. Corresponding codes are constructed by MDS code and
Claim 8.
14 else
15 if α = z + 1 and q ≥ z then
16 C(n, q) ← k. Corresponding codes are constructed by Claim 7 and
Claim 8.
17 end
18 if α > z + 1 and q > α then
19 C(n, q) ← k. Corresponding codes are constructed by Claim 6 and
Claim 8.
20 end
21 end
22 end
23 end
24 end
25 end
26 if q is not a prime power then
27 for k = 1 : (n − 1) do
28 if z ≤ 2 then
29 C(n, q) ← k. Corresponding codes are constructed by Claim 9 when z = 1 and
Claim 8 or Claim 10 and Claim 8 when z = 2.
30 end
31 end
32 end
Output: C(n, q)
53

Example 17. Consider a (9, 5) linear block code over GF (2) with generator matrix specified below.
 
1 0 0 0 0 1 1 0 0
 
0 1 0 0 0 1 0 1 0
 
 
G = 0 0 1 0 0 1 0 0 1 .
 
 
 
0 0 0 1 0 1 1 1 0
 
 
0 0 0 0 1 1 0 1 1

It can be checked that G satisfies the (5, 6)-CCP. Thus, it corresponds to a coded caching system
M1 M2
with K = 9 × 2 = 18 users. Our scheme achieves the point N = 12 , R1 = 32 , Fs,1
∗ = 64 and
N = 32 ,

R2 = 23 , Fs,2
∗ = 96.

M1 1 3
On the other hand for N = 2, R1 = 2,
by numerically solving (3.5) and (3.6) we obtain
M1∗ M10 5 M S ≥ 18 = 8568, which is much higher than

N ≈ 0.227 and therefore N = 18 . Then Fs,1 5
∗ M2∗ 1 M20 5 M S is also at
Fs,1 = 64. A similar calculation shows that N ≈ 4 and therefore N = 18 . Thus Fs,2
∗ = 96.
least as large as 8568, which is still much higher than Fs,2

The next set of comparisons are with other proposed schemes in the literature. We note here

that several of these are restrictive in the parameters that they allow.

3.4.3 Comparison with Maddah-Ali and Niesen (2014b), Yan et al. (2017a), Yan et al.

(2017b), Shangguan et al. (2018) and Shanmugam et al. (2017)

For comparison with Maddah-Ali and Niesen (2014b), denote RM N and FsM N be the rate and

the subpacketization level of the scheme of Maddah-Ali and Niesen (2014b), respectively. For the

rate comparison, we note that

R∗ 1+n M 1
M N
= , for =
R 1+k N q
R ∗ nq − k M 1+k
= , for =1− ,
RM N nq − n N nq

For the comparison of subpacketization level we have the following results.

Claim 11. When K = nq, the following results hold.


54

1
= 0.25

0.8
= 0.5

Scaling exponent gain


=0.75
0.6 =0.99

0.4

0.2

0
0 5 10 15 20
q

Figure 3.3 The plot shows the gain in the scaling exponent obtained using our techniques
for different value of M/N = 1/q. Each curve corresponds to a choice of
η = k/n.

M
• If N = 1q , we have

F MN
 
1 1 η
lim log2 s ∗ = H2 − log2 q. (3.7)
n→∞ K Fs q q
M k+1
• If N =1− nq , we have

FsM N
 
1 η η
lim log2 ∗
= H2 − log2 q. (3.8)
n→∞ K Fs q q

In the above expressions, 0 < η = k/n ≤ 1 and H2 (·) represents the binary entropy function.

K
≈ 2KH2 (p) Graham et al. (1994).

Proof. Both results are simple consequences of approximating Kp

The derivations can be found in the Appendix.

It is not too hard to see that Fs∗ is exponentially lower than FsM N . Thus, our rate is higher, but

the subpacketization level is exponentially lower. Thus, the gain in the scaling exponent of with

respect to the scheme of Maddah-Ali and Niesen (2014b) depends on the choice of R and the value

of M/N . In Fig. 3.3 we plot this value of different values of R and q. The plot assumes that codes

satisfying the CCP can be found for these rates and corresponds to the gain in eq. (3.7).

In Yan et al. (2017a) a scheme for the case when M/N = 1/q or M/N = 1 − 1/q with subpacke-

tization level exponentially smaller with respect to Maddah-Ali and Niesen (2014b) was presented.
55

This result can be recovered a special case of our work (Theorem 2) when the linear block code is

chosen as a single parity check code over Z mod q. In this specific case, q does not need to be a

prime power. Thus, our results subsume the results of Yan et al. (2017a).
m

In a more recent preprint, reference Yan et al. (2017b), proposed a caching system with K = a ,

Fs = m a m−a
   m
b and M/N = 1 − λ b−β / b . The corresponding rate is
m

m − (a + b − 2β) a + b − 2β
   
a+b−2β
R= m
 min , ,
b
λ a−β
where m, a, b, β are positive integers and 0 < a < m, 0 < b < m, 0 ≤ β ≤ min{a, b}. While a

precise comparison is somewhat hard, we can compare the schemes for certain parameter choices,

that were also considered in Yan et al. (2017b).

Let a = 2, β = 1, m = 2b. This corresponds to a coded caching system with K = b(2b−1) ≈ 2b2 ,
M b−1
≈ 12 , Fs = 2b 2b

N = 2b−1 b ≈ 2 , R = b. For comparison with our scheme we keep the transmission

rates of both schemes roughly the same and let n = b2 , q = 2, k = b − 1. We assume that the

corresponding linear block code exists. Then Fs∗ ≈ 2b , which is better than Fs .

On the other hand if we let β = 0, a = 2, m = 2qb, we obtain a coded caching system with
m
K = m(m−1) , M 1 m Y AN = (2q − 1)2 . For keeping the rates the same, we

N ≈ q , Fs = m ≈ (2q) , R
2q
2 2q
m(m−1) m2
m(m−1) m(m−1)
let n = 2q , k= 4q(2q−1) − 1 so that Fs∗ ≈ q 4q(2q−1) ≈ q 8q2 . In this regime, the subpacketization

level of Yan et al. (2017b) will typically be lower.

The work of Shangguan et al. (2018) proposed caching schemes with parameters (i) K = m

a ,
m
M (m−ab ) m
 (a+b )
= 1 − m , Fs = and R = , where a, b, m are positive integers and a + b ≤ m and (ii)
N (b) b (mb)
K = mt q t , M 1 m t 1

N = 1 − q t , Fs = q (q − 1) and R = (q−1)t , where q, t, m are positive integers.

Their scheme (i), is the special case of scheme in Yan et al. (2017b) when β = 0. For the second
q
K √
scheme, if we let t = 2, Shangguan et al. (2018) shows that R ≈ R∗ , Fs ≈ q 2q ( q − 1)2 and
K
−1
Fs∗ ≈ (q − 1)q q , which means Fs is again better than Fs∗ . We emphasize here that these results

require somewhat restrictive parameter settings.

Finally, we consider the work of Shanmugam et al. (2017). In their work, they leveraged the

results of Alon et al. (2012) to arrive at coded caching schemes where the subpacketization is linear

in K. Specifically, they show that for any constant M/N , there exists a scheme with rate K δ , where
56

δ > 0 can be chosen arbitrarily small by choosing K large enough. From a theoretical perspective,

this is a positive result that indicates that regimes where linear subpacketization scaling is possible.

However, these results are only valid when the value of K is very large. Specifically, K = C n and

the result is asymptotic in the parameter n. For these parameter ranges, the result of Shanmugam

et al. (2017) will clearly be better as compared to our work.

3.5 Conclusions and Future Work

In this work we have demonstrated a link between specific classes of linear block codes and

the subpacketization problem in coded caching. Crucial to our approach is the consecutive col-

umn property which enforces that certain consecutive column sets of the corresponding generator

matrices are full-rank. We present several constructions of such matrices that cover a large range

of problem parameters. Leveraging this approach allows us to construct families of coded caching

schemes where the subpacketization level is exponentially smaller compared to the approach of

Maddah-Ali and Niesen (2014b).

There are several opportunities for future work. Even though our subpacketization level is sig-

nificantly lower than Maddah-Ali and Niesen (2014b), it still scales exponentially with the number

of users. Of course, the rate of growth with the number of users is much smaller. There have been

some recent results on coded caching schemes that demonstrate the existence of schemes where

the subpacketization scales sub-exponentially in the number of users. It would be interesting to

investigate whether some of these ideas can be leveraged to obtain schemes that work for practical

systems with tens or hundreds of users.


57

Table 3.1 A summary of the different constructions of CCP matrices in Section 3.3

Code type Code construction Notes


(n, k) MDS codes Satisfy (k, k + 1)-CCP. Need q + 1 ≥ n.
(n, k) Cyclic codes Existence depends on certain properties
Codes over field GF (q) of the generator polynomials. All cyclic
codes satisfy the (k, k)-CCP. Need addi-
tional conditions for the (k, k + 1)-CCP.
Kronecker product of z × α Satisfy the (k, k)-CCP where k = tz.
matrix satisfying the (z, z)-
CCP with the identity matrix
It×t
Kronecker product of Vander- Satisfy the (k, k + 1)-CCP for certain pa-
monde and Vandermonde-like rameters.
matrices with structured base
matrices
CCP matrix extension Extends a k × n CCP matrix to a k × (n +
s(k + 1)) CCP matrix for integer s.
Single parity-check (SPC) Satisfy the (k, k + 1)-CCP with n = k + 1.
Codes over ring Z mod q code
Cyclic codes over the ring Require that q = q1 × q2 × · · · × qd where
qi ’s are prime. Satisfy the (k, k)-CCP.
Kronecker product of z × α Satisfy the (k, k)-CCP property where k =
matrix satisfying the (z, z)- tz.
CCP with the identity matrix
It×t
CCP matrix extension Extends a k × n CCP matrix to a k × (n +
s(k + 1)) CCP matrix for integer s.
58

Table 3.2 List of k values for Example 16. The values of n0 , α and z are obtained by
following Algorithm 4.

k n0 z α Construction Notes
11 12 1 1 (12, 11) SPC code k + 1 = n0
10 12 11 12 - -
9 12 5 6 Claim 7 α = z + 1 and q ≥ z
8 12 3 4 Claim 7 α = z + 1 and q ≥ z
7 12 2 3 Claim 7 α = z + 1 and q ≥ z
6 12 7 12 Claim 4 Generator polynomial is X 6 +X 5 +3X 4 +
3X 3 + X 2 + 4X + 3
5 6 1 1 (6,5) SPC code and Claim 8 Extend (6,5) SPC code to (12,5) code
4 7 5 7 Claim 4 Generator polynomial is X 8 +X 7 +4X 6 +
3X 5 + 2X 3 + X 2 + 4X + 4
3 4 1 1 (4,3) SPC code and Claim 8 Extend (4,3) SPC code to (12,3) code
2 3 1 1 (3,2) SPC code and Claim 8 Extend (3,2) SPC code to (12,2) code
1 2 1 1 (2,1) SPC code and Claim 8 Extend (2,1) SPC code to (12,1) code
59

CHAPTER 4. ERASURE CODING FOR DISTRIBUTED MATRIX


MULTIPLICATION FOR MATRICES WITH BOUNDED ENTRIES

The work of Yu et al. (2020) considers the distributed computation of the product of two large

matrices AT and B, which are respectively partitioned into p × m and p × n blocks of submatrices

of equal size by the master node. The key result of Yu et al. (2020) shows that the product AT B

can be recovered as long as any τ = pmn + p − 1 workers complete their computation.

Interestingly, similar ideas (relating matrix multiplication to polynomial interpolation) were

investigated in a different context by Yagle (1995) in the mid 90’s. However, the motivation for

that work was fast matrix multiplication using pseudo-number theoretic transforms, rather than

fault tolerance. There have been other contributions in this area Dutta et al. (2016); Lee et al.

(2018, 2017); Mallick et al. (2019); Wang et al. (2018) as well, some of which predate Yu et al.

(2020).

4.1 Problem Formulation and Main Contribution

4.1.1 Problem Formulation

Let A (size v × r) and B (size v × t) be two integer matrices1 . We are interested in computing

C , AT B in a distributed fashion. Specifically, each worker node can store a 1/mp fraction of

matrix A and a 1/np fraction of matrix B. The job given to the worker node is to compute the

product of the submatrices assigned to it. The master node waits for a sufficient number of the

submatrix products to be communicated to it. It then determines the final result after further
1
Floating point matrices with limited precision can be handled with appropriate scaling.
60

processing at its end. More precisely, matrices A and B are first block decomposed as follows:

A = [Aij ], 0 ≤ i < p, 0 ≤ j < m, and

B = [Bkl ], 0 ≤ k < p, 0 ≤ l < n,

v r v t
where the Aij ’s and the Bkl ’s are of dimension p × m and p × n respectively. The master node

forms the polynomials

X
Ã(s, z) = Aij sλij z ρij , and
i,j
X
B̃(s, z) = Bkl sγkl z δkl ,
k,l

where λij , ρij , γkl and δkl are suitably chosen integers. Following this, the master node evaluates

Ã(s, z) and B̃(s, z) at a fixed positive integer s and carefully chosen points z ∈ {z1 , . . . , zK } (which

can be real or complex) where K is the number of worker nodes. Note that this only requires

scalar multiplication and addition operations on the part of the master node. Subsequently, it

sends matrices Ã(s, zi ) and B̃(s, zi ) to the i-th worker node.

The i-th worker node computes the product ÃT (s, zi )B̃(s, zi ) and sends it back to the master

node. Let 1 ≤ τ ≤ K denote the minimum number of worker nodes such that the master node can

determine the required product (i.e., matrix C) once any τ of the worker nodes have completed

their assigned jobs. We call τ the recovery threshold of the scheme. In Yu et al. (2020), τ is shown

to be pmn + p − 1.

4.1.2 Main Contribution

In this work, we demonstrate that as long as the entries in A and B are bounded by sufficiently

small numbers, the recovery threshold (τ ) can be significantly reduced as compared to the approach

of Yu et al. (2020). Specifically, the recovery threshold in our work can be of the form p0 mn + p0 − 1

where p0 is a divisor of p. Thus, we can achieve thresholds as low as mn (which is optimal),

depending on our assumptions on the matrix entries. We show that the required upper bound on

the matrix entries can be traded off with the corresponding threshold in a simple manner. Finally,
61

we present experimental results that demonstrate the superiority of our method via an Amazon

Web Services (AWS) implementation.

4.2 Reduced Recovery Threshold codes

4.2.1 Motivating example

Let m = n = p = 2 so that the following block decomposition holds


   
A00 A01  B00 B01 
A=  and B =  .
A10 A11 B10 B11

We let

Ã(s, z) = A00 + A10 s−1 + (A01 + A11 s−1 )z, and

B̃(s, z) = B00 + B10 s + (B01 + B11 s)z 2 .

The product ÃT (s, z)B̃(s, z) can be verified to be

ÃT (s, z)B̃(s, z) =

s−1 (AT10 B00 + AT11 B00 z + AT10 B01 z 2 + AT11 B01 z 3 ) (4.1)

+ C00 + C10 z + C01 z 2 + C11 z 3 (4.2)

+ s(AT00 B10 + AT01 B10 z + AT00 B11 z 2 + AT01 B11 z 3 ). (4.3)

Evidently, the product above contains the useful terms in (4.2) as coefficients of z k for k =

0, . . . , 3. The other two lines contain terms (coefficients of s−1 z k and sz k , k = 0, . . . , 3) that we are

not interested in; we refer to these as interference terms. Rearranging the terms, we have

ÃT (s, z)B̃(s, z) =

(∗s−1 + C00 + ∗s) + (∗s−1 + C10 + ∗s) z+


| {z } | {z }
X00 X10

(∗s−1 + C01 + ∗s) z 2 + (∗s−1 + C11 + ∗s) z 3 ,


| {z } | {z }
X01 X11
62

where ∗ denotes an interference term.

As the above polynomial is of z-degree 3, equivalently we have presented a coding strategy

where we recover superposed useful and interference terms even in the presence of K − 4 erasures.

Now, suppose that the absolute value of each entry in C and of each of the interference terms

is < L. Furthermore, assume that s ≥ 2L. The Cij ’s can then be recovered by exploiting the fact

that s ≥ 2L, e.g., for non-negative matrices A and B, we can simply extract the integer part of

each Xij and compute its remainder upon division by s. The case of general A and B is treated in

Section 4.2.2.

To summarize, under our assumptions on the maximum absolute value of the matrix C and

the interference matrix products, we can obtain a scheme with a threshold of 4. In contrast, the

scheme of Yu et al. (2020) would have a threshold of 9.

Remark 7. We emphasize that the choice of polynomials Ã(s, z) and B̃(s, z) are quite different in

our work as compared to Yu et al. (2020); this can be verified by setting s = 1 in the expressions. In

particular, our choice of polynomials deliberately creates the controlled superposition of useful and

interference terms (the choice of coefficients in Yu et al. (2020) explicitly avoids the superposition).

We unentangle the superposition by using our assumptions on the matrix entries later. To our best

knowledge, this unentangling idea first appeared in the work of Yagle (1995), though its motivations

were different.

4.2.2 General code construction

We now present the most general form of our result. Let the block decomposed matrices A and

B be of size p × m and p × n respectively. We form the polynomials Ã(s, z) and B̃(s, z) as follows
m−1
X p−1
X
Ã(s, z) = z i
Aui s−u , and
i=0 u=0
n−1
X p−1
X
mj
B̃(s, z) = z Bvj sv .
j=0 v=0
63

Under this choice of polynomials Ã(s, z) and B̃(s, z), we have


m−1
X n−1
XX p−1
p−1 X
ÃT (s, z)B̃(s, z) = ATui Bvj z mj+i sv−u . (4.4)
i=0 j=0 u=0 v=0

To better understand the behavior of this sum, we divide it into the following cases.

• Case 1: Useful terms. These are the terms with coefficients of the form ATui Buj . They are

useful since Cij = p−1 T T


P
u=0 Aui Buj . It is easy to check that the term Aui Buj is the coefficient of

z mj+i .

• Case 2: Interference terms. Conversely, the terms in (4.4) with coefficient ATui Bvj , u 6= v are

the interference terms and they are the coefficients of z mj+i sv−u (for v 6= u).

Based on the above discussion, we obtain


m−1
X n−1
X
T
à (s, z)B̃(s, z) = z mj+i ×
i=0 j=0

(∗s−(p−1) + · · · + ∗s−1 + Cij + ∗s + · · · + ∗sp−1 ), (4.5)


| {z }
Xij

where ∗ denotes an interference term. Note that (4.5) consists of consecutive powers z k for k =

0, . . . , mn − 1.

We choose distinct values zi for worker i (real or complex). Suppose that the absolute value of

each Cij and of each interference term (marked with ∗) is at most L − 1. We choose s ≥ 2L.

4.2.3 Decoding algorithm

We now show that as long as at least mn of the worker nodes return their computations, the

master node can recover the matrix C.

Suppose the master node obtains the result Yi = ÃT (s, zi )B̃(s, zi ) from any mn workers

i1 , i2 , . . . , imn . Then, it can recover Xij , i = 0, . . . , m − 1, j = 0, . . . , n − 1 by solving the fol-


64

lowing equations,
    
Yi1 1 zi1 zi21 ··· zimn−1
1
X00
    
zimn−1
    
 Yi  1 zi
 2   2 zi22 ··· 2
 X01 
 . = .
 
 .  
 ..
 .   ··· ··· .
 
 
    
Yimn 1 zimn zi2mn ··· zimn−1
mn
X(m−1)(n−1)

The Vandermonde form of the above matrix guarantees the uniqueness of the solution. This is
Q
because the determinant of Vandermonde matrix can be expressed as 1≤a,b≤mn (zia − zib ), which

is non-zero since zij , j = 1, · · · , mn, are distinct.

Note that Xij = ∗s−(p−1) + · · · + ∗s−1 + Cij + ∗s + · · · + ∗sp−1 . The master node can recover

Cij from Xij as follows. We first round Xij to the closest integer. This allows us to recover

Cij + ∗s + · · · + ∗sp−1 . This is because

L−1
| ∗ s−(p−1) + · · · + ∗s−1 | ≤ < 1/2.
2L − 1

Next, we determine Ĉij = Cij + ∗s + · · · + ∗sp−1 mod s (we work under the convention that the

modulo output always lies between 0 and s − 1). It is easy to see that if Ĉij ≤ s/2 then Cij = Ĉij ,

otherwise Cij is negative and Cij = −(s − Ĉij ). If s is a power of 2, the modulo operation can be

performed by simple bit-shifting; this is the preferred choice.

4.2.4 Discussion of precision issues

The maximum and the minimum values (integer or floating point) that can be stored and

manipulated on a computer have certain limits. Assuming s = 2L, it is easy to see that |Xij | is at

most (2L)p /2. Therefore, large values of L and p can potentially cause numerical issues (overflow

and/or underflow). We note here that a simple but rather conservative way to estimate the value

of L would be to set it equal to v · max |A| × max |B| + 1.

4.3 Trading off precision and threshold

The method presented in Section 4.2 achieves a threshold of mn while requiring that the LHS of

(4.5) remain with the range of numeric values that can be represented on the machine. In general,
65

the terms in (4.5) will depend on the choice of the zi ’s and the values of the |Xij |’s, e.g., choosing

the zi ’s to be complex roots of unity will imply that our method requires mn × (2L)p /2 to be within

the range of values that can be represented.

We now present a scheme that allows us to trade off the precision requirements with the recovery

threshold of the scheme, i.e., we can loosen the requirement on L and p at the cost of an increased

threshold.

Assume that p0 is an integer that divides p. We form the polynomials Ã(s, z) and B̃(s, z) as

follows,
0 −1 p/p0 −1
X pX
m−1
j+p0 i
X
Ã(s, z) = z A(k+ p0 j),i sk , and
p
i=0 j=0 k=0
0 −1 p/p0 −1
X pX
n−1
mp0 u+(p0 −1−v)
X
B̃(s, z) = z B(w+ p0 v),u s−w .
p
u=0 v=0 w=0

Note that in the expressions above we use Ai,j to represent the (i, j)-th entry of A (rather than

Aij ). Next, we have


0 −1 p/p0 −1 0 −1 p/p0 −1
X pX
m−1 X n−1 X pX X
T
Ã(s, z) B̃(s, z) =
i=0 j=0 k=0 u=0 v=0 w=0
0 0 0
AT(k+ p0 j),i B(w+ p0 v),u z mp u+(p −1−v)+j+p i sk−w . (4.6)
p p

To better understand the behavior of (4.6), we again divide it into useful terms and interference

terms.

• Case 1: Useful terms. These are the terms with coefficients of the form AT(k+ p j),i B(k+ p0 j),u .
p0 p
0 0 0
The term AT(k+ p j),i B(k+ p0 j),u is the coefficient of z mp u+p i+p −1 .
p0 p

• Case 2: Interference terms. The interference terms are associated with the terms with coef-

ficient AT(k+ p j),i B(w+ p0 v),u , k 6= w and/or j 6= v. They can be written as


p0 p

0 0 0
AT(k+ p0 j),i B(w+ p0 v),u z mp u+(p −1−v)+j+p i sk−w .
p p

We now verify that the interference terms and useful terms are distinct. This is evident when

k 6= w by examining the exponent of s. When k = w but j 6= v we argue as follows. Suppose that


66

there exist some u1 , u2 , i1 , i2 such that mp0 u1 + p0 i1 + p0 − 1 = mp0 u2 + p0 + p0 i2 − v + j − 1. Then,

mp0 (u1 − u2 ) + p0 (i1 − i2 ) = j − v. This is impossible since |j − v| < p0 .

Next, we discuss the degree of Ã(s, z)T B̃(s, z) in the variable z. In (4.6), the terms with maximal

z-degree are the terms with u = n − 1, v = 0, j = p0 − 1 and i = m − 1. Thus, the maximal degree

of z in the expression is mnp0 + p0 − 2. It can be verified that terms with z-degree from 0 to

mnp0 + p0 − 2 will appear in (4.6) and the z-degree of the useful terms Ciu are mp0 u + p0 i + p0 − 1,

i = 0, · · · , m − 1, u = 0, · · · , n − 1.

Likewise the s-degree of Ã(s, z)T B̃(s, z) varies from −(p − 1), . . . , 0, . . . , (p − 1) with the useful

terms corresponding to s0 . Based on the above discussion, we obtain


0 +p0 −2
mnpX
T
à (s, z)B̃(s, z) = Xk z k , where
k=0

p p
−( p0
−1) −1 p0
−1
∗s + · · · + ∗s + C + ∗s + · · · + ∗s ,

ij






if k = mp0 j + p0 i + p − 1



Xk =
−( p −1) p
−1
∗s p0 + · · · + ∗s−1 + ∗ + ∗s + · · · + ∗s p0 ,








otherwise.

Evidently, the recovery threshold is mnp0 + p0 − 1, which is higher than that of the construction in
0
Section 4.2.2. However, let s = 2L, the maximum value of |Xij | is at most (2L)p/p /2 which is less

than the previous construction if p0 > 1.

Example 18. Let m = n = 2, p = 4 and p0 = 2 so that


   
A A01 B B01
 00   00 
   
A10 A11  B10 B11 
A=  and B =  .
   
A20 A21  B20 B21 
   
   
A30 A31 B30 B31
67

17

16

15

14

13

12

11

10

9
0 1 2 3 4 5 6 7

Figure 4.1 Comparison of total computation latency by simulating up to 8 stragglers

We let

Ã(s, z) =A00 + A10 s−1 + (A20 + A30 s−1 )z+

(A01 + A11 s−1 )z 2 + (A21 + A31 s−1 )z 3 , and

B̃(s, z) =(B00 + B10 s)z + B20 + B30 s+

(B01 + B11 s)z 5 + (B21 + B31 s)z 4 .

The product of the above polynomials can be verified to contain the useful terms with coefficients

z, z 3 , z 5 , z 7 ; the others are interference terms. For this scheme the corresponding |Xij | can at most

be 2L2 , though the recovery threshold is 9. Applying the method of Section 4.2.2 would result in

the |Xij | values being bounded by 8L4 with a threshold of 4.

4.4 Experimental Results and Discussion

We ran our experiments on AWS EC2 r3.large instances. Our code is available onlinecom (2019).

The input matrices A and B were randomly generated integer matrices of size 8000 × 8000 with
68

elements in the set {0, 1, . . . , 50}. These matrices were pre-generated (for the different straggler

counts) and remained the same for all experiments. The master node was responsible for the 2 × 2

block decomposition of A and B, computing Ã(s, zi ) and B̃(s, zi ) for i = 1, . . . , 10 and sending

them to the worker nodes. The evaluation points (zi ’s) were chosen as 10 equally spaced reals

within the interval [−1, 1]. The stragglers were simulated by having S randomly chosen machines

perform their local computation twice.

We compared the performance of our method (cf. Section 4.2) with Yu et al. (2020). For

fairness, we chose the same evaluation points in both methods. In fact, the choice of points in their

code available online nip (2017) (which we adapted for the case when p > 1), provides worse results

than those reported here.

Computation latency refers to the elapsed time from the point when all workers have received

their inputs until enough of them finish their computations accounting for the decoding time. The

decoding time for our method is slightly higher owing to the modulo s operation (cf. Section 4.2.3).

It can be observed in Fig. 4.1 that for our method there is no significant change in the latency

for the values of S ∈ {0, 2, 4, 6} and it remains around 9.83 seconds. When S = 7, as expected

the straggler effects start impacting our system and the latency jumps to approximately 16.14

seconds. In contrast, the performance of Yu et al. (2020) deteriorates in the presence of two or

more stragglers (average latency ≥ 15.65 seconds).

Real Vandermonde matrices are well-known to have bad condition numbers. The condition

number is better when we consider complex Vandermonde matrices with entries from the unit circle

Gautschi (1990). In our method, the |Xij | and |Yij | values can be quite large. This introduces small

errors in the decoding process. Let Ĉ be the decoded matrix and C , AT B be the actual product.
||C−Ĉ||F
Our error metric is e = ||C||F (subscript F refers to the Frobenius norm). The results in Fig. 1,

had an error e of at most 10−7 . We studied the effect of increasing the average value of the entries

in A and B in Table 1. The error is consistently low up to a bound of L = 1000, following which

the calculation is useless owing to numerical overflow issues. We point out that in our experiments
69

Table 4.1 Effect of bound (L) on the decoding error

Bound(L) s Error
100 228 6.31 · 10−7
200 230 8.87 · 10−7
500 232 6.40 · 10−6
1000 234 9.52 · 10−6
2000 236 1
the error e was identically zero if the zi ’s were chosen from the unit circle. However, this requires

complex multiplication, which increases the computation time.


70

CHAPTER 5. NUMERICALLY STABLE CODED MATRIX


COMPUTATIONS VIA CIRCULANT AND ROTATION MATRIX
EMBEDDINGS

Polynomial based methods have been used in several works for mitigating the effect of stragglers

in distributed matrix computations. However, they suffer from serious numerical issues owing to

the condition number of the corresponding real Vandermonde-structured recovery matrices. In

this chapter, we present a novel approach that leverages the properties of circulant permutation

matrices and rotation matrices for coded matrix computation. In addition to having an optimal

recovery threshold, our scheme has condition numbers that are orders of magnitude lower than

prior work.

5.1 Problem Setting, Related Work and Main contributions

5.1.1 Problem Setting

Consider a scenario where the master node has a large t × r matrix A ∈ Rt×r and either a

t × 1 vector x ∈ Rt×1 or a t × w matrix B ∈ Rt×w . The master node wishes to compute AT x or

AT B in a distributed manner over n worker nodes in the matrix-vector and matrix-matrix setting

respectively. Towards this end, the master node partitions A (respectively B) into ∆A (respectively

∆B ) block columns. Each worker node is assigned δA ≤ ∆A and δB ≤ ∆B linearly encoded block

columns of A0 , . . . , A∆A −1 and B0 , . . . , B∆B −1 , so that δA /∆A ≤ γA and δB /∆B ≤ γB , where γA

and γB represent storage capacity constraints for A and B respectively.

In the matrix-vector case, the i-th worker is assigned encoded block-columns of A and the vector

x and computes their inner product. In the matrix-matrix case it computes all pairwise products of

block-columns assigned to it. We say that a given scheme has computation threshold τ if the master

node can decode the intended result as long as any τ out of n worker nodes complete their jobs. In
71

this case we say that the scheme is resilient to s = n − τ stragglers. We say that this threshold is

optimal if the value of τ is the smallest possible for the given storage capacity constraints.

The overall goal is to (i) design schemes that are resilient to s stragglers (s is a design parameter),

while ensuring that the (ii) desired result can be decoded in a efficient manner, and (iii) the decoded

result is numerically robust even in the presence of round-off errors and other sources of noise.

5.1.2 Related Work

A significant amount of prior work Yu et al. (2017, 2020); Dutta et al. (2016, 2019) has demon-

strated interesting and elegant approaches based on embedding the distributed matrix computation

into the structure of polynomials. Specifically, the encoding at the master node can be viewed as

evaluating certain polynomials at distinct real values. Each worker node gets a particular evalua-

tion. Once, at least τ workers finish their tasks, the master node can decode the intended result

by performing polynomial interpolation. The work of Yu et al. (2017) demonstrates that when

δA = δB = 1 the optimal threshold is ∆A ∆B and that polynomial based approaches (henceforth

referred to as polynomial codes) achieve this threshold. Prior work has also considered other ways

in which matrices A and B can be partitioned. For instance, they can be partitioned both along

rows and columns. The work of Yu et al. (2020); Dutta et al. (2019) has obtained threshold results

in those cases as well. The so called Entangled Polynomial and Mat-Dot codes Yu et al. (2020);

Dutta et al. (2019), also use polynomial encodings. The key point is that in all these approaches,

polynomial interpolation is required when decoding the required result.

A major part of our work in this paper revolves around understanding the numerical stability

of various distributed matrix computations. This is closely related to the condition number of

matrices. Let ||M|| denote the maximum singular value of a matrix M of dimension l × l.

Definition 7. Condition number. The condition number of a l × l matrix M is defined as κ(M) =

||M||||M−1 ||. It is infinite if the minimum singular value of M is zero.

Consider the system of equations My = z, where z is known and y is to be determined. If

κ(M) ≈ 10b , then the decoded result loses approximately b digits of precision Higham (2002). In
72

particular, matrices that are ill-conditioned lead to significant numerical problems when solving

linear equations.

Polynomial interpolation corresponds to solving a real Vandermonde system of equations at

the master node. In the work of Yu et al. (2017), this would require solving a ∆A ∆B × ∆A ∆B

Vandermonde system. Unfortunately, it can be shown that the condition number of these matrices

grows exponentially in ∆A ∆B Pan (2016). This is a significant drawback and even for systems

with around ∆A ∆B ≈ 30, the condition number is so high that the decoded results are essentially

useless.

In Section VII of Yu et al. (2020), it is remarked that when operating over infinite fields such as

the reals, one can embed the computation into finite fields to avoid numerical errors. For instance,

they advocate encoding and decoding over a finite field of order p. However, this method would

require ”quantizing” real matrices A and B so that the entries are integers. We demonstrate that

the performance of this method can be catastrophically bad.

For this method to work, the maximum possible absolute value of each entry of the quantized

matrices, α should be such that α2 t < p, since each entry corresponds to the inner product of

columns of A and columns of B. This “dynamic range constraint (DRC)” means that the error in

the computation depends strongly on the actual matrix entries and the value of t is quite limited.

If the DRC is violated, the error in the underlying computation can be catastrophic. Even if the

DRC is not violated, the dependence of the error on the entries can make it very bad. We discuss

the complete details in Section 5.6.

The main goal of our work is to consider alternate embeddings of distributed matrix computa-

tions that are based on rotation and circulant permutation matrices. We demonstrate that these

are significantly better behaved from a numerical precision perspective.

The issue of numerical stability in the coded computation context has been considered in a few

recent works Tang et al. (2019); Ramamoorthy et al. (2019); Das et al. (2018); Das and Ramamoor-

thy (2019); Das et al. (2019); Fahim and Cadambe (2019); Subramaniam et al. (2019). The work

of Ramamoorthy et al. (2019); Das and Ramamoorthy (2019) presented strategies for distributed
73

matrix-vector multiplication and demonstrated some schemes that empirically have better numer-

ical performance than polynomial based schemes for some values of n and s. However, both these

approaches work only for the matrix-vector problem. Preprint Das et al. (2019) presents a ran-

dom convolutional coding approach that applies for both the matrix-vector and the matrix-matrix

multiplications problems. Their work demonstrates a computable upper bound on the worst case

condition number of the decoding matrices by drawing on connections with the asymptotic analysis

of large Toeplitz matrices. The recent preprint Subramaniam et al. (2019) presents constructions

that are based on random linear coding ideas where the encoding coefficients are chosen at random

from a continuous distribution. These exhibit better condition number properties.

The work most closely related to the our work is Fahim and Cadambe (2019) considers an

alternative approach for polynomial based schemes by working within the basis of orthogonal poly-

nomials. They demonstrate an upper bound on the worst case condition number of the decoding

matrices which grows as O(n2s ) where s is the number of stragglers that the scheme is resilient

to. They also demonstrate experimentally that their performance is significantly better than the

polynomial code approach. In contrast we demonstrate an upper bound that is ≈ O(ns+6 ). Fur-

thermore, in Section 5.6 we show that in practice our worst case condition numbers are far better

than Fahim and Cadambe (2019).

5.1.3 Main contributions

The work of Pan (2016) shows that unless all (or almost all) the parameters of the Vandermonde

matrix lie on the unit circle, its condition number is badly behaved. However, most of these

parameters are complex-valued (except ±1), whereas our matrices A and B are real-valued. Using

complex evaluation points in the polynomial code scheme, will increase the cost of computations

approximately four times for matrix-matrix multiplication and around two times for matrix-vector

multiplication. This is an unacceptable hit in computation time.

• Our main finding in this paper is that we can work with real-valued matrix embeddings that (i)

continue to have the optimal threshold of polynomial based approaches, and (ii) enjoy the low
74

condition number of complex Vandermonde matrices with all parameters on the unit circle.

In particular, we demonstrate that rotation matrices and circulant permutation matrices of

appropriate sizes can be used within the framework of polynomial codes. At the top level,

instead of evaluating polynomials at real values, our approach evaluates the polynomials at

matrices.

n

• Using these embeddings we show that the worst case condition number over all n−s possible

recovery matrices is upper bounded by ≈ O(ns+6 ). Furthermore, our experimental results

indicate that the actual values are significantly smaller, i.e., the analysis yields pessimistic

upper bounds.

5.2 Distributed Matrix Computation Schemes

Our schemes in this work will be defined by the encoding matrices used by the master node,

which are such that the master node only needs to perform scalar multiplications and additions.

The computationally intensive tasks, i.e., matrix operations are performed by the worker nodes.

We begin by defining certain classes of matrices, discuss their relevant properties and present an

example that outlines the basic idea of our work. We let i = −1.

Definition 8. Rotation matrix. The 2 × 2 matrix Rθ below is called a rotation matrix.


 
cos θ − sin θ ∗
Rθ =   = QΛQ , where (5.1)
sin θ cos θ
   

1  i −i e 0 
Q= √   , and Λ =  . (5.2)
2 1 1 0 e−iθ

Definition 9. Circulant Permutation Matrix. Let e be a row vector of length m with e =

[0 1 0 . . . 0]. Let P be a m × m matrix with e as its first row. The remaining rows are obtained

by cyclicly shifting the first row with the shift index equal to the row index. Then Pi , i ∈ [m] are

said to be circulant permutation matrices. Let W denote the m-point Discrete Fourier Transform
ij 2π
(DFT) matrix, i.e., W(i, j) = √1 ωm
m
for i ∈ [m], j ∈ [m] where ωm = e−i m denotes the m-th root
75

(m−1)
2 ,...,ω
of unity. Then, it can be shown Gray (2006) that P = Wdiag(1, ωm , ωm m )W∗ , where

the superscript ∗ denotes the complex conjugate operator.

Remark 8. Rotation matrices and circulant permutation matrices (see Appendix B for an example)

have the useful property that they are “real” matrices with complex eigenvalues lie on the unit circle.

We use this property extensively in the sequel.

Illustrative Example: Consider the matrix-vector case where ∆A = 3 and δA = 1. In the

polynomial approach, the master node forms A(z) = A0 + A1 z + A2 z 2 and evaluates it at distinct

real values z1 , . . . , zn . The i-th evaluation is sent to the i-th worker node which computes AT (zi )x.

From polynomial interpolation, it follows that as long as the master node receives results from any

three workers, it can decode AT x. However, when ∆A is large, the interpolation is numerically

unstable.

The basic idea of our approach is as follows. We further split each Ai into two equal sized block

columns. Thus we now have six block-columns, indexed as A0 , . . . A5 . Consider the 6 × 2 matrix

defined below; its columns are specified by g0 and g1 .


 
 I 
 
[g0 g1 ] =  Rθ 
 i

 
Rθ2i

P5 P5
The master node forms “two” encoded matrices for the i-th worker: j=0 Aj g0 (j) and j=0 Aj g1 (j)

(where gi (l) denotes the l-th component of the vector gi ). Thus, the storage capacity constraint

fraction γA is still 1/3. Worker node i computes the inner product of these two encoded matrices

with x and sends the result to the master node. It turns out that in this case when any three

workers i0 , i1 , and i2 complete their tasks, the decodability and numerical stability of recovering

AT x depends on the condition number of the following matrix.


 
 I I I 
 
 Ri0 Ri1 Ri2 
 θ θ θ 
 
2i0 2i1 2i2
Rθ Rθ Rθ
76

Using the eigendecomposition of Rθ (cf. 5.6) the above block matrix can expressed as
   
Q 0 0 I I I  Q∗ 0 0
   
 i i i  ∗
,
0 Q 0
 Λ 0 Λ 1 Λ 2  0 Q 0
   
0 0 Q Λ 2i 0 Λ 2i 1 Λ 2i 2 0 0 Q ∗
| {z }
Σ

As the pre- and post-multiplying matrices are unitary, the condition number of the above matrix

only depends on the properties of the middle matrix, denoted by Σ. In what follows we show that

upon appropriate column and row permutations, Σ can be shown equivalent to a block diagonal

matrix where each of the blocks is a Vandermonde matrix with parameters on the unit circle.

Thus, the matrix is invertible. Furthermore, even though we use real computation, the numerical

stability of our scheme depends on Vandermonde matrices with parameters on the unit circle. The

upcoming Theorem 3 shows that the condition number of such matrices is much better behaved.

In the sequel we show that this argument can be significantly generalized and adapted for the

case of circulant permutation embeddings. The matrix-matrix case requires the development of

more ideas that we also present.

Notation: Let [m] denote the set {0, . . . , m − 1}. For a matrix M, M(i, j) denotes its (i, j)-

th entry, whereas Mi,j denotes the (i, j)-th block sub-matrix of M. We use MATLAB inspired

notation at certain places. For instance, diag(a1 , a2 , . . . , am ) denotes a m × m diagonal matrix with

ai ’s on the diagonal and M(:, j) denotes the j-th column of matrix M. The notation M1 ⊗ M2

denotes the Kronecker product of M1 and M2 .


Encoding schemes: In this work our general strategy will be to first partition the matrices
A and B into ∆A = kA ` and ∆B = kB ` block-columns respectively. However, we use two indices
to refer to their respective constituent block columns as this simplifies our later presentation. To
avoid confusion, we use the subscript hi, ji to refer to the corresponding (i, j)-th block-columns. In
particular Ahi,ji , i ∈ [kA ], j ∈ [`] and Bhi,ji , i ∈ [kB ], j ∈ [`] refer to the (i, j)-th block column of A
and B respectively, such that

A = [Ah0,0i . . . Ah0,`−1i | . . . | AhkA −1,0i . . . AhkA −1,`−1i ],

B = [Bh0,0i . . . Bh0,`−1i | . . . | AhkB −1,0i . . . AhkB −1,`−1i ]. (5.3)


77

The encoding matrix for A will be specified by a kA ` × n` “generator” matrix G such that

X
Âhi,ji = G(α` + β, i` + j)Ahα,βi (5.4)
α∈[kA ],β∈[`]

for i ∈ [n], j ∈ [`]. A similar rule will apply for B and result in encoded matrices B̂hi,ji . Thus,

in the matrix-vector case worker node i stores Âhi,ji for j ∈ [`] and x, whereas in the matrix-

matrix case it stores Âhi,ji and B̂hi,ji , for j ∈ [`]. Thus worker i stores γA = `/∆A = 1/kA and

γB = `/∆B = 1/kB fractions of matrices A and B respectively. In the matrix-vector case, worker

node i computes ÂThi,ji x for j ∈ [`] and transmits them to the master node. In the matrix-matrix

case, it computes all `2 pairwise products ÂThi,l1 i B̂hi,l2 i for l1 ∈ [`], l2 ∈ [`].

Decoding Scheme: With the above encoding, the decoding process corresponds to solving

linear equations. We discuss the matrix-vector case here; the matrix-matrix case is quite similar.

In the matrix-vector case, the master node receives ÂThi,ji x of length r/∆A for j ∈ [`] from a certain

number of worker nodes and wants to decode AT x of length r. Based on our encoding scheme, this

can be done by solving a ∆A × ∆A linear system of equations r/∆A times. The structure of this

linear system is inherited from the encoding matrix G.

Definition 10. Vandermonde Matrix. A m×m Vandermonde matrix V with parameters z1 , z2 , . . . , zm ∈

C is such that V(i, j) = zji , i ∈ [m], j ∈ [m]. If the zi ’s are distinct, then V is nonsingular Horn

and Johnson (1991). In this work, we will also assume that the zi ’s are non-zero.

Condition Number of Vandermonde Matrices: Let V be a m × m Vandermonde matrix

with parameters s0 , s1 , . . . sm−1 . The following facts about κ(V) follow from prior work Pan (2016).

• Real Vandermonde matrices. If si ∈ R, i ∈ [m], i.e., if V is a real Vandermonde matrix, then

it is known that its condition number is exponential in m.

• Complex Vandermonde matrices with parameters “not” on the unit circle. Suppose that

the si ’s are complex and let s+ = maxm−1


i=0 |si |. If s+ > 1 then κ(V) is exponential in

m. Furthermore, if 1/|si | ≥ ν > 1 for at least β ≤ m of the m parameters, then κ(V) is

exponential in β.
78

Based on the above facts, the only scenario where the condition number is somewhat well-behaved

is if most or all of the parameters of V are complex and lie on the unit-circle. In the Appendix B,

we show the following result which is one of our key technical contributions.

Theorem 3. Consider a m×m Vandermonde matrix V where m < q (where q is odd) with distinct

parameters {s0 , s1 , . . . , sm−1 } ⊂ {1, ωq , ωq2 , . . . , ωqq−1 }. Then,

κ(V) ≤ O(q q−m+6 ).

Remark 9. If q − m is a constant, then κ(V) grows only polynomially in q. In the subsequent

discussion, we will leverage Theorem 3 extensively.

5.3 Distributed Matrix-Vector Multiplication

5.3.1 Rotation Matrix Embedding

Let q be an odd number such that q ≥ n, θ = 2π/q and ` = 2 (cf. block column decomposition

in (5.3)). We choose the generator matrix such that its (i, j)-th block submatrix for i ∈ [kA ], j ∈ [n]

is given by

ji
Grot
i,j = Rθ (5.5)

Theorem 4. The threshold for the rotation matrix based scheme specified above is kA . Further-

more, the worst case condition number of the recovery matrices is upper bounded by O(q q−kA +6 ).

Proof. Suppose that workers indexed by i0 , . . . , ikA −1 complete their tasks. We extract the corre-

sponding block columns of Grot to obtain


 
I I ··· I
 
ik −1
 Ri0 Riθ1
 
θ ··· Rθ A 
G̃rot = .
 
.. .. .. ..

 . . . .


 
i (k −1) i (k −1) ik −1 (kA −1)
Rθ0 A Rθ1 A ··· Rθ A
79

We note here that the decoder attempts to recover each entry of AThi,ji x from the results sent by

the worker nodes. Thus, we can equivalently analyze the decoding by considering the system of

equations as

mG̃rot = c,

where m, c ∈ R1×kA ` are row-vectors such that

m = [m0 , · · · , mkA −1 ]

= [mh0,0i , · · · , mh0,`−1i , · · · , mhkA −1,0i , · · · , mhkA −1,`−1i ],

and

c = [ci0 , · · · , cikA −1 ]

= [chi0 ,0i , · · · , chi0 ,`−1i , · · · , chik ,0i , · · · , chik ,`−1i ].


A −1 A −1

In the expression above, terms of the form mhi,ji and chi,ji are scalars. We need to analyze κ(G̃rot ).

Towards this end, using the eigenvalue decomposition of Rθ , we have


   
Q Q ∗
   
rot
G̃ = 
 . ..
 
 Λ̃  . ..

 , where (5.6)
   
   
Q Q∗
 
I I ··· I
 
i
 
 Λ i0 Λ i1 ··· Λ kA −1 
Λ̃ = 
 
 .
.. .
.. . .. .
..


 
 
i (k −1) i (k −1) ikA −1 (kA −1)
Λ 0 A Λ 1 A ··· Λ

and Λ is specified in (5.2). Note that the pre- and post-multiplying matrices in (5.6) above are

both unitary. Therefore κ(G̃rot ) is the same as κ(Λ̃) Horn and Johnson (1991).

Using Claim 14 in the Appendix B, we can permute Λ̃ to put it in block-diagonal form so that
 
Λ̃d [0] 0 
Λ̃d =  ,
0 Λ̃d [1]
80

Algorithm 5: Decoding Algorithm for Circulant Permutation Scheme


1 Input: Gcirc
I where |I| = kA (block-columns of G corresponding to block-columns in I). c
corresponding to observed values in one system of equations.
2 Output: m which is the solution to mGcirc I = c.
3 1. procedure: Block Fourier Transform and permute c.
4 for j = 0 to kA − 1 do
5 Apply FFT to cij = [chij ,0i , · · · , chij ,q̃−1i ] to obtain cF F F
ij = [chij ,0i , · · · , chij ,q̃−1i ].
6 end
7 Permute cF = [cF F
i 0 , · · · , c ik ] by π to obtain cF ,π = [cF ,π F ,π
0 , · · · , cq̃−1 ] where
A −1
cF
j

= [cF F F
hi0 ,ji , chi1 ,ji , · · · , chik ,ji ], for j = 0, . . . , q̃ − 1.
A −1
8 2. procedure: Decode mF ,π from cF ,π .
9 For i ∈ {1, . . . , q̃ − 1}, decode mF
i

from cF
i

by polynomial interpolation or matrix
inversion of G̃d [i] (see (B.2) in Appendix B). Set mF
F
0

= [0, · · · , 0].
10 3. procedure: Inverse permute and Block Inverse Fourier Transform mF ,π .
11 Permute mF ,π by π −1 to obtain mF = [mF F F
0 , · · · , mkA −1 ]. Apply inverse FFT to each mi
F
in m to obtain m = [m0 , · · · , mkA −1 ].

where Λ̃d [0] and Λ̃d [1] are Vandermonde matrices with parameter sets {eiθi0 , . . . , eiθikA −1 } and

{e−iθi0 , . . . , e−iθikA −1 } respectively. Note that these parameters are distinct points on the unit

circle. Thus, Λ̃d [0] and Λ̃d [1] are both invertible which implies that Λ̃ is invertible. This allows

us to conclude that the threshold of the scheme is kA . The upper bound on the condition number

follows from Theorem 3.

We note here that the decoding process involves inverting a ∆A × ∆A matrix once and using the

inverse to solve r/∆ systems of equations. Thus, the overall decoding complexity is O(∆3A + r∆A )

where typically, r  ∆2A .

5.3.2 Circulant Permutation Embedding

Let q̃ be a prime number which is greater than or equal to n. We set ` = q̃ − 1, so that A is

sub-divided into kA (q̃ − 1) block-columns as in (5.3). In this embedding we have an additional step.

Specifically, the master node generates the following “precoded” matrices.


q̃−2
X
Ahi,q̃−1i = − Ahi,ji , i ∈ [kA ]. (5.7)
j=0
81

In the subsequent discussion, we work with the set of block-columns Ahi,ji for i ∈ [kA ], j ∈ [q̃]. The

coded submatrices Âhi,ji for i ∈ [n], j ∈ [q̃] are generated by means of a kA q̃ × nq̃ matrix Gcirc as

follows.

X
Âhi,ji = Gcirc (αq̃ + β, iq̃ + j)Ahα,βi , (5.8)
α∈[kA ],β∈[q̃]

where the (i, j)-th block of Gcirc can be expressed as

Gcirc ji
i,j = P , for i ∈ [kA ], j ∈ [n]. (5.9)

The matrix P denotes the q̃ × q̃ circulant permutation matrix introduced in Definition 9. For this

scheme the storage fraction γA = q̃/(kA (q̃ − 1)), i.e., it is slightly higher than 1/kA .

Remark 10. The Âhi,ji ’s can simply be generated by additions since Gcirc is a binary matrix.

Theorem 5. The threshold for the circulant permutation based scheme specified above is kA . Fur-

thermore, the worst case condition number of the recovery matrices is upper bounded by O(q̃ q̃−kA +6 )

and the scheme can be decoded by using Algorithm 5.

The proof appears in the Appendix B. It is conceptually similar to the proof of Theorem 4

and relies critically on the fact that all eigenvalues of P lie on the unit circle and that P can be

diagonalized by the DFT matrix W. It suggests an efficient decoding algorithm where the fast

Fourier Transform (FFT) plays a key role (see Algorithm 5 and Claim 12).

Claim 12. The decoding complexity of recovering AT x is O(r(log q̃ + log2 kA )).

Remark 11. Both circulant permutation matrices and rotation matrices allow us to achieve a

specified threshold for distributed matrix vector multiplication. The required storage fraction γA

is slightly higher for the circulant permutation case and it requires q to be prime. However, it

allows for an efficient FFT based decoding algorithm. On the other hand, the rotation matrix case

requires a smaller ∆A , but the decoding requires solving the corresponding system of equations

the complexity of which can be cubic in ∆A . We note that when the matrix sizes are large, the

decoding time will be negligible as compared to the worker node computation time; we discuss this
82

in Section 5.6. In Section 5.6, we show results that demonstrate that the normalized mean-square

error when circulant permutation matrices are used is lower than the rotation matrix case.

The circulant matrix embedding idea can be also applied to the fast encoder of Quasi-cyclic

(QC) LDPC code in Huang et al. (2014); Tang et al. (2013); Zhang et al. (2014).

5.4 Distributed Matrix-Matrix Multiplication

The matrix-matrix case requires the introduction of newer ideas within this overall framework.

In this case, a given worker obtains encoded block-columns of both A and B and representing the

underlying computations is somewhat more involved. Once again we let θ = 2π/q, where q ≥ n (n

is the number of worker nodes) is an odd integer and set ` = 2. Furthermore, let kA kB < n. The

(i, j)-th blocks of the encoding matrices are given by

ji
GA
i,j = Rθ , for i ∈ [kA ], j ∈ [n], and

(jkA )i
GB
i,j = Rθ , for i ∈ [kB ], j ∈ [n].

The master node operates according to the encoding rule discussed previously (cf. (5.3)) for both A

and B. Thus, each worker node stores γA = 1/kA and γB = 1/kB fraction of A and B respectively.

The i-th worker node computes the pairwise product of the matrices ÂThi,l1 i B̂hi,l2 i for l1 , l2 = 0, 1

and returns the result to the master node. Thus, the master node needs to recover all pair-wise

products of the form AThi,αi Bhj,βi for i ∈ [kA ], j ∈ [kB ] and α, β = 0, 1. Let Z denote a 1 × 4kA kB

block matrix that contains all of these pair-wise products.

Theorem 6. The threshold for the rotation matrix based matrix-matrix multiplication scheme is

kA kB . The worst case condition number is bounded by O(q q−kA kB +6 ).


Proof. Let τ = kA kB and suppose that the workers indexed by i0 , . . . , iτ −1 complete their tasks. Let

GA A B
l denote the l-th block column of G (with similar notation for G ). Note that for k1 , k2 ∈ {0, 1}

the l-th worker node computes ÂThl,k1 i B̂hl,k2 i which can written as
83

 
X
 GA (2α + β, 2l + k1 )AThα,βi 
α∈[kA ],β∈{0,1}
 
X
 GB (2α + β, 2l + k2 )Bhα,βi 
α∈[kB ],β∈{0,1}

≡ Z · (GA (:, 2l + k1 ) ⊗ GB (:, 2l + k2 )),

using the properties of the Kronecker product. Based on this, it can be observed that the decod-

ability of Z at the master node is equivalent to checking whether the following matrix is full-rank.

G̃ = [GA B A B A B
i0 ⊗ Gi0 |Gi1 ⊗ Gi1 | . . . |Giτ −1 ⊗ Giτ −1 ].

To analyze this matrix, consider the following decomposition of GA B


l ⊗ Gl , for l ∈ [n].
   
QQ∗ QQ∗
   
 QΛl Q∗   QΛlkA Q∗ 
   
GA B
l ⊗ Gl =  ⊗ =
   
.. ..
. .
   
   
   
QΛ l(kA −1) Q∗ QΛ lkA (k B −1) Q ∗

   
I I
   
    
 Λl   ΛlkA   
(IkA ⊗ Q)   Q∗ ⊗ (IkB ⊗ Q)   Q∗ ,
   
.. ..
. .
   
   
   
Λl(kA −1) ΛlkA (kB −1)

where the first equality uses the eigenvalue decomposition of Rθ . Applying the properties of

Kronecker products, this can be simplified as

((IkA ⊗ Q) ⊗ (IkB ⊗ Q)) ×


| {z }
Q̃1
   
I I
   
   
l lk
!
 Λ   Λ A   ⊗2
⊗  Q∗ .
   
 .. ..
. .
   
    | {z }
   

Λl(kA −1) ΛlkA (kB −1) 2

| {z }
Xl
84

Therefore, we can express

[GA B A B
i0 ⊗ Gi0 |Gi1 ⊗ Gi1 | . . . |Giτ −1
A
⊗ GB
iτ −1 ]
 
Q̃ 0 ... 0
 2 
 
0 Q̃2 . . . 0 
= Q̃1 [Xi0 |Xi1 | . . . |Xiτ −1 ]  . .
 
 . .. .. ..
 . . . .


 
0 0 ... Q̃2

Thus, we can conclude that the invertibility and the condition number of G̃ only depends on

[Xi0 |Xi1 | . . . |Xiτ −1 ] as the matrices pre- and post- multiplying it are both unitary. The invert-

ibility of [Xi0 |Xi1 | . . . |Xiτ −1 ] follows from an application of Claim 15 in the Appendix B. The

proof of Claim 15 also shows that upon appropriate permutation, the matrix [Xi0 |Xi1 | . . . |Xiτ −1 ]

can be expressed as a block-diagonal matrix with four blocks each of size τ × τ . Each of these

blocks is a Vandermonde matrix with parameters from the set {1, ωq , ωq2 , . . . , ωqq−1 }. Therefore,

[Xi0 |Xi1 | . . . |Xiτ −1 ] is non-singular and it follows that the threshold of our scheme is kA kB . An

application of Theorem 3 implies that the worst case condition number is at most O(q q−τ +6 ).

5.5 Generalized Distributed Matrix Multiplication

In the previous sections, we consider the case that A and B are partitioned into block columns.

In this section, we consider a more general sceanrio where A and B are partitioned into block

columns and and block rows. This construction resembles the entangled polynomial codes of Yu

et al. (2020).

We partition the matrices A into 2p block-rows and ∆A = kA block-columns. We denote

A = [A(hi,li,j) ], i ∈ [p], l ∈ {0, 1}, j ∈ [kA ], where A(hi,li,j) denotes the submatrix indexed by the

hi, li-th block row and the j-th block-column of A. Similarly, we partition B into 2p block-rows

and ∆B = kB block-columns. We let θ = 2π/q, where q ≥ n > 2kA kB p − 1 (recall that n is the

number of worker nodes) is an odd integer.


85

The encoding in this scenario is somewhat more complicated to express. We simplify this by

leveraging the following simple lemma whose proof follows from basic Kronecker product properties

(see Appendix B.0.1).

Lemma 4. Suppose that matrices M1 and M2 both have ζ rows and the same column dimension.

Consider a 2 × 2 matrix Ψ = [Ψi,j ], i = 0, 1, j = 0, 1. Then


   
Ψ0,0 M1 + Ψ0,1 M2  M1 
  = (Ψ ⊗ Iζ )  
Ψ1,0 M1 + Ψ1,1 M2 M2
t
Definition 11. Suppose that Ahi,li and Bhi,li have ζ = 2p rows. The encoding matrix A and B

can be denoted as follows.


   
p−1 k −1
Âhk,0i  X X A(hi,0i,j) 
A
k((j−1)p+i+1)
 = (R−θ ⊗ Iζ )   , and
Âhk,1i i=0 j=0 A(hi,1i,j)
   
B −1
p−1 kX
B̂hk,0i  X k(p−1−i+jpkA ) B(hi,0i,j) 
 = (Rθ ⊗ Iζ )  .
B̂hk,1i i=0 j=0 B(hi,1i,j)

2
The k-th worker node stores Âhk,ti , B̂hk,ti , t = 0, 1. Thus, each worker node stores γA = 2pkA =
1 2 1
pkA and γB = 2pkB = pkB fraction of A and B respectively. Worker node k computes
 T  
Âhk,0i  B̂hk,0i 
   . (5.10)
Âhk,1i B̂hk,1i

Before presenting our decoding algorithm and the main result of this section, we discuss the

following example that helps clarify the underlying idea.

Example 19. Suppose kA = 1, kB = 1, p = 2, ` = 2. Let n = 4. The matrix A and B can be

partitioned as follows.
   
A(h0,0i,0) B(h0,0i,0)
   
   
A(h0,1i,0)  B(h0,1i,0) 
A= , B =  .
   
A(h1,0i,0)  B(h1,0i,0) 
   
   
A(h1,1i,0) B(h1,1i,0)
86

In this example, since kA = kB = 1, there is only one block column in A and B. Therefore,

the index j in A(hi,li,j) and B(hi,li,j) is always 0. Accordingly, to simplify our presentation, we only

use indices i and l to refer the respective constituent block rows of A and B. That is, we simplify

A(hi,li,j) and B(hi,li,j) to Ahi,li and Bhi,li , respectively. Our scheme aims to let the master node

recover AT B = ATh0,0i Bh0,0i + ATh0,1i B(h0,1i + ATh1,0i Bh1,0i + ATh1,1i Bh1,1i .

Suppose that Ahi,li and Bhi,li have ζ rows. The encoding process can be defined as
   
1
Âhk,0i  X k(i−1) Ahi,0i 
 = (R−θ ⊗ Iζ )   , and
Âhk,1i i=0 Ahi,1i
   
1
B̂hk,0i  X k(1−i) Bhi,0i 
 = (Rθ ⊗ Iζ )  .
B̂hk,1i i=0 Bhi,1i

The computation
 in worker
 node  k (cf. (5.10))  can be analyzed
 as follows.
F F
Ahi,0i  Ahi,0i  Bhi,0i  Bhi,0i 
Let   = (Q ⊗ Iζ )   and   = (Q ⊗ Iζ )  . Then
AF
hi,1i Ahi,1i B F
hi,1i Bhi,1i

 T      
∗
Âhk,0i  B̂hk,0i  (a) Âhk,0i  B̂hk,0i 

    = (Q ⊗ Iζ )   (Q ⊗ Iζ )  
Âhk,1i B̂hk,1i Âhk,1i B̂hk,1i
   
∗
A A

−k  h0,0i   h1,0i 
= (Q ⊗ Iζ )(R−θ ⊗ Iζ )   + (Q ⊗ Iζ )(I2 ⊗ Iζ )  
Ah0,1i Ah1,1i
   
Bh0,0i  Bh1,0i 
 
k
(Q ⊗ Iζ )(Rθ ⊗ Iζ )   + (Q ⊗ Iζ )(I2 ⊗ Iζ )  
Bh0,1i Bh1,1i
   
∗
Ah0,0i  Ah1,0i 

(b) −k ∗ ∗
= (QR−θ Q ⊗ Iζ )(Q ⊗ Iζ )   + (QI2 Q ⊗ Iζ )(Q ⊗ Iζ )  
Ah0,1i Ah1,1i
   
Bh0,0i  Bh1,0i 
 
k ∗ ∗
(QRθ Q ⊗ Iζ )(Q ⊗ Iζ )   + (QI2 Q ⊗ Iζ )(Q ⊗ Iζ )  
Bh0,1i Bh1,1i
87

       
∗ −k ∗
ωq Ah0,0i  1  Ah1,0i 

(c)
= (  ⊗ Iζ )(Q ⊗ Iζ )   + (  ⊗ Iζ )(Q ⊗ Iζ ) 


ωq∗ k Ah0,1i 1 Ah1,1i
       
k
ωq Bh0,0i  1  Bh1,0i 
 
 ⊗ Iζ )(Q ⊗ Iζ )   + (  ⊗ Iζ )(Q ⊗ Iζ ) 

 
ωq −k Bh0,1i 1 Bh1,1i
       
 ∗ −k F F ∗  k F F
(d) ωq Ah0,0i  Ah1,0i   ωq Bh0,0i  Bh1,0i 

=  +   + 
ωq∗ k AF
h0,1i A F
h1,1i ω q
−k BF
h0,1i B F
h1,1i

=(AF ∗ F F∗ F −k
h0,0i Bh1,0i + Ah1,1i Bh0,1i )ωq +

(AF ∗ F F∗ F F∗ F F∗ F
h0,0i Bh0,0i + Ah1,0i Bh1,0i + Ah0,1i Bh0,1i + Ah1,1i Bh1,1i )+

(AF ∗ F F∗ F k
h1,0i Bh0,0i + Ah0,1i Bh1,1i )ωq

where

• (a) holds because Q ⊗ Iζ is unitary,

• (b) holds by the mixed-product property of Kronecker product. For example

(Q ⊗ Iζ )(R−k −k
−θ ⊗ Iζ ) = (QR−θ ) ⊗ Iζ

= (QR−k ∗
−θ Q Q) ⊗ Iζ

= (QR−k ∗
−θ Q ⊗ Iζ )(Q ⊗ Iζ ).

 
ωq
• (c) holds because QRθ Q∗ =  .

ωq−1

• (d) holds by Lemma 4.

Thus, it is clear that whenever master node collects the results of any three distinct worker nodes,

it can recover AF ∗ F F∗ F F∗ F F∗ F
h0,0i Bh0,0i + Ah1,0i Bh1,0i + Ah0,1i Bh0,1i + Ah1,1i Bh1,1i . However, we observe that
 ∗    T  
F F
Ahi,0i  Bhi,0i  Ahi,0i  Bhi,0i 
   =   
AF
hi,1i BF
hi,1i Ahi,1i Bhi,1i

Thus, we can equivalently recover AT B.


88

Theorem 7. The threshold for scheme in this section is 2pkA kB − 1. The worst case condition

number of the recovery matrices is upper bounded by O(q q−2pkA kB +7 ).

Proof. We proceed in a similar manner as in Example 19. Following the encoding rules (cf. Defi-

nition 11) and worker computation


 rules
 (cf. (5.10)), we can analyze  the computation
  in worker k
F F
A(hi,0i,j)  A(hi,0i,j)  B(hi,0i,j)  B(hi,0i,j) 
as follows. Let (Q ⊗ Iζ )  =  and (Q ⊗ Iζ )  = . Then,
A(hi,1i,j) AF (hi,1i,j) B (hi,1i,j) B F
(hi,1i,j)
we have
 
p−1 k −1
A(hi,0i,j) 
A
k((j−1)p+i+1) ∗
X X
ÂF
k = (Q ⊗ Iζ )Âk = (QR−θ Q Q ⊗ Iζ )  
i=0 j=0 A (hi,1i,j)
 
p−1 kXA −1
A(hi,0i,j) 
(Λ∗ k((j−1)p+i+1) ⊗ Iζ )(Q ⊗ Iζ ) 
X
= 
i=0 j=0 A(hi,1i,j)
 
Pp−1 PkA −1 ∗ k((j−1)p+i+1) F
 i=0 j=0 ωq A(hi,0i,j) 
= P 
p−1 P kA −1 ∗ −k((j−1)p+i+1) F
i=0 j=0 ω q A (hi,1i,j)
 
p−1 k −1
B(hi,0i,j) 
B
k(p−1−i+jpkA ) ∗
X X
B̂F
k = (Q ⊗ Iζ )B̂k = (QRθ Q Q ⊗ Iζ )  
i=0 j=0 B(hi,1i,j)
 
p−1 kXB −1 F
X B(hi,0i,j) 
= (Λk(p−1−i+jpkA ) ⊗ Iζ )(Q ⊗ Iζ )  
i=0 j=0 BF(hi,1i,j)
 
Pp−1 PkB −1 k(p−1−i+jpkA ) F
 i=0 j=0 ωq B(hi,0i,j) 
= P 
p−1 PkA −1 −k(p−1−i+jpkA ) F
i=0 j=0 ω q B(hi,1i,j)

This implies that

ÂTk B̂k =((Q ⊗ Iζ )Âk )∗ (Q ⊗ Iζ )B̂k

=ÂF ∗ F
k B̂k
X A −1
p−1 kX B −1
p−1 kX
 X 
k((j−1)p+i+1) F ∗ k(p−1−i+jpkA ) F (5.11)
= ωq A(hi,0i,j) ωq B(hi,0i,j) +
i=0 j=0 i=0 j=0
X A −1
p−1 kX B −1
p−1 kX
 X 
ωq−k((j−1)p+i+1) AF ∗
(hi,1i,j) ωq−k(p−1−i+jpkA ) BF
(hi,1i,j) .
i=0 j=0 i=0 j=0

To better understanding the behavior of this sum (5.11), we divide it into two cases,
89

• Case 1: Useful terms. Master node wants to recover C = AT B = [Ci,j ], i ∈ [kA ], j ∈ [kB ],

where each Ci,j is a block matrix of size r/kA ×w/kB . Note that Ci,j = p−1 T
P
u=0 (A(hu,0i,i) B(hu,0i,j) +

AT(hu,1i,i) B(hu,1i,j) ). Thus, the “useful” terms in (5.11) are the terms with coefficients AT(hu,0i,i) B(hu,0i,j) ,

AT(hu,1i,i) B(hu,1i,j) , u ∈ [p]. They correspond to terms AF ∗ F F∗ F


(hu,0i,i) B(hu,0i,j) and A(hu,1i,i) B(hu,1i,j)

in (5.11) since

AF ∗ F F∗ F
(hu,0i,i) B(hu,0i,j) + A(hu,1i,i) B(hu,1i,j)
 ∗  
F F
A(hu,0i,i)  B(hu,0i,j) 
=   
AF (hu,1i,i) B F
(hu,1i,j)
   
∗ 
A(hu,0i,i)  B(hu,0i,j) 
 
= Q ⊗ Iζ   Q ⊗ Iζ  
A(hu,1i,i) B(hu,1i,j)
 ∗  
A(hu,0i,i)  B(hu,0i,j) 
=   
A(hu,1i,i) B(hu,1i,j)

=AT(hu,0i,i) B(hu,0i,j) + AT(hu,1i,i) B(hu,1i,j) .

k(ip+jpkA )
It is easy to check AF ∗ F
(hu,0i,i) B(hu,0i,j) is the coefficient of ωq and AF ∗ F
(hu,1i,i) B(hu,1i,j) is
−k(ip+jpkA )
the coefficient of ωq .

• Case 2: Interference terms. The terms in (5.11) with coefficient AF ∗ F


(hu,li,i) B(hv,li,j) with u 6= v
±k(ip+u−v+jpkA )
are the interference terms and they are the coefficients of ωq . We conclude

that the useful terms have no intersection with interference terms since |u − v| < p.

Next we determine the threshold of the proposed scheme. Towards this end, we find the maxi-

mum and minimum degree of ÂF ∗ F


k B̂k and then argue that (5.11) has consecutive multiples of powers

of k. The threshold can then be obtained as the difference between maximum and minimum degree

divided by k.

The maximum degree of ÂF ∗ F


k B̂k is the degree of term

ωqk(pkA kB −1) AF ∗ F
(hp−1,0i,kA −1) B(h0,0i,kB −1) ,
90

and the minimum degree is term

ωq−k(pkA kB −1) AF ∗ F
(hp−1,1i,kA −1) B̂(h0,1i,kB −1) .

We observe the power of k in (5.11) can be written as ±((j1 − 1)p + i1 + 1 + p − 1 − i2 +

j2 pkA ) = ±(j2 pkA + j1 p + i1 − i2 ), where j1 ∈ [kA ], j2 ∈ [kB ], i1 , i2 ∈ [p]. Consider a positive

power d ≤ pkA kB − 1. We can always find a solution such that j2 = b pkdA c, j1 = b d modp pkA c,

i1 − i2 = (d mod pkA ) mod p. The same result can be generalized when d is negative.

Then the threshold of this scheme is 2pkA kB − 1.

(5.11) shows that proposed encoding matrix in transformed domain is a Vandermonde matrix
−(pkA kB −1)
with coefficients {ωq , · · · , ωqpkA kB −1 }. An application of Theorem 3 implies that the worst

case condition number is upper bounded by O(q q−2pkA kB +7 ).

The proof of Theorem 7 also illustrates the decoding algorithm. Let i-th worker node compute

Ĉ = ÂTk B̂k . Suppose that the master node receives the computation result from any 2pkA kB −

1 worker nodes, which are denoted by Ĉi0 , · · · , Ĉi2pkA kB −2 . By (5.11), AF ∗ F


(hu,li,i) B(hv,li,j) , u, v ∈

[p], i ∈ [kA ], j ∈ [kB ], can be decoded by vector [Ĉi0 , · · · , Ĉi2pkA kB −2 ] multiplying the inverse of the

Vandermonde matrix
 
−i (pk k −1) −i (pk k −1) −i2pkA kB −2 (pkA kB −1)
ω 0 A B ωq 1 A B ··· ωq
 q 
 −i (pk k −2)
ωq 0 A B −i (pk k −2) −i2pkA kB −2 (pkA kB −2) 
 ωq 1 A B ··· ωq 

 .. .. .. .. 

 . . . .


 
 

 1 1 ··· 1 

 .. .. .. .. 

 . . . . 

 
i (pk k −1) i (pk k −1) i2pkA kB −2 (pkA kB −1)
ωq0 A B ωq1 A B ··· ωq
Pp−1 T
Finally, the result C = [Ci,j ], i ∈ [kA ], j ∈ [kB ] can be recovered since Ci,j = u=0 (A(hu,0i,i) B(hu,0i,j) +

AT(hu,1i,i) B(hu,1i,j) ) = p−1 F∗ F F∗ F


P
u=0 (A(hu,0i,i) B(hu,0i,j) + A(hu,1i,i) B(hu,1i,j) )

We compare proposed threshold to Fahim and Cadambe (2019). The threshold of Fahim and

Cadambe (2019) is

τM −V = 4kA kB p − 2(kA kB + pkA + pkB ) + kA + kB + 2p − 1


91

Since when p = 1, our threshold in Section 5.4 is better than τM −V , we only interest in the case

p > 1. Let τdif = τM −V − τproposed = 2kA kB p − 2(kA kB + pkA + pkB ) + kA + kB + 2p. We show our

comparison in Claim 16.

5.6 Comparisons and Numerical Experiments

Suppose that the number of workers n is odd, so that we can pick q = n for the rotation matrix

embedding. From a theoretical perspective our schemes have a worst case condition number (over

the different recovery submatrices) of O(q q−τ +6 ) where τ is the recovery threshold. In contrast, the

scheme of Yu et al. (2017) has condition numbers that are exponential in n. The work most closely

related to ours is by Fahim and Cadambe (2019), which demonstrates an upper bound of O(q 2(q−τ ) )

on the worst case condition number. It can be noted that this grows much faster than our upper

bound in the parameter q − τ . In numerical experiments, our worst case condition numbers are

much smaller than the work of Fahim and Cadambe (2019); we discuss this in detail below.

Certain approaches Mallick et al. (2019); Das et al. (2018); Das and Ramamoorthy (2019);

Ramamoorthy et al. (2019) only apply for matrix-vector multiplication and furthermore do not

provide any explicit guarantees on the worst case condition number. Other approaches include the

work of Subramaniam et al. (2019) which uses random linear encoding of the A and B matrices

and the work of Das et al. (2019) that uses a convolutional coding approach to this problem. We

note that both these approaches work via random sampling and do not have a theoretical upper

bound on the worst case condition number. We present numerical experiments comparing our work

with Yu et al. (2017); Fahim and Cadambe (2019); Subramaniam et al. (2019) below.

5.6.1 Numerical Experiments

The central point of our paper is that we can leverage the well-conditioned behavior of Vander-

monde matrices with parameters on the unit circle while continuing to work with computation over

the reals. We compare our results with the work of Yu et al. (2017) (called “Real Vandermonde”),
92

Table 5.1 Comparison for matrix-vector case with n = 31, A has size 28000 × 19720 and
x has length 28000.
Scheme γA τ Avg. Cond. Max. Cond. Avg. Worker Dec. Time
Num. Num. Comp. Time (s)
(s)
Real Vand. 1/29 29 1.1 × 1013 2.9 × 1013 1.2 × 10−3 9 × 10−5
Complex Vand. 1/29 29 12 55 2.9 × 10−3 2.8 × 10−4
Circ. Perm. Embed. 1/28 29 12 55 1.2 × 10−3 3.7 × 10−4
Rot. Mat. Embed. 1/29 29 12 55 1.3 × 10−3 10−4

10-10
Complex Vand.
Circ. Perm. Embed.
Normalized MSE (worst case)

10-11 Rot. Mat. Embed.

10-12

10-13

10-14

10-15
80 85 90 95 100 105 110 115 120
SNR (dB)

Figure 5.1 Consider matrix-vector AT x multiplication system with n = 31, τ = 29. A has size
28000 × 19720 and x has length 28000.

a “Complex Vandermonde” scheme where the evaluation points are chosen from the complex unit

circle, the work of Fahim and Cadambe (2019) and Subramaniam et al. (2019).

All experiments were run on the AWS EC2 system with a t2.2xlarge instance (for master node)

and t2.micro instances (for slave nodes).

5.6.1.1 Matrix-vector case

In Table 5.1, we compare the average and worst case condition number of the different schemes

for matrix-vector multiplication. The system under consideration has n = 31 worker nodes and a

threshold specified by the third column (labeled as τ ). The evaluation points for Yu et al. (2017)

were uniformly sampled from the interval [−1, 1] Berrut and Trefethen (2004). The Complex
93

Table 5.2 Comparison for AT B matrix-matrix multiplication case with n = 31, kA = 4,


kB = 7. A has size 8000 × 14000, B has size 8400 × 14000.
Scheme γA γB τ Avg. Cond. Max. Cond. Avg. Worker Dec. Time
Num. Num. Comp. Time (s)
(s)
Real Vand. 1/4 1/7 28 4.9 × 1012 2.3 × 1013 2.132 0.407
Complex Vand. 1/4 1/7 28 27 404 8.421 1.321
Rot. Mat. Embed. 1/4 1/7 28 27 404 2.121 0.408
Fahim and Cadambe 1/4 1/7 28 1449 8.3 × 104 2.263 0.412
(2019)
Subramaniam et al. 1/4 1/7 28 255 5.6 × 104 2.198 0.406
(2019)

10-2
Complex Vand.
Rot. Mat. Embed.
Normalized MSE (worst case)

(Fahim & Cadambe, 2019)


10-4
(Subramaniam et al., 2019)

10-6

10-8

10-10

10-12
80 85 90 95 100 105 110 115 120
SNR (dB)
Figure 5.2 Consider matrix-matrix AT B multiplication system with n = 31, kA = 4, kB = 7, A
is of size 8000 × 14000, B is of 8400 × 14000.

Vandermonde scheme has evaluation points which are the 31-st root of unity. The Fahim and

Cadambe (2019) and Subramaniam et al. (2019) schemes are not applicable for the matrix-vector

case. It can be observed from Table 5.1 that the both the worst case and the average condition

numbers of our scheme are over eleven orders of magnitude better than the Real Vandermonde

scheme. Furthermore, there is an exact match of the condition number values for all the other

schemes. This can be understood by following the discussion in Section 5.3. Specifically, our schemes

have the property that the condition number only depends on the eigenvalues of corresponding

circulation permutation matrix and rotation matrix respectively. These eigenvalues lie precisely

within 31-th roots of unity.


94

It can be observed that the decoding flop count for both matrix-vector and matrix-matrix

multiplication is independent of t, i.e., in the regime where t is very large the decoding time may

be neglected with respect to the worker node computation time. Nevertheless, from a practical

perspective it is useful to understand the decoding times as well.

When the matrix A is of dimension 28000 × 19720 and x is of length 28000, the last two

columns in Table 5.1 indicate the average worker node computation time and the master node

decoding time for the different schemes. These numbers were obtained by averaging over several

runs of the algorithm. It can be observed that the Complex Vandermonde scheme requires about

twice the worker computation time as our schemes. Thus, it is wasteful of worker node computation

resources. On the other hand, our schemes leverage the same condition number with computation

over the reals. The decoding times of almost all the schemes are quite small. However, the Circulant

Permutation Matrix scheme requires decoding time which is somewhat higher than the rotation

matrix embedding even though we can use FFT based approaches for it. We expect that for much

larger scale problems, the FFT based approach may be faster.

Our next set of results compare the mean-squared error (MSE) in the decoded result for the

different schemes. Let AT x denote the precise value of the computation and A
[ T x denote the result

||AT x−A[ T x||


2
of using one of the discussed methods. The normalized MSE is defined as ||AT x||2
. To simulate

numerical precision problems, we added i.i.d. Gaussian noise (of different SNRs) to the result of

the worker node computation. The master node then performs decoding on the noisy vectors. The

plots in Figure 5.1 correspond to the worst case choice of worker nodes for each of the schemes.

It can be observed that the Circulant Permutation Matrix Embedding has the best performance.

This is because the many of the matrices on the block-diagonal in (B.2) (see Appendix B) have

well-behaved condition numbers and only a few correspond to the worst case. We have not shown

the results for the Real Vandermonde case here because the normalized MSE was close 1.0.
95

Table 5.3 Comparison for matrix-matrix AT B multiplication case with n = 17, uA = 2,


uB = 2, p = 2, A is of size 4000 × 16000, B is of 4000 × 16000.
Scheme γA γB τ Avg. Cond. Max. Cond. Avg. Worker Dec. Time
Num. Num. Comp. Time (s)
(s)
Real Vand. 1/4 1/4 9 1.8 × 104 2 × 106 2.24 0.11
Complex Vand. 1/4 1/4 9 51 1.8 × 103 8.15 0.38
Rot. Mat. Embed. 1/4 1/4 15 7 22 2.23 0.69
Fahim and Cadambe 1/4 1/4 15 104 2.7 × 105 2.23 0.18
(2019)

5.6.1.2 Matrix-Matrix case

In the matrix-matrix scenario we again consider a system with n = 31 worker nodes and kA = 4

and kB = 7 so that the threshold τ = kA kB = 28. Once again we observe that the worst case

condition number of the Rotation Matrix Embedding is about eleven orders of magnitude lower

than the Real Vandermonde case. Furthermore, the schemes of Fahim and Cadambe (2019) and

Subramaniam et al. (2019) have a worst case condition numbers that are three orders of magnitude

and two orders of magnitude higher than our scheme. For the Subramaniam et al. (2019) scheme we

performed 200 random trials and picked the scheme with the lowest worst case condition number.

When the matrix A is of dimension 8000×14000 and B is of dimension 8000×14000, the worker

node computation times and decoding times are listed in Table 5.2. As expected the Complex

Vandermonde scheme takes much longer for the worker node computations, whereas the Rotation

Matrix Embedding, Fahim and Cadambe (2019) and Subramaniam et al. (2019) take about the

same time. The decoding times are also very similar. For the matrix-matrix case the normalized
||AT B−A\T B||
MSE is defined as ||AT B||F
F
where AT B is the true product and A
[T B is the decoded product

using one of the methods. As shown in Figure 5.2, the normalized MSE of our Rotation Matrix

Embedding scheme is much about five orders of magnitude lower than the scheme of Fahim and

Cadambe (2019). The normalized MSE of the Real Vandermonde case is almost 1.0 so we do not

plot it.
96

100
Real Vandermonde
Complex Vandermonde
Rotation Matrix Embedding
10-2 [11] scheme

Normalized MSE (worst case)


10-4

10-6

10-8

10-10

10-12

10-14
80 85 90 95 100 105 110 115 120

SNR(dB)

Figure 5.3 Consider matrix-matrix AT B multiplication system with n = 18, uA = 2, uB = 2,


p = 2, A is of size 4000 × 16000, B is of 4000 × 16000.

In the matrix-matrix multiplication scenario with p ≥ 2, we consider a system with n = 17

worker nodes and uA = 2, uB = 2, p = 2. We emphasis four schemes in Table 5.3 have different

thresholds. The Vandermonde-based schemes have lower threshold than Rotation Matrix Embed-

ding scheme and Fahim and Cadambe (2019) scheme. As we have discussed in Claim 16, in most

cases, the Rotation Matrix Embedding scheme has a lower threshold than Fahim and Cadambe

(2019) scheme. However, for a fair comparison, we consider a case that they have same thresh-

olds. We observe that Rotation Matrix Embedding scheme has much lower condition number than

other three schemes. Notice that it is even lower than Complex Vandermonde scheme, which is

different from matrix-vector and matrix-matrix multiplication. It is because in Complex Vander-

monde scheme with p ≥ 2, its encoding matrix is a Vandermonde matrix of size 9 × 17, whose

generators are roots of unity. Then by Claim 3, the condition number is bounded by O(1714 ). The

encoding matrix of Rotation Matrix Embedding scheme is a Vandermonde matrix of size 15 × 17,

whose condition number is upper bounded by O(178 ). As shown in Figure 5.3, the normalized

MSE of our Rotation Matrix Embedding scheme is much lower than the other schemes. As for the

computation and decoding time in Table 5.3, the Complex Vandermonde scheme requires higher

computation and decoding time than the Real Vandermonde scheme (4x) since they are operated
97

over the complex field. Fahim and Cadambe (2019) requires higher decoding time than the Real

Vandermonde scheme (2x) since its threshold is higher. But they have same computation time.

The Rotation Matrix Embedding scheme has highest decoding time since it have higher threshold

than Real/Complex Vandermonde scheme decoding algorithm are operated over the complex field.

5.6.1.3 Matrix-Matrix case with finite field embedding

We now consider the finite field embedding proposed in Yu et al. (2020). As discussed before

this is mentioned as a potential solution to the numerical issues encountered when operating over

the reals in Section VII of Yu et al. (2020). For this purpose the real entries will need to multiplied

by large enough integers and then quantized so that each entry lies with 0 and p − 1 for a large

enough prime p. All computations will be performed within the finite field of order p, i.e., by

reducing the computations modulo-p. This technique requires that each ATi Bj needs to have all

its entries within 0 to p − 1, otherwise there will be errors in the computation.

Let α be an upper bound on the absolute value of matrix entries in A and B. Then, this means

that the following dynamic range constraint (DRC)

α2 t ≤ p − 1,

needs to be satisfied. Otherwise, the modulo-p operation will cause arbitrarily large errors.

We note here that the publicly available code for Yu et al. (2017) uses p = 65537. Now consider

a system with kA = 3, kB = 2. Even for small matrices with A of size 400 × 200, B of size

400 × 300 and entries chosen as random integers between 0 to 30, the DRC is violated for p = 65537

since 302 × 400 > 65537. In this scenario, the normalized MSE of the Yu et al. (2017) approach

is 0.7746. In contrast, our method has a normalized MSE ≈ 2 × 10−28 for the same system with

kA = 3, kB = 2.

When working over 64-bit integers, the largest integer is ≈ 1019 . Thus, even if t < 105 , the

method can only support α ≤ 107 . Thus, the range is rather limited. Furthermore, considering

matrices of limited dynamic range is not a valid assumption. In machine learning scenarios such

as deep neural networks, matrix multiplications are applied repeatedly, and the output of one
98

stage serves as the input for the other. Thus, over several iterations the dynamic range of the

matrix entries will grow. Thus, applying the finite field embedding technique will necessarily incur

quantization error.

The most serious limitation of the method comes from the fact the error in the computation

(owing to quantization) is very strongly dependent on the actual entries of the A and B matrices.

We demonstrate this next. In fact, we can generate structured integer matrices A and B such that

the normalized MSE of their approach is exactly 1.0. Towards this end we first pick the prime

p = 2147483647 (which is much larger than their publicly available code) so that their method can

support higher dynamic range. Next let r = w = t = 400. This implies that α ≤ 1000 by the

dynamic range constraint.

For kA = kB = 2, the matrices have the following block decomposition.


 
A11 A12 
A=  , and
A21 A22
 
B
 11 B 12 
B= .
B21 B22

Each Aij and Bij is a matrix of size 200 × 200, with entries chosen from the following distributions.

A11 , A12 distributed Unif(0, . . . , 9999) and A21 , A22 distributed Unif(0, . . . , 9). Next, B11 , B12

distributed Unif(0, . . . , 9) and B21 , B22 distributed Unif(0, . . . , 9999). In this scenario, the DRC

requires us to multiply each matrix by 0.1 and quantize each entry between 0 and 999. Note that

this implies that A21 , A22 , B11 , B12 are all quantized into zero submatrices since the entry in these

four submatrices is less than 10. We emphasize that the finite field embedding technique only

recovers the product of these quantized matrices. However, this product is

 T  

 11 Ã 12   0 0
ÃT B̃ =   = 0.

 
0 0 B̃21 Ã22

Thus, the final estimate of the original product AT B, denoted as A


[ T B is the all-zeros matrix. This

implies that the normalized MSE of their scheme is exactly 1.0. Thus, the finite field embedding
99

Table 5.4 Performance of matrix inversion over a large prime order field in Python 3.7.
The table shows the computation time for inverting a `×` matrix G over a finite
field of order p. Let G
d−1 denote the inverse obtained by applying the sympy

function M atrix(G).inverse mod(p). The MSE is defined as 1` ||GG d−1 − I|| .


F

` p Computation Time (s) MSE


9 65537 1.39 0
12 65537 4.38 0
15 65537 12.64 0
9 2147483647 1.39 0
12 2147483647 4.68 1.8 × 109
15 2147483647 14.45 4.2 × 109

technique has a very strong dependence on the matrix entries. We note here that even if we consider

other quantization schemes or larger 64-bit primes, one can arrive at adversarial examples such as

the ones shown above. Once again for these examples, our methods have a normalized MSE of at

most 10−27 .

In our experience, the finite field embedding technique also suffers from significant computa-

tional issues in implementation. Note that the technique requires the computation of the inverse

matrix at the master node that is required for decoding the final result. We implemented this

within the Python 3.7, sympy library (see Tang (2020) Git hub repository). We performed ex-

periments with p = 65537 and p = 2147483647. As shown in Table 5.4, for the smaller prime

p = 65537, the inverse computation is accurate up to 15 × 15 matrices; however, the computation

time of the inverse is rather high and can dominate the overall execution time. On the other hand

for the larger prime p = 2147483647, the error in in the computed inverse is very high for 12 × 12

and 15 × 15 matrices; the corresponding time taken is even higher. It is possible that very careful

implementations can perhaps avoid these issues. However, we are unaware of any such publicly

available code.

To summarize, the finite field embedding technique suffers from major dynamic range limitations

and associated computational issues and cannot be used to support real computations.
100

BIBLIOGRAPHY

([Online] Available: https://bitbucket.org/kkonstantinidis/stragglermitmm/src/master,


2019). Repository of Erasure coding for distributed matrix multiplication for matrices with
bounded entries.

([Online] Available: https://github.com/AvestimehrResearchGroup/Polynomial-Code, 2017).


Repository of Polynomial Code for prior implementation.

Ahlswede, R., Cai, N., Li, S.-Y., and Yeung, R. W. (2000). Network information flow. IEEE Trans.
on Inf. Theory, 46(4):1204–1216.

Alon, N., Moitra, A., and Sudakov, B. (2012). Nearly complete graphs decomposable into large
induced matchings and their applications. In Proc. of the 44-th Annual ACM symposium on
Theory of computing, pages 1079–1090.

Baranyai, Z. (1975). On the factorization of the complete uniform hypergraph. In Infinite and
finite sets (Colloq., Keszthely, 1973; dedicated to P. Erdos on his 60th birthday), pages 91–108.

Berrut, J.-P. and Trefethen, L. N. (2004). Barycentric lagrange interpolation. SIAM review,
46(3):501–517.

Blake, I. F. (1972). Codes over certain rings. Information and Control, 20(4):396–404.

Das, A. B. and Ramamoorthy, A. (2019). Distributed matrix-vector multiplication: A convolutional


coding approach. In IEEE Int. Symp. on Inf. Theory, pages 3022–3026.

Das, A. B., Ramamoorthy, A., and Vaswani, N. (2019). Random convolutional coding
for robust and straggler resilient distributed matrix computation. [Online] Available at:
https://arxiv.org/abs/1907.08064.

Das, A. B., Tang, L., and Ramamoorthy, A. (2018). C3 LES : Codes for coded computation that
leverage stragglers. In IEEE Inf. Th. Workshop, pages 1–5.

Dougherty, R., Freiling, C., and Zeger, K. (2005). Insufficiency of linear coding in network infor-
mation flow. IEEE Trans. on Inf. Theory, 51(8):2745–2759.

Dummit, D. S. and Foote, R. M. (2003). Abstract algebra. Wiley, 3rd Ed.

Dutta, S., Cadambe, V., and Grover, P. (2016). Short-dot: Computing large linear transforms
distributedly using coded short dot products. In Proc. of Adv. in Neural Inf. Proc. Sys., pages
2100–2108.
101

Dutta, S., Fahim, M., Haddadpour, F., Jeong, H., Cadambe, V., and Grover, P. (2019). On
the optimal recovery threshold of coded matrix multiplication. IEEE Trans. on Inf. Theory,
66(1):278–301.

Fahim, M. and Cadambe, V. R. (2019). Numerically stable polynomially coded computing. [Online]
Available at: https://arxiv.org/abs/1903.08326.

Fitzpatrick, M. E. ([Online] Available:http://cdaweb01.storage.toshiba.com/docs/


services-support-documents/toshiba_4kwhitepaper.pdf, 2011). 4K Sector Disk Drives:
Transitioning to the Future with Advanced Format Technologies. Toshiba 4K White Paper.

Gautschi, W. (1990). How (un) stable are Vandermonde systems? Asymptotic and Computational
Analysis (Lecture Notes in Pure and Applied Mathematics), 124:193–210.

Ghasemi, H. and Ramamoorthy, A. (2016). Further results on lower bounds for coded caching. In
IEEE Int. Symp. on Inf. Theory, pages 2319–2323.

Ghasemi, H. and Ramamoorthy, A. (2017a). Algorithms for asynchronous coded caching. In


Asilomar Conference on Signals, Systems, and Computers, pages 636–640.

Ghasemi, H. and Ramamoorthy, A. (2017b). Asynchronous coded caching. In IEEE Int. Symp. on
Inf. Theory, pages 2438–2442.

Ghasemi, H. and Ramamoorthy, A. (2017c). Improved lower bounds for coded caching. IEEE
Trans. on Inf. Theory, 63(7):4388–4413.

Graham, R. L., Knuth, D. E., and Patashnik, O. (1994). Concrete mathematics: a foundation for
computer science (2nd ed.). Addison-Wesley Professional.

Gray, R. M. (2006). Toeplitz and circulant matrices: A review. Foundations and Trends R in
Communications and Information Theory, 2(3):155–239.

Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms. SIAM:Society for Industrial
and Applied Mathematics.

Ho, T., Médard, M., Koetter, R., Karger, D. R., Effros, M., Shi, J., and Leong, B. (2006). A random
linear network coding approach to multicast. IEEE Trans. on Inf. Theory, 52(10):4413–4430.

Horn, R. A. and Johnson, C. R. (1991). Topics in matrix analysis. Cambridge University Press.

Huang, Q., Tang, L., He, S., Xiong, Z., and Wang, Z. (2014). Low-complexity encoding of quasi-
cyclic codes based on galois fourier transform. IEEE Trans. on Comm., 62(6):1757–1767.

Huang, S. and Ramamoorthy, A. (2013). An achievable region for the double unicast problem based
on a minimum cut analysis. IEEE Trans. on Comm., 61(7):2890–2899.
102

Huang, S. and Ramamoorthy, A. (2014). On the multiple unicast capacity of 3-source, 3-terminal
directed acyclic networks. IEEE/ACM Trans. Netw., 22(1):285–299.

Huang, S., Ramamoorthy, A., and Medard, M. (2011). Minimum cost mirror sites using network
coding: Replication versus coding at the source nodes. IEEE Trans. on Inf. Theory, 57(2):1080–
1091.

Ji, M., Tulino, A., Llorca, J., and Caire, G. (2015a). Caching-aided coded multicasting with multiple
random requests. In IEEE Inf. Th. Workshop, pages 1–5.

Ji, M., Wong, M. F., Tulino, A. M., Llorca, J., Caire, G., Effros, M., and Langberg, M. (2015b). On
the fundamental limits of caching in combination networks. In IEEE 16th International Workshop
on Signal Processing Advances in Wireless Communications (SPAWC), pages 695–699.

Konstantinidis, K. and Ramamoorthy, A. (2018). Leveraging coding techniques for speeding up


distributed computing. In Proc. of IEEE Global Communications Conference, pages 1–6.

Konstantinidis, K. and Ramamoorthy, A. (2019). Camr: Coded aggregated mapreduce. In IEEE


Int. Symp. on Inf. Theory, pages 1427–1431.

Konstantinidis, K. and Ramamoorthy, A. (2020 (to appear)). Resolvable designs for speeding up
distributed computing. IEEE/ACM Trans. Netw.

Langberg, M. and Ramamoorthy, A. (2009). Communicating the sum of sources in a 3-sources/3-


terminals network. In IEEE Int. Symp. on Inf. Theory, pages 2121–2125.

Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., and Ramchandran, K. (2018). Speeding up
distributed machine learning using codes. IEEE Trans. on Inf. Theory, 64(3):1514–1529.

Lee, K., Suh, C., and Ramchandran, K. (2017). High-dimensional coded matrix multiplication. In
IEEE Int. Symp. on Inf. Theory, pages 2418–2422.

Li, S., Maddah-Ali, M. A., Yu, Q., and Avestimehr, A. S. (2017). A fundamental tradeoff be-
tween computation and communication in distributed computing. IEEE Trans. on Inf. Theory,
64(1):109–128.

Li, S.-Y., Yeung, R. W., and Cai, N. (2003). Linear network coding. IEEE Trans. on Inf. Theory,
49(2):371–381.

Lin, S. and Costello, D. J. (2004). Error Control Coding, 2nd Ed. Prentice Hall.

Maddah-Ali, M. A. and Niesen, U. (2014a). Decentralized coded caching attains order-optimal


memory-rate tradeoff. IEEE/ACM Trans. Netw., 23(4):1029–1040.
103

Maddah-Ali, M. A. and Niesen, U. (2014b). Fundamental limits of caching. IEEE Trans. on Info.
Theory, 60(5):2856–2867.

Mallick, A., Chaudhari, M., Sheth, U., Palanikumar, G., and Joshi, G. (2019). Rateless codes for
near-perfect load balancing in distributed matrix-vector multiplication. Proceedings of the ACM
on Measurement and Analysis of Computing Systems, 3(3):1–40.

Ngai, C. K. and Yeung, R. W. (2004). Network coding gain of combination networks. In IEEE Inf.
Th. Workshop, pages 283–287.

Olmez, O. and Ramamoorthy, A. (2016). Fractional repetition codes with flexible repair from
combinatorial designs. IEEE Trans. on Inf. Theory, 62(4):1565 –1591.

Pan, V. (2016). How bad are vandermonde matrices? SIAM Journal on Matrix Analysis and
Applications, 37(2):676–694.

Pan, V. Y. (2013). Polynomial Evaluation and Interpolation: Fast and Stable Approximate Solution.
Citeseer.

Rai, B. K. and Dey, B. K. (2012). On network coding for sum-networks. IEEE Trans. on Inf.
Theory, 58(1):50–63.

Ramamoorthy, A., Das, A. B., and Tang, L. (2020). Straggler-resistant distributed matrix com-
putation via coding theory: Removing a bottleneck in large-scale data processing. IEEE Signal
Processing Magazine, 37(3):136–145.

Ramamoorthy, A. and Langberg, M. (2013). Communicating the sum of sources over a network.
IEEE Journal on Selected Areas in Communications, 31(4):655–665.

Ramamoorthy, A. and Tang, L. (2019). Numerically stable coded matrix computations via circulant
and rotation matrix embeddings. [Online] Available at: https://arxiv.org/abs/1910.06515.

Ramamoorthy, A., Tang, L., and Vontobel, P. O. (2019). Universally decodable matrices for
distributed matrix-vector multiplication. In IEEE Int. Symp. on Inf. Theory, pages 1777–1781.

Roth, R. M. (2006). Introduction to Coding Theory. Cambridge University Press.

Shangguan, C., Zhang, Y., and Ge, G. (2018). Centralized coded caching schemes: A hypergraph
theoretical approach. IEEE Trans. on Inf. Theory, 64(8):5755–5766.

Shanmugam, K., Ji, M., Tulino, A. M., Llorca, J., and Dimakis, A. G. (2014). Finite length analysis
of caching-aided coded multicasting. In 52nd Annual Allerton Conference on Communication,
Control, and Computing, pages 914–920.
104

Shanmugam, K., Ji, M., Tulino, A. M., Llorca, J., and Dimakis., A. G. (2016). Finite-length
analysis of caching-aided coded multicasting. IEEE Trans. on Inf. Theory, 62(10):5524–5537.

Shanmugam, K., Tulino, A. M., and Dimakis, A. G. (2017). Coded caching with linear subpack-
etization is possible using Ruzsa-Szeméredi graphs. In IEEE Int. Symp. on Inf. Theory, pages
1237–1241.

Stinson, D. R. (2003). Combinatorial Designs: Construction and Analysis. Springer.

Subramaniam, A. M., Heidarzadeh, A., and Narayanan, K. R. (2019). Random khatri-rao-product


codes for numerically-stable distributed matrix multiplication. In Annual Allerton Conference
on Communication, Control, and Computing (Allerton), pages 253–259.

Tang, L. ([Online] Available: https://github.com/litangsky/inverseoverfield, 2020). Github


repository for computing matrix inverse over prime order finite field.

Tang, L., Huang, Q., Wang, Z., and Xiong, Z. (2013). Low-complexity encoding of binary quasi-
cyclic codes based on galois fourier transform. In IEEE Int. Symp. on Inf. Theory, pages 131–135.

Tang, L., Konstantinidis, K., and Ramamoorthy, A. (2019). Erasure coding for distributed matrix
multiplication for matrices with bounded entries. IEEE Comm. Lett., 23(1):8–11.

Tang, L. and Ramamoorthy, A. (2016a). Coded caching for networks with the resolvability property.
In IEEE Int. Symp. on Inf. Theory, pages 420–424.

Tang, L. and Ramamoorthy, A. (2016b). Coded caching with low subpacketization levels. In
Workshop on Network Coding (NetCod), pages 1–6.

Tang, L. and Ramamoorthy, A. (2017). Low subpacketization schemes for coded caching. In IEEE
Int. Symp. on Inf. Theory, pages 2790–2794.

Tang, L. and Ramamoorthy, A. (2018). Coded caching schemes with reduced subpacketization
from linear block codes. IEEE Trans. on Inf. Theory, 64(4):3099–3120.

Tripathy, A. S. and Ramamoorthy, A. (2015). Capacity of sum-networks for different message


alphabets. In IEEE Int. Symp. on Inf. Theory, pages 606–610.

Tripathy, A. S. and Ramamoorthy, A. (2017). Sum-networks from incidence structures: construc-


tion and capacity analysis. IEEE Trans. on Inf. Theory, 64(5):3461–3480.

Wang, S., Liu, J., and Shroff, N. B. (2018). Coded sparse matrix multiplication. In Proc. 35th Int.
Conf. on Mach. Learning, pages 5139–5147.

Yagle, A. E. (1995). Fast algorithms for matrix multiplication using pseudo-number-theoretic


transforms. IEEE Trans. on Sig. Proc., 43(1):71–76.
105

Yan, Q., Cheng, M., Tang, X., and Chen, Q. (2017a). On the placement delivery array design for
centralized coded caching scheme. IEEE Trans. on Inf. Theory, 63(9):5821–5833.

Yan, Q., Tang, X., Chen, Q., and Cheng, M. (2017b). Placement delivery array design through
strong edge coloring of bipartite graphs. IEEE Comm. Lett., 22(2):236–239.

Yu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2017). Polynomial codes: an optimal design for
high-dimensional coded matrix multiplication. In Proc. of Adv. in Neural Inf. Proc. Sys., pages
4403–4413.

Yu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2018). Characterizing the rate-memory tradeoff
in cache networks within a factor of 2. IEEE Trans. on Inf. Theory, 65(1):647–663.

Yu, Q., Maddah-Ali, M. A., and Avestimehr, A. S. (2020). Straggler mitigation in distributed
matrix multiplication: Fundamental limits and optimal coding. IEEE Trans. on Inf. Theory,
66(3):1920–1933.

Zhang, M., Tang, L., Huang, Q., and Wang, Z. (2014). Low complexity encoding algorithm of
rs-based qc-ldpc codes. In IEEE Inf. Th. Workshop, pages 1–4.
106

APPENDIX A. SUPPLEMENT FOR CODED CACHING SCHEMES WITH


REDUCED SUBPACKETIZATION FROM LINEAR BLOCK CODES

Resolvable design over Z mod q

Lemma 5. A (n, k) linear block code over Z mod q with generator matrix G = [gab ] can construct

a resolvable block design by the procedure in Section 3.2.1 if gcd(q, g0b , g1b , · · · , g(k−1)b ) = 1 for

0 ≤ b < n.

Proof. Assume q = q1 × q2 × · · · × qd where qi , 1 ≤ i ≤ d is a prime or a prime power. If the

gcd(q, g0b , g1b , · · · , g(k−1)b ) = 1, then it is evident that gcd(qi , g0b , g1b , · · · , g(k−1)b ) = 1 for 1 ≤ i ≤ d.

As qi is either a prime or a prime power, it follows that there exists a ga∗ b which is relatively prime

to qi , i.e., ga∗ b is a unit in the ring Z mod qi .

Note that for ∆ = [∆0 ∆1 . . . ∆n−1 ] = uG, we have


k−1
X
∆b = ua gab , (A.1)
a=0

where u = [u0 , · · · , uk−1 ]. We consider eq. (A.1) over the ring Z mod qi and rewrite eq. (A.1) as

X
∆b − ua∗ ga∗ b = ua gab ,
a6=a∗

For arbitrary ua , a 6= a∗ , this equation has a unique solution for ua∗ since ga∗ b is a unit in Z

mod qi . This implies that there are qik−1 distinct solutions for (A.1) over Z mod qi . Using the

Chinese remainder theorem, eq. (A.1) has q1k−1 × q2k−1 × · · · qdk−1 = q k−1 solutions over Z mod q

and the result follows.

Remark 12. From Lemma 5, it can be easily verified that a linear block code over Z mod q can

construct a resolvable block design if one of the following conditions for each column gi of the

generator matrix is satisfied.


107

• At least one non-zero entry of gi is a unit in Z mod q, or

• all non-zero entries in gi are zero divisors but their greatest common divisor is 1.

For the SPC code over Z mod q, all the non-zero entries in the generator matrix are 1, which is an

unit. Therefore, the construction always results in a resolvable design.

Proof of Lemma 2

First, we show that the proposed delivery scheme allows each user’s demand to be satisfied.

Note that Claim 2 shows that each user in a parallel class that belongs to the recovery set Sa

recovers all missing subfiles with a specified superscript from it. Thus, we only need to show that if

signals are generated (according to Claim 2) for each recovery set, we are done. This is equivalent

to showing that the bipartite recovery set graph is such that each parallel class has degree z and

multiple edges between nodes are disallowed.

Towards this end, consider the parallel class and we claim that there exist exactly z solutions

(aα , bα ) for integer values of α = 1, . . . , z to the equation

aα (k + 1) + bα = j + n(α − 1) (A.2)

such that aα1 6= aα2 for α1 6= α2 and j < n. The existence of the solution for each equation above

follows from the division algorithm. Note that aα < nz/(k + 1) as the RHS < nz. Furthermore,

note that for 1 ≤ α1 ≤ z and 1 ≤ α2 ≤ z, we cannot have solutions to eq. (A.2) such that aα1 = aα2

as this would imply that |bα1 − bα2 | ≥ n which is a contradiction. This shows that each parallel

class Pj participates in at least z different recovery sets.

The following facts follow easily from the construction of the recovery sets. The degree of each
nz
recovery set in the bipartite graph is k + 1 and there are k+1 of them; multiple edges between a

recovery set and a parallel class are disallowed. Therefore, the total number of edges in the bipartite

graph is nz. As each parallel class participates in at least z recovery sets, by this argument, it

participates in exactly z recovery sets. .


108

zn
Finally, we calculate the rate of the delivery phase. In total, the server transmits q k (q − 1) k+1

equations, where the symbol transmitted has the size of a subfile. Thus, the rate is

zn 1
R = q k (q − 1)
k + 1 qk z
(q − 1)n
= .
k+1

Proof of Claim 4

The matrix GS is shown below.


 
g
 0 1 g · · g k
d 2 e−1 0 · · 0 0 
 
0 g g
 0 1 · gd k e−2 0 · · 0 0  
. 2
. . 
.. 
. 
 
 
 0 · 0 g0 g1 0 · · 0 0 
.
 

0 · · 0 g0 gn−k 0 · · 0 
 
 
 
0 · · 0 0 gn−k−1 gn−k 0 · 0 
 
. .
 .. .. 

 
 
0 · · 0 0 gn−k−b k c · · · gn−k
2

In the above expression and the subsequent discussion if i is such that i < 0, we set gi = 0.

By Claim 3, a cyclic code with generator matrix G satisfies the CCP if all submatrices

GS\(a+j)n =[g(a)n , g(a+1)n , · · · , g(a+j−1)n ,

g(a+j+1)n , · · · , g(a+k)n ],

where a = n − b k2 c − 1, 0 ≤ j ≤ k, have full rank. In what follows, we argue that this is true. Note

that in the generator matrix of cyclic code, any k consecutive columns are linearly independent

Lin and Costello (2004). Therefore for j = 0 and k, GS\(a+j)n has full rank, without needing the
109

conditions of Claim 4. For 0 < j ≤ b k2 c, GS\(a+j)n is as below.


 g0 g1 · gd k e−1 0 · · · · 0 
2
0 g0 · gd k e−2 0 · · · · 0
 2 
 .. .. 
. . 
 

 0 · g0 g1 0 · · · · 0 
 0 · 0 g0 gn−k 0 · · · 0 
0 · · 0 gn−k−1 gn−k 0 · · 0 
 
.

 .. .. 
 . . 

 0 · · 0 gn−k−j+1 · gn−k 0 · 0  
 0 · · 0 gn−k−j · gn−k−1 0 · 0 
 0 · · 0 gn−k−j−1 · gn−k−2 gn−k · 0 
.. .. 
 

. .
0 · · 0 gn−k−b k c · gn−k−b k c+j−1 gn−k−b k c+j+1 · gn−k
2 2 2

Rewriting GS\(a+j)n in block form, we get


 
 Aj Bj 
 
GS\(a+j)n =
 Cj 0 ,

0 Dj Ej

where  
g0 g1 · · gd k e−1
 2 
 
 0 g0 g1 · g k 
d 2 e−2 
Aj =  . ,

. .. 
. . 

 
0 · 0 g0
 
g
 n−k−1 gn−k 0 · · 0 
 
 n−k−2 gn−k−1 gn−k 0 ·
 g 0  
.. .. 
 
Cj =  ,

 . . 
 
g
 n−k−j+1 · · · · gn−k 

 
gn−k−j · · · · gn−k−1
and  
gn−k 0 · · 0
 
 
 gn−k−1 gn−k 0 · 0 
Ej =  .
 
.. .. 
. . 
 

 
gn−k−b k c+j+1 · · · gn−k
2
110

Matrices Aj and Ej have full rank as they are respectively upper triangular and lower triangular,

with non-zero entries on the diagonal (as g0 and gn−k are non-zero in a cyclic code). Therefore,

GS\(a+j)n has full rank if Cj has full rank. For b k2 c < j < k, GS\(a+j)n can be partitioned into a

similar form and the result in Claim 4 follows.

Proof of Claim 6

We need to argue that all k × k submatrices of GSa where 0 ≤ a < α are full rank. In what

follows we argue that all k × k submatrices of GS0 are full rank. The proof for any GSa is similar.

Note that GS0 can be written compactly as follows by using Kronecker products.
 
A ⊗ It
GS0 =  ,
 
B ⊗ C(t−1)×t (1, 1)

where
 
 b00 ··· b0(z−1) 
 . ..
.

A=
 . . 

 
b(z−2)0 · · · b(z−2)(z−1)

and B = [b(z−1)0 , · · · , b(z−1)(z−1) ].

Next, we check the determinant of submatrices GS0 \i obtained by deleting i-th column of GS0 .

W.l.o.g, we let i = (z − 1)t + j where 0 ≤ j < t. The block form of the resultant matrix GS0 \i can

be expressed as
 
A0
⊗ It A00
⊗ ∆1 
GS0 \i =  ,

0 00
B ⊗ C(t−1)×t (1, 1) B ⊗ ∆2

where A0 and A00 are the first z −1 columns and last column of A respectively. Likewise, B0 and B00

are the first z − 1 components and last component of B. The matrices ∆1 and ∆2 are obtained by

deleting the j-th column of It and C(t−1)×t (1, 1) respectively. Then, using the Schur determinant
111

identity Horn and Johnson (1991), we have

det(GS0 \i ) = det(A0 ⊗ It ) det(B00 ⊗ ∆2 −

B0 ⊗ C(t−1)×t (1, 1) · (A0 ⊗ It )−1 · A00 ⊗ ∆1 )


(1)
= det(A0 ⊗ It ) det(B00 ⊗ ∆2 −

B0 A0−1 A00 ⊗ C(t−1)×t (1, 1)∆1 )


(2)
= det(A0 ⊗ It ) det((B00 − B0 A0−1 A00 ) ⊗ ∆2 ),

where (1) holds by the properties of the Kronecker product Horn and Johnson (1991) and (2) holds

since C(t−1)×t (1, 1)∆1 = ∆2 . Next note that det(∆2 ) 6= 0. This is because ∆2 can be denoted as
 
 A 0 
∆2 =  ,
0 B

where A is a j × j upper-triangular matrix;


 
1 1 0 ··· 0 0
 
0
 1 1 ··· 0 0 
. .. 
 
A =  .. .
 
 
0
 0 0 ··· 1 1 
 
0 0 0 ··· 0 1

and B is a (t − 1 − j) × (t − 1 − j) lower-triangular matrix;


 
1 0 · · · 0 0 0
 
1 1 · · · 0 0 0
 
. .. 
 
B =  .. . .
 
 
0 0 · · · 1 1 0
 
 
0 0 ··· 0 1 1
112

Next, we define the matrix


 
b00 b01 ··· b0(z−1)
 
 
 b10 b11 ··· b1(z−1) 
F= . .
 
 . ..
 . .


 
b(z−1)0 b(z−1)1 · · · b(z−1)(z−1)

Another application of the Schur determinant identity yields

det(F)
det(B00 − B0 A0−1 A00 ) =
det(A0 )

6= 0,

since det(F) and det(A0 ) are both non-zero as their columns have the Vandermonde form. In F,

the columns correspond to distinct and non-zero elements from GF (q); therefore, q > z. Note

however, that the above discussion focused only GS0 . As the argument needs to apply for all GSa

where 0 ≤ a < α, we need q > α.

Proof of Claim 7

Note that the matrix in eq. (3.4) is the generator matrix of (n, k) linear block code over GF (q)

where nz = (z + 1)(k + 1). Since z and z + 1 are coprime, z is the least positive integer such that

k + 1 | nz. To show G satisfies the CCP, we need to argue that all k × k submatrices of GSa where

0 ≤ a ≤ z are full rank. It is easy to check that Sa = {0, · · · , n − 1} \ {t(z − a), t(z − a) + 1, · · · , t(z −

a) + t − 1}. We verify three types of matrix GSa as follows: I. a = 0 II. a = 1 III. a > 1.

• Type I

When a = 0, it is easy to verify that any k × k submatrix of GS0 has full rank since GS0 has

the form [Ik×k |1k ], which is the generator matrix of the SPC code.

• Type II
113

 
It×t 0t×t ··· 0t×t b1 It×t

 0t×t It×t ··· 0t×t b2 It×t 

 .. .. 
GS1

= . . 
 (A.3)
0t×t 0t×t ··· It×t bz−1 It×t
 
 
 0(t−1)×t 0(t−1)×t · · · 0(t−1)×t C(c1 , c2 )(t−1)×t
 

| {z } | {z }
Case 1 Case 2

When a = 1, GS1 has the form in eq. (A.3),

Case 1: Suppose that we delete any of first (z − 1)t columns in GS1 (this set of columns is

depicted by the underbrace in eq. (A.3)), say i-th column of GS1 , where z1 t ≤ i < (z1 + 1)t

and 0 ≤ z1 ≤ z − 2. Let i1 = i − z1 t, i2 = (z1 + 1)t − i − 1. The resultant matrix GS1 \i can

be expressed as follows.
 
 A C 
GS1 \i =  ,
B D

where

A = Ii×i ,

B = 0(k−i)×i ,
 
 0t×(k−t−i) b1 It×t 
 
 0 b2 It×t 
 t×(k−t−i) 
.. ..
 
C= ,
 
 . . 
 
 0 bz1 It×t 
 t×(k−t−i) 
 
0i1 ×(k−t−i) bz1 +1 Ii1 ×i1 0i1 ×(i2 +1)

and D has the form in eq. (A.4).

Note that if z1 = 0, C = [0i1 ×(k−t−i) b1 Ii1 ×i1 0i1 ×(i2 +1) ] and if z1 = z − 2,
 
 01×i2 01×i1 bz−1 01×i2 
 
D =  Ii2 ×i2
 0i2 ×(i1 +1) bz−1 Ii2 ×i2  .
 
0(t−1)×i2 C(c1 , c2 )(t−1)×t
114

 
01×i2 01×t ··· 01×t 01×i1 bz1 +1 01×i2
Ii2 ×i2 0i2 ×t ··· 0i2 ×t 0i2 ×(i1 +1) bz1 +1 Ii2 ×i2
 
 

D= .. .. 
(A.4)
 . . 

0t×i2 0t×t ··· It×t bz−1 It×t
 
 
0(t−1)×i2 0(t−1)×t ··· 0(t−1)×t C(c1 , c2 )(t−1)×t

To verify GS1 \i has full rank, we just need to check D has full rank (as A is full rank).

Checking that D has full rank can be further simplified as follows. As bz1 +1 6= 0, we can move

the corresponding column that has bz1 +1 as its first entry so that it is the first column of D.

Following this, consider C(c1 , c2 )(t−1)×t \ ci1 which is obtained by deleting the i1 -th column

of C(c1 , c2 )(t−1)×t .
 
D1 D2 
C(c1 , c2 )(t−1)×t \ ci1 =  ,
D3 D4

where D1 is a i1 × i1 matrix as follows


 
c1 c2 0 0 ··· 0 0
 
0 c c
 1 2 0 ··· 0 0 
. .. 
 
D1 =  .. . ,
 
 
0 0 0 ··· 0 c1 c2 
 
 
0 0 0 ··· 0 0 c1

D4 is a i2 × i2 matrix as follows
 
c2 0 0 ··· 0 0 0
 
c c
 1 2 0 ··· 0 0 0 
. .. 
 
D4 =  .. . ,
 
 
0 0 ··· 0 c1 c2 0
 
 
0 0 ··· 0 0 c1 c2

and D2 and D3 are i1 ×i2 and i2 ×i1 all zero matrices respectively. Then det(C(c1 , c2 )(t−1)×t \

ci1 ) = ci11 ci22 and det(GS1 \i ) = ±bz1 +1 ci11 ci22 6= 0.


115

··· ···
 
It×t 0t×t 0t×t 0t×t 0t×(t−1) 1t b1 It×t
 .. .. .. .. .. .. .. 

 . . . . . . . 

 0t×t
 ··· It×t 0t×t ··· 0t×t 0t×(t−1) 1t bz−a It×t 

 0
 t×t ··· 0t×t 0t×t ··· 0t×t 0t×(t−1) 1t bz−a+1 It×t 

GSa =  0t×t ··· 0t×t It×t ··· 0t×t 0t×(t−1) 1t bz−a+2 It×t
 

 .. .. .. .. .. .. .. 

 . . . . . . . 

 0t×t ··· 0t×t 0t×t ··· It×t 0t×(t−1) 1t bz−1 It×t
 

 
 0(t−1)×t ··· 0(t−1)×t 0(t−1)×t · · · 0(t−1)×(t−1) I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t 
| {z } | {z } | {z } | {z } | {z }
Case 1 Case 2 Case 3 Case 4 Case 5
(A.5)

Case 2: By deleting any of last t columns in GS1 , say i-th column of GS1 , where (z − 1)t ≤

i < zt, the block form of resultant matrix GS1 \i can be expressed as follows.
 
 A C 
GS1 \i =  ,
B D

where A = I(z−1)t×(z−1)t , B = 0(t−1)×(z−1)t , C is obtained by deleting the (i − (z − 1)t)-th

column of matrix [b1 It×t b2 It×t · · · bz−1 It×t ]T and D is C(c1 , c2 )(t−1)×t \ ci−(z−1)t . Since
i−(z−1)t zt−i−1 i−(z−1)t zt−i−1
det(D) = c1 c2 6= 0, det(GS1 \i ) = ±c1 c2 6= 0 and therefore GS1 \i has

full rank.

• Type III when a > 1, GSa has the form in eq. (A.5). As before we perform a case analysis.

Each of the cases is specified by the corresponding underbrace in eq. (A.5).

Case 1: By deleting the i-th column of GSa , where z1 t ≤ i < (z1 + 1)t, z1 ≤ z − a − 1,

i1 = i − z1 t, and i2 = (z1 + 1)t − i − 1, the block form of the resultant matrix GSa \i can be

expressed as follows,

 
 A C 
GSa \i =  ,
B D
116

···

0t×i2 0t×t 0t×(t−1) 1t b1 It×t

 .. .. 
C=
 . . 
 (A.6)
 0t×i2 0t×t · · · 0t×(t−1) 1t bz1 It×t 
0i1 ×i2 0i1 ×t · · · 0i1 ×(t−1) 1i1 bz1 +1 Ii1 ×i1 0i1 ×(i2 +1)

 
01×i2 01×t ··· ··· ··· 01×(t−1) 1 01×i1 bz1 +1 01×i2

 Ii2 ×i2 0i2 ×t ··· ··· ··· 0i2 ×(t−1) 1i2 0i2 ×(i1 +1) bz1 +1 Ii2 ×i2 

0t×i2 It×t ··· ··· ··· 0t×(t−1) 1t bz1 +2 It×t
 
 
 .. .. 
. .
 
D=
 
0t×i2 0t×t ··· 0t×t ··· 0t×(t−1) 1t bz−a+1 It×t

 
 
 0t×i2 0t×t ··· It×t ··· 0t×(t−1) 1t bz−a+2 It×t 
.. ..
 
 
 . . 
0(t−1)×i2 0(t−1)×t · · · 0(t−1)×t · · · I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t
(A.7)

where A = Ii×i , B = 0(k−i)×i , C and D has the form in eq. (A.6) and eq. (A.7), respectively.

Note that if z1 = 0, C = [0i1 ×i2 0i1 ×t · · · 0i1 ×(t−1) 1i1 b1 Ii1 ×i1 0i1 ×(i2 +1) ] and if

z1 = z − a − 1, D has the form in (A.8).

 
01×i2 01×t ··· ··· ··· 01×(t−1) 1 01×i1 bz−a 01×i2

 Ii2 ×i2 0i2 ×t ··· ··· ··· 0i2 ×(t−1) 1i2 0i2 ×(i1 +1) bz−a Ii2 ×i2 

0t×i2 0t×t ··· 0t×t ··· 0t×(t−1) 1t bz−a+1 It×t
 
 
D=
 0t×i2 0t×t ··· It×t ··· 0t×(t−1) 1t bz−a+2 It×t
.

.. ..
 
. .
 
 
0(t−1)×i2 0(t−1)×t · · · 0(t−1)×t · · · I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t
(A.8)
117

To verify that GSa \i has full rank, we just need to check D has full rank. Owing to the

construction of D, we have to check the determinant of the following (t + 1) × (t + 1) matrix.


 
1 0 0 · · · bz1 +1 · · · 0 
 
1 b
 z−a+1 0 ··· 0 ··· 0  
 
F = 1 · · · · · · ;
 
 0 bz−a+1 0 0 
. ..
 ..

 . 

 
1 0 0 ··· 0 · · · bz−a+1
t−1
det(F) = (bz−a+1 − bz1 +1 )bz−a+1 . Since z1 6= z − a and then bz1 +1 6= bz−a+1 , the above matrix

has full rank and det(GSa \i ) = ±(bz−a+1 − bz1 +1 )bt−1


z−a+1 6= 0, so that GSa \i has full rank.

Case 2: By deleting i-th column of GSa , where z1 t ≤ i < (z1 + 1)t, z − a ≤ z1 ≤ z − 3, the

proof that the resultant matrix has full rank is similar to the case that z1 ≤ z − a − 1 and we

omit it here.

Case 3: By deleting i-th column of GSa , where (z − 2)t ≤ i ≤ (z − 1)t − 2, i1 = i − (z − 2)t

and i2 = (z − 1)t − 2 − i, the resultant matrix is as follows,


 
 A C 
GSa \i =  ,
B D
where

A = I(z−a)t×(z−a)t

B = 0(k−(z−a)t)×(z−a)t
 
0t×t · · · 0t×(t−2) 1t b1 It×t 
 . ..
.

C=  . . 

 
0t×t · · · 0t×(t−2) 1t bz−a It×t
 
0t×t ··· 0t×(t−2) 1t bz−a+1 It×t
 
 It×t ··· 0t×(t−2) 1t bz−a+2 It×t 
 
 .. .. 
. .
D=
 


 Ii1 ×i1 0i1 ×i2 

···
 
 0(t−1)×t 01×i1 01×i2 1t−1 C(c1 , c2 )(t−1)×t 
0i2 ×i1 Ii2 ×i2
118

 
It×t ··· 0t×t 0t×t ··· 0t×(t−1) b1 It×t
 .. .. 

 . . 

 0
 t×t ··· It×t 0t×t ··· 0t×(t−1) bz−a It×t 

GSa \i =  0t×t ··· 0t×t 0t×t ··· 0t×(t−1) bz−a+1 It×t (A.9)
 

 0t×t ··· 0t×t It×t ··· 0t×(t−1) bz−a+2 It×t
 

.. ..
 
. .
 
 
0(t−1)×t ··· 0(t−1)×t 0(t−1)×t · · · I(t−1)×(t−1) C(c1 , c2 )(t−1)×t

To verify GSa \i has full rank, we need to check the determinant of D. Owing to the construc-

tion of D, the following matrix is required to be full rank,


 
1t bz−a+1 It×t
D0 = 


1 C(c1 , c2 )(t−1)×t (i1 )
 1 bz−a+1 0 ··· 0 0 0 ··· 0 
1 0 bz−a+1 ··· 0 0 0 ··· 0
=  ... .. ,
 
.
1 0 0 ··· 0 0 0 ··· bz−a+1
1 0 ··· 0 c1 c2 0 ··· 0

where C(c1 , c2 )(t−1)×t (i1 ) denotes the i1 -th row of C(c1 , c2 )(t−1)×t , 0 ≤ i ≤ t − 2.

det D0 = det(bz−a+1 It×t ) · det(1−

C(c1 , c2 )(t−1)×t (i1 ) · (b−1


z−a+1 It×t ) · 1t )

=btz−a+1 (1 − b−1
z−a+1 (c1 + c2 ))

Since bz−a+1 6= 0 and c1 + c2 = 0, det D0 6= 0 and D0 has full rank. Then det(D) =

btz−a+1 (1 − b−1
z−a+1 (c1 + c2 )) 6= 0 and thus GSa \i is full rank.

Case 4: By deleting i-th column of GSa , where i = (z − 1)t − 1, the block form of the resultant

matrix GSa \i can be expressed as eq. (A.9). Evidently, det(GSa \i ) = ±btz−a+1 , so that

GSa \i has full rank.

Case 5: By deleting i-th column of GSa , where (z − 1)t ≤ i < zt and i1 = i − (z − 1)t, the

block form of the resultant matrix GSa \i can be expressed as eq. (A.10).

where bs It×t \ ci1 denotes the submatrix obtained by deleting i1 -th column of bs It×t and

C(c1 , c2 )(t−1)×t \ci1 denotes the submatrix obtained by deleting i1 -th column of C(c1 , c2 )(t−1)×t \
119

 
It×t ··· 0t×t 0t×t ··· 0t×(t−1) 1t b1 It×t \ ci1
 .. .. 

 . . 

 0
 t×t ··· It×t 0t×t ··· 0t×(t−1) 1t bz−a It×t \ ci1 

GSa \i =  0t×t ··· 0t×t 0t×t ··· 0t×(t−1) 1t bz−a+1 It×t \ ci1
 

 0t×t ··· 0t×t It×t ··· 0t×(t−1) 1t bz−a+2 It×t \ ci1
 

.. ..
 
. .
 
 
0(t−1)×t ··· 0(t−1)×t 0(t−1)×t · · · I(t−1)×(t−1) 1t−1 C(c1 , c2 )(t−1)×t \ ci1
(A.10)

ci1 . To verify GSa \i has full rank, we just need to check [1t |bz−a+1 It×t \ ci1 ] has full rank.

Since [1t |bz−a+1 It×t ] has the following form,


 
1 bz−a+1 0 ··· 0
 
 
1 0 bz−a+1 · · · 0 
,
 
. ..
.
. .


 
1 0 0 ··· bz−a+1

by deleting any column of above matrix, it is obvious that det([1t |bz−a+1 It×t \ci1 ]) = ±bt−1
z−a+1

and det(GSa \i ) 6= 0.

Proof of Claim 8

Let z be the least integer such that k + 1 | nz. First, we argue that z is the least integer such

that k + 1 | (n + s(k + 1))z. Assume that this is not true, then there exists z 0 < z such that

k + 1 | (n + s(k + 1))z 0 . As n ≥ k + 1 and k + 1 | s(k + 1)z 0 this implies that k + 1 | nz 0 which is a

contradiction.

Next we argue that G0 satisfies the CCP, i.e., all k × k submatrices of each G0Sa0 , where Ta0 =
nz
{a(k + 1), · · · , a(k + 1) + k} and Sa0 = {(t)n+s(k+1) |t ∈ Ta0 } and 0 ≤ a ≤ k+1 + sz, are full rank. Let

n0 = n + s(k + 1). We argue it in three cases.


120

• Case 1. The first column of G0Sa0 lies in the first s(k + 1) columns of G0 .

Suppose ln0 ≤ a(k + 1) < ln0 + s(k + 1) where 0 ≤ l < z − 1. By the construction of G0 ,

G0Sa = D. Since D = GS0 , all k × k submatrices of D have full rank and so does G0Sa .

• Case 2. The first and last column of G0Sa0 lie in the last n columns of G0 .

Suppose ln0 + s(k + 1) ≤ a(k + 1) and a(k + 1) + k < (l + 1)n0 where 0 ≤ l < z − 1. As

n0 > s(k + 1), a(k + 1) − (l + 1)s(k + 1) > 0 and k + 1|a(k + 1) − (l + 1)s(k + 1). Let

a0 = a − (l + 1)s, then G0Sa = GSa0 and hence all k × k submatrices of G0Sa have full rank.

• Case 3. The first column of G0Sa0 lies in the last n columns of G0 but the last column lies in

the first (k + 1) columns of G0 .

Suppose ln0 + s(k + 1) ≤ a(k + 1) and a(k + 1) + k > (l + 1)n0 where 0 ≤ l < z − 2.

Again, we can get k + 1|a(k + 1) − (l + 1)s(k + 1) and let a0 = a − (l + 1)s. Let S 01 =

{(a(k + 1))n0 , · · · , (ln0 + n0 − 1)n0 } and S 1 = {(a0 (k + 1))n , · · · , (ln + n − 1)n }. As (ln0 + n0 −

1)−a(k+1) = (ln+n−1)−a0 (k+1), G0S 01 = GS 1 . Let S 02 = {(ln0 +n0 )n0 , · · · , (a(k+1)+k)n0 }

and S 2 = {(ln + n)n , · · · , (a0 (k + 1) + k)n }. By the construction of G0 , GS 2 = G0S 02 . Then

G0Sa = [G0S 01 G0S 02 ] = [GS 1 GS 2 ] = GSa0 and hence all k × k submatrices of G0Sa have full rank.

Proof of Claim 11

M
• N = 1q . We have

FsM N
 
1 1 K 1 k
log2 ∗
= log2 − log2 z − log2 q.
K Fs K K/q K K

Using the fact that z ≤ k + 1 and taking limits as n → ∞, we get that

F MN
 
1 1 η
lim log2 s ∗ = H2 − log2 q.
n→∞ K Fs q q

M k+1
• N =1− nq . We have

FsM N
 
1 1 K k+1
log2 ∗
= log2 − log2 q
K Fs K k+1 K
1 zn
− log2
K k+1
121

Using the fact that z ≤ k + 1 and taking limits as n → ∞, we get that

FsM N
 
1 η η
lim log2 ∗
= H2 − log2 q
n→∞ K Fs q q

Discussion on coded caching systems constructed by generator matrices


satisfying the (k, α)-CCP where α ≤ k

Consider the (k, α)-CCP (cf. Definition 6) where α ≤ k. Let z be the least integer such that

α | nz, and let Taα = {aα, · · · , aα + α − 1)} and Saα = {(t)n | t ∈ Taα }. Let GSaα = [gi0 , · · · , giα−1 ]

be the submatrix of G specified by the columns in Saα , i.e, gij ∈ GSaα if ij ∈ Saα . We demonstrate

that the resolvable design generated from a linear block code that satisfies the (k, α)-CCP can also

be used in a coded caching scheme. First, we construct a (X, A) resolvable design as described

in Section 3.2.A., which can be partitioned into n parallel classes Pi = {Bi,j : 0 ≤ j < q},

0 ≤ i < n. By the constructed resolvable design, we partition each subfile Wn into q k z subfiles
s | 0 ≤ t < q k , 0 ≤ s < z} and operate the placement scheme in Algorithm 2. In
Wn = {Wn,t

the delivery phase, for each recovery set, several equations are generated, each of which benefit α

users simultaneously. Furthermore, the equations generated by all the recovery sets can recover all

the missing subfiles. In this section, we only show that for the recovery set PSaα , it is possible to

generate equations which benefit α users and allow the recovery of all of missing subfiles with given

superscript. The subsequent discussion exactly mirrors the discussion in the (k, k + 1)-CCP case

and is skipped.

Towards this end, we first show that picking α users from α distinct parallel classes can al-

ways form q k−α+1 − q k−α signals. More specifically, consider blocks Bi1 ,li1 , . . . , Biα ,liα (where lij ∈

{0, . . . , q − 1}) that are picked from α distinct parallel classes of PSaα . Then, | ∩α−1
j=1 Bij ,lij | = q
k−α+1

and | ∩αj=1 Bij ,lij | = q k−α .

Claim 13. Consider the resolvable design (X, A) constructed by a (n, k) linear block code that
zn
satisfies the (k, α) CCP. Let PSaα = {Pi | i ∈ Saα } for 0 ≤ a < α, i.e., it is the set of parallel classes

corresponding to Saα . We emphasize that |PSaα | = α ≤ k. Consider blocks Bi1 ,li1 , . . . , Biα0 ,li (where
α0
122

lij ∈ {0, . . . , q − 1}) that are picked from any α0 distinct parallel classes of PSaα where α0 ≤ α. Then,
0 0
| ∩αj=1 Bij ,lij | = q k−α .

The above argument implies that any α − 1 blocks from any α − 1 distinct parallel classes of PSaα

have q k−α+1 points in common and any α blocks Bi1 ,li1 , Biα ,liα from any α distinct parallel classes

of PSaα have q k−α points in common. These blocks (or users) can participate in q k−α+1 − q k−α

equations, each of which benefits α users. In particular, each user will recover a missing subfile

indexed by an element belonging to the intersection of the other α − 1 blocks in each equation.

A very similar argument to Lemma 2 can be made to justify enough equations can be found that

allow all users to recover all missing subfiles.

Proof. Recall that by the construction in Section III.A, block Bi,l ∈ Pi is specified as follows,

Bi,l = {j : Ti,j = l}.

Let G = [gab ], for 0 ≤ a < k, 0 ≤ b < n.

Now consider Bi1 ,li1 , . . . , Biα0 ,li (where ij ∈ Saα , lij ∈ {0, . . . , q − 1}) that are picked from α0
α0

distinct parallel classes of PSaα . W.l.o.g. we assume that i1 < i2 < · · · < iα0 . Let I = {i1 , . . . , iα0 }

and TI denote the submatrix of T obtained by retaining the rows in I. We will show that the
0
vector [li1 li2 . . . liα0 ]T is a column in TI and appears q k−α times in it.

We note here that by the (k, α)-CCP, the vectors gi1 , gi2 , . . . , giα are linearly independent and

thus the subset of these vectors, gi1 , · · · , giα0 are linearly independent. W. l. o. g., we assume that

the top α0 × α0 submatrix of the matrix [gi1 gi2 . . . giα0 ] is full-rank. Next, consider the system of

equations in variables u0 , . . . , uα0 −1 .


0 −1
αX k−1
X
ub gbi1 = li1 − ub gbi1 ,
b=0 b=α0
0 −1
αX k−1
X
ub gbi2 = li2 − ub gbi2 ,
b=0 b=α0
..
.
0 −1
αX k−1
X
ub gbiα0 = liα0 − ub gbiα0 .
b=0 b=α0
123

By the assumed condition, it is evident that this system of α0 equations in α0 variables has a unique
0
solution for a given vector v = [uα0 , · · · , uk−1 ] over GF (q). Since there are q k−α possible v vectors,

the result follows.

As in the case of the (k, k + 1)-CCP, we form a recovery set bipartite graph with parallel

classes and recovery sets as the disjoint vertex subsets, and the edges incident on each parallel

class are labeled arbitrarily from 0 to z − 1. For a parallel class P ∈ PSaα we denote this label by

label(P − PSaα ). For a given recovery set PSaα , the delivery phase proceeds by choosing blocks from

α distinct parallel classes in PSaα and it provides q k−α+1 − q k−α equations that benefit α users.

Note that in the (k, α)-CCP case, randomly picking α blocks from α parallel classes in PSaα will

always result in q k−α intersections, which is different from (k, k + 1)-CCP. It turns out that each

equation allows a user in P ∈ PSaα to recover a missing subfile with superscript label(P − PSaα ).

Let the demand of user UBi,j for i ∈ n − 1, 0 ≤ j ≤ q − 1 by Wκi,j . We formalize the argument

in Algorithm 6 and prove that equations generated in each recovery set PSaα can recover all missing

subfile with superscript label(P − PSaα ).

Algorithm 6: Signal Generation Algorithm for PSaα


Input : For P ∈ PSaα , E(P) = label(P − PSaα ). Signal set Sig = ∅.
1 while any user UB ∈ Pj , j ∈ Saα does not recover all its missing subfiles with superscript
E(P) do
2 Pick blocks Bj,lj ∈ Pj for all j ∈ Saα and lj ∈ {0, . . . , q − 1};
/* Pick blocks from distinct parallel classes in PSaα . The cardinality of their
k−α
intersection is always q */
3 Find set L̂s = ∩j∈Saα \{s} Bj,lj \ ∩j∈Saα Bj,lj for s ∈ Saα ;
/* Determine the missing subfile indices that the user from Psα will recover. Note that
k−α+1 k−α
|L̂s | = q −q */
E(Ps )
4 Add signals ⊕s∈Saα W , 0 ≤ t < q k−α+1 − q k−α , to Sig;
κs,ls ,L̂s [t]
/* User UBs,ls demands file Wκs,ls . This equation allows it to recover the corresponding
missing subfile index L̂s [t], which is the t-th element of L̂s [t]. The superscript is
determined by the recovery set bipartite graph */
5 end
Output: Signal set Sig.
124

For the sake of convenience we argue that user UBβ,lβ that demands Wκβ,lβ can recover all its

missing subfiles with superscript E(Pβ ). Note that Bβ,lβ = q k−1 . Thus user UBβ,lβ needs to obtain

q k − q k−1 missing subfiles with superscript E(Pβ ). The delivery phase scheme repeatedly picks α

users from different parallel classes of PSaα . The equations in Algorithm 6 allow UBβ,lβ to recover all
E(Pβ )
W where L̂β = ∩j∈Saα \{β} Bj,lj \ ∩j∈Saα Bj,lj and t = 1, · · · , q k−α+1 − q k−α . This is because
κβ,lβ ,L̂β [t]

of Claim 13.

Next, we count the number of equations that UBβ,lβ participates in. We can pick α − 1 users

from α − 1 parallel classes in PSaα . There are totally q α−1 ways to pick them, each of which

generate q k−α+1 − q k−α equations. Thus there are a total of q k − q k−1 equations in which user

UBβ,lβ participates in.

It remains to argue that each equation provides a distinct file part of user UBβ,lβ . Towards this

end, let {i1 , · · · , iα−1 } ⊂ Saα be an index set such that β ∈


/ {i1 , · · · , iα−1 } but β ∈ Saα . Note that

when we pick the same set of blocks {Bi1 ,li1 , · · · , Biα−1 ,liα−1 }, it is impossible that the recovered
E(Pβ ) E(Pβ )
subfiles W and W are the same since the points in L̂β are distinct. Next, sup-
κβ,lβ ,L̂β [t1 ] κβ,lβ ,L̂β [t2 ]

pose that there exist sets of blocks {Bi1 ,li1 , · · · , Biα−1 ,liα−1 } and {Bi1 ,li0 , · · · , Biα−1 ,li0 } such that
1 α−1

{Bi1 ,li1 , · · · , Biα−1 ,liα−1 } = }, but γ ∈ ∩α−1


6 {Bi1 ,li0 , · · · , Biα−1 ,li0 α−1
j=1 Bij ,lij \Bβ,lβ and γ ∈ ∩j=1 Bij ,li0 j \
1 α−1

Bβ,lβ0 . This is a contradiction since this in turn implies that γ ∈ ∩αj=2 Bij ,lij ∩αj=2 Bij ,li0 , which is
T
j

impossible since two blocks from the same parallel class have an empty intersection.

Finally we calculate the transmission rate. In Algorithm 6, for each recovery set, we transmit
zn
q k+1 − q k equations and there are totally α recovery sets. Since each equation has size equal to a

subfile, the rate is given by


zn 1
R = (q k+1 − q k ) × × k
α zq
n(q − 1)
= .
α
The (n, k) linear block codes that satisfy the (k, α)-CCP over GF (q) correspond to a coded
M 1 n(q−1)
caching system with K = nq, N = q, Fs = zq k and have a rate R = α . Thus, the rate

of this system is a little higher compared to the (k, k + 1)-CCP system with almost the same

subpacketization level.
125

However, by comparing Definitions 5 and 6 it is evident that the rank constraints of the (k, α)-

CCP are weaker as compared to the (k, k+1)-CCP. Therefore, in general we can find more instances

of generator matrices that satisfy the (k, α)-CCP. For example, a large class of codes that satisfy

the (k, k)-CCP are (n, k) cyclic codes since any k consecutive columns in their generator matrices

are linearly independent Lin and Costello (2004). Thus, (n, k) cyclic codes always satisfy the

(k, k)-CCP but satisfy (k, k + 1)-CCP if they satisfy the additional constraints discussed in Claim

3.

Cyclic codes over Z mod q Blake (1972)

First, we show that matrix T constructed by constructed by the approach outlined in Section

3.3.4 still results in a resolvable design. Let ∆ = [∆0 ∆1 · · · ∆n−1 ] be a codeword of the cyclic

code over Z mod q, denoted C where q = q1 q2 · · · qd , and qi , i = 1, . . . , d are prime. By using

the Chinese remaindering map ψ (discussed in Section 3.3.4), ∆ can be uniquely mapped into d

codewords c(i) , i = 1, . . . , d where each c(i) is a codeword of C i (the cyclic code over GF (qi )). Thus,
(1) (2) (d)
the b-th component ∆b can be mapped to (cb , cb , . . . , cb )
(i)
Let Gi = [gab ] represent the generator matrix of the code C i . Based on prior arguments, it is
P i −1 (i) (i)
evident that there are qiki −1 distinct solutions over GF (qi ) to the equation ka=0 ua gab = cb . In

turn, this implies that ∆b appears q1k1 −1 q2k2 −1 · · · qdkd −1 times in the b-th row of T and the result

follows.

Next we show any α blocks from distinct parallel classes of PS kmin have q1k1 −α q2k2 −α · · · qdkd −α
a

intersections, where α ≤ kmin and Sakmin = {(akmin )n , (akmin + 1)n , · · · , (akmin + kmin − 1)n }

Towards this end consider Bi1 ,li1 , . . . , Biα ,liα (where ij ∈ Sakmin , lij ∈ {0, . . . , q − 1}) that are

picked from α distinct parallel classes of PS kmin . W.l.o.g. we assume that i1 < i2 < · · · < iα . Let
a

I = {i1 , . . . , iα } and TI denote the submatrix of T obtained by retaining the rows in I. We will

show that the vector [li1 li2 . . . liα ]T is a column in TI and appears q1k1 −α q2k2 −α · · · qdkd −α times.
126

Let ψm (lij ) for m = 1, . . . , d represent the m-th component of the map ψ. Consider the (n, k1 )

cyclic code over GF (q1 ) and the system of equations in variables u0 , . . . , uα−1 that lie in GF (q1 ).
α−1 1 −1
kX
(1) (1)
X
ub gbi1 = ψ1 (li1 ) − ub gbi1 ,
b=0 b=α
α−1 kX1 −1
(1) (1)
X
ub gbi2 = ψ1 (li2 ) − ub gbi2 ,
b=0 b=α
..
.
α−1 1 −1
kX
(1) (1)
X
ub gbiα = ψ1 (liα ) − ub gbiα .
b=0 b=α

By arguments identical to those made in Claim 13 it can be seen that this system of equations has

q1k1 −α solutions. Applying the same argument to the other cyclic codes we conclude that the vector

[li1 , li2 , · · · , liα ] appears q1k1 −α q2k2 −α · · · qdkd −α times in TI and the result follows.
127

APPENDIX B. SUPPLEMENT FOR NUMERICALLY STABLE CODED


MATRIX COMPUTATIONS VIA CIRCULANT AND ROTATION MATRIX
EMBEDDINGS

Examples of4 × 4 circulant permutation matrices

Example 20. For m = 4, the four possible circulation permutation matrices are
   
0 1 0 0 1 0 0 0
   
   
0 0 1 0 0 1 0 0
 0
P=  , P = I4 =  ,
  
0 0 0 1 0 0 1 0
   
   
1 0 0 0 0 0 0 1
   
0 0 1 0 0 0 0 1
   
   
0 0 0 1
 3 1 0 0 0

P2 =  ,P =  .
 
1 0 0 0 0 1 0 0
   
   
0 1 0 0 0 0 1 0

Proof of Claim 12

Proof. Note that Algorithm 5 is applied for recovering the corresponding entries of ATi,j x for i ∈

[kA ], j ∈ [q̃] separately. There are r/(kA (q − 1)) such entries. The complexity of computing a N -

point FFT is O(N log N ) in terms of the required floating point operations (flops). Computing the

permutation does not cost any flops and its complexity is negligible as compared to the other steps.

Step 1 of Algorithm 5 therefore has complexity O(kA q̃ log q̃). In Step 2, we solve the degree kA − 1

polynomial interpolation, (q̃ − 1) times. This takes O((q̃ − 1)kA log2 kA ) time Pan (2013). Finally,

Step 3, requires applying the inverse permutation and the inverse FFT; this requires O(kA q̃ log q̃)
128

operations. Therefore, the overall complexity is given by

r 2

O(kA q̃ log q̃) + O((q̃ − 1)kA log kA )
kA (q̃ − 1)

≈ O(r(log q̃ + log2 kA )).

Proof of Theorem 5

Proof. The arguments are conceptually similar to the proof of Theorem 4. Suppose that the workers

indexed by i0 , . . . , ikA −1 complete their tasks. The corresponding block columns of Gcirc can be

extracted to form
 
I I ··· I
 
ikA −1
 
 P i0 P i1 ··· P 
G̃ =  .
 
.
.. .
.. .. ..

 . .


 
Pi0 (kA −1) Pi1 (kA −1) · · · PikA −1 (kA −1)

As in the proof of Theorem 4 we can equivalently analyze the decoding by considering the

system of equations

mG̃ = c,

where m, c ∈ R1×kA q̃ are row-vectors such that

m = [m0 , · · · , mkA −1 ]

= [mh0,0i , · · · , mh0,q̃−1i , · · · , mhkA −1,0i , · · · , mhkA −1,q̃−1i ], and

c = [ci0 , · · · , cikA −1 ]

= [chi0 ,0i , · · · , chi0 ,q̃−1i , · · · , chik ,0i , · · · , chik ,q̃−1i ].


A −1 A −1
129

Note that not all variables in m are independent owing to (5.7). Let mF and cF denote the q̃-point

“block-Fourier” transforms of these vectors, i.e,


 
W 
mF = m 
 .. 
 and
 . 
 
W
 
W 
cF = c 
 .. 
,
 . 
 
W

where W is the q̃-point DFT matrix. Let G̃k,l = Pil k denote the (k, l)-th block of G̃. Using the

fact that P can be diagonalized by the DFT matrix W, we have

(q̃−1)il k
G̃k,l = Wdiag(1, ωq̃il k , ωq̃2il k , . . . , ωq̃ )W∗ .

il k 2il k (q̃−1)il k
Let G̃F
k,l = diag(1, ωq̃ , ωq̃ , . . . , ωq̃ ), and G̃F represent the kA × kA block matrix with G̃F
k,l

for k, l = 0, . . . , kA − 1 as its blocks. Therefore, the system of equations

mG̃ = c,

can be further expressed as


      

W  W  W  W 
m
 ..  ..  
 G̃  ..  
 = c .. 
,
 . 
 .   .   . 
      
W W∗ W W

=⇒ [mF F F F F
0 , · · · , mkA −1 ]G̃ = [ci0 , · · · , cik ]
A −1

 
W 

upon right multiplication by the matrix  .. 
. Next, we note that as each block within
 . 
 
W
G̃F has a diagonal structure, we can rewrite the system of equations as a block diagonal matrix

upon applying an appropriate permutation (cf. Claim 14 in Appendix B). Thus, we can rewrite it
130

as

[mF ,π F ,π F F ,π F ,π
0 , · · · , mq̃−1 ]G̃d = [c0 , · · · , cq̃−1 ], (B.1)

where the permutation π is such that mF


j

= [mF F F F ,π
0,j m1,j . . . mkA −1,j ] and likewise cj =

[cF F FFurthermore, G̃F


i0 ,j ci1 ,j . . . cik ,j ]. d is a block-diagonal matrix where each block is of size
A −1
Pq̃−1
kA × kA . Now, according to (5.7), we have mFi,0 = j=0 mi,j = 0 for i = 0, . . . , kA − 1, which

implies that mF
0

is a 1 × kA zero row-vector and thus cF
0

is too.

In what follows, we show that each of other diagonal blocks of G̃F


d is non-singular. This means

that [mF F
0 , · · · , mkA −1 ] and consequently m can be determined by solving the system of equations

in (B.1). Towards this end, we note that the k-th diagonal block (1 ≤ k ≤ q̃ − 1) of G̃F
d , denoted

by G̃F
d [k] can be expressed as follows.

 
1 1 ··· 1
 
ik −1 k
 ω i0 k ωq̃i1 k
 
F q̃ ··· ωq̃ A 
G̃d [k] =  . (B.2)
 
.
.. .
.. .. ..

 . .


 
(k −1)i0 k (k −1)i1 k (kA −1)ikA −1 k
ωq̃ A ωq̃ A ··· ωq̃

ikA −1 k
The above matrix is a complex Vandermonde matrix with parameters ωq̃i0 k , . . . , ωq̃ . Thus, as

long these parameters are distinct, G̃F


d [k] will be non-singular. Note that we need the property to

hold for k = 1, . . . , q̃ − 1. This condition can be expressed as

(iα − iβ )k 6≡ 0 (mod q̃),

for iα , iβ ∈ {0, . . . , n − 1} and 1 ≤ k ≤ q̃ − 1. A necessary and sufficient condition for this to hold

is that q̃ is prime. An application of Theorem 3 shows that κ(G̃F


d [k]) ≤ O(q̃
q̃−kA +6 ) for all k. As

decoding m is equivalent to solving systems of equations specified by G̃F


d [k] for 1 ≤ k ≤ q̃ − 1, the

worst case condition number is at most O(q̃ q̃−kA +6 ).


131

Vandermonde Matrix condition number analysis

Let V be a m × m Vandermonde matrix with parameters s0 , s1 , . . . sm−1 . We are interested

in upper bounding κ(V). Let s+ = maxm−1 n−1


i=0 |si |. Then, it is known that ||V|| ≤ m max(1, s+ )

Pan (2016). Finding an upper bound on ||V−1 || is more complicated and we discuss this in detail

below. Towards this end we need the definition of a Cauchy matrix.

Definition 12. A m × m Cauchy matrix is specified by parameters s = [s0 s1 . . . sm−1 ] and

t = [t0 t1 . . . tm−1 ], such that its (i, j)-th entry


 
1
Cs,t (i, j) = for i ∈ [m], j ∈ [m].
si − tj

In what follows, we establish an upper bound on the condition number of Vandermonde matrices

with parameters on the unit circle.

Theorem 8. Consider a m × m Vandermonde matrix V where m < q (q is odd) with distinct

parameters {s0 , s1 , . . . , sm−1 } ⊂ {1, ωq , ωq2 , . . . , ωqq−1 }. Then,

κ(V) ≤ O(q q−m+6 ).

i 2π 2π j
Proof. Recall that ωq = e q and ωm = ei m and define tj = f ωm , j = 0, . . . , m − 1 where f

is a complex number with |f | = 1. We let Cs,f denote the Cauchy matrix with parameters

{s0 , . . . , sm−1 } and {t0 , . . . , tm−1 }. Let W be the m-point DFT matrix. The work of Pan (2016)

shows that
!m−1
−j m−1 −1 1
V−1 = diag(f m−1−j )m−1 ∗
j=0 W diag(ωm )j=0 Cs,f diag m .
sj − f m
j=0

−j m−1
It can be seen that the matrix diag(f m−1−j )m−1 ∗
j=0 W diag(ωm )j=0 is unitary. Therefore,
!m−1
1
||V−1 || = ||C−1
s,f diag m ||
sj − f m
j=0
 
1
≤ ||C−1
×s,f ||
minm−1 m
i=0 |si − f |
m
 
−1 0 0 1
≤ m × (max |Cs,f (i , j )|) × , (B.3)
i0 ,j 0 minm−1 m m
i=0 |si − f |
132

where the first inequality holds as the norm of a product of matrices is upper bounded by the

products of the individual norms and second inequality holds since for any M, we have ||M|| ≤

||M||F .

In what follows, we upper bound the RHS of (B.3). Let s(x) denote a function of x so that
−1
s(x) = Πm−1 0 0
i=0 (x − si ). The (i , j )-the entry of Cs,f can be expressed as Pan (2016)

C−1 0 0 m m m
s,f (i , j ) = (−1) s(tj 0 )(si0 − f )/(si0 − tj 0 ), so that

|C−1 0 0 m m
s,f (i , j )| = |s(tj 0 )||si0 − f |/|si0 − tj 0 |

≤ |s(tj 0 )|(|sm m
i0 | + |f |)/|si0 − tj 0 |

= 2|s(tj 0 )|/|si0 − tj 0 | (since |si0 | = |f | = 1).

Let M = {1, ωq , ωq2 , . . . , ωqq−1 } \ {s0 , s1 , . . . , sm−1 } denote the q-th roots of unity that are not

parameters of V. Note that

s(tj 0 ) = Πm−1
i=0 (tj 0 − si )
xq − 1
= , so that
Παj ∈M (x − αj ) x=tj 0

|tqj0 − 1|
|s(tj 0 )| =
Παj ∈M |tj 0 − αj |
2
≤ (since |tj 0 | = 1 and by the triangle inequality).
Παj ∈M |tj 0 − αj |

Thus, we can conclude that

1 1
|C−1 0 0
s,f (i , j )| ≤ 4 max (B.4)
i ,j Παj ∈M |(tj 0 − αj )| |si0 − tj 0 |
0 0
 
1 1
=4 . (B.5)
mini0 ,j 0 Παj ∈M |(tj 0 − αj )| |si0 − tj 0 |

Note that in the expression above, si0 is a parameter of V while the αj ’s are the points within
π 0
Ωq = {1, ωq , ωq2 , . . . , ωqq−1 } that are “not” parameters of V. We choose f = ei m so that tj 0 = f ωm
j
=
j0
eiπ/m ωm . Next, we determine an upper bound on the RHS of (B.5). Towards this end, we note

that the distance between two points on the unit circle can be expressed as 2 sin(θ/2) if θ is the

induced angle between them. Furthermore, we have 2 sin(θ/2) ≥ 2θ/π as long as θ ≤ π.


133

It can be seen that the closest point to tj 0 that lies within Ωq has an induced angle

2π` 2π(j 0 + 21 ) 2π 1 π
− ≥ ≥ 2.
q m qm 2 q

Therefore, the corresponding distance is lower bounded by 2/q 2 . Similarly, the next closest distance

is lower bounded by 2/q, followed by 2(2/q), 3(2/q), . . . , (q − m − 1)(2/q). Let d = q − m, Then,


Παj ∈M |(tj 0 − αj )| min
0 0
|si0 − tj 0 |
i ,j

≥ 2/q 2 × 2/q × 4/q × · · · × 2(d − 1)/q × 2/q 2


1
= 2d+1 (d − 1)! .
q d+3

Therefore,

q d+3
|C−1 0 0
s,f (i , j )| ≤ Cd

where Cd = 2d−1 (d − 1)! is a constant. Let the i-th parameter si = ei2π`/q . Then,

|sm m
i − f | = |e
i2π`m/q
+ 1|

= 2| cos(π`m/q)|.

The term `m can be expressed as `m = βq + η for integers β and η such that 0 ≤ η ≤ q − 1. Now

note that η 6= q/2 since by assumption q is odd. Thus, | cos(π`m/q)| takes its smallest value when

η = (q + 1)/2 or (q − 1)/2. In this case


 
q+1
| cos(π`m/q)| = cos βπ + π
2q
 
π
≥ sin
2q
1
≥ .
q

Thus, we can upper bound the RHS of (B.3) and obtain

q d+3
||V−1 || ≤ m q
Cd
q d+5
≤ (since m < q).
Cd
134

Finally, using the fact that ||V || ≤ m < q. we obtain

q d+6
κ(V) ≤ .
Cd

Auxiliary Claims

Definition 13. Permutation Equivalence. We say that a matrix M is permutation equivalent to

Mπ if Mπ can be obtained by permuting the rows and columns of M. We denote this by M  Mπ .

Claim 14. Let M be a l1 q × l2 q matrix consisting of blocks of size q × q denoted by Mi,j for

i ∈ [l1 ], j ∈ [l2 ]. Each Mi,j is a diagonal matrix. Then, the rows and columns of M can be

permuted to obtain Mπ which is a block diagonal matrix where each block matrix is of size l1 × l2

and there are q of them.

Proof. For an integer a, let (a)q denote a mod q. In what follows, we establish two permutations

πl1 (i) = l1 (i)q + bi/qc, 0 ≤ i < l1 q, and

πl2 (j) = l2 (j)q + bj/qc, 0 ≤ j < l2 q

and show that applying row-permutation πl1 and column-permutation πl2 to M will result in a

block diagonal matrix Mπ .

We observe that (i, j)-th entry in M is the ((i)q , (j)q )-th entry in Mbi/qc,bj/qc . Under the applied

permutations the (i, j)-th entry in M is mapped to (l1 (i)q +bi/qc, l2 (j)q +bj/qc)-entry in Mπ . Recall

that Mbi/qc,bj/qc is a diagonal matrix which implies that for (i)q 6= (j)q , the (l1 (i)q + bi/qc, l2 (j)q +

bj/qc) entry in Mπ is 0. Therefore Mπ is a block diagonal matrix with q blocks of size l1 × l2 .


135

Example 21. Let l1 = 2, l2 = 3, q = 2. Consider a 4 × 6 matrix M which consists of diagonal

matrices Mi,j of size 2 × 2. For 0 ≤ i ≤ 1, 0 ≤ j ≤ 2


 
M0,0 M0,1 M0,2 
M= 
M1,0 M1,1 M1,2
 
1 0 1 0 1 0
 
 
0 1 0 1 0 1 
= .
 
1 0 ωq 0 ωq2 0 
 
 
0 1 0 ωq −1 0 ωq−2

We use row permutation πrow = (0, 2, 1, 3), which means 0, 1, 2, 3-th row of M permutes to

0, 2, 1, 3-th row. Similarly, the column permutation is πcol = (0, 3, 1, 4, 2, 5). Thus, Mπ becomes
 
1 1 1
 
 
1 ωq ω 2 
q
Mπ =  .
 
1 1 1
 
 
 
1 ωq −1 ωq−2

P`a −1 j
P`a −1 −j and b (z) =
P`b −1 j`a ,
Claim 15. (i) Let a0 (z) = j=0 aj0 z , a1 (z) = j=0 aj1 z 0 j=0 bj0 z
P`b −1 −j`a . Then, a (z)b (z) for k , k = 0, 1 are polynomials that can be
b1 (z) = j=0 bj1 z k1 k2 1 2

recovered from `a `b distinct evaluation points in C.

Let D(z j ) = diag([z j z −j ]) and let


   
I2 I2
   
   
 D(z)   D(z `a ) 
X(z) =  ⊗ .
   
.. ..
. .
   
   
   
D(z ` a −1 ) D(z `a (` b −1) )

Then, if zi ’s are distinct points in C, the matrix

[X(z1 )|X(z2 )| . . . |X(z`a `b )],

is nonsingular.
136

(ii) The matrix [Xi0 |Xi1 | . . . |Xiτ −1 ] (defined in the proof of Theorem 6) is permutation equivalent

to a block-diagonal matrix with four blocks each of size τ × τ . Each of these blocks is a

Vandermonde matrix with parameters from the set {1, ωq , ωq2 , . . . , ωqq−1 }.

Proof. First we show that ak1 (z)bk2 (z) for k1 , k2 = 0, 1 are polynomials that can be recovered from

`a `b distinct evaluation points in C. Towards this end, these four polynomials can be written as
b −1
a −1 `X
`X
a0 (z)b0 (z) = ai0 bj0 z i+j`a ,
i=0 j=0
b −1
a −1 `X
`X
a0 (z)b1 (z) = ai0 bj1 z i−j`a ,
i=0 j=0
b −1
a −1 `X
`X
a1 (z)b0 (z) = ai1 bj0 z −i+j`a , and
i=0 j=0
b −1
a −1 `X
`X
a1 (z)b1 (z) = ai1 bj1 z −i−j`a .
i=0 j=0

Upon inspection, it can be seen that each of the polynomials above has `a `b consecutive powers of

z. Therefore, each of these can be interpolated from `a `b non-zero distinct evaluation points in C.

The second part of the claim follows from the above discussion. To see this we note that
 
I2
 
 
 D(z) 
[a0 (z) a1 (z)] = [a00 a01 a10 a11 . . . a(`a −1)0 a(`a −1)1 ]   and
 
 .
.. 
 
 
D(z ` a −1 )
 
I2
 
 
 D(z `a ) 
[b0 (z) b1 (z)] = [b00 b01 b10 b11 . . . b(`b −1)0 b(`b −1)1 ]  .
 
..
.
 
 
 
D(z ` a (` b −1) )

Furthermore, the four product polynomials under consideration can be expressed as

[a0 (z) a1 (z)] ⊗ [b0 (z) b1 (z)]



= [a00 a01 a10 a11 . . . a(`a −1)0 a(`a −1)1 ] ⊗ [b00 b01 b10 b11 . . . b(`b −1)0 b(`b −1)1 ] X(z).
137

We have previously shown that all polynomials in [a0 (z) a1 (z)] ⊗ [b0 (z) b1 (z)] can be interpolated

by obtaining their values on `a `b non-zero distinct evaluation points. This implies that we can

equivalently obtain


[a00 a01 a10 a11 . . . a(`a −1)0 a(`a −1)1 ] ⊗ [b00 b01 b10 b11 . . . b(`b −1)0 b(`b −1)1 ]

which means that [X(z1 )|X(z2 )| . . . |X(z`a `b )] is non-singular. This proves the statement in part (i).

The proof of the statement in (ii) is essentially an exercise in showing the permutation equiva-

lence of several matrices by using Claim 14 and the permutation equivalence properties of Kronecker

products. For convenience, we define


 
I
 
 
 Λl 
Xl,A =  , and
 
..
.
 
 
 
Λl(kA −1)
 
I
 
 
 ΛlkA 
Xl,B =
 
.. 
.
 
 
 
Λ lkA (k B −1)

so that Xl = Xl,A ⊗ Xl,B .

Recall that we are analyzing the matrix X = [Xi0 |Xi1 | . . . |Xiτ −1 ]. An application of Claim 14

shows that
   
Vl,A,1 Vl,B,1
Xl,A  XPl,A = P
 , and Xl,B  Xl,B = ,
 
Vl,A,2 Vl,B,2

l(kA −1) T −l(kA −1) T lkA (kB −1) T


where Vl,A,1 = [1, ωql , · · · , ωq ] , Vl,A,2 = [1, ωq−l , · · · , ωq ] , Vl,B,1 = [1, ωqlkA , · · · , ωq ] ,
−lkA (kB −1) T
Vl,B,2 = [1, ωq−lkA , · · · , ωq ] . Then we conclude that X  XP = [XPi0 |XPi1 | · · · |XPiτ −1 ],
138

where XPl = XPl,A ⊗ XPl,B . Next we show that


 
V ⊗ Vl,B,1
 l,A,1 
 
 Vl,A,2 ⊗ Vl,B,1 
XPl = XPl,A ⊗ XPl,B  XP,π = .
 
l
Vl,A,1 ⊗ Vl,B,2
 
 
 
Vl,A,2 ⊗ Vl,B,2

By the definition of Kronecker product, we have


 
P
Vl,A,1 ⊗ Xl,B
XPl,A ⊗ XPl,B =  .

Vl,A,2 ⊗ XPl,B

Note that Vl,A,i ⊗ Vl,B,j  Vl,B,j ⊗ Vl,A,i , then

Vl,A,i ⊗ XPl,B
 
Vl,B,1
=Vl,A,i ⊗ 


Vl,B,2
 
Vl,B,1
  ⊗ Vl,A,i

Vl,B,2
 
Vl,B,1 ⊗ Vl,A,i
=


Vl,B,2 ⊗ Vl,A,i
 
Vl,A,i ⊗ Vl,B,1
 .

Vl,A,i ⊗ Vl,B,2

Thus, we can conclude that XPl  XP,π


l . In addition, we have

Vl,A,1 ⊗ Vl,B,1 = [1, ωql , · · · , ωql(kA kB −2) , ωql(kA kB −1) ]T ,

Vl,A,2 ⊗ Vl,B,1 = [ωq−l(kA −1) , ωq−l(kA −2) , · · · , ωq−l , 1, ωql , · · · , ωql(kA (kB −1)−1) , ωqlkA (kB −1) ]T ,

Vl,A,1 ⊗ Vl,B,2 = [ωq−lkA (kB −1) , ωq−l(kA (kB −1)−1) , · · · , ωq−l , 1, ωql , · · · , ωql(kA −2) , ωql(kA −1) ]T , and

Vl,A,2 ⊗ Vl,B,2 = [ωq−l(kA kB −1) , ωq−l(kA kB −2) , · · · , ωq−l , 1]T .

Finally applying Claim 14 again we obtain the required result.


139

B.0.1 Proof of Lemma 4

Proof. The proof is essentially an exercise in the definition of Kronecker products.


   
Ψ0,0 M1 + Ψ0,1 M2  Ψ0,0 Iζ M1 + Ψ0,1 Iζ M2 
 = 
Ψ1,0 M1 + Ψ1,1 M2 Ψ1,0 Iζ M1 + Ψ1,1 Iζ M2
  
Ψ0,0 Iζ Ψ0,1 Iζ  M1 
=  
Ψ1,0 Iζ Ψ1,1 Iζ M2
   
Ψ0,0 Ψ0,1  M1 
=  ⊗ Iζ  
Ψ1,0 Ψ1,1 M2

Claim 16. Let τdif = 2kA kB p − 2(kA kB + pkA + pkB ) + kA + kB + 2p and p > 1. τdif < 0 only if

kA = 1 or kB = 1.

Proof. We fix kB and p and consider τdif is a function of kA . Then

0
τdif (kA ) = 2kB p − 2kB − 2p + 1 > 0,

which means τdif (kA ) is a strictly increasing function. We consider the following three cases,

• kA = 1. In this case, τdif (1) = 1 − kB . Then τdif ≤ 0.

• kA = 2. In this case τdif (2) = 2kB p − 3kB − 2p + 2. It is not clear to evaluate the value

of τdif (2) directly, thus we consider τdif (2) be a function of kB . τdif (2)0 (kB ) = 2p − 3 > 0,

which means τdif (2)0 (kB ) is a strictly increasing function since in this proof we assume p > 2.
0 (2)(2) = 2p − 4 ≥ 0. Then we conclude τ
When kB = 2, τdif dif (2) ≥ 0.

• kA > 2. It is clear that τdif (kA ) > 0 since τdif (kA ) is strictly increasing and τdif (2) ≥ 0.

The same argument can be applied to kB .

You might also like