Multi-Level Encoding and Decoding in A Scalable Photonic Tensor Processor With A Photonic General

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO.

6, NOVEMBER/DECEMBER 2022 8300714

Multi-Level Encoding and Decoding in a Scalable


Photonic Tensor Processor With a Photonic General
Matrix Multiply (GeMM) Compiler
Zhimu Guo , Alexander N. Tait , Member, IEEE, Bicky A. Marquez , Member, IEEE, Matthew Filipovich ,
Hugh Morison , Paul R. Prucnal, Life Fellow, IEEE, Lukas Chrostowski , Senior Member, IEEE,
Sudip Shekhar , Senior Member, IEEE, and Bhavin J. Shastri , Senior Member, IEEE

(Invited Paper)

Abstract—The resurgence of artificial intelligence enabled by Index Terms—Integrated optics, matrix decomposition, matrix
deep learning and high performance computing has seen a dramatic multiplication, optical computing, optical neural networks,
increase of demand in the accuracy of deep learning model which programmable circuits.
has come at the cost of computational complexity. The fundamental
operations in deep learning models are matrix multiplications,
and large scale matrix operations and data-centric tasks have I. INTRODUCTION
experienced bottlenecks from current digital electronic hardware
in terms of performance and scalability. Recent research on pho- DVANCEMENTS in machine learning (ML) and artificial
tonic processors have found solutions to enable applications in
machine learning, neuromorphic computing and high performance
A intelligence (AI) technologies have enabled numerous
applications including sophisticated recommendation models,
computing using basic photonic processing elements on integrated
silicon photonic platform. However, efficient and scalable photonic natural language processing, machine vision, augmented reality,
computing requires an information encoding/decoding scheme. and so on [1], [2], [3], [4]. The groundbreaking progress of
Here, we propose a multi-level encoding and decoding scheme, these AI applications in different fields is enabled by heavy
and experimentally demonstrate it with a wavelength-multiplexed dependence of ML algorithms training on large data sets. Since
silicon photonic processor. We also discuss the scalability of our pro- the interconnection of neurons in artificial neural networks can
posed scheme by introducing a photonic general matrix multiply
compiler, and consider the effects of speed, bit precision, and noise. be described by a matrix and the data being processed can be
Our proposed scheme could be adapted to a variety of photonic represented as a vector, training on large data sets with deep
information processing architectures for photonic neural networks, neural networks results in large-scale dense matrix-vector mul-
photonics tensor cores, and programmable photonic. tiplications. The improvement in the performance (i.e. accuracy)
of many ML applications comes at the cost of higher computa-
tional power requirement [5]. As such, there has been significant
Manuscript received 22 February 2022; revised 14 June 2022; accepted 31 progress in the development of digital electronic application-
July 2022. Date of publication 5 August 2022; date of current version 23 August specific integrated circuits known as AI accelerators that are
2022. This work was supported in part by the Natural Sciences and Engineering dedicated for dense matrix computations [6], [7]. However,
Research Council of Canada (NSERC), in part by the Canadian Foundation for
Innovation (CFI), and in part by the Queen’s University. (Corresponding author: modern AI accelerators have seen two major bottlenecks when
Zhimu Guo.) it comes to energy efficiency: data transfer to and from memory,
Zhimu Guo, Bicky A. Marquez, Matthew Filipovich, and Hugh Mori- and large matrix-vector multiplications, and both have imposed
son are with the Department of Physics, Engineering Physics and Astron-
omy, Queen’s University, Kingston, ON K7L 3N6, Canada (e-mail: 15zg11@ strict physical limitations on the scalability and performance of
queensu.ca; [email protected]; [email protected]; hugh.moris digital electronic AI accelerators.
[email protected]). Integrated photonic processors enabled by silicon photonics
Alexander N. Tait is with the Department of Electrical and Computer
Engineering, Queen’s University, Kingston, ON K7L 3N6, Canada (e-mail: have shown promising capabilities in accelerating tensor (i.e.,
[email protected]). multidimensional vector and matrix) operations [8], [9], [10],
Paul R. Prucnal is with the Department of Electrical Engineering, Princeton [11] by exploiting the high bandwidth of photonic devices (mod-
University, Princeton, NJ 08544 USA (e-mail: [email protected]).
Lukas Chrostowski and Sudip Shekhar are with the Department of Electrical ulators and photodetectors), low latency and minimal energy-
and Computer Engineering, University of British Columbia, Vancouver, British delay product due to passive optical waveguides [12]. Some
Columbia V6T 1Z4, Canada (e-mail: [email protected]; [email protected]). of these processors [9], [10], [11] are scalable and can use the
Bhavin J. Shastri was with the Department of Physics, Engineering Physics
and Astronomy, Queen’s University, Kingston, ON K7L 3N6, Canada.. He is parallel nature of light through wavelength-division multiplex-
now with the Vector Institute, Toronto, ON M5G 1M1, Canada (e-mail: shastri@ ing (WDM) to achieve large-scale interconnects and massively
ieee.org). parallel data processing and transfer. Recent developments have
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSTQE.2022.3196884. proven that the wavelength-multiplexed silicon photonic plat-
Digital Object Identifier 10.1109/JSTQE.2022.3196884 form can be operated with up to 7-bit precision [13], and most
1077-260X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

recently 8.5-bit precision [14] on each individual multiplication


unit. However, recent studies in these photonic processors have
also seen an increasing demand for a rigorous photonic pro-
gramming scheme to facilitate efficient communication between
photonic hardware and its control system [8], [9], [12], [15].
A reliable information encoding and decoding scheme is re-
quired to interface between the silicon photonic platform and
rest of the computing systems.
The core of a programmable system is a viable and efficient
information encoding and decoding method that translates the
same information between different hardware platforms and
media using their own “languages” respectively. For example,
in digital electronics, binary scheme is used as the information
encoding and decoding method, where every channel in a binary
system has one of the two digital states: either “1” or “0”.
However, the actual switching of the state at the transistor level
is achieved through changing the voltage across the transis-
tors. Therefore, the binary scheme maps “1” to a high voltage
value and “0” to a low voltage value, and the binary scheme
serves as the fundamental for all digital electronic platforms
and hardware. Similarly, silicon photonic systems also require
such an information encoding and decoding method that can Fig. 1. (a): schematics of an MRR-based photonic TPE for vector dot product
conveniently translate information between digital user inter- between vectors X and W , along with its control system. (b): general math-
ematical concept for matrix-vector dot product using MRRs, and an optical
face and analog compute hardware, using the specific physical micrograph of the fabricated silicon MRR with N-doped heater inside a photonic
parameters measured on different silicon photonic hardware TPE on a silicon-on-insulator (SOI) platform.
platforms.
In this work, we present a feasible information encod-
ing/decoding solution, the multi-level scheme, for WDM pho- shown in Fig 1. This architecture was first proposed by Ban-
tonic processors based on microring resonator (MRR) [11], [16], gari et al. [11] to perform convolution operations and recently
[17]. Unlike the digital binary information system, the proposed demonstrated by Marquez et al. [18] for vector dot products
multi-level scheme encodes multiple values as distinct amplitude with limited precision. The photonic TPE includes an array
levels using only a single analog input channel. Instead of using of MRRs, each operating on a distinct resonant wavelength,
multiple channels to achieve a high bit precision, the multi-bit encoding a row vector W  . Tunable lasers, that are intensity
encoding method from multi-level encoding scheme will enable modulated (with variable optical attenuators (VOAs) in our case,
a higher bandwidth per input channel. By designing a dedicated or directly modulated laser (DML) diodes [19]), provide carrier
information system for photonic tensor processors, we aim to signals for encoding the inputs X  to the MRRs using different
take full advantage of photonics to create a fully packaged wavelengths. For a proof-of-concept demonstration, our TPE
software/hardware photonic tensor processor solution that is processes vectors of size n resulting in n lasers and n coupled
capable of large matrix operations. As a dedicated photonic MRRs. As shown in Fig. 1(a), the MRRs are in an add/drop
information system, the proposed multi-level encoding scheme configuration and are coupled with two bus waveguides—a
can be generalized to different photonic tensor processors im- shared waveguide for IN-THRU connection, and another one
plementing an MRR-based architecture, and will also ensure a connecting the DROP. While the input vector X  is encoded
high compatibility with these photonic systems to achieve a high via the attenuators as the intensities of the input optical power,
scalability on a hardware level. To demonstrate the scalability the weight vector W  is encoded as currents to the MRRs that
of our photonic tensor processor, we have implemented a simple shift their resonances, and redistribute the input optical between
general matrix multiplication (GeMM) compiler for the multi- the DROP and the THRU ports according to the difference
level photonic information system as a software scaling solution. between the resonance of the MRR and the laser wavelength.
Computation results have verified the viability of this approach, In short, each input value is encoded onto a channel with a
and the computation accuracy is close to ideal for large matrices. different wavelength, and we use multiple MRRs in parallel,
A hardware scaling solution is presented, and we have shown an each weighting a different channel. Both the inputs and weights
example of the actual implementation of this solution in a later are strictly encoded as the amplitude of the input optical power,
iteration of our photonic tensor processor design. as well as the output measurements. Thus, the phase infor-
mation will be neglected on all optical channels, and we will
II. MULTI-LEVEL INFORMATION ENCODING AND DECODING not discuss any phase-change effects in our photonic TPE. As
a proof-of-concept, we exploit the thermo-optic effect for the
A. Photonic Tensor Processing Element
MRRs tuning [20], [21]. More efficient carrier-depletion effects
The multi-level information encoding/decoding scheme is or phase-change materials [22] can also be used to shift the
designed for the photonic tensor processing element (TPE) resonant wavelength of the MRR with the current.

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714

Fig. 1(b) shows the silicon photonic TPE fabricated on a


silicon-on-insulator (SOI) wafer with a silicon thickness of
220 nm and a buried oxide thickness of 2 μm. The bus waveg-
uides have a width of 500 nm. The MRRs have radii of 8.0 μm,
8.01213 μm, 8.02426 μm, 8.03639 μm, 8.04852 μm. The gap
between the ring and the bus waveguide is 200 nm, yielding a
Q factor of ∼ 6000, and the free spectral range is around 12 nm
for an MRR with 8 μm radius. The MRRs have N-doped photo-
conductive heaters [23] that can actuate the weight by thermally
tuning the MRR resonance. To implement the N-doped heater,
each MRR consists of a circular waveguide is etched to a 90 nm
thick pedestal that hosts the phosphorous dopants. A 10 μm
wide N doping section is patterned to follow the MRR, outside Fig. 2. Operation flowchart for a single MRR, including both calibration and
validation stages.
of which heavy N++ doping is used to make ohmic contacts.
The phosphorous dopant concentrations are N: 5 × 1017 cm−3
and N++: 5 × 1020 cm−3 . Metal vias and traces are deposited to
connect the heater contacts of the MRR weight bank to electrical
metal pads.
The TPE control system consists of source meters that provide
the current to the MRRs, and a powermeter with a balanced
photodetector, all controlled by a computer. The output optical
power from both DROP and THRU ports are collected by the
two photodetectors in a balanced push-pull configuration that
subtracts the THRU port power from the DROP port power,
giving us PDROP − PT HRU in units of dB [24]. All analog
values are passed to and from the computer that regulates the
information flow between a user application for ML and the
photonic TPE.
Fig. 3. Experimental data for the MRR profile mapping the measured output,
PDROP − PT HRU , to the applied heating current, Iheat .
B. Input and Weight Encoding
The encoding scheme requires each photonic channel to linear region is selected using the result of a linear regression of
represent numbers with n-bit precision using analog signals, the heating current sweep data. We choose a specific tolerance
and every analog value to be decoded back to its corresponding of standard deviation that we aim to achieve, and manually
digital value. The photonic TPE has already shown promising adjust the heating current range for the linear regression until
results in its bit precision, and the highest possible precision the standard deviation is under the specified tolerance. Having
achieved on a single photonic channel has been verified to be created the input mapping and the MRR profile, the next impor-
7-bit [13], and more recently to be 8.5-bit [14]. Here, each tant parameter to define is the “zero point,” or the “reflection
photonic channel will include one MRR for multiplication, and point,” of the MRR. The reflection point of the MRR represents
one attenuator for input encoding. the specific current value required to move the resonance of the
The proposed multi-level encoding scheme implements a MRR such that only half of the input optical power couples into
direct value mapping to translate an n-bit digital number to the MRR and goes into DROP, while leaving the other half going
an analog value, and requires calibration and validation stages into THRU. Thus, the linear power difference between DROP
before and after the computation, respectively, as shown in and THRU ports, PDROP − PT HRU , is essentially a constant
Fig. 2. The calibration stage first starts with the inputs to the regardless of the input power. Therefore, we can perform a
MRR, which are encoded as the amplitude of the input optical two-dimensional sweep on both the heating current and input
channel modulated by an attenuator. A direct input mapping is power level to find the reflection point for the MRR, as shown
implemented to encode numerical input values onto the attenu- in Fig. 4. The criterion for choosing the reflection point is the
ation applied on the input optical channel. Next, the calibration spread of power difference values at every current levels. The
performs a heating current sweep for one MRR at a time under a spread of power differences represents how far away the MRR is
constant laser power, and compares the currents to the measured from the heating current level that gives the even distribution of
outputs, PDROP − PT HRU , of the MRR. After collecting the power between DROP and THRU ports. A larger spread means
heating current sweep data, we will choose a range of heating the MRR is further away from that current level, and the less even
currents that produce a relatively linear response in optical power distribution will pronounce the change in input attenua-
output power as the MRR profile. tion in larger magnitudes. On the contrary, the smallest spread
As shown in Fig. 3, the points in the middle of the heating means the MRR is almost indifferent to the change in input
current range produce a relatively linear trend. The relatively attenuation, and this only happens when the power distribution

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

is r2 , the loss in the MRR is a, and the detuning is φ. The detuning


can be calculated as following:
2πnef f 4π 2 Rnef f
φ= · 2πR = . (1)
λ λ
Here, nef f is the effective refractive index of the MRR, R is the
radius of the MRR, and λ is the wavelength of the input optical
signal. Then we calculate the THRU port transmission, TT HRU ,
as following:
r22 a2 − 2r1 r2 a cos φ + r12
TT HRU = , (2)
1 − 2r1 r2 a cos φ + (r1 r2 a)2
and the DROP port transmission, TDROP , as following:
Fig. 4. Experimental data for the sweep that searches for the reflection point in (1 − r1 )2 (1 − r2 )2 a
TDROP = . (3)
the output transmission for the MRR. The powers from both DROP and THRU 1 − 2r1 r2 a cos φ + (r1 r2 a)2
ports are measured at the output of the optical circuit, which is equivalent to
the location before the signals come into the balanced photodetectors. The laser Finally, we can calculate the insertion loss, IL, for the MRR as
pump power is a constant 10 dBm, and with the input attenuation the laser pump
power is low enough that it will not cause optical nonlinearities. following:
TT HRU + TDROP
IL = 10 log . (4)
1.0
between THRU and DROP is close to even. Ideally, the constant
The transmission curves plotted using (2)–(4) for both coupling
power difference between DROP and THRU will be zero, but
conditions are shown in Fig. 5(a), as well as the insertion loss
because of insertion losses between the waveguides and MRR,
curves for both coupling conditions shown in Fig. 5(b). As is
the measured reflection point yields a constant, non-zero power
shown here, there is a non-zero insertion loss at resonance in
difference. However, for a practical MRR, the heating current
both coupling conditions, meaning the magnitude of DROP port
levels that create the reflection point and the zero point are
transmission will always be less than that of the THRU port trans-
related to each other, the difference between these two points are
mission. As a result, the “reflection point,” which is calculated
determined by the coupling condition of DROP and THRU ports.
as the difference between DROP and THRU port powers at half
Different coulping conditions will introduce different insertion
THRU port transmission, will be non-zero in a real-world MRR
losses on DROP and THRU ports, which breaks the even power
with losses regardless of which coupling condition. In addition,
distribution between the two ports in the ideal case.
we choose symmetrical coupling condition for all MRRs in our
Here we use electrical current instead of electrical power as
photonic TPE because of fabrication variation. It is hard to hit an
the calibration metric during the search for the reflection point
exact coupling value, r, because the as-fabricated gap strongly
of an MRR. In theory, thermo-optic effect shifts the resonance
affects r. On the other hand, it is easy to make r1 = r2 by making
of the MRR by applying a heating power to the MRR, and the
it symmetric because the gaps usually come out the same. Most
resonance shift is linear with applied power. When the MRR
of the MRRs that have been fabricated for our weight banks are
is on-resonance with the input power, the input light will also
over coupled, meaning (1 − a)  (1 − r). This is not optimal
induce a small photocurrent that affects the power reading. In
in terms of Q-factor, but it takes the loss, a, out of (2) and 3.
addition, the resistance of the MRR will also increase as the
Thus, we can end up with an expression that has good extinction
temperature increases as a result of thermo-electric effect, con-
ratio and also is robust to fabrication-sensitive parameters.
sequently affecting the power measurement. On the other hand,
Now we can combine the “reflection point” location with the
if we focus on the small range of power output values around
MRR profile to choose a proper heating current range and map
the reflection point, the output values can be approximated as a
that range to the other set of inputs that are encoded as heating
linear response. This allows us to use current values during the
currents to the MRR. The selected heating current range should
calibration phase with acceptable accuracy. Another benefit of
center around the reflection point so that we can encode same
using a tight range of current around the reflection point is that
range of positive and negative numbers. Since multiplication
the small range prevents the use of larger currents and higher
between two numbers can also be interpreted as one value being
heat fluctuations created by large changes in the current from
weighted by another, we call the mapping between the second
weight updates.
set of inputs and heating current the weight mapping.
To further explain the reason for this non-zero “reflection
point,” we will take a closer look at an MRR under different
coupling conditions: with critical coupling between THRU and C. Output Decoding and Calibration
DROP ports, and with symmetrical coupling between the two Finally, the output mapping is created using both the weight
ports. The equations that describe different coulping conditions mapping and the input mapping. This step generates random
are originally demonstrated by Stokes et al. [25] and Heebner numbers for both the heating current on the MRR and the
et al. [26], and are later re-derived by Bogaerts et al. [27]. Here, attenuator, and the product of the two is represented as the power
we assume the coupling coefficient between the MRR and THRU difference between the DROP and THRU ports of the MRR
port is r1 , the coupling coefficient between the MRR and DROP as PDROP − PT HRU . The measured outputs from the MRR

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714

operation complexity, validation on the output values will not be


executed on every output. Here we can implement our control
system to sample the outputs at a fixed frequency (i.e., every 5
minutes), and every sampled output is compared to the expected
output calculated with the set of MRR and laser inputs, and the
system will trigger a re-calibration if the measured output failed
to match the expected output within tolerance.
The step that consumes the most time during a validation-
recalibration process is the laser frequency sweep to redefine all
the resonances of the MRRs due to resonance shifts over time.
This is directly constrained by the tuning speed of the tunable
laser (TLS). In our experiments the tuning speed of the TLS
is 100 nm/s, and the typical free spectral range for our MRR
designs with an 8 μm radius is around 20 nm. Therefore, it takes
about 0.2 seconds to complete a frequency sweep to redefine
all the MRR resonances. Other steps include the differential
comparison between measured and expected values during the
validation stage, and laser frequency resets after the resonance
calibration. These steps take significantly less time when com-
pared to the TLS frequency sweep. Therefore, we estimate
that each validation-recalibration process will take around 0.2
seconds. In addition, our further system stability testing results
showed that such validation-recalibration process would only
be required hourly, making the time lost during this process
insignificant compared to our system’s actual uptime.

III. OPERATIONAL RULES


A. Precision Flexibility
The multi-level encoding scheme only provides finite preci-
sion for number representation, and the total number of dif-
Fig. 5. (a) Transmission curves of the THRU (blue) and DROP (red) ports ferent values is also limited. On the other hand, the range
of a lossy MRR under either symmetrical coupling condition (solid lines) or of user requested values can vary depending on the specific
critical coupling condition (dashed lines). For the MRR dimensions, we choose
an MRR with a 8 µm radius and an effective refractive index of n = 2.82, the application intended for the photonic TPE. However, because
loss is a = 0.99 and the coupling coefficient between the MRR and the THRU the proposed encoding and decoding scheme takes advantage of
port is r1 = 0.97. For symmetrical coupling condition, we choose the coupling direct value mapping, the range of digital value that the analog
coefficient between the MRR and DROP port to be r2 = r1 , whereas for critical
coupling condition we have r2 a = r1 . (b) Insertion loss (IL) curves (black) for signals are mapped to is arbitrary. In addition, the photonic
both coupling conditions on the same plot, calculated using (4). TPE also supports multi-bit precision during operation, and the
switch between different bit-precision only requires a system
re-calibration. Therefore, the photonic TPE can be flexible with
are mapped to the range of desired digital values after a linear the value mapping and bit-precision during the encoding and
regression, and the parameters of the regression are used as the decoding process. For example, if the software requires high
output mapping to transform all measured optical output power computational accuracy but relatively small numerical ranges for
to the numerical values, thus concluding the entire calibration input and output values, then the photonic TPE can use lower bit-
stage for a single MRR. precision for faster re-calibration, and fit the smaller numerical
The calibration stage is executed once at the start of the pho- ranges with better computational accuracy. Here, computational
tonic system, and then the photonic TPE enters the computation- accuracy is defined as the difference between measured and
validation cycle. To demonstrate our proposed multi-level en- expected outputs. For lower bit-precision, each digital output
coding and decoding scheme using a single channel, the calibra- value can have a larger analog output range, which can greatly
tion is performed using only one tunable laser and one MRR. For reduce inaccuracy due to any kind of signal fluctuations or
larger scale photonic TPEs, calibration will require switching system instabilities. On the other hand, if the software requires
on all the optical channels and only calibrate one channel at larger numerical input and output ranges but has higher tolerance
a time. This will take into account the constant optical power on accuracy, the photonic TPE can instead incorporate higher
offset contributed from all the other channels at the output. bit-precision encoding scheme to cover more values in the large
The validation stage keeps track of the laser inputs, the MRR numerical ranges. In this case, we can fit more digital values
inputs, and the outputs, together with the three mapping profiles within the same overall analog output range at the expense of
obtained from calibration stage. To reduce compute latency and reducing the analog step size for individual digital output values.

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

The choice for higher or lower bit-precision is largely dependent


on the actual application, thus the control system will require
user specification to configure the bit-precision used in actual
multiplication tasks.

B. Commutative Property
To guarantee the stability of the system, the multi-level
encoding scheme also imposes strict operational rules on the
inputs for both the MRRs and the lasers. The input encoding for
both sides uses the same direct value mapping between digital
and analog values, but the underlying operating mechanisms
are different. For the attenuators, different digital values are
mapped to different optical power levels through different levels Fig. 6. Output mapping for a 11-bit signed system with 6-bit signed inputs.
of attenuation, where small digital numbers corresponds to large
attenuation, and vice versa. Since non-linearity will occur at
high attenuation, we can only operate within a relatively small separate negative and positive multiplication completely within
range of attenuation. As a result, the input optical power will our control system for the TPEs, and dedicate one photonic TPE
never go to zero. For the MRRs, the digital values are mapped to to process either all positive/negative multiplications, or mixed
the applied heating current values, which shift the resonance of positive/negative multiplications. For either TPE, negative signs
the MRRs. The mismatch between the MRR resonance and the will be dropped everywhere during multiplication, and the con-
laser wavelength determines how the incoming optical power is trol system will take outputs from the one processing mixed pos-
distributed between DROP and THRU ports, but the total output itive/negative multiplications as negative values automatically.
power will equal the total input power in the ideal lossless case.
Because loss is present in a real-world scenario, a higher laser IV. EXPERIMENTAL DEMONSTRATION
power is more beneficial for a better performance of the photonic
Here, we implement a 11-bit signed system with 6-bit signed
TPE. In addition, the heating current range chosen for the MRR
inputs for our proof-of-concept demonstration. First we perform
will center around the “zero” point where the output power
the calibration stage as mentioned above, including creating an
is evenly distributed between DROP and THRU. This means
input mapping, an MRR profile, and performing a reflection
that the output power range is also centered around the zero
point search. The input mapping uses an attenuation range
point, and only spans a limited range on both sides of the zero
between 2 dB and 8 dB for mapping 25 positive input digital
point. Therefore, the input mapping can only encode numbers
numbers to their corresponding, linearly spaced, analog optical
to a non-zero optical power range, whereas the weight mapping
power levels. From the reflection point search we determine that
encodes numbers that centers around zero optical power. As a
a heating current of 0.48 mA to the MRR would produce a zero
result, same numbers going through the attenuator will produce
output power calculated from PDROP − PT HRU . Combining
a different optical output than those going through the MRR,
this with the MRR profile which gives us the heating current
and the range of available optical outputs is different for the
range that produces a linear output power level, the weight map-
two. Therefore, multiplication of numbers from both sides does
ping is finished with a heating current range between [0.37, 0.59]
not commute, i.e. a × b does not equal b × a. To circumvent this
mA that fits 26 signed digital numbers.
problem, the multi-level encoding scheme will force the larger
Next, the output mapping is constructed through sweeping
number through the lasers when multiplying two numbers with
both inputs and weights across all possible values using both
the photonic TPE since higher input power for the MRR will
the input mapping and the weight mapping. All possible in-
give better output resolution.
put/weight combinations include 25 × 26 = 2048 pairs, but only
a subset of combinations that meet the aforementioned commu-
C. Negative Number Encoding tative property is selected. The input number range is chosen
Aside from the non-commutative operation rule mentioned to be [0, 31], and the weight number range is [−31, 31]. The
above, we also implement another restriction on the sign of the choice of values inside the matrices is based on the selected
multiplication. Since only the MRR can encode both positive precision for the system, which is a 6-bit signed integer system
and negative numbers using left and right of the “zero” point in as an example. This range is only a digital representation of
its output power but the attenuator can only encode positive num- the measured analog values, and the example demonstrates how
bers, any negative number we encounter will be sent to the MRR the matrix dot product will work based on an arbitrary value
automatically. In case of two negative numbers during multipli- range selection. However, this value range selection can be
cation, both negative signs will be dropped automatically since any numerical range that centers around zero depending on the
that is equivalent to two positive number multiplication. application, and in many situations, the common choice will
An alternative solution to encode negative numbers in our be the normalized range of [−1, 1]. We collect the experimental
photonic TPE is to have another photonic TPE with the exact results as shown in Fig. 6. Here, the expected output is calculated
same configuration running in parallel. This will allow us to by multiplying the input number with weight number directly

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714

Fig. 8. Matrix dot product using a GeMM compiler.

Fig. 7. Multiplication results with full range of 6-bit signed inputs and weights
with the implementation of above mentioned operational rules. Here the inputs A. GeMM Compiler for Photonic TPE
range between −31 and +31, and the weights also have the same range. Different
colors in the colorbar represents the product of a weight value and an input value, First, we demonstrate our solution for software scaling.
with purple representing the smallest and the yellow representing the largest. One common software up-scaling approach for tensor proces-
sors is the use of a General Matrix Multiplication (GeMM)
compiler [28], which helps a software to map different ma-
inside the control computer. The measured output is converted trix operations to specific hardware architectures to optimize
from the measured optical output power, PDROP − PT HRU , to structure utilization and computation efficiency. As the modern
the desired output number range via a re-scaling. The re-scaling data science industry continues to develop and the computation
of experimental data first performs a linear regression using both volume and complexity increases, most data-focused compute
the measured output power and the expected output values, and hardware finds it helpful to implement a dedicated compiler to
then it compares the slope and intercept of experimental data to a efficiently perform sophisticated matrix multiplications. There
theoretical slope of 1.0 and intercept of 0.0. After the re-scaling, are many different designs for GeMM compilers depending
the measured experimental data is converted to measured output on their targeted hardware platforms [29], [30], [31], but the
values that were in the same range as the expected outputs. basic operating rule for any GeMM compiler focuses on the
Having fully characterized the photonic TPE that includes an most prevalent matrix multiplication, matrix dot product, and
MRR and an attenuator, we now incorporate the sign rule and its mathematical form can be expressed as (5),
include full positive and negative numbers for both the input and
Y = αW · X + βZ. (5)
weight. The result of full 6-bit signed multiplication is shown
in Fig. 7. Here, both input and weight go from [−31, 31], and Here, W, X and Z are input matrices, both α and β are
the experimentally measured output is shown as colored contour scaling constants, and Y is the output matrix. The math is
maps on the two-dimensional grid of weight versus input. The simple, but the main focus of GeMM compilers is mapping the
measured output values range between (−1000, 1000), and the mathematical expression to the topology of different hardware
standard deviation calculated from the measured outputs is platforms. Because the sizes of the matrices from data-focused
9.34 × 10−6 . tasks often exceed the physical sizes of the actual compute
The precision adjustment can be easily made at the generation hardware, GeMM compilers need to first break down these large
of the direct mapping stage during calibration. The calibration matrices into smaller matrices or vectors. How the matrices are
starts with weight and reflection sweeps, which will estimate broken down depends on the core/thread count of the actual
the usable heating current range for the MRRs. The next step hardware, and the overall task of matrix multiplication will be
is to decide how many analog levels we need to represent all done in multiple batches. Once the matrices are divided, GeMM
digital values up to the chosen precision. We demonstrated an compilers need to send specific values from the current data
11-bit signed system with half-precision for weights and inputs. batch to the compute units used by the task. After one iteration
However, the system can be easily adapted to lower precision of computation is finished, GeMM compilers will then collect
levels, such as 8-bit signed precision with half-precision for all the results and send out the next batch. As an example, we
the weights and inputs. In this case, we only need to redo the have a simple matrix dot product between two matrices W and
direct mapping for the weights and inputs to accommodate fewer X as shown in Fig. 8.
analog levels. We divided matrix W into four batches each containing four
elements, and matrix X into two batches each containing six
elements. The number of compute units used for this task will
V. GEMM COMPILER AND SCALABILITY
be six, matching the number of elements in the largest data
Having demonstrated the functionality and performance of a batch. The GeMM compiler will first send out the first data
single photonic TPE, we will focus on scaling up our system batches from both matrix W, W11 , and matrix X, X11 to all the
to accommodate higher computing capacity and throughput. compute units to calculate the dot product W11 × X11 . For the
Scaling up a processing element architecture generally involves second round of operation, the GeMM compiler will send out
two approaches: hardware scaling and software scaling. W12 and X21 instead, and the same procedure is repeated for

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

Fig. 9. Accuracy of a matrix dot product computation between two matrices


both of size 128 × 128. Here, the color grid on the left shows the computation Fig. 10. Distribution of the matrix dot product accuracy collected from trials
accuracy of each element in the output matrix, where white means 100% accurate with different matrix sizes, each including 100 computations using randomly
and red means the output accuracy is zero. generated matrices. Here, we simulated four different matrix sizes, including (a)
64 × 64, (b) 128 × 128, (c) 512 × 512, and (d) 1024 × 1024. We used 6-bit
signed inputs and weights for all the computations.
the third and fourth iterations. Once all input data batches are
cycled through all compute units and the results are collected, zeros, but our photonic TPE measures non-zero analog values
the GeMM compiler then send out the results in two batches to at those points. Such behavior indicates that there is noticeable
calculate the elements in the output matrix Y. Having calculated noise in our photonic circuits, and optimizing our control system
all the elements, the GeMM compiler then reconstructs matrix to account for such noise will require further work.
Y with all the results and sends it back to the user.
Because of this divide and conquer technique, GeMM com- B. System Scalability
pilers have enabled many modern tensor processors to achieve a
compute capability far beyond their physical topology limit with Having validated the functionality of the GeMM compiler,
a high efficiency. Examples of this include NVIDIA’s Tensor we verify the performance of our photonic tensor processor
Core [6] and Google’s TPU [7]. Therefore, our software scaling using the multi-level encoding scheme, together with the GeMM
solution can take full advantage of GeMM compiler’s promises compiler. In the first test we vary the sizes of input matrices
and enable a high volume computation on a small but efficient from 64 × 64 to 1024 × 1024 while maintaining the same bit-
physical architecture. precision for all the matrix element as signed 6-bit. We run
To demonstrate this idea, we implement a simple GeMM the matrix dot product computation with randomly generated
compiler that can break down large matrices and schedule com- elements for each matrix size 100 times, and collect the average
putation tasks among different MRRs of our photonic tensor accuracy of each output matrix into a histogram as shown in
processor. First, we perform a matrix dot product between two Fig. 10. As shown here, the spread of average computation
matrices each of size 128 × 128 using a photonic TPE consisting accuracy tightens as the matrix size increases, indicating that the
of five MRRs, and all matrix elements are randomly generated multi-level encoding scheme and our photonic tensor processor
and are encoded using 6-bit signed precision. Since the input see a performance improvement over larger matrices.
matrices are larger than the size of the photonic TPE, the matrices A second test is designed to investigate the performance
are broken down into many vectors each containing five elements change as a result of changing the bit-precision of matrix el-
during the many iterations of computation. The output matrix is ements. We start the test using only 3-bit signed values for
shown in Fig. 9, where each pixel on the matrix plot represents all matrix elements, and increase to the original 6-bit signed
the computational accuracy of each output matrix element from precision. The sizes of matrices used in this test are the same
this single trial, presented in different colors shown on the scale 128 × 128 for all 100 randomly generated computations, and
on the right. Here the accuracy of each element in the output the results from the second test are shown in Fig. 11. Here, trials
matrix is calculated as following: using only 3-bit signed precision generate the largest spread
of average accuracy, whereas 4-bit signed precision and above
M easured − T arget
Accuracy = 1 − . (6) produce comparable results. The results show that multi-level
M easured encoding scheme exhibits the best performance when using
As shown here, this single trial achieved high accuracy for the higher bit-precision, but the accuracy slightly decreases for
majority of the matrix elements, with only a few exceptions lower bit-precision.
represented as the red pixels. The average accuracy across all After performing the two performance tests using different
matrix elements from this trial is 99.71%, with a standard devia- parameters, we condense the histograms shown in Fig. 10 and
tion of 0.0285. However, we notice a few elements in the output Fig. 11 by calculating the overall average accuracy and its
matrix that have an accuracy value of zero. This is likely due to standard deviation for every 100 trials with a specific parameter
the fact that the expected output for those elements are digital as shown in Fig. 12. Here, Fig. 12(a) shows the overall accuracy

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714

the accuracy results obtained from such random trials can faith-
fully represent the true accuracy achievable on our photonic
TPE. Therefore, an improved accuracy results over larger test
matrix sizes indicates that our system can achieve above 99.5%
accuracy.
The standard deviation in Fig. 12(b) represents the accuracy
value fluctuation over multiple repeated trials—the increase in
standard deviation results from the noise within our analog
system. At higher precision levels, the same level of analog
noise will be more likely to cause a misrepresentation of each
digital value. Lower precision only requires fewer analog levels
to include all the digital values, whereas higher precision will
require more analog levels within the same analog range. As a
result, the system is more susceptible to noise at higher precision.
Fig. 11. Distribution of the matrix dot product accuracy collected from trials We have observed a larger fluctuation in average accuracy values
with different bit-precisions, each including 100 computations using randomly over multiple tests, leading to a slight increase in the standard
generated matrices. Here, we simulate four different bit-precisions, including
(a) 3 bits, (b) 4 bits, (c) 5 bits, and (d) 6 bits to encode input values. We used a deviation value.
matrix size of 128 × 128 for all computations. Aside from software scaling, hardware scaling is also a crucial
part in boosting the computation capacity of our photonic tensor
processor. The hardware architecture for a single photonic TPE
is shown in Fig. 1, which contains an array of five MRRs
sharing both a common THRU connection and a common DROP
connection. The photonic TPE is capable of performing five mul-
tiplications simultaneously using five sets of inputs through the
same bus waveguide, each set encodes one number through the
attenuator as the “input” and the other through the source meters
as the “weight”. Thus, the single photonic TPE can compute a
dot product between two vectors each with five elements within
a single iteration. However, this is only one single photonic TPE,
and its architecture can be easily duplicated on chip. In addition,
because different copies of the same photonic TPE have their
own bus waveguide for inputs, the same laser sources can be
used in a multiplexer/splitter fashion to provide the same copies
Fig. 12. (a) Average accuracy and its standard deviation calculated from the of all signal carriers for all the photonic TPEs. The multiplexer is
trials with different matrix sizes, as shown in Fig. 10. (b) Average accuracy and
its standard deviation calculated from the trials with different bit precisions, as implemented using a WDM multiplexer that combines all the in-
shown in Fig. 11. dividual laser sources from separate waveguides, and the splitter
evenly distributes the combined signal among all photonic TPEs.
On the other hand, most hardware scaling solution will benefit
from the first test using four different matrix sizes, from 64 × 64 from a higher level of integration for lower latency and higher
to 1024 × 1024. We see a clear upward trend in the overall compute throughput. In our current design for the photonic TPE,
average accuracy as aforementioned, together with a decreasing the input mapping still relies on external attenuators to encode
trend for the standard deviation from the overall average accu- different input values as different optical power levels. However,
racy calculation. For the second test, the overall average accuracy same effect can be achieved by using the THRU port output of
also increases with higher bit-precision, but we also see a small an on-chip MRR. By tuning the MRR on and off resonance, the
increase in standard deviation from the calculation as shown in THRU port output will carry different output power depending
Fig. 12(b). Because the change in standard deviation in Fig. 12(b) on the wavelength mismatch between the optical signal and
is one magnitude smaller than that in Fig. 12(a) and the average the MRR. Therefore, by replacing the attenuators with on-chip
accuracy is similar for trials using more than 3-bit precision, the MRRs for input encoding, the control mechanism can be applied
small increase in standard deviation can be a random result since to both the input encoding MRRs and the multiplication MRRs.
all matrices are randomly generated in all trials. The improved Additionally, the balanced photodetectors can also be integrated
average accuracy in both performance tests is likely a result of on chip, and will only require a bias voltage from the external
larger sample size when running randomized trials. Randomized source meters. The output of the balanced photodetectors is in
trials require greater sample sizes to better achieve the ideal the form of different current levels, which can also be monitored
normal distribution of test samples, and as the matrix sizes through the same sourcing and measurement units. Thus, both
increase these test matrices include more randomized values information encoding and decoding will be uniformly imple-
which contributes to a better test sample distribution. As test mented through the external source meters for both inputs and
sample distribution approaches the ideal normal distribution, weights.

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

To illustrate this hardware scaling idea, we consider the fol-


lowing design for a scaled-up version of the photonic tensor
processor containing four TPEs, as shown in Fig. 13. Each of
the photonic TPE contains four MRRs with only THRU ports
for input encoding, and another four MRRs with both DROP
and THRU connections for multiplication. Four laser sources
are used to provide carrier signals for each of the input and
multiplication MRR in every TPE. Since each TPE has its own
separate bus waveguide, only four lasers will be required to drive
all four TPEs. The summation of all photonic channels within
each TPE is done by the integrated balanced photodetectors
on chip. Both information encoding and decoding will rely on Fig. 13. Scaled-up design of a photonic tensor processor containing four
photonic TPEs, each capable of performing a vector dot product with four
source meters to provide either a heating current or a bias voltage, elements simultaneously. All photonic TPEs in this design integrate input
and to measure the photo current output from the balanced mapping on-chip through the input encoding MRRs, and also retain the same
photodetectors. MRR weight bank design as shown before.
The large matrix decomposition and dot product are done
in simulation using the experimental data collected from the
single MRR experiment. However, we have also performed for energy consumption, we arrive at the total energy per cycle
testing using multiple MRRs and tried to qualitatively observe for the larger photonic TPE as x2 · (n2 EM AC ). If we want to
the effects of thermal crosstalk between neighboring MRRs. compute a matrix multiplication between two matrices, both of
Our preliminary results showed that the magnitude of thermal size m × l with m ≥ xn and l ≥ xn, then it will take ml cycles
crosstalk mainly depends on the MRR spacing. With a spacing to complete the operation on the small TPE, whereas it only
of around 150 um, the crosstalk effect becomes insignificant takes ml
x2 cycles on the scaled-up TPE. If we compute the total
relative to other noise sources within our system. The most energy consumed by tuning the MRRs during the operation, the
effective way to minimize thermal crosstalk in our system is to total energy will be identical since the total workload remains
create larger spacing between MRRs; however, this will reduce the same.
the compute density of our device. The other solution will Next, we will consider the I/O energy consumption in our
require extensive calibration to be performed simultaneously system. For the larger scale system concept shown in Fig. 13, we
across all active MRRs and sophisticated monitoring procedures chose to use one set of input encoding MRRs for each TPE. As a
during operation. As a result, the reduced compute speed due to result, the actual tuning power on the input encoding process will
the added calibration and monitor steps will also hamper the scale linearly with the input matrix size, which is on the same
compute density. order of magnitude scaling as compared to the photonic MAC.
On the other hand, the best solution to thermal crosstalk However, we only implement one set of balanced photodetector
would be to eliminate thermal tuning and implement the carrier- for accumulating all the computed results. Therefore, the energy
depletion effect. The carrier-depletion effect not only offers a consumption scaling for the output optoelectrical conversion
high tuning speed that can enable fast weight updates but also will be sublinear compared to the input and weight matrix sizes.
generates significantly less heat, allowing for a more compact If we break down the input and output energy consumption for a
photonic TPE design to achieve higher compute density. single photonic MAC, the input energy consumption per MAC
will not decrease for larger photonic TPEs, but the output en-
ergy consumption per MAC will decrease significantly. Recent
C. Estimated Energy Consumption
investigation by Al-Qadasi et al. [32] has also quantified this
Our photonic TPE design implements a multi-wavelength estimation, where for a thermally tuned MRR-based photonic
approach that uses multiple MRRs. As a demonstration we TPE, the energy per MAC was calculated to be around 1.2
showed the performance and the multi-level encoding/decoding pJ/OP for a network size of 15 MRRs. The energy per MAC
scheme for a single channel TPE, but for multi-channel TPE de- will decrease to around 1 pJ/OP for a larger network size of 85
signs, each MRR will strictly operate on a separate wavelength. MRRs. However, if we can improve the thermal tuning design
First we only consider the tuning power during the operation, by adding insulators to the heaters inside the MRRs, the energy
which will be the major energy consumption during a photonic per MAC can drop significantly down to around 0.3 pJ/OP for
MAC operation. For a small photonic TPE, we assume it has the smaller network size of 15 MRRs, and down to less than 0.1
a size of n × n, giving us a total of n2 MRRs and n balanced pJ/OP for the larger network size of 85 MRRs.
photodetectors for photonic MAC operations. The total number Therefore, the total energy consumption will be the slightly
of MACs per cycle for this small photonic TPE will be n2 MACs, less for the large TPE to complete the same workload as the
and the total energy consumption per cycle will be n2 · EM AC output optoelectrical conversion will be more efficient. However,
if we assume each MAC consumes energy equals EM AC .Now the small improvement in energy per MAC is outmatched by
we consider a scaled up photonic TPE that is x times larger in improved designs of the heaters, and as a result the overall
both dimensions compared to the small photonic TPE, giving energy consumption for both the small and the large systems
us a size of xn × xn = x2 · n2 . Under the same assumption are similar. The main difference between operating the small

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714

and the large TPEs comes from the number of cycles required a size of 1024 × 1024. In this scenario, the speed of photonic
and how the energy consumption is distributed over time. Here, TPE is bottlenecked by the weight update speed because the
the larger photonic TPE will require more energy within a short photodetector is 107 times faster than the thermally tuned MRRs.
time window, whereas the small TPE uses much less energy per However, suppose we were to use the carrier-depletion effect to
cycle and spread out the energy consumption over longer periods modulate the optimal photonic weights at up to 56 GHz and
of time. use a fast PIN photodetector that operates at 67 GHz. In that
case, TPE size will consist of around 850 MRRs. In conclusion,
VI. SPEED AND PRECISION ANALYSIS thermal tuning the MRR will create a speed bottleneck from
weight updates during matrix tiling, but narrowing the speed
The MRRs shown in Fig. 1 implemented thermal tuning using gap between weight modulation and photodetection will require
N-doped heaters, and our photodetectors were SiGe-based PIN larger photonic TPE sizes to take full advantage of that fast
junction photodetectors. The tuning speed of N-doped heaters is weight update capability.
up to millisecond scale, which can be a limitation in certain sce- Recent analysis on signal resolution in silicon photonic neural
narios and applications. However, in many other deep learning network by Tait [35] summarizes the relation between laser
applications the update rate of the weights can be much slower, pump power, signal frequency, and bit precision. In the middle of
especially during inference or convolutions. During inference, all three terms are different dominating types of noise in different
the photonic TPE will be loaded with pre-trained weights. Thus, operating regimes of the silicon photonic system. Our silicon
the photonic TPE can perform MAC operations at the speed photonic system implements an O/E/O operating regime, where
limit of the photodetectors, which is shown to be 56 GHz the first part is the optical signal from a tunable laser, and then
for an avalanche photodetector [33] and 67 GHz for a PIN optical weighting uses MRR weight banks with thermal tuning.
photodetector [34]. In case of convolutions as demonstrated by After the weight bank is the optoelectrical conversion by the
Feldman et al. [22], the convolution filters only require a slower balanced photodetector. For such a photonic circuit, there are
update rate compared to the inputs. Therefore, the relatively slow three major noise regimes that affect the interaction between
tuning speed of the weights inside the photonic TPE can satisfy laser pump power, signal frequency, and bit precision: thermal
a high-speed MAC operation for inference or convolution. regime, shot regime, and relative intensity noise (RIN) regime.
However, in the case of matrix tiling the photonic TPE compu- In the thermal regime, the dominant noise is known as Johnson-
tation speed will also be limited by the weight updates. Because Nyquist noise which comes from the random movement of
we are using a wavelength-multiplexed approach, the speed electrons within the photodetector. Here, the noise equivalent
bottleneck for our system during a matrix tiling process will be power increases exponentially with higher bit precision, and the
affected by three factors: weight update speed, detection speed, relation between laser pump power Pltherm , signal frequency f ,
and the physical size of our photonic TPE. The weight update and bit precision in thermal regime B can be written as:
speed determines how fast the photonic TPE can be updated for
the next batch of weights. The detection speed determines how  J ∗ (B) 3
Pltherm (f, B) = f· , J ∗ (B) ∝ 2 2 B . (7)
fast the TPE can process all batches of inputs before adjusting ηnet
the weights. The size of the photonic TPE will affect the number
of input batches per weight batch. In a wavelength-multiplexed Here, ηnet is the transmission efficiency of our photonic
setup where a single balanced photodetector is paired with circuit, and J ∗ represents the Johnson-Nyquist noise at the given
multiple MRRs, we can increase the number of MRRs as long as precision B. During the operation of the MRR weight bank
their resonances can all fit within their free spectral ranges. With inside our photonic TPE, the input laser pump power will remain
more MRRs in the weight bank, large matrices require less tiling a constant value. As is shown here, there is a trade-off between
to finish the computation. Also, we can implement data batching signal frequency and the bit precision of our system at a given
and the weights inside the photonic TPE will not be updated until laser pump power level. Thus, higher frequency operations will
all the inputs have been processed through the TPE. Therefore, require lower bit precision to maintain system stability.
larger photonic TPEs will go through all the inputs using fewer During the optoelectrical conversion, photon shot noise will
cycles and require more often weight updates when compared to be the dominant noise and this is called shot noise regime. Shot
smaller ones. As a result, smaller TPEs rely more on the speed noise comes from the randomness in photon detection, and in
of the photodetector to process all the inputs, but larger TPEs this regime we still have the same relation between laser pump
rely more on the weight update speed of the MRRs once all the power Plshot , signal frequency f , and bit precision B as is shown
inputs are processed. in the thermal regime. Here we have:
Given these three factors that bottleneck the speed of our Eshot (B)
photonic TPE, there will be an optimal photonic TPE size that Plshot (f, B) = f · , Eshot (B) ∝ 23B . (8)
ηnet
balances the latency between weight updates and photodetec-
tion. Currently, we are using the thermo-optic effect in our As shown here, the same trade-off between signal frequency
system, which operates on a millisecond time scale, and the and bit precision still remains. In addition to thermal and shot
photodetectors in our system have been verified to achieve noise regimes, the carrier laser power output also has random
10 GHz. Assuming thermal tuning, the MRRs take 1 ms, then changes that create relative intensity noise. In RIN regime, the
the optimal size for our photonic TPE can have no more than a noise is independent of laser pump power, but the frequency-
single MRR per pair of balanced photodetectors for a matrix with precision relation gives us the maximum signal frequency that

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

Fig. 14. The trade-off between operating frequency and precision under a
constant laser pump power of 10 mW, and a round trip loss of 15 dB. The Fig. 15. The frequency-precision trade-off comparison between multiple pho-
transmission efficiency was taken to be 0.1, and we consider both thermal regime tonic systems with different components. In this paper, the photonic TPE design
and shot noise regime. For the thermal regime, we consider the situation where implemented grating couplers (GC), MRRs with N-doped heaters (N-doped
we are using less than the full designed bandwidth for our single channel. For MRR), and PIN junction photodetectors (PIN PD). However, recent work
the shot noise regime, we consider the shot noise amplitude in a typical analog has shown that we can replace these components with ones that have higher
photonic system, which will be larger than the noise amplitude in the an ideal efficiency and speed, like photonic wirebonds (PWB), PIN junction modulators
system nearing its physical limit. (PIN mod), and avalanche photodetectors (APD). This plot gave an estimation
on how the frequency-precision trade-off will look like compared to our current
design, and we will be implementing these more advanced components in our
can be obtained at a certain bit precision: future designs.

f ≤ FRIN (B), FRIN (B) ∝ 2−3B . (9)


should expect the max frequency to drop with more channels
Therefore, the signal frequency has a device-specific upper limit due to the additional MRRs, and the fan-in and fan-out effects
at any given bit precision, and the upper limit is independent of on different types of noises. In addition, extra MRRs on the bus
laser pump power. waveguide will introduce an insertion loss to all signals, but this
To demonstrate the aforementioned trade-off between op- loss is only measured at around 0.01 dB per MRR when it is off
erating frequency and precision under a constant laser pump resonance [36].
power of 10 mW and with a 15 dB round trip loss, we plot the On the other hand, we have been using grating couplers
operating frequency and the expected precision in both thermal for optical coupling on our chips and the N-doped heaters on
and shot noise regimes as is shown in Fig. 14. Here, we chose the MRRs for thermal tuning the weight bank. As previously
a transmission efficiency of 0.1 for our analog photonic circuit, mentioned, grating couplers have a high insertion loss at around
and we included a wide range of frequency values ranging from 15 dB, but recent work has already shown that photonic wire-
1 Hz to 10 GHz. As shown in Fig 14, the trade-off between bonds can be implemented reliably for a much more efficient
operating frequency and precision is well pronounced and the on/off chip coupling. The round trip loss of a photonic device
thermal regime contributed to the upper limit for our system using photonic wirebonds can be as low as 2 dB [37], which
precision across all frequency values. greatly increases the available precision at any given frequency.
In this paper we demonstrate a single channel system, but we As shown in Fig 15, by replacing grating couplers for photonic
can also scale up our photonic TPE by adding more MRRs and wirebonds, we can achieve up to a 3-bit improvement on the
more rails to perform more operations simultaneously. Similar precision across all frequencies. Simultaneously, we can replace
structures for this scale-up idea can be found in the paper the N-doped MRRs with PIN junction modulators for higher
by Bangari et al. [11]. However, if we were to scale up our speed modulation using carrier depletion effect [17]. Moreover,
photonic TPE to include multiple channels, then we will need our analog photonic circuit operate with low laser pump power to
to include noise added by fan-in and fan-out effects. These avoid nonlinearities during weighting. At low laser pump power,
effects can be categorized into three subcategories: singular case if we were to implement an avalanche photodetector that has
(with only one non-zero input), uncorrelated case (all inputs active avalanche gain in our current designs, and we can further
are independent), identical case (all inputs are same). More reduce the thermal noise inside the photodetectors. By replacing
specifically, the correlation of inputs affects fan-in gain and the PIN photodetectors with avalanche photodetectors, we can
therefore signal-to-noise ratio (SNR) [35]. When considering receive a further improvement of around 2 bits on the available
fan-out loss in a multi-channel system, signal root mean square precision across all frequencies.
(RMS) value and SNR decrease proportionally. However, ac-
counting for fan-in gain, SNR only increases sub-linearly which
results in an overall decrease of signal RMS with more channels. VII. CONCLUSION
Therefore, laser input power needs to increase sub-linearly to We have demonstrated the proposed multi-level encod-
maintain the same level of SNR with more channels. As a ing/decoding scheme for an MRR-based photonic TPE, and
result, with the same optical input power and precision level, we the experimental results have verified the feasibility of such

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714

implementation. We have also noted some unique characteristics [20] A. N. Tait et al., “Feedback control for microring weight banks,” Opt. Exp.,
of the photonic TPE architecture, and by taking advantages vol. 26, pp. 26422–26443, Oct. 2018.
[21] L.-W. Luo, G. S. Wiederhecker, K. Preston, and M. Lipson, “Power insen-
of its flexibility we have refined and improved the details of sitive silicon microring resonators,” Opt. Lett., vol. 37, no. 4, pp. 590–592,
the multi-level encoding scheme. We also combined multi-level 2012.
encoding scheme with a simple GeMM compiler, and explored [22] J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and W. H. P.
Pernice, “All-optical spiking neurosynaptic networks with self-learning
the scalability of our photonic tensor processor. The results from capabilities,” Nature, vol. 569, pp. 208–214, May 2019.
larger scale matrix computations have verified that the proposed [23] H. Jayatilleka et al., “Wavelength tuning and stabilization of microring-
multi-level encoding scheme can achieve a high level of compu- based filters using silicon in-resonator photoconductive heaters,” Opt.
Exp., vol. 23, pp. 25084–25097, Sep. 2015.
tational accuracy while providing up to 6-bit signed precision. [24] M. S. Hai, M. N. Sakib, and O. Liboiron-Ladouceur, “A 16 silicon-based
Combining the multi-level encoding/decoding scheme with a monolithic balanced photodetector with on-chip capacitors for 25 front-
GeMM compiler can serve as the operation foundation allowing end receivers,” Opt. Exp., vol. 21, pp. 32680–32689, Dec. 2013.
[25] L. F. Stokes, M. Chodorow, and H. J. Shaw, “All-single-mode fiber
us to explore larger-scale ML applications using MRR-based resonator,” Opt. Lett., vol. 7, pp. 288–290, Jun. 1982.
photonic tensor processors. [26] J. E. Heebner, R. Grover, and T. A. Ibrahim, Optical Microresonators:
Theory, Fabrication, and Applications, 1st ed., London, U.K.: Springer,
2008, doi: 10.1007/978-0-387-73068-4.
ACKNOWLEDGMENT [27] W. Bogaerts et al., “Silicon microring resonators,” Laser Photon. Rev.,
vol. 6, no. 1, pp. 47–73, 2012.
The authors thank Mohammed Al-Qadasi, Thomas Ferreira [28] J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff, “A set of level
de Lima, and Jagmeet Singh for suggestions and experimental 3 basic linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16,
pp. 1–17, Mar. 1990.
support. [29] V. Kelefouras, A. Kritikakou, I. Mporas, and V. Kolonias, “A high-
performance matrix–matrix multiplication methodology for and architec-
tures,” J. Supercomputing, vol. 72, pp. 804–844, Mar. 2016.
REFERENCES [30] C. Jhurani and P. Mullowney, “A interface and implementation on Nvidia
GPUs for multiple small matrices,” J. Parallel Distrib. Comput., vol. 75,
[1] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recommender
pp. 133–140, 2015.
system: A survey and new perspectives,” ACM Comput. Surv., vol. 52,
[31] S. A. Hassan, M. M. Mahmoud, A. Hemeida, and M. A. Saber, “Effec-
no. 1, pp. 1–38, 2019.
tive implementation of matrix–vector multiplication on Intel’s multicore
[2] K. R. Bokka, S. Hora, T. Jain, and M. Wambugu, Deep Learning for Nat-
processor,” Comput. Lang., Syst. Struct., vol. 51, pp. 158–175, 2018.
ural Language Processing. Birmingham, U.K.: Packt Publishing, 2019.
[32] M. A. Al-Qadasi, L. Chrostowski, B. J. Shastri, and S. Shekhar, “Scaling up
[3] L. Shao, H. P. H. Shum, and T. Hospedales, “Editorial: Special issue on
silicon photonic-based accelerators: Challenges and opportunities,” APL
machine vision with deep learning,” Int. J. Comput. Vis., vol. 128, no. 4,
Photon., vol. 7, 2022, Art. no. 020902, doi: 10.1063/5.0070992.
pp. 771–772, 2020.
[33] M. Huang et al., “56GHZ waveguide Ge/Si avalanche photodiode,” in
[4] L. Abdi and A. Meddeb, “Driver information system: A combination of
Proc. IEEE Opt. Fiber Commun. Conf., Optical Society of America, 2018,
augmented reality, deep learning and vehicular ad-hoc networks,” Multi-
pp. 1–3.
media Tools Appl., vol. 77, no. 12, pp. 14673–14703, 2018.
[34] H. Chen et al., “100-Gbps rz data reception in 67-Ghz Si-contacted
[5] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural
germanium waveguide p-i-n photodetectors,” J. Lightw. Technol., vol. 35,
network models for practical applications,” 2017, arXiv:1605.07678.
no. 4, pp. 722–726, Feb. 2017.
[6] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “Nvidia
[35] A. N. Tait, “Quantifying power in silicon photonic neural networks,” Phys.
tensor core programmability, performance and precision,” in Proc. IEEE
Rev. Appl., vol. 17, May 2022, Art. no. 054029.
Int. Parallel Distrib. Process. Symp. Workshops, 2018, pp. 522–531.
[36] A. N. Tait et al., “Microring weight banks,” IEEE J. Sel. Topics Quantum
[7] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor pro-
Electron., vol. 22, no. 6, pp. 312–325, Nov./Dec. 2016, Art. no. 5900214.
cessing unit,” in Proc. 44th Annu. Int. Symp. Comput. Architecture, ACM,
[37] N. Lindenmann et al., “Connecting silicon photonic circuits to multicore
2017, pp. 1–12.
fibers by photonic wire bonding,” J. Lightw. Technol., vol. 33, no. 4,
[8] Y. Shen et al., “Deep learning with coherent nanophotonic circuits,” Nature
pp. 755–760, Feb. 2015.
Photon., vol. 11, pp. 441–446, Jul. 2017.
[9] J. Feldmann et al., “Parallel convolutional processing using an integrated
photonic tensor core,” Nature, vol. 589, pp. 52–58, Jan. 2021.
[10] M. Miscuglio and V. J. Sorger, “Photonic tensor cores for machine learn- Zhimu Guo received the B.A.Sc. degree in en-
ing,” Appl. Phys. Rev., vol. 7, Sep. 2020, Art. no. 031404. gineering physics and computing option and the
[11] V. Bangari et al., “Digital electronics and analog photonics for convo- M.A.Sc. degree from Queen’s University, Kingston,
lutional neural networks (DEAP-CNNS),” IEEE J. Sel. Topics Quantum ON, Canada, where he is currently working toward
Electron., vol. 26, no. 1, pp. 1–13, Jan./Feb. 2020. the Ph.D. degree. His research focuses on the junction
[12] B. J. Shastri et al., “Photonics for artificial intelligence and neuromorphic of the hardware and software for computer systems.
computing,” Nature Photon., vol. 15, pp. 102–114, Jan. 2021. He is also looking forward to exploring new technolo-
[13] C. Huang et al., “Demonstration of scalable microring weight bank control gies in the quantum computing realm, including in-
for large-scale photonic integrated circuits,” APL Photon., vol. 5, no. 4, tegrated neuromorphic photonic processors for deep
2020, Art. no. 040803. learning.
[14] W. Zhang et al., “Silicon microring synapses enable photonic deep learning
beyond 9-bit precision,” Optica, vol. 9, pp. 579–584, 2022.
[15] P. Prucnal, B. Shastri, and M. Teich, Neuromorphic Photonics. Boca Raton,
FL, USA: CRC Press, Jan. 2017. Alexander N. Tait (Member, IEEE) received the
[16] A. N. Tait et al., “Neuromorphic photonic networks using silicon photonic Ph.D. degree from Lightwave Communications Re-
weight banks,” Sci. Rep., vol. 7, no. 1, pp. 1–10, 2017. search Laboratory, Department of Electrical Engi-
[17] A. N. Tait et al., “Silicon photonic modulator neuron,” Phys. Rev. Appl., neering, Princeton University, Princeton, NJ, USA,
vol. 11, Jun. 2019, Art. no. 064043. under the direction of Paul Prucnal. He is currently
[18] B. A. Marquez et al., “Photonic pattern reconstruction enabled by on- an Assistant Professor of electrical and computer
chip online learning and inference,” J. Phys., Photon., vol. 3, Feb. 2021, engineering with Queen’s University, Kingston, ON,
Art. no. 024006. Canada. He was a NRC Postdoctoral Fellow with the
[19] D. Liang et al., “Fully-integrated heterogeneous DML transmitters for Quantum Nanophotonics and Faint Photonics Group,
high-performance computing,” J. Lightw. Technol., vol. 38, no. 13, National Institute of Standards and Technology, Boul-
pp. 3322–3337, Jul. 2020. der, CO, USA.

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022

Bicky A. Marquez (Member, IEEE) received the Lukas Chrostowski (Senior Member, IEEE) is cur-
bachelor’s degree from the Central University of rently a Professor of electrical and computer engi-
Venezuela, Caracas, Venezuela, in 2012, the mas- neering with the University of British Columbia, Van-
ter’s degree from the Venezuelan Institute for Sci- couver, BC, Canada. He has authored or coauthored
entific Research, Parroquia Macarao, Venezuela, in more than 300 journal and conference publications.
2014, and the Ph.D. degree in optics and photon- His research interests include silicon photonics de-
ics from Bourgogne-Franche-Comté University, Be- vices, optoelectronics and lasers, including design
sançon France in 2018, where she worked for Pro- fabrication and test, for applications in optical com-
fessor Laurent Larger. Her research interests include munications, computing, biophotonics, and quantum
nonlinear and complex dynamical systems, machine information. He coauthored the textbook Silicon Pho-
learning, and AI photonic hardware. She likes to tonics Design (Cambridge University Press, 2015).
spend her free time by traveling and painting/drawing. He was the Program Director of the NSERC CREATE Silicon Electronic-
Photonic Integrated Circuits research training program in Canada (2012–2018).

Matthew Filipovich received the B.A.Sc. degree


in engineering physics from Queen’s University,
Kingston, ON, Canada, where he is currently working
toward the master’s degree in engineering physics. Sudip Shekhar (Senior Member, IEEE) received the
His research interests include investigating different B.Tech. degree from the Indian Institute of Technol-
approaches for neural network training in situ us- ogy Kharagpur, Kharagpur, India, in 2003, and the
ing neuromorphic photonics, including designing a M.Sc. and Ph.D. degrees from the University of Wash-
circuit for executing the direct feedback alignment ington, Seattle, WA, USA, in 2005 and 2008, respec-
training algorithm. tively. He is currently an Associate Professor with the
Department of Electrical and Computer Engineering,
The University of British Columbia, Vancouver, BC,
Canada. . From 2008 to 2013, he was a Research
Hugh Morison received the B.A.Sc. degree in engi-
Scientist with the Circuits Research Laboratory, Intel
neering physics with an option in computing from
Corporation, Hillsboro, Oregon. He then joined the
Queen’s University of Technology, Kingston, ON,
ECE department in 2013.
Canada, where he is currently working toward the
graduation degree researching neuromorphic silicon
photonic systems. He joined the Shastri Lab with
Queen’s after the B.A.Sc. degree. His research inter-
ests include novel computing systems, artificial intel-
ligence, and experimental demonstrations of silicon
photonic neural networks (ANN tasks and network Bhavin J. Shastri (Senior Member, IEEE) received
dynamics). the Ph.D. degree in electrical engineering (photonics)
from McGill University, Montreal, QC, Canada, in
2012. He is currently an Assistant Professor of engi-
neering physics with Queen’s University, Kingston,
Paul R. Prucnal (Life Fellow, IEEE) received ON, Canada, and a Faculty Affiliate with the Vector
the A.B. degree (graduating summa cum laude) in Institute for Artificial Intelligence, Canada. He was an
mathematics and physics from Bowdoin College, Associate Research Scholar (2016–2018) and Bant-
Brunswick, ME, USA, and the M.S., M.Phil., and ing/NSERC Postdoctoral Fellow (2012–2016) with
the Ph. D. degrees in electrical engineering from Princeton University, Princeton, NJ, USA. He has
Columbia University, New York, NY, USA. After authored or coauthored more than 70 journal articles
his Doctorate, he joined the faculty with Columbia and 100 conference proceeding, seven book chapters, and given more than 65
University, where, he was a Member of the Columbia invited talks and lectures including five keynotes and three tutorials. His research
Radiation Laboratory, he performed groundbreaking interests include silicon photonics, photonic integrated circuits, neuromorphic
work in OCDMA and self-routed photonic switch- computing, and machine learning. He is a coauthor of the book (CRC Press,
ing. In 1988, he joined the Faculty with Princeton 2017) Neuromorphic Photonics, a term he helped coin.
University, Princeton, NJ, USA. He has authored or coauthored more than 350 Dr. Shastri was the recipient of the 2022 SPIE Early Career Achievement
journal articles and book chapters and holds 28 U.S. patents. His research on Award and the 2020 IUPAP Young Scientist Prize in Optics for his pioneering
optical CDMA initiated a new research field in which since then more than 1000 contributions to neuromorphic photonics from ICO. He is a Senior Member of
papers have been published, exploring applications ranging from information Optica (formerly OSA) and IEEE, recipient of the 2014 Banting Postdoctoral
security to communication speed and bandwidth. In 1993, he invented the Fellowship from the Government of Canada, the 2012 D. W. Ambridge Prize
Terahertz Optical Asymmetric Demultiplexer, the first optical switch capable for the top graduating Ph.D. student at McGill, an IEEE Photonics Society 2011
of processing terabit per second pulse trains. He is the author of the book Graduate Student Fellowship amongst others awards.
Neuromorphic Photonics and the Editor of the book Optical Code Division
Multiple Access: Fundamentals and Applications. He was an Area Editor of
the IEEE TRANSACTIONS ON COMMUNICATIONS. He is a Fellow of the Optical
Society of America (OSA) and the National Academy of Inventors (NAI), and a
Member of honor societies, including Phi Beta Kappa and Sigma Xi. He was the
recipient of the 1990 Rudolf Kingslake Medal for his paper entitled Self-routing
photonic switching with optically-processed control, received the Gold Medal
from the Faculty of Mathematics, Physics and Informatics, Comenius University,
for leadership in the field of Optics 2006 and has won multiple teaching awards at
Princeton, including the E-Council Lifetime Achievement Award for Excellence
in Teaching, the School of Engineering and Applied Science Distinguished
Teacher Award, and The President’s Award for Distinguished Teaching. He is
instrumental in founding the field of neuromorphic photonics and developing
the photonic neuron, a high-speed optical computing device modeled on neural
networks and integrated optical circuits to improve the wireless signal quality
by cancelling radio interferences.

Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.

You might also like