Multi-Level Encoding and Decoding in A Scalable Photonic Tensor Processor With A Photonic General
Multi-Level Encoding and Decoding in A Scalable Photonic Tensor Processor With A Photonic General
Multi-Level Encoding and Decoding in A Scalable Photonic Tensor Processor With A Photonic General
(Invited Paper)
Abstract—The resurgence of artificial intelligence enabled by Index Terms—Integrated optics, matrix decomposition, matrix
deep learning and high performance computing has seen a dramatic multiplication, optical computing, optical neural networks,
increase of demand in the accuracy of deep learning model which programmable circuits.
has come at the cost of computational complexity. The fundamental
operations in deep learning models are matrix multiplications,
and large scale matrix operations and data-centric tasks have I. INTRODUCTION
experienced bottlenecks from current digital electronic hardware
in terms of performance and scalability. Recent research on pho- DVANCEMENTS in machine learning (ML) and artificial
tonic processors have found solutions to enable applications in
machine learning, neuromorphic computing and high performance
A intelligence (AI) technologies have enabled numerous
applications including sophisticated recommendation models,
computing using basic photonic processing elements on integrated
silicon photonic platform. However, efficient and scalable photonic natural language processing, machine vision, augmented reality,
computing requires an information encoding/decoding scheme. and so on [1], [2], [3], [4]. The groundbreaking progress of
Here, we propose a multi-level encoding and decoding scheme, these AI applications in different fields is enabled by heavy
and experimentally demonstrate it with a wavelength-multiplexed dependence of ML algorithms training on large data sets. Since
silicon photonic processor. We also discuss the scalability of our pro- the interconnection of neurons in artificial neural networks can
posed scheme by introducing a photonic general matrix multiply
compiler, and consider the effects of speed, bit precision, and noise. be described by a matrix and the data being processed can be
Our proposed scheme could be adapted to a variety of photonic represented as a vector, training on large data sets with deep
information processing architectures for photonic neural networks, neural networks results in large-scale dense matrix-vector mul-
photonics tensor cores, and programmable photonic. tiplications. The improvement in the performance (i.e. accuracy)
of many ML applications comes at the cost of higher computa-
tional power requirement [5]. As such, there has been significant
Manuscript received 22 February 2022; revised 14 June 2022; accepted 31 progress in the development of digital electronic application-
July 2022. Date of publication 5 August 2022; date of current version 23 August specific integrated circuits known as AI accelerators that are
2022. This work was supported in part by the Natural Sciences and Engineering dedicated for dense matrix computations [6], [7]. However,
Research Council of Canada (NSERC), in part by the Canadian Foundation for
Innovation (CFI), and in part by the Queen’s University. (Corresponding author: modern AI accelerators have seen two major bottlenecks when
Zhimu Guo.) it comes to energy efficiency: data transfer to and from memory,
Zhimu Guo, Bicky A. Marquez, Matthew Filipovich, and Hugh Mori- and large matrix-vector multiplications, and both have imposed
son are with the Department of Physics, Engineering Physics and Astron-
omy, Queen’s University, Kingston, ON K7L 3N6, Canada (e-mail: 15zg11@ strict physical limitations on the scalability and performance of
queensu.ca; [email protected]; [email protected]; hugh.moris digital electronic AI accelerators.
[email protected]). Integrated photonic processors enabled by silicon photonics
Alexander N. Tait is with the Department of Electrical and Computer
Engineering, Queen’s University, Kingston, ON K7L 3N6, Canada (e-mail: have shown promising capabilities in accelerating tensor (i.e.,
[email protected]). multidimensional vector and matrix) operations [8], [9], [10],
Paul R. Prucnal is with the Department of Electrical Engineering, Princeton [11] by exploiting the high bandwidth of photonic devices (mod-
University, Princeton, NJ 08544 USA (e-mail: [email protected]).
Lukas Chrostowski and Sudip Shekhar are with the Department of Electrical ulators and photodetectors), low latency and minimal energy-
and Computer Engineering, University of British Columbia, Vancouver, British delay product due to passive optical waveguides [12]. Some
Columbia V6T 1Z4, Canada (e-mail: [email protected]; [email protected]). of these processors [9], [10], [11] are scalable and can use the
Bhavin J. Shastri was with the Department of Physics, Engineering Physics
and Astronomy, Queen’s University, Kingston, ON K7L 3N6, Canada.. He is parallel nature of light through wavelength-division multiplex-
now with the Vector Institute, Toronto, ON M5G 1M1, Canada (e-mail: shastri@ ing (WDM) to achieve large-scale interconnects and massively
ieee.org). parallel data processing and transfer. Recent developments have
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSTQE.2022.3196884. proven that the wavelength-multiplexed silicon photonic plat-
Digital Object Identifier 10.1109/JSTQE.2022.3196884 form can be operated with up to 7-bit precision [13], and most
1077-260X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
B. Commutative Property
To guarantee the stability of the system, the multi-level
encoding scheme also imposes strict operational rules on the
inputs for both the MRRs and the lasers. The input encoding for
both sides uses the same direct value mapping between digital
and analog values, but the underlying operating mechanisms
are different. For the attenuators, different digital values are
mapped to different optical power levels through different levels Fig. 6. Output mapping for a 11-bit signed system with 6-bit signed inputs.
of attenuation, where small digital numbers corresponds to large
attenuation, and vice versa. Since non-linearity will occur at
high attenuation, we can only operate within a relatively small separate negative and positive multiplication completely within
range of attenuation. As a result, the input optical power will our control system for the TPEs, and dedicate one photonic TPE
never go to zero. For the MRRs, the digital values are mapped to to process either all positive/negative multiplications, or mixed
the applied heating current values, which shift the resonance of positive/negative multiplications. For either TPE, negative signs
the MRRs. The mismatch between the MRR resonance and the will be dropped everywhere during multiplication, and the con-
laser wavelength determines how the incoming optical power is trol system will take outputs from the one processing mixed pos-
distributed between DROP and THRU ports, but the total output itive/negative multiplications as negative values automatically.
power will equal the total input power in the ideal lossless case.
Because loss is present in a real-world scenario, a higher laser IV. EXPERIMENTAL DEMONSTRATION
power is more beneficial for a better performance of the photonic
Here, we implement a 11-bit signed system with 6-bit signed
TPE. In addition, the heating current range chosen for the MRR
inputs for our proof-of-concept demonstration. First we perform
will center around the “zero” point where the output power
the calibration stage as mentioned above, including creating an
is evenly distributed between DROP and THRU. This means
input mapping, an MRR profile, and performing a reflection
that the output power range is also centered around the zero
point search. The input mapping uses an attenuation range
point, and only spans a limited range on both sides of the zero
between 2 dB and 8 dB for mapping 25 positive input digital
point. Therefore, the input mapping can only encode numbers
numbers to their corresponding, linearly spaced, analog optical
to a non-zero optical power range, whereas the weight mapping
power levels. From the reflection point search we determine that
encodes numbers that centers around zero optical power. As a
a heating current of 0.48 mA to the MRR would produce a zero
result, same numbers going through the attenuator will produce
output power calculated from PDROP − PT HRU . Combining
a different optical output than those going through the MRR,
this with the MRR profile which gives us the heating current
and the range of available optical outputs is different for the
range that produces a linear output power level, the weight map-
two. Therefore, multiplication of numbers from both sides does
ping is finished with a heating current range between [0.37, 0.59]
not commute, i.e. a × b does not equal b × a. To circumvent this
mA that fits 26 signed digital numbers.
problem, the multi-level encoding scheme will force the larger
Next, the output mapping is constructed through sweeping
number through the lasers when multiplying two numbers with
both inputs and weights across all possible values using both
the photonic TPE since higher input power for the MRR will
the input mapping and the weight mapping. All possible in-
give better output resolution.
put/weight combinations include 25 × 26 = 2048 pairs, but only
a subset of combinations that meet the aforementioned commu-
C. Negative Number Encoding tative property is selected. The input number range is chosen
Aside from the non-commutative operation rule mentioned to be [0, 31], and the weight number range is [−31, 31]. The
above, we also implement another restriction on the sign of the choice of values inside the matrices is based on the selected
multiplication. Since only the MRR can encode both positive precision for the system, which is a 6-bit signed integer system
and negative numbers using left and right of the “zero” point in as an example. This range is only a digital representation of
its output power but the attenuator can only encode positive num- the measured analog values, and the example demonstrates how
bers, any negative number we encounter will be sent to the MRR the matrix dot product will work based on an arbitrary value
automatically. In case of two negative numbers during multipli- range selection. However, this value range selection can be
cation, both negative signs will be dropped automatically since any numerical range that centers around zero depending on the
that is equivalent to two positive number multiplication. application, and in many situations, the common choice will
An alternative solution to encode negative numbers in our be the normalized range of [−1, 1]. We collect the experimental
photonic TPE is to have another photonic TPE with the exact results as shown in Fig. 6. Here, the expected output is calculated
same configuration running in parallel. This will allow us to by multiplying the input number with weight number directly
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714
Fig. 7. Multiplication results with full range of 6-bit signed inputs and weights
with the implementation of above mentioned operational rules. Here the inputs A. GeMM Compiler for Photonic TPE
range between −31 and +31, and the weights also have the same range. Different
colors in the colorbar represents the product of a weight value and an input value, First, we demonstrate our solution for software scaling.
with purple representing the smallest and the yellow representing the largest. One common software up-scaling approach for tensor proces-
sors is the use of a General Matrix Multiplication (GeMM)
compiler [28], which helps a software to map different ma-
inside the control computer. The measured output is converted trix operations to specific hardware architectures to optimize
from the measured optical output power, PDROP − PT HRU , to structure utilization and computation efficiency. As the modern
the desired output number range via a re-scaling. The re-scaling data science industry continues to develop and the computation
of experimental data first performs a linear regression using both volume and complexity increases, most data-focused compute
the measured output power and the expected output values, and hardware finds it helpful to implement a dedicated compiler to
then it compares the slope and intercept of experimental data to a efficiently perform sophisticated matrix multiplications. There
theoretical slope of 1.0 and intercept of 0.0. After the re-scaling, are many different designs for GeMM compilers depending
the measured experimental data is converted to measured output on their targeted hardware platforms [29], [30], [31], but the
values that were in the same range as the expected outputs. basic operating rule for any GeMM compiler focuses on the
Having fully characterized the photonic TPE that includes an most prevalent matrix multiplication, matrix dot product, and
MRR and an attenuator, we now incorporate the sign rule and its mathematical form can be expressed as (5),
include full positive and negative numbers for both the input and
Y = αW · X + βZ. (5)
weight. The result of full 6-bit signed multiplication is shown
in Fig. 7. Here, both input and weight go from [−31, 31], and Here, W, X and Z are input matrices, both α and β are
the experimentally measured output is shown as colored contour scaling constants, and Y is the output matrix. The math is
maps on the two-dimensional grid of weight versus input. The simple, but the main focus of GeMM compilers is mapping the
measured output values range between (−1000, 1000), and the mathematical expression to the topology of different hardware
standard deviation calculated from the measured outputs is platforms. Because the sizes of the matrices from data-focused
9.34 × 10−6 . tasks often exceed the physical sizes of the actual compute
The precision adjustment can be easily made at the generation hardware, GeMM compilers need to first break down these large
of the direct mapping stage during calibration. The calibration matrices into smaller matrices or vectors. How the matrices are
starts with weight and reflection sweeps, which will estimate broken down depends on the core/thread count of the actual
the usable heating current range for the MRRs. The next step hardware, and the overall task of matrix multiplication will be
is to decide how many analog levels we need to represent all done in multiple batches. Once the matrices are divided, GeMM
digital values up to the chosen precision. We demonstrated an compilers need to send specific values from the current data
11-bit signed system with half-precision for weights and inputs. batch to the compute units used by the task. After one iteration
However, the system can be easily adapted to lower precision of computation is finished, GeMM compilers will then collect
levels, such as 8-bit signed precision with half-precision for all the results and send out the next batch. As an example, we
the weights and inputs. In this case, we only need to redo the have a simple matrix dot product between two matrices W and
direct mapping for the weights and inputs to accommodate fewer X as shown in Fig. 8.
analog levels. We divided matrix W into four batches each containing four
elements, and matrix X into two batches each containing six
elements. The number of compute units used for this task will
V. GEMM COMPILER AND SCALABILITY
be six, matching the number of elements in the largest data
Having demonstrated the functionality and performance of a batch. The GeMM compiler will first send out the first data
single photonic TPE, we will focus on scaling up our system batches from both matrix W, W11 , and matrix X, X11 to all the
to accommodate higher computing capacity and throughput. compute units to calculate the dot product W11 × X11 . For the
Scaling up a processing element architecture generally involves second round of operation, the GeMM compiler will send out
two approaches: hardware scaling and software scaling. W12 and X21 instead, and the same procedure is repeated for
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714
the accuracy results obtained from such random trials can faith-
fully represent the true accuracy achievable on our photonic
TPE. Therefore, an improved accuracy results over larger test
matrix sizes indicates that our system can achieve above 99.5%
accuracy.
The standard deviation in Fig. 12(b) represents the accuracy
value fluctuation over multiple repeated trials—the increase in
standard deviation results from the noise within our analog
system. At higher precision levels, the same level of analog
noise will be more likely to cause a misrepresentation of each
digital value. Lower precision only requires fewer analog levels
to include all the digital values, whereas higher precision will
require more analog levels within the same analog range. As a
result, the system is more susceptible to noise at higher precision.
Fig. 11. Distribution of the matrix dot product accuracy collected from trials We have observed a larger fluctuation in average accuracy values
with different bit-precisions, each including 100 computations using randomly over multiple tests, leading to a slight increase in the standard
generated matrices. Here, we simulate four different bit-precisions, including
(a) 3 bits, (b) 4 bits, (c) 5 bits, and (d) 6 bits to encode input values. We used a deviation value.
matrix size of 128 × 128 for all computations. Aside from software scaling, hardware scaling is also a crucial
part in boosting the computation capacity of our photonic tensor
processor. The hardware architecture for a single photonic TPE
is shown in Fig. 1, which contains an array of five MRRs
sharing both a common THRU connection and a common DROP
connection. The photonic TPE is capable of performing five mul-
tiplications simultaneously using five sets of inputs through the
same bus waveguide, each set encodes one number through the
attenuator as the “input” and the other through the source meters
as the “weight”. Thus, the single photonic TPE can compute a
dot product between two vectors each with five elements within
a single iteration. However, this is only one single photonic TPE,
and its architecture can be easily duplicated on chip. In addition,
because different copies of the same photonic TPE have their
own bus waveguide for inputs, the same laser sources can be
used in a multiplexer/splitter fashion to provide the same copies
Fig. 12. (a) Average accuracy and its standard deviation calculated from the of all signal carriers for all the photonic TPEs. The multiplexer is
trials with different matrix sizes, as shown in Fig. 10. (b) Average accuracy and
its standard deviation calculated from the trials with different bit precisions, as implemented using a WDM multiplexer that combines all the in-
shown in Fig. 11. dividual laser sources from separate waveguides, and the splitter
evenly distributes the combined signal among all photonic TPEs.
On the other hand, most hardware scaling solution will benefit
from the first test using four different matrix sizes, from 64 × 64 from a higher level of integration for lower latency and higher
to 1024 × 1024. We see a clear upward trend in the overall compute throughput. In our current design for the photonic TPE,
average accuracy as aforementioned, together with a decreasing the input mapping still relies on external attenuators to encode
trend for the standard deviation from the overall average accu- different input values as different optical power levels. However,
racy calculation. For the second test, the overall average accuracy same effect can be achieved by using the THRU port output of
also increases with higher bit-precision, but we also see a small an on-chip MRR. By tuning the MRR on and off resonance, the
increase in standard deviation from the calculation as shown in THRU port output will carry different output power depending
Fig. 12(b). Because the change in standard deviation in Fig. 12(b) on the wavelength mismatch between the optical signal and
is one magnitude smaller than that in Fig. 12(a) and the average the MRR. Therefore, by replacing the attenuators with on-chip
accuracy is similar for trials using more than 3-bit precision, the MRRs for input encoding, the control mechanism can be applied
small increase in standard deviation can be a random result since to both the input encoding MRRs and the multiplication MRRs.
all matrices are randomly generated in all trials. The improved Additionally, the balanced photodetectors can also be integrated
average accuracy in both performance tests is likely a result of on chip, and will only require a bias voltage from the external
larger sample size when running randomized trials. Randomized source meters. The output of the balanced photodetectors is in
trials require greater sample sizes to better achieve the ideal the form of different current levels, which can also be monitored
normal distribution of test samples, and as the matrix sizes through the same sourcing and measurement units. Thus, both
increase these test matrices include more randomized values information encoding and decoding will be uniformly imple-
which contributes to a better test sample distribution. As test mented through the external source meters for both inputs and
sample distribution approaches the ideal normal distribution, weights.
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714
and the large TPEs comes from the number of cycles required a size of 1024 × 1024. In this scenario, the speed of photonic
and how the energy consumption is distributed over time. Here, TPE is bottlenecked by the weight update speed because the
the larger photonic TPE will require more energy within a short photodetector is 107 times faster than the thermally tuned MRRs.
time window, whereas the small TPE uses much less energy per However, suppose we were to use the carrier-depletion effect to
cycle and spread out the energy consumption over longer periods modulate the optimal photonic weights at up to 56 GHz and
of time. use a fast PIN photodetector that operates at 67 GHz. In that
case, TPE size will consist of around 850 MRRs. In conclusion,
VI. SPEED AND PRECISION ANALYSIS thermal tuning the MRR will create a speed bottleneck from
weight updates during matrix tiling, but narrowing the speed
The MRRs shown in Fig. 1 implemented thermal tuning using gap between weight modulation and photodetection will require
N-doped heaters, and our photodetectors were SiGe-based PIN larger photonic TPE sizes to take full advantage of that fast
junction photodetectors. The tuning speed of N-doped heaters is weight update capability.
up to millisecond scale, which can be a limitation in certain sce- Recent analysis on signal resolution in silicon photonic neural
narios and applications. However, in many other deep learning network by Tait [35] summarizes the relation between laser
applications the update rate of the weights can be much slower, pump power, signal frequency, and bit precision. In the middle of
especially during inference or convolutions. During inference, all three terms are different dominating types of noise in different
the photonic TPE will be loaded with pre-trained weights. Thus, operating regimes of the silicon photonic system. Our silicon
the photonic TPE can perform MAC operations at the speed photonic system implements an O/E/O operating regime, where
limit of the photodetectors, which is shown to be 56 GHz the first part is the optical signal from a tunable laser, and then
for an avalanche photodetector [33] and 67 GHz for a PIN optical weighting uses MRR weight banks with thermal tuning.
photodetector [34]. In case of convolutions as demonstrated by After the weight bank is the optoelectrical conversion by the
Feldman et al. [22], the convolution filters only require a slower balanced photodetector. For such a photonic circuit, there are
update rate compared to the inputs. Therefore, the relatively slow three major noise regimes that affect the interaction between
tuning speed of the weights inside the photonic TPE can satisfy laser pump power, signal frequency, and bit precision: thermal
a high-speed MAC operation for inference or convolution. regime, shot regime, and relative intensity noise (RIN) regime.
However, in the case of matrix tiling the photonic TPE compu- In the thermal regime, the dominant noise is known as Johnson-
tation speed will also be limited by the weight updates. Because Nyquist noise which comes from the random movement of
we are using a wavelength-multiplexed approach, the speed electrons within the photodetector. Here, the noise equivalent
bottleneck for our system during a matrix tiling process will be power increases exponentially with higher bit precision, and the
affected by three factors: weight update speed, detection speed, relation between laser pump power Pltherm , signal frequency f ,
and the physical size of our photonic TPE. The weight update and bit precision in thermal regime B can be written as:
speed determines how fast the photonic TPE can be updated for
the next batch of weights. The detection speed determines how J ∗ (B) 3
Pltherm (f, B) = f· , J ∗ (B) ∝ 2 2 B . (7)
fast the TPE can process all batches of inputs before adjusting ηnet
the weights. The size of the photonic TPE will affect the number
of input batches per weight batch. In a wavelength-multiplexed Here, ηnet is the transmission efficiency of our photonic
setup where a single balanced photodetector is paired with circuit, and J ∗ represents the Johnson-Nyquist noise at the given
multiple MRRs, we can increase the number of MRRs as long as precision B. During the operation of the MRR weight bank
their resonances can all fit within their free spectral ranges. With inside our photonic TPE, the input laser pump power will remain
more MRRs in the weight bank, large matrices require less tiling a constant value. As is shown here, there is a trade-off between
to finish the computation. Also, we can implement data batching signal frequency and the bit precision of our system at a given
and the weights inside the photonic TPE will not be updated until laser pump power level. Thus, higher frequency operations will
all the inputs have been processed through the TPE. Therefore, require lower bit precision to maintain system stability.
larger photonic TPEs will go through all the inputs using fewer During the optoelectrical conversion, photon shot noise will
cycles and require more often weight updates when compared to be the dominant noise and this is called shot noise regime. Shot
smaller ones. As a result, smaller TPEs rely more on the speed noise comes from the randomness in photon detection, and in
of the photodetector to process all the inputs, but larger TPEs this regime we still have the same relation between laser pump
rely more on the weight update speed of the MRRs once all the power Plshot , signal frequency f , and bit precision B as is shown
inputs are processed. in the thermal regime. Here we have:
Given these three factors that bottleneck the speed of our Eshot (B)
photonic TPE, there will be an optimal photonic TPE size that Plshot (f, B) = f · , Eshot (B) ∝ 23B . (8)
ηnet
balances the latency between weight updates and photodetec-
tion. Currently, we are using the thermo-optic effect in our As shown here, the same trade-off between signal frequency
system, which operates on a millisecond time scale, and the and bit precision still remains. In addition to thermal and shot
photodetectors in our system have been verified to achieve noise regimes, the carrier laser power output also has random
10 GHz. Assuming thermal tuning, the MRRs take 1 ms, then changes that create relative intensity noise. In RIN regime, the
the optimal size for our photonic TPE can have no more than a noise is independent of laser pump power, but the frequency-
single MRR per pair of balanced photodetectors for a matrix with precision relation gives us the maximum signal frequency that
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
Fig. 14. The trade-off between operating frequency and precision under a
constant laser pump power of 10 mW, and a round trip loss of 15 dB. The Fig. 15. The frequency-precision trade-off comparison between multiple pho-
transmission efficiency was taken to be 0.1, and we consider both thermal regime tonic systems with different components. In this paper, the photonic TPE design
and shot noise regime. For the thermal regime, we consider the situation where implemented grating couplers (GC), MRRs with N-doped heaters (N-doped
we are using less than the full designed bandwidth for our single channel. For MRR), and PIN junction photodetectors (PIN PD). However, recent work
the shot noise regime, we consider the shot noise amplitude in a typical analog has shown that we can replace these components with ones that have higher
photonic system, which will be larger than the noise amplitude in the an ideal efficiency and speed, like photonic wirebonds (PWB), PIN junction modulators
system nearing its physical limit. (PIN mod), and avalanche photodetectors (APD). This plot gave an estimation
on how the frequency-precision trade-off will look like compared to our current
design, and we will be implementing these more advanced components in our
can be obtained at a certain bit precision: future designs.
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
GUO et al.: MULTI-LEVEL ENCODING AND DECODING IN A SCALABLE PHOTONIC TENSOR PROCESSOR WITH A PHOTONIC GENERAL MATRIX 8300714
implementation. We have also noted some unique characteristics [20] A. N. Tait et al., “Feedback control for microring weight banks,” Opt. Exp.,
of the photonic TPE architecture, and by taking advantages vol. 26, pp. 26422–26443, Oct. 2018.
[21] L.-W. Luo, G. S. Wiederhecker, K. Preston, and M. Lipson, “Power insen-
of its flexibility we have refined and improved the details of sitive silicon microring resonators,” Opt. Lett., vol. 37, no. 4, pp. 590–592,
the multi-level encoding scheme. We also combined multi-level 2012.
encoding scheme with a simple GeMM compiler, and explored [22] J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, and W. H. P.
Pernice, “All-optical spiking neurosynaptic networks with self-learning
the scalability of our photonic tensor processor. The results from capabilities,” Nature, vol. 569, pp. 208–214, May 2019.
larger scale matrix computations have verified that the proposed [23] H. Jayatilleka et al., “Wavelength tuning and stabilization of microring-
multi-level encoding scheme can achieve a high level of compu- based filters using silicon in-resonator photoconductive heaters,” Opt.
Exp., vol. 23, pp. 25084–25097, Sep. 2015.
tational accuracy while providing up to 6-bit signed precision. [24] M. S. Hai, M. N. Sakib, and O. Liboiron-Ladouceur, “A 16 silicon-based
Combining the multi-level encoding/decoding scheme with a monolithic balanced photodetector with on-chip capacitors for 25 front-
GeMM compiler can serve as the operation foundation allowing end receivers,” Opt. Exp., vol. 21, pp. 32680–32689, Dec. 2013.
[25] L. F. Stokes, M. Chodorow, and H. J. Shaw, “All-single-mode fiber
us to explore larger-scale ML applications using MRR-based resonator,” Opt. Lett., vol. 7, pp. 288–290, Jun. 1982.
photonic tensor processors. [26] J. E. Heebner, R. Grover, and T. A. Ibrahim, Optical Microresonators:
Theory, Fabrication, and Applications, 1st ed., London, U.K.: Springer,
2008, doi: 10.1007/978-0-387-73068-4.
ACKNOWLEDGMENT [27] W. Bogaerts et al., “Silicon microring resonators,” Laser Photon. Rev.,
vol. 6, no. 1, pp. 47–73, 2012.
The authors thank Mohammed Al-Qadasi, Thomas Ferreira [28] J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff, “A set of level
de Lima, and Jagmeet Singh for suggestions and experimental 3 basic linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16,
pp. 1–17, Mar. 1990.
support. [29] V. Kelefouras, A. Kritikakou, I. Mporas, and V. Kolonias, “A high-
performance matrix–matrix multiplication methodology for and architec-
tures,” J. Supercomputing, vol. 72, pp. 804–844, Mar. 2016.
REFERENCES [30] C. Jhurani and P. Mullowney, “A interface and implementation on Nvidia
GPUs for multiple small matrices,” J. Parallel Distrib. Comput., vol. 75,
[1] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recommender
pp. 133–140, 2015.
system: A survey and new perspectives,” ACM Comput. Surv., vol. 52,
[31] S. A. Hassan, M. M. Mahmoud, A. Hemeida, and M. A. Saber, “Effec-
no. 1, pp. 1–38, 2019.
tive implementation of matrix–vector multiplication on Intel’s multicore
[2] K. R. Bokka, S. Hora, T. Jain, and M. Wambugu, Deep Learning for Nat-
processor,” Comput. Lang., Syst. Struct., vol. 51, pp. 158–175, 2018.
ural Language Processing. Birmingham, U.K.: Packt Publishing, 2019.
[32] M. A. Al-Qadasi, L. Chrostowski, B. J. Shastri, and S. Shekhar, “Scaling up
[3] L. Shao, H. P. H. Shum, and T. Hospedales, “Editorial: Special issue on
silicon photonic-based accelerators: Challenges and opportunities,” APL
machine vision with deep learning,” Int. J. Comput. Vis., vol. 128, no. 4,
Photon., vol. 7, 2022, Art. no. 020902, doi: 10.1063/5.0070992.
pp. 771–772, 2020.
[33] M. Huang et al., “56GHZ waveguide Ge/Si avalanche photodiode,” in
[4] L. Abdi and A. Meddeb, “Driver information system: A combination of
Proc. IEEE Opt. Fiber Commun. Conf., Optical Society of America, 2018,
augmented reality, deep learning and vehicular ad-hoc networks,” Multi-
pp. 1–3.
media Tools Appl., vol. 77, no. 12, pp. 14673–14703, 2018.
[34] H. Chen et al., “100-Gbps rz data reception in 67-Ghz Si-contacted
[5] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural
germanium waveguide p-i-n photodetectors,” J. Lightw. Technol., vol. 35,
network models for practical applications,” 2017, arXiv:1605.07678.
no. 4, pp. 722–726, Feb. 2017.
[6] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “Nvidia
[35] A. N. Tait, “Quantifying power in silicon photonic neural networks,” Phys.
tensor core programmability, performance and precision,” in Proc. IEEE
Rev. Appl., vol. 17, May 2022, Art. no. 054029.
Int. Parallel Distrib. Process. Symp. Workshops, 2018, pp. 522–531.
[36] A. N. Tait et al., “Microring weight banks,” IEEE J. Sel. Topics Quantum
[7] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor pro-
Electron., vol. 22, no. 6, pp. 312–325, Nov./Dec. 2016, Art. no. 5900214.
cessing unit,” in Proc. 44th Annu. Int. Symp. Comput. Architecture, ACM,
[37] N. Lindenmann et al., “Connecting silicon photonic circuits to multicore
2017, pp. 1–12.
fibers by photonic wire bonding,” J. Lightw. Technol., vol. 33, no. 4,
[8] Y. Shen et al., “Deep learning with coherent nanophotonic circuits,” Nature
pp. 755–760, Feb. 2015.
Photon., vol. 11, pp. 441–446, Jul. 2017.
[9] J. Feldmann et al., “Parallel convolutional processing using an integrated
photonic tensor core,” Nature, vol. 589, pp. 52–58, Jan. 2021.
[10] M. Miscuglio and V. J. Sorger, “Photonic tensor cores for machine learn- Zhimu Guo received the B.A.Sc. degree in en-
ing,” Appl. Phys. Rev., vol. 7, Sep. 2020, Art. no. 031404. gineering physics and computing option and the
[11] V. Bangari et al., “Digital electronics and analog photonics for convo- M.A.Sc. degree from Queen’s University, Kingston,
lutional neural networks (DEAP-CNNS),” IEEE J. Sel. Topics Quantum ON, Canada, where he is currently working toward
Electron., vol. 26, no. 1, pp. 1–13, Jan./Feb. 2020. the Ph.D. degree. His research focuses on the junction
[12] B. J. Shastri et al., “Photonics for artificial intelligence and neuromorphic of the hardware and software for computer systems.
computing,” Nature Photon., vol. 15, pp. 102–114, Jan. 2021. He is also looking forward to exploring new technolo-
[13] C. Huang et al., “Demonstration of scalable microring weight bank control gies in the quantum computing realm, including in-
for large-scale photonic integrated circuits,” APL Photon., vol. 5, no. 4, tegrated neuromorphic photonic processors for deep
2020, Art. no. 040803. learning.
[14] W. Zhang et al., “Silicon microring synapses enable photonic deep learning
beyond 9-bit precision,” Optica, vol. 9, pp. 579–584, 2022.
[15] P. Prucnal, B. Shastri, and M. Teich, Neuromorphic Photonics. Boca Raton,
FL, USA: CRC Press, Jan. 2017. Alexander N. Tait (Member, IEEE) received the
[16] A. N. Tait et al., “Neuromorphic photonic networks using silicon photonic Ph.D. degree from Lightwave Communications Re-
weight banks,” Sci. Rep., vol. 7, no. 1, pp. 1–10, 2017. search Laboratory, Department of Electrical Engi-
[17] A. N. Tait et al., “Silicon photonic modulator neuron,” Phys. Rev. Appl., neering, Princeton University, Princeton, NJ, USA,
vol. 11, Jun. 2019, Art. no. 064043. under the direction of Paul Prucnal. He is currently
[18] B. A. Marquez et al., “Photonic pattern reconstruction enabled by on- an Assistant Professor of electrical and computer
chip online learning and inference,” J. Phys., Photon., vol. 3, Feb. 2021, engineering with Queen’s University, Kingston, ON,
Art. no. 024006. Canada. He was a NRC Postdoctoral Fellow with the
[19] D. Liang et al., “Fully-integrated heterogeneous DML transmitters for Quantum Nanophotonics and Faint Photonics Group,
high-performance computing,” J. Lightw. Technol., vol. 38, no. 13, National Institute of Standards and Technology, Boul-
pp. 3322–3337, Jul. 2020. der, CO, USA.
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.
8300714 IEEE JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS, VOL. 28, NO. 6, NOVEMBER/DECEMBER 2022
Bicky A. Marquez (Member, IEEE) received the Lukas Chrostowski (Senior Member, IEEE) is cur-
bachelor’s degree from the Central University of rently a Professor of electrical and computer engi-
Venezuela, Caracas, Venezuela, in 2012, the mas- neering with the University of British Columbia, Van-
ter’s degree from the Venezuelan Institute for Sci- couver, BC, Canada. He has authored or coauthored
entific Research, Parroquia Macarao, Venezuela, in more than 300 journal and conference publications.
2014, and the Ph.D. degree in optics and photon- His research interests include silicon photonics de-
ics from Bourgogne-Franche-Comté University, Be- vices, optoelectronics and lasers, including design
sançon France in 2018, where she worked for Pro- fabrication and test, for applications in optical com-
fessor Laurent Larger. Her research interests include munications, computing, biophotonics, and quantum
nonlinear and complex dynamical systems, machine information. He coauthored the textbook Silicon Pho-
learning, and AI photonic hardware. She likes to tonics Design (Cambridge University Press, 2015).
spend her free time by traveling and painting/drawing. He was the Program Director of the NSERC CREATE Silicon Electronic-
Photonic Integrated Circuits research training program in Canada (2012–2018).
Authorized licensed use limited to: Queen's University. Downloaded on August 24,2022 at 02:16:24 UTC from IEEE Xplore. Restrictions apply.