0% found this document useful (0 votes)
42 views7 pages

11 8T SRAM Cell As A Multi-Bit Dot Product Engine For Beyond Von-Neumann Computing

Uploaded by

huangzhen11245
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views7 pages

11 8T SRAM Cell As A Multi-Bit Dot Product Engine For Beyond Von-Neumann Computing

Uploaded by

huangzhen11245
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

8T SRAM Cell as a Multi-bit Dot Product Engine

for Beyond von-Neumann Computing


Akhilesh Jaiswal*, Indranil Chakraborty*, Amogh Agrawal, Kaushik Roy, Fellow, IEEE
*(Equal Contributor)

Abstract—Large scale digital computing almost exclusively A possible approach geared towards high throughput beyond
relies on the von-Neumann architecture which comprises of sep- von-Neumann machines is to enable distributed computing
arate units for storage and computations. The energy expensive characterized by tightly intertwined storage and compute ca-
arXiv:1802.08601v1 [cs.ET] 22 Feb 2018

transfer of data from the memory units to the computing cores


results in the well-known von-Neumann bottleneck. Various ap- pabilities. If computing can be performed inside the memory
proaches aimed towards bypassing the von-Neumann bottleneck array, rather than in a spatially separated computing core, the
are being extensively explored in the literature. These include compute throughput can be considerably increased . As such,
in-memory computing based on CMOS and beyond CMOS tech- one could think of ubiquitous computing on the silicon chip,
nologies, wherein by making modifications to the memory array, wherein both the logic cores and the memory unit partake
vector computations can be carried out as close to the memory
units as possible. Interestingly, in-memory techniques based on in compute operations. Various proposals for ‘in-memory’
CMOS technology are of special importance due to the ubiquitous computing with respect to emerging non-volatile technologies
presence of field-effect transistors and the resultant ease of large have been presented for both dot product computations [2],
scale manufacturing and commercialization. On the other hand, [3] as well as vector Boolean operations [4]. Prototypes based
perhaps the most important computation required for applica- on emerging technologies can be found in [3], [5] .
tions like machine-learning etc. comprises of the dot product
operation. Emerging non-volatile memristive technologies have With respect to the CMOS technology, Boolean in-memory
been shown to be very efficient in computing analog dot products operations have been presented in [6] and [7]. In [6] authors
in an in-situ fashion. The memristive analog computation of the have presented vector Boolean operations using 6T SRAM
dot product results in much faster operation as opposed to dig- cells. Additionally, authors in [7] have demonstrated that the
ital vector in-memory bit-wise Boolean computations. However, 8 transistor (8T) SRAM cells lend themselves easily as vector
challenges with respect to large scale manufacturing coupled
with the limited endurance of memristors have hindered rapid compute primitives due to their decoupled read and write
commercialization of memristive based computing solutions. In ports. Both the works [6] and [7] are based on vector Boolean
this work, we show that the standard 8 transistor (8T) digital operations. However, perhaps the most frequent and compute
SRAM array can be configured as an analog-like in-memory intensive function required for numerous applications like
multi-bit dot product engine. By applying appropriate analog machine learning is the dot product operation. Memristors
voltages to the read-ports of the 8T SRAM array, and sensing the
output current, an approximate analog-digital dot-product engine based on resistive-RAMs (Re-RAMs) have been reported in
can be implemented. We present two different configurations for many works as an analog dot product compute engine [4], [8].
enabling multi-bit dot product computations in the 8T SRAM cell Few works based on analog computations in SRAM cells can
array, without modifying the standard bit-cell structure. Since be found in [9], [10]. Both these works use 6T SRAM cells
our proposal preserves the standard 8T-SRAM array structure, and rely on the resultant accumulated voltage on the bit-lines
it can be used as a storage element with standard read-write
instructions, and also as an on-demand analog-like dot product (BLs). Not only 6T SRAMs are prone to read-disturb failures,
accelerator. the failures are also a function of the voltage on the BLs. This
leads to a tightly constrained design space for the proposed 6T
Index Terms—In-memory computing, SRAMs, von Neumann
bottleneck, convolution, dot product. SRAM based analog computing. In this paper, we employ 8T
cells that are much more robust as compared to the 6T cells
due to isolated read port. We show that without modifying the
I. I NTRODUCTION basic bit-cell for the 8T SRAM cell, it is possible to configure
Tate-of-the-art computing platforms are widely based on
S the von-Neumann architecture [1]. The von-Nuemann
architecture is characterized by distinct spatial units for com-
the 8T cell for in-memory dot product computations. Note, in
sharp contrast to the previous works on in-memory computing
with the CMOS technology, we enable current based, analog-
puting and storage. Such physically separated memory and like dot product computations using robust digital 8T bit-cells.
compute units result in huge energy consumption due to The key highlights of the present work are as follows:
frequent data transfer between the two entities. Moreover, the 1) We show that the conventional 8T SRAM cell can be
transfer of data through a dedicated limited-bandwidth bus used as a primitive for analog-like dot product com-
limits the overall compute throughput. The resulting memory putations, without modifying the bit-cell circuitry. In
bottleneck is the major throughput concern for hardware addition, we present two different configurations for
implementations of data intensive applications like machine enabling dot product computation using the 8T cell.
learning, artificial intelligence etc. 2) Apart for the sizing of the individual transistors con-
sisting the read port of the 8T cell, the basic bit-cell
VDD vi
WWL WWL Bit-cell RBL Bit-cell RBL
M2 M2
Q QB
Q IRBL Q IRBL
M1 M1
WBL WBLB
RWL vi Vbias

Storage cell M2 Sensing Circuit Sensing Circuit


M1 RBL
Config-A Config-B
SL
(a) (b) (c)
Fig. 1: (a) Schematic of a standard 8T-SRAM bit-cell. It consists of two decoupled ports for reading and writing respectively.
(b) First proposed configuration (Config-A) for implementing the dot product engine using the 8T-SRAM bit-cell. The SL
is connected to the input analog voltage vi , and the RWL is turned ON. The current IRBL through the RBL is sensed and
is proportional to the dot product vi · gi , where gi is the ON/OFF conductance of the transistors M 1 and M 2. (c) Second
proposed configuration (Config-B). The input analog voltages are applied to the RWL, while the SL is supplied with a constant
voltage Vbias . The current through the RBL is sensed in the same way as in Config-A.

structure remains unaltered. Thereby, the 8T SRAM we will describe later. Thus, there is a static current flow from
array can also be used for usual digital memory read the SL to the RBL, which is proportional to the input vi and the
and write operations. As such, the presented 8T cell conductance of the two transistors M 1 and M 2. For simplicity,
array can act as a dedicated dot product engine or as assume that the weights (stored in the SRAM) have a single-bit
an on-demand dot product accelerator. precision. If the bit-cell stores ‘0’, the transistor M 1 is OFF,
3) A detailed simulation analysis using 45nm predictive and the output current through the RBL is close to 0. Whereas
technology models including layout analysis has been if the bit-cell stores a ‘1’, the current is proportional to vi ·gON ,
reported highlighting the various trade-offs presented by where gON is the series ‘ON’ conductance of the transistors.
each of the two proposed configurations. Assume similar inputs vi are applied on the SLs for each row
of the memory array. Since the RBL is common throughout
II. 8T-SRAM AS A D OT P RODUCT E NGINE the column, the currents from all the inputs vi are summed
A conventional 8T bit-cell is schematically shown in Fig. into the RBL. Moreover, since the SL is common throughout
1(a). It consists of the well-known 6T-SRAM bit-cell with two each row, the same inputs vi are supplied to multiple columns.
additional transistors that constitute a decoupled read port. To Thus, the final output current through RBL of each column is
j
write into the cell, the write word-line (WWL) is enabled, and proportional to IRBL = Σ(vi · gij ), where gij is the ‘ON’ or
write bit-lines (WBL/WBLB) are driven to VDD or ground ‘OFF’ conductance of the transistors, depending on whether
depending on the bit to be stored. To read a value from the the bit-cell in the i-th row and j-th column stores a ‘1’ or
cell, the read bit-line (RBL) is pre-charged to VDD and the ‘0’, respectively. The output current vector thus resembles the
read word-line (RWL) is enabled. Note, that the source-line vector-matrix dot product, where the vector is vi in the form
(SL) is connected to the ground. Depending on whether the of input analog voltages, and the matrix is gij stored as digital
bit-cell stores a logic ‘1’ or ‘0’, the RBL discharges to 0V or data in the SRAM.
stays at VDD , respectively. The resulting voltage at the RBL Let us now consider a 4-bit precision for the weights.
is read out by the sense amplifiers. Although 8T-cells incur If the weight Wij = w3 w2 w1 w0 , where wi are the bits
a ∼30% increase in bit-cell area compared to the 6T design, corresponding to the 4-bit weight, the vector matrix dot
they are read-disturb free and more robust due to separate read product becomes:
and write path optimizations [11].
We now show how such 8T-SRAMs, with no modification Σ(vi · Wij ) = Σ[vi · (23 w3 + 22 w2 + 21 w1 + w0 )]
to the basic bit-cell circuit (except for the sizing of the read = Σ(vi · 23 w3 ) + Σ(vi · 22 w2 ) + Σ(vi · 21 w1 ) + Σ(vi · w0 )
transistors), can behave as a dot product engine, without
affecting the stability of the bits stored in the SRAM cells. Now, if we size the read transistors M 1 and M 2 of the
We propose two configurations - Config-A and Config-B, for SRAM bit-cells in column 1 through 4 in the ratio 23 : 22 :
enabling dot-product operations in the 8T-SRAMs. Config-A is 21 : 1, as shown in Fig. 2, the transistor conductances in
shown in Fig. 1(b). The inputs vi (encoded as analog voltages) the ‘ON’ state would also be in the ratio 23 : 22 : 21 : 1.
are applied to the SLs of the SRAM array, and the RWL is also Thus, summing the currents through the RBLs of the four
enabled. The RBL is connected to a sensing circuitry, which columns yields the required dot product in accordance to the
can be exploited to calculate the approximate dot product. To
RWL0
RBL RBL RBL RBL
implement multi-bit precision, the transistor sizing is done in
M2 M2 M2 M2 the same way as Config-A as represented in Fig. 2, so that the
Q3
M1
Q2
M1
Q1
M1
Q0
M1 IRBL is directly proportional to the transistor conductances.
vi0 Key features of the proposed Config-B are as follows. Vbias
RWL1 can be adjusted to reduce the current flowing through the
M2 M2 M2 M2 RBLs. The input voltages vi have a capacitive load, as opposed
Q3
M1
Q2
M1
Q1
M1
Q0
M1 to a resistive load in Config-A. This relaxes the constraints
vi1 on the input voltage generator circuitry, and is useful while
cascading two or more stages of the dot product engine.
However, as presented in the next section, Config-B has a
IRBL IRBL IRBL IRBL
small non-zero current corresponding to zero input as opposed
to Config. A that has zero current for zero input.
Sensing Circuit
In order to sense the output current at the RBLs, we use a
Fig. 2: 8T-SRAM memory array for computing dot-products current to voltage converter. This can most simply be a resistor,
with 4-bit weight precision. Only the read port is shown, the as shown in Fig. 1. However, there are a few constraints.
6T storage cell and the write port are not shown. The array As the output current increases, the voltage drop across the
columns are grouped in four, and the transistors M 1 and M 2 output resistor increases, which in turn changes the desired
are sized in the ratio 8 : 4 : 2 : 1 for the four columns. current output. A change in the voltage on the RBL would
j
The output current IOU T represents the weighted sum of the also change the voltage across the transistors M 1 and M 2,
IRBL of the four columns, which is approximately equal to thereby making their conductance a function of the voltage on
the desired dot-product. the RBL. Thus, at higher currents corresponding to multiple
rows of the memory array, the IRBL does not approximate the
vector-matrix dot product, but deviates from the ideal output.
equation shown above. This sizing pattern can be repeated This dependence of the RBL voltage on the current IRBL
throughout the array. In addition, one could also use transistors will be discussed in detail in the next section with possible
having different threshold voltages to mimic the required ratio solutions.
of conductances as 23 : 22 : 21 : 1. Note that, the currents
through the RBLs of the four consecutive columns are summed III. R ESULTS AND D ISCUSSIONS
together, thus we obtain one analog output current value for The operation of the proposed configurations (Config-A and
every group of four columns. In other words, the digital 4- Config-B) for implementing a multi-bit dot product engine
bit word stored in the SRAM array is multiplied by the input was simulated using HSPICE on the 45nm PTM technology
voltage vi and summed up by analog addition of the currents [14]. For the entire analysis, we have used a scaled down
on the RBLs. This one-go computation of vector multiplication VDD of 0.65V for the SRAM cells. The main components of
and summation in a digital memory array would result in high the dot-product engine implementation are the input voltages
throughput computations of the dot products. and conductances of the transistors for different states of the
It is worth mentioning, that the way input vi are multiplied cells. A summary of the analysis for the two configurations
by the stored weights and summed up is reminiscent of is presented in Fig. 3. In Fig. 3, we have assumed a sensing
memristive dot product computations [8]. However, a concern resistance of 50-ohms connected to the RBL. Note, a small
with the presented SRAM based computation is the fact that sense resistance is required to ensure that the voltage across
the ON resistance of the transistors (few kilo ohms) are much the sensing resistance is not high enough to drastically alter
lower as compared to a typical memristor ON resistance which the conductances of the connected transistors M 1 and M 2.
is in the range of few tens of kilo ohms [12]. As such the static In Fig. 3(a)-(b) we plot the output current in RBL (IRBL )
current flowing through the ON transistors M 1 and M 2 would as a function of the input voltage for three 4-bit weight
typically be much higher in the presented proposal. In order combinations ‘1111’, ‘1010’ and ‘0100’ for the two different
to reduce the static current flow, we propose scaling down the configurations described in the previous section. The results
supply voltage of the SRAM cell. Note, interestingly, 8T cells presented are for a single 4-bit cell. To preserve the accuracy
are known to retain their robust operation even at highly scaled of a dot-product operation, it is necessary to operate the cell
supply voltages [13]. In the next section we have used a VDD in the voltage ranges such that the current is a linear function
lower than the nominal VDD of 1V. We would now describe of the applied voltage vi . These voltage ranges are marked as
another way of reducing the current, although with trade-offs, linear region in Fig. 3(a)-(b). The slope of the linear section
as detailed below. IRBL versus Vin plot varies with weight, thus signifying a
Config-B is shown in Fig. 1(c). Here, the SLs are connected dot product operation. Further, at the left voltage extremity
to a constant voltage Vbias . The input vector vi is connected of the linear region, IRBL tends to zero irrespective of the
to RWLs, i.e., the gate of M 2. Similar to Config-A, the output weight, thus satisfying the constraint that the output current is
current IRBL is proportional to vi . We will later show from zero for zero Vin . It is to be noted that the two configurations
our simulations that for a certain range of input voltage values, show significantly different characteristics due to the different
we get a linear relationship between IRBL and vi , which point-of-application of input voltages.
Without Opamp
Config. A a) c) e)
VDD RBL

Vin Rsense
IRBL
To sensing circuit

Config. B b) d) f)
Vin RBL

Vbias Rsense
IRBL
To sensing circuit

Fig. 3: IRBL versus Vin characteristics for (a) Config. A and (b) Config. B shows the linear region of operation for different
weights. IRBL versus Weight levels for (c) Config. A and (d) Config. B shows desirable linear relationship at various voltages
Vin . IRBL shows significant deviation from ideal output (IN = N × I1 with increasing number of rows for both (e) Config.
A and (f) Config. B, where I1 is the current corresponding to one row and N is the number of rows. The analyses were done
for VDD = 0.65V

Fig. 3(c)-(d) presents the dependence of the current IRBL ative differential input of the Opamp is fed by the bit-line
on the 4-bit weight levels for Config-A at constant voltages corresponding to a particular column. Whereas, the positive
Vin = 0.05V, 0.1V, 0.15V and configuration B at Vin = 0.5V, input is supplemented by a combination of the Opamp offset
0.55V, 0.6V, respectively. Different voltages were chosen so as voltage and any desired voltage required for suitable operation
to ensure the circuit operates in the linear region as depicted of the dot-product as shown in left hand side of Fig. 4. Opamp
by Fig. 3(a)-(b). Desirably, IRBL shows a linear dependence provides a means of sensing the summed up current at the RBL
on weight levels and tends to zero for weight = ‘0000’. The while maintaining a constant voltage at the RBL. Opams in
choice of any voltage in the linear regions of Fig 3(a)-(b) does the configuration as shown in Fig. 4 have been traditionally
not alter the linear dependence of the IRBL on weight levels. used for sensing in memristive crossbars as in [3].
To expand the dot-product functionality to multiple rows, we We performed the same analysis as previously described
performed an analysis for upto 64 rows in the SRAM array, in Fig. 3 for the two proposed configurations with the bit-
driven by 64 input voltages. In the worst case condition, when line terminated by an Opamp. For our analysis, we have set
the 4-bit weight stores ‘1111’, maximum current flows through Vpos = 0.1V for the positive input of the Opamp and thus
the RBLs, thereby increasing the voltage drop across the analysis is limited to input voltages above Vpos to maintain
output resistance. Fig. 3(e)-(f) indicates that the total current the unidirectional current. Note, we have used an ideal Opamp
IRBL deviates from its ideal value with increasing number of for our simulations, where the voltage Vpos can be accounted
rows, in the worst case condition. The deviation in Fig. 3(e)- for both the non-ideal offset voltage of the Opamp and a
(f) is because we sense the output current with an equivalent combination of an externally supplied voltage. Fig. 4(a)-(b)
sensing resistance (Rsense ) and hence the final voltage on the shows the plot of IRBL versus input voltage Vin for the two
bit-line (VBL ) is dependent on the current IRBL . At the same configurations. Similar behavior as in the case of Fig. 3(a)-
time, IRBL is also dependent on VBL and as a result the (b) is observed even in the presence of the Opamp. However,
effective conductance of the cell varies as VBL changes as note that the current ranges have decreased since RBL is
a function of the number of rows. It was also observed that now clamped at Vpos . Further, the dot-product operation is
the deviation reduces with decreasing sensing resistance as only valid for Vin > Vpos and thus the acceptable input
expected. Another concern with respect to Fig. 3 is the fact range is shifted in the presence of an Opamp. Fig. 4(c)-(d)
that the total summed up current reaches almost 6mA for 64 shows the behavior of IRBL versus weight levels for the two
rows for the worst case condition (all the weights are ‘1111’). configurations and desirably, linearity is preserved.
Fig. 4(e)-(f) presents the current through the RBL as a
There are several ways to circumvent the deviation from function of the number of rows. As expected, due to the high
ideal behavior with increasing number of simultaneous row input impedance of the Opamp, and the clamping of VBL at a
accesses and also reduce the maximum current flowing through voltage Vpos the deviation of the summed up current from the
the RBLs. One possibility is to use an operational amplifier ideal value have been mitigated to a huge extent. Although, the
(Opamp) at the end of each 4-bit column, where the neg- current levels have reduced significantly as compared to the
With Opamp

Config. A a) c) e)
VDD RBL

Vpos
Q -+

+ -
Vin Rf
IRBL

Vout = -IRBLRf+Vpos

Config. B b) d) f)
Vin RBL

Vpos
Q -+

+ -
Vbias Rf
IRBL
Vout = -IRBLRf+Vpos

Fig. 4: IRBL versus Vin characteristics for (a) Config. A and (b) Config. B shows the linear region of operation for different
weights. IRBL versus weight levels for (c) Config. A and (d) Config. B shows desirable linear relationship at various voltages
Vin . IRBL shows almost zero deviation from ideal output (IN = N × I1 with increasing number of rows for both (e) Config.
A and (f) Config. B, where I1 is the current corresponding to one row and N is the number of rows. These analyses were
done for VDD = 0.65V

Synaptic Synaptic BL VDD GND


SL
Connections Connections
(wji1) (zj1) σ (ai2) (wji2)
N×M P×N
(ai1) (zj2) σ (aiL) W
WWL
RWL

WWL

VDD Nwell BLB RBL


GND

Input layer Output layer Fig. 6: Thin-cell layout for a standard 8T-SRAM bit-cell [11].
M Neurons P Neurons
Hidden layer
N Neurons
is to evaluate the impact of the non-linearity in the I-V
Fig. 5: Fully connected network topology consisting of 3 characteristics on the inference accuracy of the neural network.
layers, the input layer, the hidden layer and the output layer We chose an input voltage range of 0.1-0.22V. As can be
[12]. We have used M=784, N = 500 and P = 10. observed in Fig. 4(a), the I-V characteristics are not exactly
linear within this range, as such a network level functional
simulation is required to ascertain the impact of the non-
Fig. 3, the resultant current for 64 rows would still be higher
linearity on classification accuracy. The network details are
than the electro-migration limit for the metal lines constituting
as follows. The hidden layer consisted of 500 neurons. The
the RBL [15]. One possible solution is to sequentially access a
network was trained using the Backpropagation algorithm [18]
smaller section of the crossbar (say 16 rows at a time), convert
on the MNIST digit recognition dataset under ideal conditions
the analog current into its digital counterpart each time and
using MATLAB R Deep Learning Toolbox [19].
finally add all accumulated digital results. In addition use of
During inferencing, we incorporated the proposed 8T-
high threshold transistors for the read port of the SRAM would
SRAM based dot-product engine in the evaluation framework
also help to reduce the maximum current values. Further, the
by discretizing and mapping the trained weights proportionally
maximum current is obtained only when all the weights are
to the conductances of the 4-bit synaptic cell. The linear range
‘1111’, which is usually not true due to the sparsity of matrices
of the voltage was chosen to be [0.1-0.22V] and normalized
involved in various applications as in [16], [17].
to a range of [0 1]. The dot-product operation was ensured
We also performed functional simulations using the pro-
by normalizing the I-V characteristics for all the weight levels
posed dot-product engine based on Config. A in a fully
such that current corresponding to the highest input voltage
connected artificial neural network consisting of 3 layers as
and highest weight level is Imax = Vmax × Gmax . The activa-
shown in Fig. 5. The main motivation behind this analysis
8W 4W 2W W

BIT3 BIT2 BIT1 BIT0

Fig. 7: Thin-cell layout for the proposed 8T-SRAM array with 4-bit precision weights. The width of read transistors of different
bit positions are sized in the ratio 8:4:2:1. This incurs an area overhead of ∼15.6% compared to the standard 8T-SRAM bit-cell.

tion function of the neuron was considered to be a behavioral The average power consumption across different voltages and
satlin function scaled according to the scaling factor of the weight levels was determined to be around 33.5µW for the
weights to preserve the mathematical integrity of the network. same configuration. Similarly, for the highest voltage of 0.65
To be noted, the normalization of current and input voltage V and Vbias = 0.3V for Config. B with Opamp, the worst
simplifies the scaling of the neuron activation function. The case and average power consumption was 196µW and 68.1µW
accuracy of digit recognition task was calculated to be merely respectively. It is to be noted that the power in case of Config.
0.11% lower than the ideal case (98.27%) thus indicating that B can be reduced by using a lower Vbias .
the proposed dot-product engine can be seamlessly integrated
into the neural network framework without significant loss in IV. C ONCLUSION
performance. In the quest for novel in-memory techniques for beyond
Further, it is to be noted that in many cases the inherent von-Neumann computing, we have presented the 8T-SRAM
resilience of the applications that require dot product compu- as a vector-matrix dot-product compute engine. Specifically,
tations can be leveraged to circumvent some of the circuit level we have shown two different configurations with respect to
non-idealities. Additionally, the proposed technique can either 8T SRAM cell for enabling analog-like multi-bit dot product
be used as a dedicated CMOS based dot product compute computations. We also highlight the trade-offs presented by
engine or as an on-demand dot product accelerator, wherein each of the proposed configurations. The usual 8T SRAM
the 8T array acts as usual digital storage and can also be bit cell circuit remains unaltered and as such the 8T cell
configured as a compute engine as and when required. It is also can still be used for the normal digital memory read and
worth mentioning that the 8T cell has also been demonstrated write operations. The proposed scheme can either be used
in [7] as a primitive for vector Boolean operations. This work, as a dedicated dot product compute engine or as an on-
significantly augments the possible use cases for the 8T cells demand compute accelerator. The presented work augments
by adding analog-like dot product acceleration. the applicability of 8T cells as a compute accelerator in the
Due to different sizing of the read transistors, there is an view that dot products find wide applicability in multiple
area penalty of using the proposed configurations, compared to data intensive application and algorithms including efficient
the standard 8T-SRAM bit-cell used for storage. Fig. 6 shows hardware implementations for machine learning and artificial
the thin-cell layout for a standard 8T-SRAM bit-cell [11]. Note intelligence.
that the rightmost diffusion with width (W) constitute the read
transistors (M 1 and M 2). To implement the 4-bit precision R EFERENCES
dot-product, we size the width of read transistors in the ratio
[1] J. Von Neumann, The computer and the brain. Yale University Press,
8 : 4 : 2 : 1, as described earlier. Thus, the width of the 2012.
rightmost diffusion is increased to 8W, 4W, and 2W, resulting [2] J. J. Yang, D. B. Strukov, and D. R. Stewart, “Memristive devices for
in an area overhead of ∼39.6%, 17.1%, and 5.7% for bits 3, 2 computing,” Nature nanotechnology, vol. 8, no. 1, p. 13, 2013.
[3] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang,
and 1, respectively, compared to the standard minimum sized W. Song, N. Dávila, C. E. Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang,
8T bit-cell with diffusion width W. The resulting layout of M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “Analogue
first four columns for a particular row in the proposed array signal and image processing with large memristor crossbars,” Nature
Electronics, vol. 1, no. 1, pp. 52–59, dec 2017. [Online]. Available:
is shown in Fig. 7. The overall area overhead for the whole https://doi.org/10.1038%2Fs41928-017-0002-z
SRAM array with 4-bit weight precision, amounts to ∼15.6% [4] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and
compared to the standard 8T SRAM array. Note, this low area R. S. Williams, “memristiveswitches enable statefullogic operations via
material implication,” Nature, vol. 464, no. 7290, p. 873, 2010.
overhead results from the fact that both the read transistors [5] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu,
M 1 and M 2 share a common diffusion layer and hence an “Nanoscale memristor device as synapse in neuromorphic systems,”
increase in transistor width can be easily accomplished by Nano letters, vol. 10, no. 4, pp. 1297–1301, 2010.
[6] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada,
having a longer diffusion, whithout worrying about spacing S. Miyoshi, D. Blaauw, and D. Sylvester, “A 0.3 v vddmin 4+ 2t sram
beween metal or poly layers. for searching and in-memory computing using 55nm ddc technology,”
Finally, we discuss the power consumed by the proposed dot in VLSI Circuits, 2017 Symposium on. IEEE, 2017, pp. C160–C161.
[7] A. Agrawal, A. Jaiswal, and K. Roy, “X-sram: Enabling in-memory
product engine. The worst case power consumption will occur boolean computations in cmos static random access memories,” arXiv
when all the inputs are at its highest operating voltage and all preprint arXiv:1712.05096, 2017.
the weights are set to ‘1111’. Considering a column of 16 input [8] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song,
N. Dávila, C. E. Graves et al., “Analogue signal and image processing
nodes (16 rows), for a highest voltage of 0.22 V for Config. A with large memristor crossbars,” Nature Electronics, vol. 1, no. 1, p. 52,
with Opamp, the worst case power consumption was 128µW. 2018.
[9] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A multi-
functional in-memory inference processor using a standard 6t sram
array,” IEEE Journal of Solid-State Circuits, vol. 53, no. 2, pp. 642–
655, Feb 2018.
[10] J. Lee, D. Shin, Y. Kim, and H. J. Yoo, “A 17.5-fj/bit energy-efficient
analog sram for mixed-signal processing,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2714–2723,
Oct 2017.
[11] K. Takeda, Y. Hagihara, Y. Aimoto, M. Nomura, Y. Nakazawa, T. Ishii,
and H. Kobatake, “A read-static-noise-margin-free sram cell for low-
vdd and high-speed applications,” IEEE journal of solid-state circuits,
vol. 41, no. 1, pp. 113–121, 2006.
[12] I. Chakraborty, D. Roy, and K. Roy, “Technology aware training in
memristive neuromorphic systems based on non-ideal synaptic cross-
bars,” arXiv preprint arXiv:1711.08889, 2017.
[13] L. Chang, Y. Nakamura, R. K. Montoye, J. Sawada, A. K. Martin,
K. Kinoshita, F. H. Gebara, K. B. Agarwal, D. J. Acharyya, W. Haensch
et al., “A 5.3 ghz 8t-sram with operation down to 0.41 v in 65nm cmos,”
in VLSI Circuits, 2007 IEEE Symposium on. IEEE, 2007, pp. 252–253.
[14] W. Zhao and Y. Cao, “New generation of predictive technology model
for sub-45 nm early design exploration,” IEEE Transactions on Electron
Devices, vol. 53, no. 11, pp. 2816–2823, 2006.
[15] G. Posser, V. Mishra, R. Reis, and S. S. Sapatnekar, “Analyzing the
electromigration effects on different metal layers and different wire
lengths,” in Electronics, Circuits and Systems (ICECS), 2014 21st IEEE
International Conference on. IEEE, 2014, pp. 682–685.
[16] S. Changpinyo, M. Sandler, and A. Zhmoginov, “The power of sparsity
in convolutional neural networks,” arXiv preprint arXiv:1702.06257,
2017.
[17] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Advances in neural information
processing systems, 2015, pp. 1135–1143.
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” Tech. Rep., sep 1985. [Online].
Available: https://doi.org/10.21236%2Fada164453
[19] R. B. Palm, “Prediction as a candidate for learning deep hierarchical
models of data,” Technical University of Denmark, vol. 5, 2012.

You might also like