2015_Merging the interface
2015_Merging the interface
net/publication/283653729
MErging the Interface: Power, area and accuracy co-optimization for RRAM
crossbar-based mixed-signal computing system
CITATIONS READS
41 93
5 authors, including:
Huazhong Yang
Tsinghua University
554 PUBLICATIONS 4,703 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yu Wang on 05 July 2016.
{
1
1 0 0.28?
DAC 0.3
{
1 1 10 1 0.32?
0 0
1 0.38?
Analog Circuits Analog Circuits
Oxygen Tunneling
RCS with ( Sigmoid, Amplifiers.. ) Merge the ( Sigmoid, Amplifiers.. ) 1 1
0
Vacancies Gap
SAAB 0
AD/DA 0.69 0.3 Interface Analog Circuits
1
( Sigmoid, Amplifiers.. ) 0
Weight = 0.1
ADC ADC 1 10 1 1 10 1
Analog Circuits
( Sigmoid, Amplifiers.. )
{
{
11100.. 00100.. Weight = 0.2
0.69 0.3 1 10 1
(a) (b) (c) (d)
Figure 1: (a). Physical model of RRAM device. (b). RRAM crossbar-based computing system (RCS). (c).
Basic idea of merging the interface (MEI). (d). Basic idea of serial array adaptive boosting (SAAB).
Therefore, the RRAM crossbar array is able to perform ana- 3. THE PROPOSED METHOD
log matrix-vector multiplication and the parameters of the
matrix depend on the RRAM resistance states. 3.1 MEI: MErging the Interface
With the RRAM crossbar structure, an RRAM crossbar- Fig. 1(c) demonstrates the basic idea of MEI. The tech-
based computing system (RCS) can be implemented by real- nique is based on the idea that, instead of using an RCS to
izing analog artificial neural networks (ANNs) [8]. Generally approximate the function between the analog value convert-
speaking, an ANN processes the data by executing the fol- ed by AD/DA, we can try to make it directly learn the rela-
lowing operations layer by layer: tionship between the binary 0/1 arrays which represent the
input and output digital data. For example, for a traditional
~ xi + b~i )
yj = f (Wij · ~ (3) 2 × 8 × 2 RCS equipped with 8-bit AD/DAs as described in
where ~xi and ~
yj represent the data in the i and j layer of th th Section 2.2, MEI will directly set 2 × 8 = 16 ports in both the
the network. Wij is the weight matrix between Layer i and input and output layers of RCS. Digital 0/1 signals, instead
Layer j. f (x) is a nonlinear activation function, e.g., the sig- of the analog signals converted by DA, will be set to the 16
moid function. It can be seen that the basic operations of an input ports in parallel, and MEI will directly calculate the
ANN are the matrix-vector multiplication and the nonlinear corresponding 16 digital outputs. Therefore, MEI is able to
activation function, which can be implemented with RRAM directly connect to digital systems without AD/DA.
crossbar structures and analog circuits, respectively [8]. An An important difference between the proposed architecture
ANN can learn the relationship between the input and out- and the original RCS with AD/DAs is that, after all the input
put data automatically, which makes the RCS an efficient and output ports are exposed, they will be independent with
and powerful tool to accomplish a wide range of tasks [11]. each other and we can treat each port differently to increase
For example, an RCS can be configured as a power efficien- the performance and efficiency of MEI.
t approximate computing system by learning to fit complex To be specific, we propose to pay more attention to the
numerical functions [6, 8]. ports that represent the most significant bits (MSB) of a bi-
nary number. If we can reduce of error rate of MSBs, we may
2.2 Motivation significantly decrease the error rate of the whole system. This
The RCS achieves significant power efficiency gains by tak- technique is realized by carefully modifying the loss function
ing advantage of the RRAM crossbar structure. As the RCS of the training algorithm. As described in Section 2.1, an RC-
processes data in analog, an interface, such as AD/DA, is S realizes different tasks by realizing an RRAM-based ANN.
usually required to connect the RRAM crossbar-based ana- The training process of an ANN can be described as adjust-
log accelerator to digital systems. However, compared with 1
An I × H × O RCS represents that the RCS consists of a 3-layer ANN
the high density and efficiency of RRAM analog units, the with I nodes in the input layer, H nodes in the hidden layer and O
interface is drastically area and power consuming. nodes in the output layer.
ing the network weights (Wij in Eq. (3)) to minimize the Algorithm 1: Serial Array Adaptive Boosting
difference between the target and actual outputs by solving Input: Training Samples: X = {(~ x1 , ~
y1 ), ..., (~
xN , ~
yN )},
the following optimization problem [15]: Bits for Comparison: BC , Boost Times: K, Non-Ideal Factors: ~ σ
XX Output: Trained RCSs {R1 , ..., RK } with weights {αi , ...., αK }
min [tp (n) − op (n)]2 (4) 1 Initialize weights of training samples wn = 1/N ;
2 for k = 1 → K do
n p P
3 Normalize the distribution of samples: pn = wn / wn ;
n
where n represents the index of the training sample and p is 4 Generate training samples sk with X and distribution pn ;
the port index of the output “vector”. tp is the target output 5 Train the RCS Rk with sk ;
6 Evaluate the error rate of Rk with Non-Ideal Factors ~
σ :
of the network, e.g., the label of the input training sample. B
yk C ]∗ ;
σ )BC 6= ~
P
εk = pn · [Rk (~
xn , ~
And op is the actual output of the network by executing a n
series of Eq. (3). 7 Calculate the weight of Rk : αk = 12 ln[ 1−ε
ε ];
In order to suppress the error rate of the MSB and in- 8 Update weights of training samples:
crease the accuracy of MEI, we revise the loss function of the −α
e k if the first B bits of Rk (~ xn , ~
σ ) are correct
training optimization problem as follows: wn = wn × αk
e else
XX
min [wp · (tp (n) − op (n))]2 (5) 9 end P
10 Output: h(~
x) = arg max αk · [Rk (~
x) = ~
y ];
n p ~∈Y
y k
11 * [Rk (~ σ )B 6= ~
xn , ~ ykB
] represents the operation of comparing the
where wp is the weight of each output port. We set larger most significant B bits of Rk (~ xn , ~
σ ) and ~
yk .
weights to the MSBs while the least significant bits (LSBs)
will be given smaller weights. For example, we exponentially
increase the weight of each bit and set the MSB and LSB
weights in an 8-bit output array to 20 and 2−7 , respectively. AD/DAs, it may also require a larger hidden layer to
By using the revised loss function, an error from the MSB support more output ports and achieve a good result.
will lead to a much lager penalty than the error of LSB. So • It can be also observed that there’s a bottleneck of the
the training process will put more effort to suppress the MSB RCS performance. The accuracy begin to stall after we
error and the accuracy of MEI will be improved. increase the size of the hidden layer to a certain value.
Fig. 3 demonstrates a comparison of different architecture • The RRAM devices may suffer from different non-ideal
performance. We use a 1×N ×1 RCS to perform approximate factors [10], such as the signal fluctuation and the pro-
computing by fitting the calculation of f (x) = exp(−x2 ) [6, cess variation. The robustness of an RCS is important
7]. We randomly generate 10,000 samples within the range for physical realization.
of (0, 1) to train the RCS and test it with another 1,000 sam-
Based on the above observation, the ensemble method be-
ples. It can be seen that the revised training algorithm not
comes a promising solution to boost the accuracy and robust-
only helps significantly improve the accuracy of the proposed
ness of the RCS. Compared with the traditional redundancy
architecture, but also can even achieve better performance
method, an ensemble method uses a series of learning ma-
than the traditional RCS with AD/DAs.
chines (learners) with different parameters to provides bet-
A major side effect of MEI is that the number of input and
ter results. We propose SAAB, which uses Serial Array to
output ports and the crossbar size will increase massively.
Adaptive Boost the performance of an RRAM crossbar based
However, due to the ultra-high integration density and effi-
computing system as shown in Fig. 1(d).
ciency of RRAM devices, which can be more than hundreds
To be specific, SAAB is inspired by the AdaBoost method
of times better than AD/DA, MEI still significantly reduces
[16]. The basic idea of AdaBoost, which is also its major
the area and power consumption of RRAM crossbar-based
advantage, is to train a series of learners sequentially, and
computing systems. A minor problem is that the outputs of
every time we train a new learner, the method will increase
the proposed architecture are continuous analog signals. We
the weights of examples that are incorrectly classified by pre-
use flip-flop buffers or analog comparators (to work as 1-bit
vious trained learners and force the new learner to focus on
ADCs) to convert them to discrete binary digital signals.
these “hard” examples with wrong answers in the training set.
Compared with the original AdaBoost, SAAB is customized
3.2 SAAB: Serial Array Adaptive Boosting to MEI by relaxing the error calculation, focusing on the MS-
With MEI, we can save the area and power consumption of Bs and introducing the impact of non-ideal factors.
AD/DAs. We argue that we can use these saved resources to Algorithm 1 demonstrates the basic flow of SAAB. We
integrate more RRAM devices and analog peripheral circuits maintain a distribution (pn ) of training samples according
and further boost the accuracy and robustness of the RCS. to their weights (wn ), which reflect the “hardness” of a sam-
The first step is to choose a proper method. The experimen- ple, i.e., a larger weight will represent that the corresponding
tal results of MEI provide us with the following observations: sample is more likely to be misclassified by previous learners.
• As shown in Fig. 3, although the proposed architec- Each time we need to train a new RCS, we use this distribu-
ture may perform better than the traditional RCS with tion to generate a customized training data (sk ), where the
“hard” examples that are incorrectly classified by previous
-2
10 learners will have a greater proportion in the training set to
Mean Square Error (MSE)
AD/DA
MEI Trained with Eq. (4)
make the new learner (Rk ) pay more attention to these sam-
-3
10
MEI Trained with Eq. (5) and
Exponential Weights
ples (Line 4 − 5). After the training process of a learner fin-
ishes, the algorithm will test the learner’s performance (εk ),
calculate its contribution to the system (αk ), and update the
-4
10 weights wn according to the training results (Line 6 − 8).
In order to enhance the robustness of RCS, in Line 6, we al-
-5
10
so introduce the non-ideal factors when evaluating the perfor-
4 8 16 24 32 48 64 96 128
Nodes in the Hidden Layer of RCS mance of a trained RCS to find out the “sensitive” samples as
well as the “hard” ones under noisy conditions. Moreover,we
Figure 3: Comparison of the performance of different relax the error calculation by only comparing the most sig-
architectures when approximating f (x) = exp(−x2 ). nificant BC bits, e.g., the first 4-6 bits in an 8-bit array, of
the RCS in Line 6. Otherwise, most of the training samples Algorithm 2: Design Space Exploration
will be either sensitive or “hard” in the algorithm, and the Input: Training and Testing Samples: X and T ;
performance of SAAB may significantly decrease. Initial RCS Size: I × Hi × O;
Finally, SAAB will provide a balanced output by a weight- Required Bit-Length: Br ; Non-Ideal Factors: ~ σ;
Error Rate Requirement: ε; Robustness Requirement: γ;
ed voting of different RCSs as described in Line 10. Com- Output: Trained RCS R with a size of Bin I × H × Bout O;
pared with the training process, in which a series of RCSs 1 Search a proper Hidden Layer Size H from Hi for the RCS;
are configured in sequence, each trained RCS can predict the 2 Calculate the Maximum SAAB Number Kmax with H and
Eq. (9);
output of a given input in parallel. And the weighted voting 3 Train the RCS R1 with a size of Br I × H × Br O with X;
operation of the outputs of different RCSs can be execut- 4 Test the error rate εs and robustness γs of R1 with T and ~ σ;
ed by the digital system directly connected to these RRAM 5 if εs < ε && γs > γ then
6 R ← R1 ;
crossbar-based computing systems. 7 end
8 else
4. DESIGN SPACE EXPLORATION 9 Calculate α1 , the weight of R1 , with ~
σ as Algorithm 1;
10 K ← 1;
4.1 Area and Power Estimation 11 while εs > ε || γs < γ do
12 K++;
The area and power consumption are both mainly deter- 13 if K > Kmax then
mined by the architecture and the size of ANN in an RCS. 14 Return Mission Impossible;
For example, the area of a traditional I × H × O RCS with 15 end
16 Train RK and αK as Algorithm 1;
AD/DAs can be estimated as follows: 17 Test the error rate εs and robustness γs of the ensemble
of {R1 , ..., RK } with weights {α1 , ..., αK } with T and ~
σ;
Aorg ≈ I · ADA + O · AAD + H · AP + 2(I + O)H · AR (6) 18 Train an RCS R0 with a size of Br I × HK × Br ;
19 Compare the error rate and robustness of R0 and the
where ADA , AAD , AP , AR are the circuit size for a cell of ensemble of {R1 , ..., RK }, and set H and R according to
DAC, ADC, analog peripheral circuits and RRAM devices, the better one;
20 end
respectively. The area of RRAM devices are doubled because
21 end
two crossbar are required to represent a matrix with both 22 Prune the least significant bits in the input and output layer of R
positive and negative parameters as described in Section 2.1. to Bin and Bout ;
And for the proposed I 0 ×H 0 ×O0 RCS with MEI and B-bit 23 Return R;
accuracy, the area estimation should be modified to:
AMEI ≈ H 0 · AP + B · 2(I 0 + O0 )H 0 · AR (7)
lower than a certain value (e.g. 5%). The absolute change
Moreover, Eq. (6) & (7) can also be used to evaluate the rate of the ith training result can be defined as follows:
power consumption by replacing the area parameters, such
i − i−1
as AAD and ADA , with parameters for power estimation. η=| | (8)
i−1
4.2 Accuracy and Robustness Evaluation where can be error rate, mean square error (MSE) or any
As RCS is based on the analog realization of ANN, the other index that can be used to evaluate the performance of
accuracy and robustness of an RCS are usually both deter- a trained RCS.
mined by the scale of the system [15]. Therefore, we discuss After a proper hidden layer size is determined, the max-
them together when exploring the design space. imum number of RCSs that can be used for SAAB can be
As discussed in Section 3.2, there are two methods to scale estimated to reduce the design space as described in Line 2.
up an RCS which may result in accuracy and robustness im- In our design, both the circuit area and power consumption
provement of the system: 1). increasing the scale of a single of an RCS with MEI should not be larger than the original
RCS; and 2). combining several RCSs together with SAAB. architecture with AD/DA. Therefore, the maximum SAAB
Because the dimensions of input and output data are usually number will be bounded by the following expression:
determined by the application, the size of the hidden layer
Aorg Porg
in an RCS, and the number of RCSs combined with SAAB, KSAABmax = min{ , } (9)
become the two major parameters that can be configured to AMEI PMEI
boost accuracy and robustness in the design space. where KSAABmax is the maximum number of RCSs can be
It should be noted that, it will be very difficult, if not im- used in SAAB. Porg and PMEI are the power consumption
possible, to directly predict that SAAB will achieve better ac- estimation for the original and proposed architectures, re-
curacy or robustness to non-ideal factors than the increasing- spectively, and they can be estimated as Eq. (6) & (7).
hidden-layer method. Therefore, we keep both the methods After achieving the basic RCS scale and the maximum
when exploring the design space. SAAB number, the algorithm will begin to explore the design
space by gradually adding new learner to the organization of
4.3 Exploring the Design Space RCSs with SAAB until the requirements of accuracy and ro-
Because MEI is aimed to reduce the area and power con- bustness (such as the error rate under noisy conditions) are
sumption of an RCS, while SAAB boost the accuracy at the both satisfied (Line 13 − 17).
cost of consuming more power and area, there will be a trade- As discussed in Section 4.2, we retain both SAAB and the
off among power, area, and accuracy in an RCS equipped increasing-hidden-layer method to enhance the robustness of
with MEI and SAAB. Therefore, we propose a design space an RCS. Therefore, in Line 18 − 19, the algorithm will al-
exploration flow to help convert a traditional RCS to the pro- so train a single RCS with the same hidden layer size of K
posed architecture with MEI and achieve better trade-offs. trained RCSs with SAAB, and then compare their perfor-
Algorithm 2 demonstrates the technical flow for exploring mance. The algorithm will select a better one as the final
the design space. As shown in Line 1, for each specific ap- output candidate. In addition, as an I × HK × O RCS can
plication, the first step is to determine a proper hidden layer save 2(K − 1)O RRAM devices and (K − 1)O analog pe-
size of a single RCS. Inspired by the results shown in Fig. 3, ripheral circuits in the output ports compared with K RCSs
we search a proper hidden layer size by gradually increasing with a size of I × H × O, the increasing-hidden-layer method
the size (with linear or exponential searching steps) until the will be preferred if the performance of these two methods are
absolute change rate of the error rate or accuracy becomes similar.
Table 1: Benchmark Description and Results
Digital/ Pruned MSE MSE MSE Error Error Error
Error Area Power
Name Type AD/DA MEI Digital AD/DA MEI Digital AD/DA MEI
Metric Saved Saved
Topology Topology ANN RCS RCS ANN RCS RCS
Average
Signal
FFT 1×8×2 (1·7)×16×(2·8) 0.0046 0.0071 0.0052 Relative 6.03% 10.72% 8.87% 74.24% 87.23%
Processing
Error
Average
Inversek2j Robotics 2×8×2 (2·8)×32×(2·8) 0.0038 0.0053 0.0067 Relative 6.57% 9.07% 10.45% 54.63% 73.73%
Error
Miss
Jmeint 3D Gaming 18×48×2 (18·6)×64×(2·1) 0.0117 0.0258 0.0262 7.19% 9.50% 9.96% 69.67% 61.82%
Rate
Image
JPEG Compression 64×16×64 (64·6)×64×(64·7) 0.0081 0.0153 0.0142 6.89% 11.44% 9.73% 86.14% 79.58%
Diff
Machine Image
K-Means 6×20×1 (6·6)×32×(1·8) 0.0052 0.0081 0.0094 3.59% 7.59% 8.13% 67.00% 70.25%
Learning Diff
Image Image
Sobel 9×8×1 (9·6)×16×(1·1) 0.0024 0.0028 0.0026 3.71% 4.00% 3.77% 85.99% 86.80%
Processing Diff
A special technique in the algorithm is that we propose to the accuracy of AD/DA and the basic bit-length requirement
prune the least significant bits (LSBs) of a trained RCS to of MEI (Br ) are both set to 8-bit.
further reduce the area and power consumption. Traditional
AD/DAs require time and energy to convert analog signals 5.2 Results of MEI and SAAB
from/to the LSBs of the output/input binary number. This Table 1 summarizes the results of different methods. In or-
operation is fixed in the AD/DA architecture but may con- der to reflect the performance of MEI, SAAB is not used in
tribute little to the performance of RCS. However, thanks to this section. The ‘Digital’ method refers to an ideal ANN per-
MEI, each individual bit of the interface is exposed indepen- formed by CPU with 32-bit floating-point numbers, and the
dently and this gives us a chance to easily remove the bits of ‘AD/DA’ method represents the traditional RRAM crossbar-
little importance. Therefore, we propose to prune the LSBs based computing system with 8-bit AD/DAs as interfaces.
of the input and output ports of MEI and this technique is For the ‘Pruned MEI Topology’, a (D · B) value represents
demonstrated in Line 22. that there are D groups of B ports in the input/output lay-
To be specific, for the input ports, we treat all groups of er, and each group stands for the most significant B bits of
the input ports the same and keep reducing the LSBs of each a binary number.
group together until the performance becomes worse than Compared with the traditional architecture, the proposed
the requirement. For example, an original RCS may require technique significantly reduces more than half of the area and
3 analog input ports. And there will be 3 groups of 8 input power consumption in all the 6 benchmarks. It can be seen
ports, i.e., 24 input ports in total, for a proposed architecture that our method can achieve approximate, or even better,
with 8-bit accuracy. We will try to remove the ports for the error rate compared with the traditional architecture. More-
least significant 1, 2, ... bits of each group simultaneously, over, although the required RRAM devices and analog pe-
test the pruned architecture’s performance, and finally reach ripheral circuits both increased in the proposed architecture,
the minimum size of the input layer. The pruning of the this overhead is still well compensated by the high density
output layer is much easier and will be executed after the and efficiency of RRAM devices compared with the AD/DA
size of the input layer is determined. The algorithm will first interface. More specifically, the proposed merging the inter-
compare the accuracy of the LSBs and the performance, such face method demonstrates to benefit more to the application
as the mean square error (MSE), of the trained network, and with a larger ratio of the interface size to the hidden layer
then try prune to the LSBs whose weights are much smaller size, such as the ‘JPEG’ and ‘Sobel’.
than the RCS error. For example, the LSB of an 8-bit fixed- On the contrary, for the application like ‘Inversek2j’ and
point binary number may account for a value of 2−8 , and we ‘Jmeint’, where more hidden nodes are required in the RCS,
can try to remove it once the MSE of the RCS is ∼ 2−10 or the gains of the proposed method may decrease. Finally, the
larger. It should be noted that the algorithm only prunes the topology results of MEI, except the ‘Inversek2j’, demonstrate
LSBs of a given bit-length (Br ), and we set the bit-length to that the LSBs in both input and output ports of many appli-
the same as AD/DA in this work. cations can be pruned to further reduce the area and power
Finally, by combing the above steps, we can convert a tradi- consumption, which verifies the feasibility and effectiveness
tional RCS to MEI and SAAB, and achieve trade-offs among of the proposed design space exploration flow.
accuracy, area, power consumption and even robustness. Fig. 4 illustrates the comparison of different methods2 .
MEI cannot achieve better performance for all benchmarks.
5. EXPERIMENTAL RESULTS It seems that MEI will be more suitable for the application
5.1 Experimental Setup 2
In SAAB method, we boost the system performance with the Max-
In the experiment, 6 different benchmarks from a wide imum SAAB Number. For example, the area and power saved in the
‘JPEG’ benchmark are 86.14% and 79.58%, and we use 4 RCSs in
range of applications are used to evaluate performance of SAAB according to Eq. (9).
the proposed method. The benchmarks are the same as that
described in Ref. [1, 7] and they are used to test the per- MEI v.s. AD/DA SAAB v.s. MEI
20% 15%
formance of an analog neural processing unit. The data for 17.26% 14.97% 13.05%
Accuracy Improvement
Accuracy Improvement
15%
the area and power estimation of analog peripheral circuit- 10%
12%
9.81%
5.75%
s RRAM devices and are taken from Ref. [7, 12, 13, 14] as 5% 9%
0%
discussed in Section 2.2. For the accuracy and robustness FFT Inverse2j Jmeint JPEG K-Means Sobel
6%
-5%
4.18%
emulation, an RRAM device model packed in Verilog-A [9] is -10%
-4.84%
-7.11% 3.16% 2.91%
3%
used to build up the SPICE-level crossbar array. We choose -15%
-15.01% 0.48%
-20% 0%
the 90nm technology node to build the interconnection of the FFT Inverse2j Jmeint JPEG K-Means Sobel
45
MEI MEI
25
SAAB 35 SAAB tion than the traditional method with AD/DA. Such results
20 Hidden Hidden
15
25 suggest that the MEI architecture will be more easier to be
10 15 physically realized.
5 5
0 5 10 15 20 0 5 10 15 20
Process Variation (%) Signal Fluctuation (%) 6. CONCLUSION AND FUTURE WORK
JPEG JPEG
55 75 The RRAM crossbar-based computing system (RCS) pro-
AD/DA 65 AD/DA
vides a promising solution to boost performance and power
Error Rate (%)
7
MEI 9 MEI plied to a broad range of applications and a large number of
SAAB 8 SAAB
6 Hidden 7 Hidden
followup work, such as reducing the IR drop for a larger RCS
5 6 under smaller technology node, are needed for this emerging
5
4 4 architecture. In addition, we set the basic bit-length of MEI
3 3
0 5 10 15 20 0 5 10 15 20 according to AD/DA, e.g. 8-bit, in order to convert a tradi-
Process Variation (%) Signal Fluctuation (%) tional RCS to the proposed architecture. In future work, we
may directly use higher bit-level or even floating-point format
Figure 5: System performance under different noisy in MEI to further improve the system performance.
conditions.
where the output changes more “slowly” with the input, espe- Acknowledgment
cially for the LSBs, like ‘JPEG’. And for the application like This work was supported by 973 Project 2013CB329000, Na-
‘Inversek2j’ in which many LSBs in the output results change tional Natural Science Foundation of China (No. 61373026),
sensitively with the input data, the relationship between the Brain Inspired Computing Research, Tsinghua University
input and output 0/1 binary array may be much more com- (20141080934), Tsinghua University Initiative Scientific Re-
plex than that between the converted analog value. It will be search Program, the Importation and Development of High-
more difficult for MEI to approximate the relationship bet- Caliber Talents Project of Beijing Municipal Institutions, and
ter than the traditional architecture with AD/DA. In that Huawei Technologies.
case, the performance of MEI may be worse than ‘AD/DA’.
However, although the accuracy may decrease, the perfor-
mance of MEI is still within the acceptable range and may
7. REFERENCES
[1] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural
be compensated by increasing the bit requirement of MEI to acceleration for general-purpose approximate programs,” in
from 8 10, 12 or a higher level. Moreover, compared with MICRO, 2012, pp. 449–460.
[2] S. A. McKee, “Reflections on the memory wall,” in Proceedings
MEI, SAAB is able to further boost the accuracy of all the of the 1st conference on Computing frontiers. ACM, 2004.
benchmarks with an average improvement of 5.76%. [3] B. Liu et al., “Reduction and ir-drop compensations techniques
for reliable neuromorphic computing systems,” in ICCAD, 2014.
5.3 Impact of Non-Ideal Factors [4] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design implications
of memristor-based rram cross-point structures,” in DATE, 2011.
We also evaluate the impact of different non-ideal factors
[5] M. Versace and B. Chandler, “Moneta: a mind made from
to the proposed architecture and compare it with previous memristors,” IEEE Spectrum, vol. 47, no. 12, pp. 30–37, 2010.
methods. In this paper, we mainly focus on two major non- [6] B. Li, Y. Shan et al., “Memristor-based approximated
ideal factors in the RRAM crossbar-based computing system- computation,” in ISLPED, 2013, pp. 242–247.
s: the process variation (PV) and the signal fluctuation (SF) [7] R. St. Amant, A. Yazdanbakhsh et al., “General-purpose code
acceleration with limited-precision analog computation,” in
[10]. The process variation reflects the degree of the RRAM ISCA, 2014, pp. 505–516.
device deviating from the required resistance state, and the [8] X. Liu, M. Mao et al., “A heterogeneous computing system with
signal fluctuation represents the impact of noise to the electri- memristor-based neuromorphic accelerators,” in IEEE High
Performance Extreme Computing (HPEC), 2014.
cal signal, such as the input signal. To fully demonstrate the
[9] S. Yu, B. Gao et al., “A low energy oxide-based electronic
impact of the these non-ideal factors, the lognormal distribu- synaptic device for neuromorphic visual systems with tolerance
tion is used to generate variations of different levels. Under to device variation,” Advanced Materials, 2013.
each noisy condition, we evaluate the system performance [10] M. Hu, H. Li et al., “Hardware realization of bsb recall function
using memristor crossbar arrays,” in DAC, 2012, pp. 498–503.
1,000 times and statistically analyze the average result. The
[11] B. Li, Y. Wang et al., “Ice: inline calibration for memristor
simulation results3 are demonstrated in Fig. 5. crossbar-based computing engine,” in DATE, 2014, pp. 184–187.
It can be seen that both SAAB and the increasing-hidden- [12] Y. Deng, H.-Y. Chen et al., “Design and optimization
layer method can improve the robustness of the RCS to non- methodology for 3d rram arrays,” in IEEE International
Electron Devices Meeting (IEDM), 2013, pp. 9–11.
ideal factors. But as discussed in Section 3.2, it is difficult to
[13] W.-H. Tseng and P.-C. Chiu, “A 960ms/s dac with 80db sfdr in
predict which method will perform better than the other in 20nm cmos for multi-mode baseband wireless transmitter,” in
each specific application. For example, SAAB benefit more to VLSI Circuits Digest of Technical Papers, 2014 Symposium
the ‘Inversek2j’ benchmark while the increasing-hidden-layer on. IEEE, 2014, pp. 1–2.
[14] J. Proesel, G. Keskin et al., “An 8-bit 1.5 gs/s flash adc using
method is more suitable for the ‘JPEG’. And in ‘Sobel’, they post-manufacturing statistical selection,” in CICC, 2010.
perform almost the same. This result motivates us to keep [15] C. M. Bishop et al., “Neural networks for pattern recognition,”
both methods in the design space exploration (Line 18 − 19 1995.
in Algorithm 2). In addition, as MEI only requires discrete [16] Z.-H. Zhou, Ensemble methods: foundations and algorithms.
CRC Press, 2012.
3
We evaluate all the 6 benchmarks and 3 group of results are presented [17] ITRS, “International technology roadmap for semiconductors,”
in this paper as they are enough to reflect all the simulation results. 2013.