Optimization of Noc Wrapper Design Under Bandwidth and Test Time Constraints
Optimization of Noc Wrapper Design Under Bandwidth and Test Time Constraints
1. Introduction
The NoC [1] provides abundant communication resources,
which makes the traditional approach of adding extraneous
Test Access Mechanism (TAM) [2, 3] overkill. Several research groups have published works on NoC test scheduling
[4, 5, 6] utilizing the NoC as the test data transportation path
from external testers to the CUTs. Test scheduling for the
NoC router [6, 7] and crosstalk test of the interconnects [8]
have also been discussed. In these approaches, each CUT is
wrapped by an IEEE 1500 [9] compatible wrapper in order to
provide isolation and access during the test application.
Many NoC architectures have been proposed such as SPIN
[10], thereal [11, 12], SoCIN [13], NOSTRUM [14], QNoC
[15], and HERMES [16]; all are based on a synchronous communication between nodes. Several other types of NoCs such
as CHAIN [17], NEXUS [18], and ANoC [19] are based on
Globally Asynchronous Locally Synchronous (GALS) communication. The copious NoC architectures highlight the
growing interest in NoC as a next generation SoC interconnect.
With regard to the NoCs Design-for-Testability (DFT), the
authors in [20] presented an architecture called ANoC-TEST,
which targets the Asynchronous Networks-on-Chip (ANoC)
[19]. In [21], the proposed NoC wrapper takes advantage of
2. NoC Model
The proposed wrapper utilizes the functional communication channel between a test source and a CUT. The delivery
channel can be a dedicated path or a transparent virtual channel. The wrapper is topology independent; it can be used for
any NoC architecture as long as minimum sustainable bandwidth and latency are guaranteed during the test application
of the target CUT. The quality-of-service guarantees ensure
that the test data are available at the CUT at the right time. In
this paper, the thereal [11, 12] NoC is used to explain the
wrapper design and optimization.
The thereal NoC routers [11] provide both guaranteed
and best-effort services. The guaranteed throughput (GT)
router guarantees uncorrupted, lossless, and ordered data
transfer, and both latency and throughput over a finite time interval. It also implements a network interface (NI) [12]NI
kernel and NI shellswhich connects the network routers to
I/O
ATE
Channel 1
NoC
l
NIS
ATE
I/O
ATE
Channel 2
Virtua
l chan
ne
I/O Port 1
Core
5
NIS
Core
6
NIS
NIS
NIS
NIK
R0
NIK
R2
R1
NIK
R3
NIS
NIK
I/O Port 2
NIS
nel
Virtual chan
SoC
Core
1
Core
2
Core
3
Network Interface
(input side)
1 2
PDI
Core
4
PCI
PI
IP Core
PDO
1 2
NIS
PCO
PO
Network Interface
(output side)
Internal
scan chains
To other cores, POs, etc
3. IP Core Model
IP core I/Os consist of primary inputs (PI), primary outputs (PO), scan inputs (SI) and scan outputs (SO). A subset
of the PIs can be categorized into primary data inputs (PDI)
and primary control inputs (PCI). Assuming that the CUT
communicates with the NoC by means of the AXI protocol
(Fig. 2), PDI would be made up of WDATA[31:0] signals,
while PCI consists of ADDR[31:0], AVALID, DLAST, and
DVALID signals. Some PO signals can also be categorized
T0
T1
T2
T3
T4
T5
T6
CLK
ADDR[31:0]
AVALID
WDATA[31:0]
D(A0)
D(A1)
D(A2)
DVALID
OK
DLAST
BRESP[1:0]
(1)
11
11
11
11
4
3
11
From
PDI
port
4
3
9
9
To
PDO
port
shift
Legends:
The proposed Type 1 wrapper uses the same approach as in
[2, 3] when forming the wrapper scan chains which minimizes
. For a given number of wrapper scan chains,
, and the PDI bit-width, , the number of PDI bits that
can be used to carry the test data for each wrapper scan chain,
, is given by equation (2) [21]. To differentiate these PDI
bits, those that can carry the test data are called input data
boundary cells, IDBC (shaded black in Fig. 5). If
(Eqn. (3)), some PDI bits cannot be used to carry the test data.
A similar analysis can be done for the output data boundary
cells (ODBC), resulting in equations (4) and (5).
(2)
(3)
(4)
(5)
For the CUT with 8-bit PDI/PDOs, and three wrapper scan
chains (Fig. 5), means that each
wrapper scan chain is interfaced to two IDBC/ODBC cells.
In addition,
means that the
remaining two PDI/PDO bits cannot be used to carry the test
data. These unused PDI/PDO bits become part of the wrapper
scan chain, with no extra functionality. Since in
typical cases, the following discussion on the PDI on the input
port also applies to the PDO on the output port.
During the test application, IDBC cells are loaded with the
test data in one clock cycle, in the normal operation mode
(refer to Fig. 5). The IDBC cells change into the test mode,
during which the test data are serially shifted for two clock
cycles to empty the contents into the scan chains. After completion, the IDBC cells change again into the normal mode to
capture the next incoming data from the PDI port. This operation is controlled by a test controller which keeps track of
the number of loads and shifts using counters [23].
For the NoC wrapper with a scan-in depth of nine (Fig. 5),
after four repetitions of loads and shifts, the first eight bits of
11
load
IDBC/ODBC
Normal boundary cell
Internal scan chain
each scan chains are loaded with the test data. To load the
last bit, the IDBC cells are loaded with new test data and a
single shift clock is applied. However, before applying the
capture cycle, the IDBC must also be loaded with valid test
data. After the last single shift, only part of the IDBC cells
contain valid test data. Reloading the IDBC data from the PDI
port can corrupt the valid data currently in the IDBC cells.
To overcome this problem, the first
shift
cycles of every test pattern must shift in dummy bits into the
scan chains. After the scan chains are completely loaded, another clock cycle is required to load the IDBC cells with valid
test data before applying the capture cycle. Here, a formal
definition of a new terminology based on this new scheme is
given.
[Definition] The Scan-in (scan-out) elements for the Type
1 NoC wrapper consist of the unused IDBC (ODBC) cells,
bidirectional cells, and internal scan chains (i.e. excluding all
the IDBC/ODBC cells). The maximum scan-in and scan-out
depths are denoted by and , respectively (Fig. 5).
As a result of the new test scheme, the number of shift
cycles required for the Type 1 NoC wrapper is summarized by
equations (6) and (7). Equation (8) gives the total TAT, where
the additional represents the final load of the IDBC data
prior to the capture cycle. For the NoC wrapper in Fig. 5,
clock cycles, smaller than
based on
equation (1).
(6)
(7)
(8)
!"
4000
1000
11
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
11
11
4
3
To PDO port
2000
3000
Bandwidth (Mbps)
5000
shift
Scan bandwidth
6000
11
11
shift
Legends:
load
(9)
#
$"
!"
Section 4.2 has shown that the Type 1 wrapper is inefficient
in terms of bandwidth utilization. The Type 2 NoC wrapper in
Fig. 7, is designed to complement the Type 1 wrapper in this
aspect. Extra load/shift registers are added to the PDI/PDO
ports, similar to the buffer architecture in [23] for the reuse
of the SoCs functional bus and the bandwidth matching registers in [24]. The load/shift registers translate the PDI bitwidth into the number of wrapper scan chains using parallelserial shift registers. As a result, the required NoC bandwidth
matches the scan bandwidth. The TAT for the Type 2 NoC
wrapper is also the same as the TAM-based wrappers in equation (1). This is achieved at the cost of area overhead of
load/shift registers and a more complex control scheme to realize the bit-width conversion. Therefore, it is important that
the Type 2 wrapper is used only when necessary. The next
section looks at two proposed optimization schemes.
Load/shift register
element
Normal boundary cell
Internal scan chain
Bandwidth
Core 2
B1
B2
Core 1
Core 3
Core 4
T2
Time
nsc
13
14
15
16-19
20-21
22
23
24-38
39-42
43-45
46
47-64
%
TAT (clock cycles)
TAM [2, 3] Type 1 NoC increase
451,577
452,452 0.19%
451,358
451,576 0.05%
447,197
448,072 0.20%
341,858
342,076 0.06%
337,478
338,134 0.19%
333,317
333,754 0.13%
231,478
231,258 -0.10%
227,978
228,196 0.10%
223,598
223,816 0.10%
219,218
219,436 0.10%
115,848
115,847 0.00%
114,317
114,535 0.19%
value for
. The progression of the binary search is graphically illustrated in Fig. 9. As a result,
(Type 1) with
a TAT of 65,098 clock cycles.
For the Type 2 wrapper,
is directly calculated since
is a linear function of
. Binary search in step 2
(similar to the Type 1 wrapper) results in
with a
TAT of 32,766 clock cycles. Clearly a better result for the
Type 2 wrapper. In this case ( ), the Type
1 wrapper is unable to utilize efficiently the allocated bandwidth because of the constraint in its I/O architecture.
A similar heuristic is implemented for and some selected cases for both algorithms are presented in Sect. 6.
6. Experimental Results
In order to evaluate the effectiveness of the proposed
methodology, we have conducted experiments on three IP
cores. Core 17 and core 6 (the largest of p93791 circuit)
from the ITC02 benchmark [25] are selected in order to offer
comparisons with TAM-based approaches [2, 3]. Another IP
corean example core from [21]allows some comparison
with an NoC wrapper to be offered.
A TAT comparison between the proposed Type 1 NoC
wrapper and TAM-based approaches is given in Table 1, for
core 6 with bits. In all cases, the differences are
always less than 0.25%; the proposed Type 1 NoC wrapper
does not incur noticable penalty on the TAT. In fact, some reductions are achieved for
and scan chains. For
the Type 2 NoC wrapper, the TAT is the same as the TAMbased approach because the added interface between the CUT
and the NoC port does not constraint the scan chain design.
The Type 2 wrappers required bandwidth matches the scan
bandwidthan improvement due to the extra load/shift registers.
For the circuit from [21], the TAT is given in Table 2. Compared to the TAM-based wrapper, the proposed Type 1 NoC
wrapper is better for smaller number of wrapper scan chains.
For wider scan chains, the TATs are about 3% longer. However, compared to the NoC wrapper design in [21]1 , the Type
1 wrapper is always superior.
1 Based
nsc
1
2
3
4
5
6
TAM
[2, 3]
5,532
2,771
1,858
1,396
1,363
1,363
Amory
[21]1
5,532
2,771
1,858
1,451
1,429
1,418
Proposed
Type 1
Type 2
5,300
5,532
2,660
2,771
1,780
1,858
1,428
1,396
1,406
1,363
1,395
1,363
% increase
Type 1 / TAM
Type 1 / Amory
-4.19%
-4.19%
-4.01%
-4.01%
-4.20%
-4.20%
2.29%
-1.59%
3.15%
-1.61%
2.35%
-1.62%
Further, we implemented the wrapper architecture proposed in [21] and compared the results (Table 3) for the a
larger IP core (core 6 of p93791). The TAT and the required
bandwidth,
, (column 3) are obtained for selected
(column 1). Using
(column 4) as input
to , the corresponding
, , and TAT for the proposed Type 2 wrapper are obtained. Using at most the bandwidth required by [21], the proposed wrapper gives shorter
TATs. For
scan chains (last row), [21] requires 33%
more bandwidth to obtain a comparable TAT.
Table 4 compares the Type 1 and Type 2 wrappers when
and are applied. For , both
wrappers result in similar performancea slight advantage
for Type 1 in terms of area overhead. At
, Type 2 is clearly the winner, with only 0.8% bandwidth overhead to achieve 32.5% TAT reductions. For
, Type 2 requires 31% smaller bandwidth
with less than 0.7% TAT overhead. On the other hand, at
, Type 1 wrapper is superior due to its minimal wrapper hardware overhead. The results illustrate the
tradeoffs between the two types of NoC wrappers for a given
constraint, which can be explored during the test schedule optimization.
7. Conclusion
We have proposed two versions of the NoC wrapper that
requires minimal overhead on the test application time and
area overhead. The previously proposed wrapper design did
not handle the problem of inefficient bandwidth utilization.
In this paper, we have proposed two heuristics that find the
optimal wrapper design for a given maximum bandwidth or
maximum test application timeimportant for test schedule
optimization.
Table 3. TAT comparison with [21] (Core 6)
nsc
11
15
22
24
Amory [21]
TAT
562,172
448,073
333,755
228,416
Breq
1,280
1,600
3,200
3,200
Bmax
1,280
1,600
3,200
3,200
nsc
12
16
24
24
Proposed (Type 2)
Breq
TAT
1,200
455,738
1,600
341,858
2,400
227,978
2,400
227,978
%incr. TAT
-18.9%
-23.7%
-31.7%
-0.2%
Tmax
70,000
200,000
Type 1
Type 2
nsc
15
21
Breq (Mbps)
1,600
2,133
TAT
97,648
96,129
nsc
15
23
Breq (Mbps)
1,500
2,300
nsc
22
8
Breq (Mbps)
TAT
3,200
65,098
800
193,128
nsc
22
8
Breq (Mbps)
TAT
2,200
65,530
800
192,912
TAT
97,215
64,882
1.E+06
1.E+05
Type 1
wrapper
Step 2
Type 2
Type 1
Type 2
wrapper
Step 2
1.E+04
1
10
11 12 13
14 15
16 17 18
19 20
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Step 1
Required Bandwidth
6.E+09
Bmax = 5.6E+09
Required
Bandwidth
(bps)
4.E+09
2.E+09
Step 1
11
13
15
17
19
21
23
25
27
29
31
35
33
37
39
41
43
45
47
49
51
53
55
57
59
61
63
Figure 9. Optimization of NoC wrapper design for a given . In step 2 (Type 1), the dotted lines represent the
search space which halves in every progression of the binary search
The proposed wrapper does not incur large test time overhead (against TAM-based designs) for the same number of
wrapper scan chains (about 3% for a very small circuit, and
less than 0.25% for larger circuits). The wrappers scale well
for large circuits. The advantage of the proposed wrapper is
that NoC reuse is possible with only small test time overhead. With additional allowances on the area overhead, the
proposed wrapper (Type 2) can efficiently utilize the NoC
bandwidth with zero overhead on the test application time.
Acknowledgements
This work was supported in part by Japan Society for the
Promotion of Science (JSPS) under Grants-in-Aid for Scientific Research B (No.15300018) and for Young Scientists B
(No.18700046).
References
[1] L. Benini and G. D. Micheli, Networks on Chips: A New SoC
Paradigm, IEEE Computer, 35(1), pp. 70-80, 2002.
[2] V. Iyengar, K. Chakrabarty, and E. J. Marinissen, Test Wrapper and
Test Access Mechanism Co-Optimization for System-on-Chip, Journal
of Electronic Testing: Theory and Applications 18, pp. 23-230, 2002.
[3] S. K. Goel and E. J. Marinissen, SoC Test Architecture Design for Efficient Utilization of Test Bandwidth, ACM Trans. Design Automation
of Electronic Systems, Vol. 8(4), Oct. 2003, pp. 399-429.
[4] E. Cota, L. Carro, and M. Lubaszewski, Reusing and On-Chip Network
for the Test of Core-Based Systems, ACM Trans. Design Automation
of Electronic Systems, Vol. 9, No. 4, October 2004, pp. 471-499.
[5] A. M. Amory, E. Cota, M. Lubaszewski, and F. G. Moraes, Reducing
Test Time With Processor Reuse in Network-on-Chip Based Systems,
In Proc. Integrated Circuits and Systems Design, 2004, pp. 111-116.
[6] C. Liu, Z. Link, and D.K. Pradhan, Reuse-Based Test Access and Integrated Test Scheduling for Network-on-Chip, In Proc. Design, Automation and Test in Europe, 2006, pp. 303-308
[7] A. M. Amory, E. Briao, E. Cota, M. Lubaszewski, and F. G. Moraes, A
Scalable Test Strategy for Network-on-Chip Routers, In Proc. IEEE
International Test Conference, 2005, pp. 591-599.
[8] C. Grecu, P. Pande, A. Ivanov, and R. Saleh, BIST for Network-onChip Interconnect Infrastructure, VLSI Test Symposium, 2006, pp. 3035.
[9] E. J. Marinissen, R. Kapur, M. Lousberg, T. McLaurin, M. Ricchetti,
and Y. Zorian, On IEEE P1500 standard for embedded core test, Journal of Electronic Testing: Theory and Applications, 2002, pp. 365-383.