DFT Strategy For Arm Cores
DFT Strategy For Arm Cores
processor-based designs
Chris AllsupKun Chung, - January 22, 2013
One of the most significant design trends of the decade is the widespread use of ARM® multicore
processors in systems-on-chip (SoCs). Designers’ ability to easily and cost-effectively employ
multiple, high-performance embedded processors as needed to meet the computational
requirements of the end application has helped fuel the explosive growth in mobile computing,
networking infrastructure, and digital home entertainment systems. But from the design-for-test
(DFT) perspective, is there a strategy for easily and cost-effectively testing multicore designs? A key
challenge is already emerging: as the number of processor cores increases, it has become
increasingly difficult to maintain high test quality without a requisite increase in cost stemming from
the need to allocate substantially more pins for digital test.
This article provides an example of an optimized DFT architecture, referred to as “shared I/O.” It is
enabled by Synopsys’ synthesis-based test solution, which has been used successfully in Samsung’s
multicore processor designs. The experience demonstrates that shared I/O is a better approach than
the standard DFT architecture for testing multicore designs since it reduces test costs by utilizing
fewer pins while providing the same or better test time reduction.
The amount of compression implemented for a particular CODEC determines the number and length
of its scan chains, and is chosen to ensure an approximately uniform scan chain length L across all
the digital logic in the design. While more compression shortens the chain length and achieves
greater test time reduction, the amount of compression applied is constrained in practice by a
minimum number of scan inputs to the CODEC as well as routing considerations. Even so, as the
number of cores increases, it becomes essential to keep the number of scan I/O needed to test each
CODEC reasonably small to avoid exceeding the number of chip-level pins available for testing.
Figure 2. The optimized architecture shares uniformly-connected test input pins and uses
integration logic to observe the CODEC outputs. The pin count increases by just log2(N)
with the number of processor cores, N.
Synopsys’ DFTMAX™ compression was used to implement both DFT architectures for a 20-nm
design containing four identical ARM processor cores, one of the latest versions of the Cortex-A
series, plus user-defined logic. Only a few modifications to the original DFT scripts were required to
implement the optimized architecture. Table 1 compares test pin count and normalized TetraMAX®
ATPG stuck-at pattern count results for the standard versus shared I/O architectures using
equivalent chain length and high fault coverage for both scenarios:
Table 1. Shared I/O results in fewer ATPG patterns than the standard architecture even
when only half as many pins are used.
Despite consuming half the test pin resources, shared I/O required substantially fewer patterns (and
tester cycles)—30% fewer for the power-aware patterns, which are generated to avoid false failures
during production testing [3]. The decrease in pattern count can be explained in part by an increase
in ATPG efficiency that comes “for free” when the scan inputs are shared among multiple cores.
However, in this application, designers also used additional tools in the Synopsys product, enabled
when processor cores are identical, that enhance both pattern efficiency and the ability to isolate
defective parts. Optimizations for identical cores
Optimizations for identical cores
The select lines going into the integration block in Figure 2 control logic were used to isolate
defective scan chains, making it possible to determine which of several faulty values belong to which
scan chains in which cores. The integration block also contains XOR trees that provide virtually the
same high observability as the standard approach that relies on dedicated connections to output
pins. When the processor cores are identical, it is possible to improve diagnostics accuracy using a
technique borrowed from image processing. “Swizzling” or rotating the order of a CODEC’s outputs
with respect to the order used in its neighboring CODEC, depicted in Figure 3, ensures that a fault
can be detected and isolated to a particular core.
Figure 3. “Swizzling” or port bit rotation is one of the techniques DFTMAX uses to improve
diagnostics and ATPG efficiency for designs utilizing identical processor cores.
As an added benefit, scan chain isolation and output port rotation provide TetraMAX enhanced
ability to manage unknown logic values (X’s) captured across multiple scan chains and cores and,
more generally, to improve fault coverage, pattern count, and runtime. Pattern generation efficiency
is improved further through the use of automation that focuses the ATPG effort on a single processor
core while fault simulating the entire set. These DFT and ATPG optimizations, applied in
combination, make it possible to trade-off test pin count versus test cycle count reduction to achieve
significant cost savings when utilizing multiple identical processor cores.
Flexible architecting
We have observed that the shared I/O strategy, with minor variations to the architecture of Figure 2,
holds up well as we scale the number of processor cores up or down. If a large portion of the logic in
a design is external to the cores, we can easily implement a hybrid arrangement wherein a subset of
the pins are shared among the identical cores while other pins are dedicated to the top-level
CODEC. When testing this “mixed-shared” variant on the same quad-core design using 33 pins, we
found the pattern counts decreased by about 15% compared with the standard architecture of
Figure 1.
In contrast, when the processor core count is relatively high, the blocks are partitioned across the
SoC based on topology constraints that might be at odds with sharing all the scan inputs and outputs
among all the cores. For these large multi-processor designs, sharing subsets of pin groups among
subsets of identical cores as shown in Figure 4—a feature automated in DFTMAX—avoids routing
congestion and timing issues. For example, we recently implemented group-shared I/O for a large
design with many processor cores, a strategy that led to the same test quality benefits as the
traditional approach but utilized fewer than half the test pins. This freed-up more ATE channels for
multi-site testing that resulted in a 59% reduction in test execution cost per wafer.
Figure 4. Group-shared I/O reduces routing overhead for partitioned designs with many
identical processor cores.
Conclusion
A DFT strategy optimized for testing multicore processor designs is needed to achieve both high-
quality and cost-effective manufacturing test, especially as the number of cores per design
increases. The shared I/O architecture we have highlighted lowers the cost of testing ARM
processor-based designs and other multicore SoCs in two fundamental ways: First, it reduces the
number of test pins required for efficient compression of high-coverage test patterns, which
decreases packaging costs and facilitates deployment of other cost-saving methodologies such as
multisite testing. Second, it reduces ATPG pattern count, which decreases test cycle time and,
therefore, test execution cost. Automation in the test solution enables flexible implementation of
shared I/O architecture variants. Optimizations such as scan chain isolation and output port rotation,
embedded in the integration logic, improve diagnostics accuracy and facilitate greater ATPG
efficiency for making economical pin-count and pattern-count tradeoffs.
where α is a constant. The exponential reflects the efficiency of the system, which declines as the
parallelism increases for a variety of reasons. For instance, if M is large, stepping over a wafer will
not fit into the probe card’s footprint, leading to more touchdowns than the number of devices per
wafer divided by M [4].
In many situations mixed signal, embedded memory, flash, and quiescent current testing are the
time bottlenecks, with scan testing consuming only a small portion of the total test time even though
it requires the most pins. When production volumes are very high, substantial cost savings can be
achieved by “sacrificing” some of these pins in the interests of parallelism that reduces total test
time even though the scan test time itself increases.
If you liked this feature, and would like to see a weekly collection of related features
delivered directly to your inbox, sign up for the Test & Measurement World newsletter
here.
References
1. DFTMAX Compression Backgrounder, Fall 2009, Synopsys, Inc.
2. Allsup, C., “The Economics of Implementing Scan Compression to Reduce Test Data Volume and
Test Application Time,” Proc. International Test Conf., 2006.
3. Bahl, S.; Mattiuzzo, R.; Khullar, S.; Garg, A.; Graniello, S.; Abdel-Hafez, K.S.; Talluto, S., “State
of the Art Low Capture Power Methodology,” Proc. International Test Conf., 2011.
4. Kuntzsch, C.; Shah, M.; Mittermaier, N., “Massive Test Cost Reduction by Advanced SCAN
Testing,” SNUG Germany 2010 Proceedings.
Chris Allsup, marketing manager in Synopsys’ synthesis and test group, has more than 20 years
combined experience in IC design, field applications, sales, and marketing. He earned a BSEE
degree from UC San Diego and an MBA degree from Santa Clara University. Chris has authored
numerous articles and papers on design and test.
Related Content: