0% found this document useful (0 votes)
143 views6 pages

DFT Strategy For Arm Cores

DFT Strategy for Arm Cores

Uploaded by

yellow51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views6 pages

DFT Strategy For Arm Cores

DFT Strategy for Arm Cores

Uploaded by

yellow51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Product How To: DFT strategy for ARM

processor-based designs
Chris AllsupKun Chung, - January 22, 2013

One of the most significant design trends of the decade is the widespread use of ARM® multicore
processors in systems-on-chip (SoCs). Designers’ ability to easily and cost-effectively employ
multiple, high-performance embedded processors as needed to meet the computational
requirements of the end application has helped fuel the explosive growth in mobile computing,
networking infrastructure, and digital home entertainment systems. But from the design-for-test
(DFT) perspective, is there a strategy for easily and cost-effectively testing multicore designs? A key
challenge is already emerging: as the number of processor cores increases, it has become
increasingly difficult to maintain high test quality without a requisite increase in cost stemming from
the need to allocate substantially more pins for digital test.

This is an important consideration at Samsung Electronics, which designs a variety of SoCs


containing ARM multicore processors. Simply adding more chip-level pins for testing conflicts with
packaging constraints and can potentially undermine other cost-saving techniques that rely on
utilizing fewer pins (see sidebar on multisite testing). What is needed instead is a DFT strategy
optimized for designs that use multicore processors—a strategy in which the architecture and
automation elements work in tandem to lower test cost without compromising test quality or
significantly increasing automatic test pattern generation (ATPG) runtime.

This article provides an example of an optimized DFT architecture, referred to as “shared I/O.” It is
enabled by Synopsys’ synthesis-based test solution, which has been used successfully in Samsung’s
multicore processor designs. The experience demonstrates that shared I/O is a better approach than
the standard DFT architecture for testing multicore designs since it reduces test costs by utilizing
fewer pins while providing the same or better test time reduction.

Dedicated test pins


Figure 1 shows a standard DFT architecture for a design with quad-core processors and user-defined
logic at the top level. The cores need not be identical but all blocks contain an embedded scan
compressor-decompressor (CODEC) to reduce the number of tester cycles required for achieving
high test coverage, which in turn reduces test cost [1, 2].
Figure 1. The standard DFT architecture for a quad-core design with CODECs embedded in
each core and at the top level. Because each CODEC uses its own dedicated pins, the
number of pins needed to test the design grows large as more processor cores are added.

The amount of compression implemented for a particular CODEC determines the number and length
of its scan chains, and is chosen to ensure an approximately uniform scan chain length L across all
the digital logic in the design. While more compression shortens the chain length and achieves
greater test time reduction, the amount of compression applied is constrained in practice by a
minimum number of scan inputs to the CODEC as well as routing considerations. Even so, as the
number of cores increases, it becomes essential to keep the number of scan I/O needed to test each
CODEC reasonably small to avoid exceeding the number of chip-level pins available for testing.

Shared test pins


The optimized DFT architecture is illustrated in Figure 2. The chip-level test pins are uniformly
connected to all the CODECs in the design, and an integration block added to the CODEC outputs
enables full observability of the scan chains. Assuming a design contains N CODECs, then log2(N)
test input pins are allocated to select which logic is being observed on the output pins.

Figure 2. The optimized architecture shares uniformly-connected test input pins and uses
integration logic to observe the CODEC outputs. The pin count increases by just log2(N)
with the number of processor cores, N.

Synopsys’ DFTMAX™ compression was used to implement both DFT architectures for a 20-nm
design containing four identical ARM processor cores, one of the latest versions of the Cortex-A
series, plus user-defined logic. Only a few modifications to the original DFT scripts were required to
implement the optimized architecture. Table 1 compares test pin count and normalized TetraMAX®
ATPG stuck-at pattern count results for the standard versus shared I/O architectures using
equivalent chain length and high fault coverage for both scenarios:

Table 1. Shared I/O results in fewer ATPG patterns than the standard architecture even
when only half as many pins are used.

Despite consuming half the test pin resources, shared I/O required substantially fewer patterns (and
tester cycles)—30% fewer for the power-aware patterns, which are generated to avoid false failures
during production testing [3]. The decrease in pattern count can be explained in part by an increase
in ATPG efficiency that comes “for free” when the scan inputs are shared among multiple cores.
However, in this application, designers also used additional tools in the Synopsys product, enabled
when processor cores are identical, that enhance both pattern efficiency and the ability to isolate
defective parts. Optimizations for identical cores
Optimizations for identical cores
The select lines going into the integration block in Figure 2 control logic were used to isolate
defective scan chains, making it possible to determine which of several faulty values belong to which
scan chains in which cores. The integration block also contains XOR trees that provide virtually the
same high observability as the standard approach that relies on dedicated connections to output
pins. When the processor cores are identical, it is possible to improve diagnostics accuracy using a
technique borrowed from image processing. “Swizzling” or rotating the order of a CODEC’s outputs
with respect to the order used in its neighboring CODEC, depicted in Figure 3, ensures that a fault
can be detected and isolated to a particular core.

Figure 3. “Swizzling” or port bit rotation is one of the techniques DFTMAX uses to improve
diagnostics and ATPG efficiency for designs utilizing identical processor cores.
As an added benefit, scan chain isolation and output port rotation provide TetraMAX enhanced
ability to manage unknown logic values (X’s) captured across multiple scan chains and cores and,
more generally, to improve fault coverage, pattern count, and runtime. Pattern generation efficiency
is improved further through the use of automation that focuses the ATPG effort on a single processor
core while fault simulating the entire set. These DFT and ATPG optimizations, applied in
combination, make it possible to trade-off test pin count versus test cycle count reduction to achieve
significant cost savings when utilizing multiple identical processor cores.

Flexible architecting
We have observed that the shared I/O strategy, with minor variations to the architecture of Figure 2,
holds up well as we scale the number of processor cores up or down. If a large portion of the logic in
a design is external to the cores, we can easily implement a hybrid arrangement wherein a subset of
the pins are shared among the identical cores while other pins are dedicated to the top-level
CODEC. When testing this “mixed-shared” variant on the same quad-core design using 33 pins, we
found the pattern counts decreased by about 15% compared with the standard architecture of
Figure 1.

In contrast, when the processor core count is relatively high, the blocks are partitioned across the
SoC based on topology constraints that might be at odds with sharing all the scan inputs and outputs
among all the cores. For these large multi-processor designs, sharing subsets of pin groups among
subsets of identical cores as shown in Figure 4—a feature automated in DFTMAX—avoids routing
congestion and timing issues. For example, we recently implemented group-shared I/O for a large
design with many processor cores, a strategy that led to the same test quality benefits as the
traditional approach but utilized fewer than half the test pins. This freed-up more ATE channels for
multi-site testing that resulted in a 59% reduction in test execution cost per wafer.

Figure 4. Group-shared I/O reduces routing overhead for partitioned designs with many
identical processor cores.
Conclusion
A DFT strategy optimized for testing multicore processor designs is needed to achieve both high-
quality and cost-effective manufacturing test, especially as the number of cores per design
increases. The shared I/O architecture we have highlighted lowers the cost of testing ARM
processor-based designs and other multicore SoCs in two fundamental ways: First, it reduces the
number of test pins required for efficient compression of high-coverage test patterns, which
decreases packaging costs and facilitates deployment of other cost-saving methodologies such as
multisite testing. Second, it reduces ATPG pattern count, which decreases test cycle time and,
therefore, test execution cost. Automation in the test solution enables flexible implementation of
shared I/O architecture variants. Optimizations such as scan chain isolation and output port rotation,
embedded in the integration logic, improve diagnostics accuracy and facilitate greater ATPG
efficiency for making economical pin-count and pattern-count tradeoffs.

Sidebar: Multi-site Testing


Multi-site testing is a technique that reduces test time and cost by screening multiple dies
simultaneously. It uses an ATE system fitted with a probe card specifically designed for parallel load
operations. The maximum degree of parallelism depends on the number of ATE channels and the
number of pins per die available for testing. The cost saving from multisite testing increases with the
multi-site count (i.e., the degree of parallelism), M, according to:

where α is a constant. The exponential reflects the efficiency of the system, which declines as the
parallelism increases for a variety of reasons. For instance, if M is large, stepping over a wafer will
not fit into the probe card’s footprint, leading to more touchdowns than the number of devices per
wafer divided by M [4].

In many situations mixed signal, embedded memory, flash, and quiescent current testing are the
time bottlenecks, with scan testing consuming only a small portion of the total test time even though
it requires the most pins. When production volumes are very high, substantial cost savings can be
achieved by “sacrificing” some of these pins in the interests of parallelism that reduces total test
time even though the scan test time itself increases.

If you liked this feature, and would like to see a weekly collection of related features
delivered directly to your inbox, sign up for the Test & Measurement World newsletter
here.

References
1. DFTMAX Compression Backgrounder, Fall 2009, Synopsys, Inc.
2. Allsup, C., “The Economics of Implementing Scan Compression to Reduce Test Data Volume and
Test Application Time,” Proc. International Test Conf., 2006.
3. Bahl, S.; Mattiuzzo, R.; Khullar, S.; Garg, A.; Graniello, S.; Abdel-Hafez, K.S.; Talluto, S., “State
of the Art Low Capture Power Methodology,” Proc. International Test Conf., 2011.
4. Kuntzsch, C.; Shah, M.; Mittermaier, N., “Massive Test Cost Reduction by Advanced SCAN
Testing,” SNUG Germany 2010 Proceedings.

About the Authors


Kun Young Chung is the DFT manager of the Design Technology Team at Samsung Electronics. He
received BS and MS degrees from Seoul National University in 1996 and 1998, and a MS and PhD in
electrical engineering from the University of Southern California in 2002 and 2008. Kun Young’s
research interests include delay testing, power-aware DFT, scan compression, and 3D IC testing.

Chris Allsup, marketing manager in Synopsys’ synthesis and test group, has more than 20 years
combined experience in IC design, field applications, sales, and marketing. He earned a BSEE
degree from UC San Diego and an MBA degree from Santa Clara University. Chris has authored
numerous articles and papers on design and test.

Related Content:

● Fundamentals of Multicore Processing


● Multicore and virtualization in automotive environments
● Getting around multicore walls: the road less traveled
● Synopsys unveils multicore optimization technology

You might also like