PHY Link Design and Optimization For High-Speed Low-Power Communication System
PHY Link Design and Optimization For High-Speed Low-Power Communication System
Yuan Fang
Vom Fachbereich 18
Elektrotechnik und Informationstechnik
der Technischen Universität Darmstadt
zur Erlangung des akademischen Grades einer
Doktor-Ingenieurin (Dr.-Ing.)
genehmigte Dissertation
von
Dipl.-Ing.
Yuan Fang
geboren am 16. October 1983
in Shanghai, China
D17
Darmstädter Dissertationen 2015
4
Erklärung laut §9 der PromO
Ich versichere hiermit, dass ich die vorliegende Dissertation allein und nur unter Ver-
wendung der angegebenen Literatur verfasst habe. Die Arbeit hat bisher noch nicht zu
Prüfungszwecken gedient.
v
Dedicated to my lovely parents
Acknowledgment
This thesis is based on my work that I have started in July 2010 at Fachgebiet Integri-
erte Elektronische Systeme, Institut für Datentechnik, Fachbereich Elektrotechnik und Informa-
tionstechnik, Technische Universität Darmstadt as a three-year scholarship student and a
one-year research assistant. Therefore, I would like to thank graduate college "Tunable In-
tegrated Components in Microwave Technology and Optics" (TICMO) for providing me the
scholarship to pursue my doctoral degree. Special thanks are due to my advisor Prof.
Dr.-Ing. Klaus Hofmann for his advice, guidance, patience and providing me with excel-
lent working environment for doing research. His warm and open personality, quality
and curiosity in new things have made me benefited a lot.
As my co-advisor, I express my acknowledgment to Prof. Dr.-Ing. Franko Küppers for
his critical reading of the thesis and advice.
I gratefully acknowledge Ashok Jaiswal for spending his time on deep technical dis-
cussion regarding my research, as well as on sharing his experience in both professional
and personal life with me. I would like to thank all anonymous reviewers of my jour-
nal and conference papers for the positive critics and suggestions. Many thanks are due
to the current staff members at Fachgebiet Integrierte Elektronische Systeme, Dr. Mohamed
Salem, Harish Balasubramaniam, Mareiki Kaloumenos, Jing Ning, Muhammad Saif, Alex
Schönberger, Lufei Shen, Eman Soliman, Boris Traskov, Botao Xiong and Haoyuan Ying
for the friendship and cooperation as well as the former staff, Prof. Chunyue Huang.
I would also like to express my appreciation to the staff members at the former Fachge-
biet Mikroelektronische Systeme, the head of the institute Prof. Dr.-Ing Prof. Dr. Dr. h.c.
mult. Manfred Glesner and his research and teaching assistant staffs, Ramkumar Gane-
san and Francois Philipp as well as the former staff members, Dr.-Ing Fizal Arya Samman
and Dr.-Ing Ping Zhao. My acknowledgments are granted to our technical staff Andres
Schmidt and Roland Brand for helping me in software and hardware matters, and to our
secretary Silvia Hermann as well as our former secretary Iselona Klenk for helping me in
many administrative matters.
I express my obligation to all my supervised students who have worked with me in
pro-seminar/project seminar/master thesis. I would also like to give special thanks to
Francois Philipp, Jan-Mark Metten, Damiano Gadler, Jonas Bargon and Ling Chen for
proof-reading of my dissertation.
I appreciate all my friends from China, from Germany and also from all the other
ix
x
countries whom I could not mention all here. I am especially grateful to my best friends
Lang Yu and Jan-Mark Metten in Munich for the emotional support and encouragement.
I deeply acknowledged all my teachers in my primary school, secondary school and
my high school in Shanghai, China, as well as my lecturers at TU Dresden as well as TU
Munich for providing me with basic and advanced knowledge.
Finally, I would like to thank my mother and my father for cheering me up and stood
by me through the good and bad times from the depth of my heart. Without them this
dissertation would not have been possible. Especially, I would give my deepest thank to
my lovely father in heaven, who has motivated me the most to make him proud. I also
would like to thank my family and friends in Shanghai who have taken good care of my
mother during my stay in Germany.
Abstract
The ever-growing demands for high-bandwidth data transfer have been pushing towards
advancing research efforts in the field of high-performing communication systems. Stud-
ies on the performance of single chip, e.g. faster multi-core processors and higher system
memory capacity, have been explored. To further enhance the system performance, re-
searches have been focused on the improvement of data-transfer bandwidth for chip-to-
chip communication in the high-speed serial link. Many solutions have been addressed to
overcome the bottleneck caused by the non-idealties such as bandwidth-limited electrical
channel that connects two link devices and varieties of undesired noise in the communi-
cation systems. Nevertheless, with these solutions data have run into limitations of the
timing margins for high-speed interfaces running at multiple gigabits per second data
rates on low-cost Printed Circuit Board (PCB) material with constrained power budget.
Therefore, the challenge in designing a physical layer (PHY) link for high-speed com-
munication systems turns out to be power-efficient, reliable and cost-effective. In this
context, this dissertation is intended to focus on architectural design, system-level and
circuit-level verification of a PHY link as well as system performance optimization in re-
spective of power, reliability and adaptability in high-speed communication systems.
The PHY is mainly composed of clock data recovery (CDR), equalizers (EQs) and high-
speed I/O drivers. Symmetrical structure of the PHY link is usually duplicated in both
link devices for bidirectional data transmission. By introducing training mechanisms into
high-speed communication systems, the timing in one link device is adaptively aligned
to the timing condition specified in the other link device despite of different skews or
induced jitter resulting from process, voltage and temperature (PVT) variations in the in-
dividual link. With reliable timing relationships among the interface signals provided, the
total system bandwidth is dramatically improved. On the other hand, interface training
offers high flexibility for reuse without further investigation on high demanding compo-
nents involved in high costs.
In the training mode, a CDR module is essential for reconstructing the transmitted bit-
stream to achieve the best data eye and to detect the edges of data stream in asynchronous
systems or source-synchronous systems. Generally, the CDR works as a feedback control
system that aligns its output clock to the center of the received data. In systems that con-
tain multiple data links, the overall CDR power consumption increases linearly with the
increase in number of links as one CDR is required for each link. Therefore, a power-
xi
xii
efficient CDR plays a significant role in such systems with parallel links. Furthermore, a
high performance CDR requires low jitter generation in spite of high input jitter. To mini-
mize the trade-off between power consumption and CDR jitter, a novel CDR architecture
is proposed by utilizing the proportional-integral (PI) controller and three times sampling
scheme.
Meanwhile, signal integrity (SI) becomes critical as the data rate exceeds several gi-
gabits per second. Distorted data due to the non-idealties in systems are likely to reduce
the signal quality aggressively and result in intolerable transmission errors in worst case
scenarios, thus affect the system effective bandwidth. Hence, additional trainings such as
transmitter (Tx) and receiver (Rx) EQ trainings for SI purpose are inserted into the inter-
face training. Besides, a simplified system architecture with unsymmetrical placement of
adaptive Rx and Tx EQs in a single link device is proposed and analyzed by using differ-
ent coefficient adaptation algorithms. This architecture enables to reduce a large number
of EQs through the training, especially in case of parallel links. Meanwhile, considerable
power and chip area are saved.
Finally, high-speed I/O driver against PVT variations is discussed. Critical issues
such as overshoot and undershoot interfering with the data are primarily accompanied
by impedance mismatch between the I/O driver and its transmitting channel. By app-
lying PVT compensation technique I/O driver impedances can be effectively calibrated
close to the target value. Different digital impedance calibration algorithms against PVT
variations are implemented and compared for achieving fast calibration and low power
requirements.
Kurzfassung
xiii
xiv
xv
xvi TABLE OF C ONTENTS
5 I/O Design 67
5.1 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1 I/O Driver Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.2 PVT Compensation Schemes . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Proposed Hybrid Digital Impedance Calibration . . . . . . . . . . . . . . . . 74
5.2.1 Calibration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Hybrid Calibration Algorithm . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2.1 Hybrid Linear Search . . . . . . . . . . . . . . . . . . . . . . 77
5.2.2.2 Hybrid Reduced Binary Search . . . . . . . . . . . . . . . . 77
5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
References 99
xix
List of Figures
xxi
xxii L IST OF F IGURES
xxv
xxvi L IST OF A BBREVIATIONS
Contents
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Objectives and Scope . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Background
The trend of remarkable data traffic growth with the rise of multimedia-on-demand and
cloud computing usage has led to incessant requests for high performance communica-
tion systems supporting considerable bandwidth data transmission. For example, fully
integrated backplane transceivers that operated 9 years ago at speeds of 5 Gbps [1] runs
at speeds up to 28 Gbps [2] now. To keep up with Moore’s law, microprocessors in com-
munication systems have shifted towards multi-core or even many-core as single-core
processor has reached power wall by continuously increasing clock speeds.
Meanwhile, the issue of chip-to-chip communication has received considerable critical
attention for high-bandwidth oriented applications, for example, multi-socket connectiv-
ity in server systems. As the number of processors increases, the improvement of data-
transfer bandwidth is saturated by conventional highly capacitive shard front-side bus
(FSB) architecture. Thereby, a new system concept utilizing point-to-point architecture
such as AMD HT [3] or Intel QuickPath Interconnect (QPI) [4] technology has been intro-
duced in the recent years, which enables each processor to be connected directly through
a fast bus. Similarly, a direct access between processor and memory is also provided by
embedding the north bridge into the individual core. Figure 1.1 illustrates the peak mem-
ory bandwidth in both CPUs and GPUs over the last seven years. Trends show that the
memory bandwidth in GPUs increases about four-fold while that in CPUs increases about
two-fold.
1
2 C HAPTER 1 I NTRODUCTION AND O VERVIEW
GB/s
E5-2690
10
2007 2008 2009 2010 2011 2012 2013
Year
1.2 Motivations
Figure 1.2 illustrates an example of a PHY in a high-speed communication system which
is composed of two level shifters (LS), a serializer (SER), a transmitter equalizer (Tx EQ),
I/O, a receiver equalizer (Rx EQ), a control unit, a deserializer (DES), a clock data recov-
ery (CDR) and a phase locked loop (PLL). In communication systems, a PHY plays an
important role to ensure error free bit-level data transmission between two devices and
also it supports electrical interfaces connecting to the channel. In general, tasks of a PHY
1.2 M OTIVATIONS 3
ChipmA ChipmB
Core PHY PHY Core
LS SER TxmEQ
Data I/O I/O FIFO
clk clk
control
Iclk
unit Channel
Qclk
I
CDR Q DES RxmEQ impedance
calibration
clkL clk
Data
LS
clk
PLL
m
LS:mLevelmShiftermmmmmmSER:mSerializermmmmmDES:mDeserializermmmmm
EQ:mEqualizermmmmmmCDR:mClockmDatamRecovery
A PHY link, as its name suggests, is established from a PHY in one link device to a
PHY in the other link device. So far, most studies in the PHY link design targeting at high
data bandwidth have only been carried out by following standard protocols for their spe-
cific applications, e.g. PCI Express for expansion cards, RapidIO Serial in wireless base
stations. Although many improvements have been made by using advanced interconnect
techniques to minimize data errors and to meet the requirements of the protocols, data
have run into limitations of the timing margins on low-cost Printed Circuit Board (PCB)
material. The limitations prevent higher speed data transmission in modern communica-
tion systems.
In the industry, customized design is usually performed for different demands, which
ensures robust and reliable signal transmission at high data rate. However, redesign be-
comes problematic in terms of cost and time-to-market. In worst case scenarios, unde-
sired issues such as process, voltage and temperature (PVT) variations and random noise
contributing to random timing jitter may perturb the system.
4 C HAPTER 1 I NTRODUCTION AND O VERVIEW
Last but not least, the complexity of additional advanced interconnect circuits has
made PHY link reach its limit in the total system power consumption, thus reduces the
power efficiency. Power dissipated in the PHY link is proportional to the number of the
links in both Tx and Rx directions, which implies that to increase the system bandwidth,
the number of link is however, constrained due to limited power budget.
Therefore, the challenge in the PHY link design for high-speed communication sys-
tems turns out to be adaptive, reliable, power-efficient and cost-effective. The idea and
motivation applied for the PHY link design and optimization are as follows.
• Low cost and adaptability of a PHY link design for high-speed communication
systems: Components such as Tx EQ and Rx EQ comprising a PHY link are fre-
quently employed in a wide variety of high-speed communication systems due to
lossy channels on PCB board. For the reason of cost saving, an optimization or a new
design of PCB boards is not considered. Hence, a sophisticated adaptive mechanism
to allow higher-speed data transmission despite of various low-lost PCB boards is
required for future high-speed communication systems.
• To optimize a CDR system in the PHY link on the trade-off between jitter genera-
tion and power consumption for high-speed data transmission [12]. A three times
sampling (3TS) scheme in phase detector (PD) is proposed to decrease the dither in
the recovered sampling clock.
1.4 T HESIS O UTLINE 5
• To introduce two adaptive EQ trainings [13] [14] [15] [16] in the PHY link for high-
speed communication systems, where different algorithms applied in EQ trainings
are either optimized or proposed at both system-level using Matlab/Simulink and
gate-level using Matlab/Simulink-Cadence co-simulation [17]. Evaluations of the
signal quality and hardware usages are analyzed for different algorithms.
• To reduce total power dissipated in the PHY link for EQ trainings by simplifying
system architecture in high-speed communication systems, where Tx EQ and Rx
EQ are placed asymmetrically in a single link device.
• Optimization on low-jitter power-efficient CDR system for read training and write
training at both system-level and gate-level
• Chapter 3: This chapter discusses the design of phase interpolator-based CDR sys-
tem which targets at low power and low jitter for high-speed communication sys-
tems. A CDR based on digital phase interpolators using proportional-integral (PI)
controller, PI2 CDR, is proposed for graphic double data rate 5 (GDDR5) interface
trainings, i.e. read training and write training. The proposed CDR is composed of
a DES, a PD, a programmable accumulator, a PI controller and four digital phase
interpolators instead of two for the purpose to be glitch free. Issues such as glitches
during the phase switching and phase variation caused by different loads are dis-
cussed. Simulation results show that the proposed PI2 CDR provides trade-off be-
tween low CDR jitter and low power consumption compared to other CDRs such as
modified CDR and adaptive CDR. As the proposed CDR has the drawback of high
dither, a 3TS CDR is suggested to keep the phase of the recovered clock constant
and to stop the current training.
• Chapter 4: This chapter describes a simplified system architecture, where the EQs
are only applied in one single link device when integrating Tx and Rx EQ train-
ings into other interface trainings, e.g. read training and write training. This novel
system architecture is verified at the gate-level by implementing the Rx equalizer
training with both analog and semi-digital circuits while the Tx equalizer training
uses different algorithms 1) Least Mean Square (LMS) algorithm 2) pilot signal/-
peak detection 3) direct calculation implemented with semi-digital and full-digital
circuits. Among the investigated algorithms results show that LMS in semi-digital
Rx and Tx equalizer is the best performing algorithm in the high-speed communi-
cation systems with medium consumed die areas. Compared to other algorithms, it
can reduce the ISI caused by both pre- and post-cursors of the channel.
• Chapter 6: The new contributions of this thesis are summarized in this chapter. Any
directions for future works will also be briefly described in this chapter.
Chapter 2
Contents
2.1 PCI Express . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 HyperTransport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 DDRx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 GDDRx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Many standards that use physical layers are predefined to interconnect devices from
different vendors. The layer that interfaces to the physical world determines both the
physical and electrical characteristics of the protocol. In general, interconnect standards
7
8 C HAPTER 2 I NTERCONNECT S TANDARDS IN C OMPUTER P ERIPHERALS
can be applied in either network (e.g. Ethernet) or computer peripherals (e.g. PCI ex-
press). As an example, Figure 2.1 illustrates the 4th generation Intel® Core™ processor
that features 64-bit, multi-core processors with Intel® Z87 express chip platform block
diagram [19], which supports 16 lanes PCI express 3.0, 8 PCI express 2.0 ports, 6 SATA
ports, 14 USB ports, etc. Trend shows that the number of interconnect standards is increas-
ing and various interconnect standards are under continuous development and improve-
ment in fast transfer rate. In this chapter, an overview of various interconnect standards
in computer peripherals is given.
Parallel Transmit
TransmitRData Interface Serializer
(2.5Gbps)
SystemRClock Clock
N
RReferenceRClock
PLL
R(ParallelRdataRrate)
ParallelRData SerialRData
CDR
RReceiveRData Receive
Deserializer
(2.5Gbps)
RReceiveRClock
(ParallelRdataRrate)R
Furthermore, the PCI express features point-to-point architecture between the master
and the peripheral using wired interfaces, where one pair of wires is for transmitting
and one pair for receiving. Differential signaling makes PCI express superior to single-
ended in noise margins and provides flexibility in varying supply voltages for both Tx
and Rx chips. When data is transmitted, only one bit is allowed every cycle. On a single
2.2 H YPERT RANSPORT 9
connection (either to transmit or to receive) maximum data transfer rate of 2.5 Gbps can be
achieved. For various usages, e.g. graphics adapters, PCI express link can be configured
up to 32 lanes, which deliver a throughput of 16 GBps at the maximum (20% 8b/10b
conversion penalty).
2.2 HyperTransport
HT [21] features point-to-point source-synchronous interconnects with one clock signal
per byte. Figure 2.3 illustrates an example of HT links that utilize bus width up to 32
lanes for chip-to-chip interconnects. To minimize pin counts, no separate command or
address lanes are required in the HT interface where a signal CTL indicates whether a
signal CAD represents control or data packets. Moreover, power consumption is saved
with no overhead of serializer, deserializer as well as 8b/10b conversion.
Memory Memory
HT Bus HT Bus
HT Bus
I/O Bus
CPU CPU I/O Bus
Memory Memory
Requirements for various compliant channels in the time domain for HT are well de-
fined in [22]. Meanwhile, a number of fixed Tx equalization settings and optional Rx
equalization settings are provided as well to compensate the non-idealties in those chan-
nels. Similar to PCI express, HT also adopts differential signaling. Figure 2.4 demon-
strates a low-voltage differential signaling with differential impedance of 100Ω and char-
acteristic line impedance of 60Ω. HT also uses training and equalization schemes for high
performance and signal reliability.
10 C HAPTER 2 I NTERCONNECT S TANDARDS IN C OMPUTER P ERIPHERALS
RON(DC)
RTT(DC)
VLDT 50Ω±10% VLDT
RLINE=60Ω±10%
50Ω±10% HT Driver
HT Driver
100Ω±10%
Data
Data#
Data Data
CLK
CLK#
2.3 DDRx
Double data rate synchronous Dynamic Random Access Memories (DDR SDRAMs) [23]
such as DDR2 and DDR3 are memory integrated circuits. As high performance PC mem-
ories, generations of DDR memories have significantly been advanced in the data rate.
Figure 2.5 shows the growth of data rate per pin for DDRx memories from single dara
rate (SDR) memories. The evolution in data bandwidth is mainly contributed by fast
clock frequencies.
2000
R3
1000 DD
Per Pin Data Rate [Mb/s]
800
600 R2
DD
400
R1
DD
200
R
SD
100
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
Figure 2.6 shows a simplified PHY link in DDRx memories using source-synchronous
clocking scheme. In this architecture no CDR circuits are applied in both Tx and Rx chips
2.3 DDR X 11
by using bidirectional DQs data strobe signal that indicates the validness of the data.
TX RX
MUX Latch Buffer Latch DEMUX
Buffer
DATA
D Q D Q
Buffer Buffer
+ Δt
TXKPLL LatchKCLK
-
DATA 1 0 0 1 1 1 0 0
LatchKCLK
Δt=1/2KTsym
Figure 2.7 demonstrates interconnects between north bridge and different memory
modules (DDR2 and DDR3). In DDR2 dual in-line memory module (DIMM) the con-
DQKBus+DQS DQKBus+DQS
(8+1KLanes/Memory) T D T D (8+1KLanes/Memory) T D T D
CAKBus+CLK CAKBus+CLK
(28KLanes/Memory) T D T D (25KLanes/Memory) T D T D
T D T D T D T D
T D T D T D T D
North North
Bridge Bridge
T D T D T D T D
T D T D T D T D
T D T D T D T D
T D T D T D T D
T T T T
are passed through each memory sequentially. To compensate the skew, PHY in memory
controller needs to support write and read leveling, which enables DQs to move towards
the rising edge of the clock.
2.4 GDDRx
As high bandwidth memories, graphic DDRx (GDDRx) [23] in graphic cards is commonly
used in game consoles and desktop personal computers. Figure 2.8 shows the time of
ns
0 1 2 3 4 5 6
burst data for various memories. It is clearly to see that GDDR5, currently the highest
bandwidth memory commercially available in the market, has the fastest burst data speed
compared to the other memories.
Different from DDR3, GDDR5 inherited no data strobe signal. Figure 2.9 shows a
GDDR5 interface, where a memory controller (master IC) and a GDDR5 memory (slave
IC) are connected via channels. Source-synchronous clocking scheme is adopted in GDDR5
memory systems. Unidirectional signals such as address/command (ADDR/CMD) run
at the system clock (CK) while bidirectional signals such as data (DQ) work at the data
clock (WCK), twice as high as CK. Error detection and correction signal (EDC) which
carries different feedback information for different modes is also unidirectional and runs
asynchronously. For the sake of simplicity many other control signals e.g. scan enable are
not drawn in the figure.
In GDDR5 memory systems, the timing of memory controller and memory is adapted
by introducing training mechanisms, which benefits both flexibility and low system cost.
Several interface trainings i.e. address training, wck2ck training, read training, write
2.4 GDDR X 13
ADDR
CMD
CK
Memory Memory
Controller DATA (GDDR5)
(GPU)
WCK
EDC
training are predefined after power up in JEDEC [8]. The training sequence has to be
operated to guarantee a defined interface timing relationship for high-speed data trans-
mission in GDDR5 as illustrated in Figure 2.10. Address training is aimed to adjust the
mmmmmPowermUp
AddressmTrainingmLoop
mmmmmmmmmm(optional)
CK
mmmWRITEmmTrainingmLoop DATA D D D D D D D D D D D D D D D D
StartmNormalmOperation
phase of address signals relative to the CK to meet the timing at the DRAM. It is optimal
as long as address input setup time (tAS) and address input hold time (tAH) fulfill the
system specification. WCK training is to adjust the WCK phase relative to the CK to meet
the timing at the DRAM. As next, read training is operated in order to adjust the receive
phase for the read data on DQs to meet the timing at the sampler in the memory con-
troller. Once read training is successful, write training is followed to adjust the transmit
14 C HAPTER 2 I NTERCONNECT S TANDARDS IN C OMPUTER P ERIPHERALS
phase for the write data on DQs to meet timing at the sampler in DRAM.
2.5 Summary
To summarize, many efforts have been made to explore different PHY links architectures
supporting various interconnect standards in computer peripherals. Some PHYs require
CDR circuits under the consideration of limited pin counts and others employ deskew
circuits to compensate the different arrival time of the signals. With regards to EQs, some
PHYs provide no EQs as signal quality is not significantly affected at low transmission
speed with all predefined system requirements and others have fixed settings of EQs. Fur-
thermore, different PHYs also specify various signaling requirements in I/O interfaces.
The challenge involved in PHY link design is to provide reliable high-speed data trans-
mission with considerably low power and cost. Issues such as interface timings have to
be guaranteed when low-cost PCB boards are used. Therefore, a smart system which can
adapt different system environments in terms of various channel characteristics and PVT
variations turns out to be trendy for future high-speed communication systems. In the
following chapters, CDR, EQ trainings and I/O driver are discussed in details to fulfill
the requirements for the power-efficient and reliable data transmission.
Chapter 3
Contents
3.1 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 PI2 CDR Simulink Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 PI2 CDR Components Design . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Digital Phase Interpolator . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Three Times Sampling CDR . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
15
16 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
ADDR
LDFF
ALPG D Q D Q
phase
PD LF interpolator
DQ
exp. RD data D Q D
D
DD
D Q
QQ
Q
Q
Training Engine
GDDR5
FIFO
address channel that usually has a better SI than high-speed data channel with the com-
mand load FIFO (LDFF). After successfully writing the data to the FIFO, the command
read training (RDTR) is started. By comparing the read data and the expected read data,
the receive phase is adjusted in the memory controller. To achieve the right clock phase,
a CDR circuit composed of PD, low pass filter (LF) and phase interpolator is usually ap-
plied.
DQ
WR data DD Q
DD QQ
Q
Q
ALPG D Q D
FIFO
PD phase
LF interpolator
exp. RD data Q D
Training Engine
GDDR5
Once read training is finished, write training [8] depicted in Figure 3.2 begins. Data
are first written into the FIFO through a DQ channel with the write training (WRTR) com-
mand and data are transferred error free due to the read training back into the memory
controller with RDTR commands. By comparing the write data and the expected read
data, the transmit phase is adjusted in the memory controller.
As a result, data eyes are sent at different time by changing the sampling clock phase
in the controller so that data arrive simultaneously at the Rx. Figure 3.3 illustrates the
3.1 S TATE - OF - THE - ART 17
However, one major problem in both read training and write training is when to com-
pare the expected data and received data due to different component delays (channel
delays and other component delays) for various vendors. To address the problem, the
proposed CDR architecture is mainly inherited from the conventional CDR, but focused
on the trade-off between power and CDR jitter.
3.1 State-of-the-art
A wide variety of CDRs are proposed in the papers. Authors in [24] suggested a digital
CDR using analog digital converter (ADC)-based front-end. The received analog signals
are first equalized through a continuous-time analog EQ and its output is then converted
into 5-bit digital data by flash ADCs. In this architecture, the number of valid data is
adjusted instead of the phase of the sampling clock by applying zero cross PD. When
data is faster, one additional data is inserted. Otherwise one duplicated data is removed.
Authors in [25] proposed a 2-threshold eye-tracking CDR when loop-unrolling decision
feedback EQ (DFE) is applied. Due to the fact that the time difference between the edge
and the eye center of the post-DFE signal is not equal to 0.5UI, two different thresholds
operated for edge and data signals are required. The drawback of this architecture is
its significant circuit complexity. In [26], authors suggested a phase-tracking CDR with
18 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
dynamic control of jitter transfer bandwidth to increase the jitter tolerance. In [27], the
phase tracking CDR is optimized with a pre-detector to reduce the close-loop latency. The
pre-detector is provided to improve intrinsic jitter caused by the update periods needed
in the frequency detector and the accumulator of finite-state-machine (FSM). In the pre-
detector the signs of input and output in FSM are compared. Authors in [28] proposed
sub-rate phase tracking CDR using fully-differential approach in the tail current sources
in phase interpolator design which avoids switching clocks and thus resulting better clock
jitter.
In general, CDRs are classified into several categories [29] such as PLL based CDR [30],
delay locked loop (DLL) based CDR [31] and phase interpolator based CDR [32]. Due to
the dual tracking loop in both phase and frequency, PLL-based CDR provides good input
jitter rejection. But in the mean time, it suffers from jitter accumulation and stability
issues by high-order PLL. Instead, a DLL-based CDR using voltage-controlled delay line
has shown its benefits in the stability analysis. However, the DLL-based CDR has limited
phase capturing range, which can be solved by replacing the voltage-controlled delay line
with a phase interpolator. Other CDRs such as dual-edge injection-locking-based CDR
[33], baud-rate timing recovery [34], phase tracking CDR with sleeping mode [35] and
CDR using symbol-rate PD [36] targeting at fast lock or low power and area consumption
are also proposed.
Among different types of CDR structures, phase interporlator-based CDR has been
gaining popularity due to many advantages. The primary advantage of the phase inter-
polator based CDR is that a single clock source can be shared by multiple phase interpo-
lator based CDR circuits. This feature reduces the system power consumption, die size
overhead as well as cross-talk between adjacent channels [37]. Furthermore, the phase in-
terpolator based CDR offers short re-locking time, better noise suppression during phase
tracking and simple compensation scheme for offset due to device mismatching [38].
Figure 3.4 shows the basic building block of a conventional CDR [39]. After sampling
the data, PD generates early-late information, which is accumulated in the controller. The
phase interpolator adjusts the phase of its output clock according to the output signal of
the controller. The updated clocks are then applied to the sample block.
A modified phase interpolator based CDR [40] is shown in Figure 3.5. Compared to
the conventional CDR, this architecture includes the deserializer in the synchronization
loop that enables most digital components to work with low sampling clocks, and thus
reduces the complexity and the power of the whole system. The system is composed of
a PLL, a deserializer, a PD, a programmable accumulator, a binary phase accumulator
and two analog phase interpolators. The 1 to 8 deserializer first samples the incoming
data with half-rate in-phase (ICLK) and quadrature (QCLK) clocks and synchronizes the
outputs to the low system clock (CLKL) which is one quarter of the ICLK. A modified
Alexander PD [41] takes two parallel bytes from the deserializer and outputs a moving
average of phase error to obtain a high possibility of phase error detection for the present
jitter. A programmable accumulator Pre-filter provides an updated carry signal when a
3.1 S TATE - OF - THE - ART 19
IE-EPhase
0°
90° Interpolator
PLL 180°
270° Controller
QE-EPhase
Interpolator
QEclk
IEclk
EData Rx Sample PD
DES
Phase_acc
6
I2-2Phase
Iclk
Σ
0° 2Interpolator
90° Z2-1
PLL 180°
270° K_PD
Q2-2Phase
2Interpolator Qclk
0/+-- 1
IDATA UP/DN
Data Rx DES QDATA PD Pre-filter
clkL
certain threshold P re_f ilt is reached. A phase accumulator accumulates the carry signal
with the PD gain k_P D and converts the result into digital bits. By feeding the digital
information and 4-phase clocks generated by PLL into two phase interpolators, ICLK
and QCLK are recovered. This CDR generates a low dither on the recovered clock due to
the linear accumulators. However, high CDR jitter is a major drawback of this CDR.
To reduce the CDR jitter, an early-late PD with adaptive algorithm and PI controller
are utilized in an alternative CDR [42] shown in Figure 3.6. The sample and DEMUX
20 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
8 32
Data Rx Sample DEMUX EL5TED
0/,5TmT+k_LE)Tk_SE(
clk1
clk3
clk4
clk2
Phase5interpolatorT1
0°
90°
180°
270° Phase5interpolatorT2
PLL 45°
135° PI5Controller
225°
315° Phase5interpolatorT3
Phase5interpolatorT4
blocks generate 4 oversampling points for each symbol. For eight symbols 32 parallel
binary data are provided at the input of the early-late timing error detector (EL-TED).
D1 D2 Dn
Sk,n S1,1 S2,1 S3,1 S4,1 S1,2 S2,2 S3,2 S4,2 S1,3 S1,n S2,n S3,n S4,n S1,n S2,n
δ2,n δ3,n δ4,n δ1,1 δ2,1 δ3,1 δ4,1 δ1,2 δ2,2 δ3,n-1 δ4,n-1 δ1,n δ2,n δ3,n
L
X L
X
f2 (k_SE) = k_SE ∗ ( |δ2 ,n | − |δ3 ,n−1 |) (3.2)
n=1 n=1
where k_LE, k_SE and f (tr) represent the gain of PD for large errors, the gain of PD for
small errors and a function of input data transitions, respectively. L is the total length
3.2 PI2 CDR S IMULINK M ODEL 21
of data symbols, where the early-late information is averaged. In our case, L is set to
8. The data transitions in the middle of one symbol δ1 ,n and δ4 ,n−1 are weighted with
a large value k_LE, which indicates large phase errors taking place at these positions.
However, transitions at both beginning and end of the symbol are weighted with k_SE
which indicates small phase errors. The PI controller takes the resulted early-late error
information and provides the digital controller bits for four analog phase interpolators
which can be expressed as
L
phaseb ,w = (b ,w ∗β ∗ E T + b ,w ∗α ∗ E T ) ∗ k_P D
X
(3.4)
n=1
where vectors α and β are the coefficients of the PI controller, E represents identity matrix
and k_P D is the fundamental gain of the PD.
The adaptive CDR is suitable for very low CDR jitter requirement and usually it pro-
vides a fast locking time due to its fast adaptivity of the input jitter. However, it has the
drawback of high dither in the output clock and consumes considerable power by using
four phase interpolators. By combining the advantage of the first architecture that has
low dither and the advantage of the second architecture that outputs low CDR jitter, a
novel system architecture, PI2 CDR, is proposed and described in details in the following
section.
6
///I/-/Phase PI/Controller
Iclk
0° /Interpolator
90°
PLL 180°
270°
///Q/-/Phase
/Interpolator Qclk
0/+-/1
IDATA UP/DN
Data Rx DES QDATA PD Pre-filter
clkL
Figure 3.8 shows the proposed PI2 CDR architecture using PI controller, where the
phase accumulator in the modified CDR is replaced. The internal block of the PI controller
22 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
β1,2,3,4
in + out
+
Σ α1,2,3,4
Z -1
is given in Figure 3.9. Instead of using b ,w in the adaptive CDR, which is dependent on
the k_LE and k_SE and δ, the output of Pre_filter is connected to the input of the PI
controller in the proposed CDR. Compared to the output of the phase accumulator in the
modified CDR
L
X
phaseb ,w = ( b ) ∗ k_P D (3.5)
n=1
The total simulation in Matlab/Simulink takes 280ns where 1400 data symbols are sent
(data rate= 5Gbps). The sent data is a repeating sequence [1 0 1 1 0 1 1 1 0 1 1 1 1 0 1
1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0] with random jitter that changes every 200ps. One of the
reference clock with 0 degree generated by PLL is assumed to have a 90 degree phase shift
3.2 PI2 CDR S IMULINK M ODEL 23
to the ideal sampling clock (middle of the data). Table 3.1 shows the variable settings in
modified, adaptive and PI2 CDRs at system-level simulation.
Table 3.1: Variable settings in modified, adaptive and PI2 CDR systems
Variables
Modified CDR k_PD=725; Pre_filt=2 or 8
Adaptive CDR k_PD=26; k_LE=4; k_SE=1
PI2 CDR k_PD=32; Pre_filt=2 or 8
For the simulation the loop gain of the CDR is chosen so that the locking time of all
three models is approximately the same. Figure 3.11a and Figure 3.11b depict the phase
error between the ideal sampling point and the actual phase of the adaptive CDR without
random jitter and with maximal random jitter 60e-12s, respectively. In Figure 3.11a, the
locking process is complete when the CDR begins to dither around the ideal sampling
point that happens in the setup between time point 0.4e-7s and 0.5e-7s.
Table 3.2: Phase errors in modified, adaptive and PI2 CDR systems
Maximal RJ 0 6e-12 18e-12 60e-12 6e-12 18e-12 60e-12
in second Pre_filt=2 Pre_filt=2 Pre_filt=2 Pre_filt=2 Pre_filt=8 Pre_filt=8 Pre_filt=8
Modified CDR 11.0711 11.0224 12.3257 17.4357 14.2508 16.7797 19.5531
Adaptive CDR 10.6471 10.3966 10.8724 16.3235 10.2693 10.3928 16.1897
Proposed CDR 11.5353 11.3381 11.8110 16.5613 12.9823 12.8890 17.3397
Table 3.2 shows the phase errors in degree for modified, adaptive and proposed CDR
systems when the maximal random jitter and threshold of the programmable accumula-
tor vary. The CDR jitter is calculated by the sum of the absolute jitter at each simulation
point divided by the number of simulation points. The simulation results show that the
adaptive CDR has the lowest CDR jitter among all three models due to its capability of fol-
lowing fast input data phase changes, but with the trade-off of large dither. The modified
24 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
80 80
60 60
40 40
20 20
0 0
−20 −20
−40 −40
−60 −60
−80 −80
0 0.5 1 1.5 0 0.5 1 1.5
−7 −7
Adaptive CDR x 10 Adaptive CDR x 10
80 80
60 60
40 40
20 20
0 0
−20 −20
−40 −40
−60 −60
−80 −80
0 0.5 1 1.5 0 0.5 1 1.5
−7 −7
Proposed CDR x 10 Proposed CDR x 10
80 80
60 60
40 40
20 20
0 0
−20 −20
−40 −40
−60 −60
−80 −80
0 0.5 1 1.5 0 0.5 1 1.5
−7 −7
x 10 x 10
Time offset: 0 Time offset: 0
Figure 3.11: Phase error between ideal sampling point and actual phase of adaptive CDR with-
/without jitter
CDR and proposed PI2 CDR do not differentiate too much in the phase error, especially in
the case of very small jitter (jitter=0s and 6e-12s). But by higher jitter (e.g. jitter=60e-12),
the proposed CDR shows better performance. The usage of a high value in the threshold
of the programmable accumulator is not recommended in both modified and proposed
CDR when the target is to achieve a low CDR phase error. But it is an option to reduce
the frequency at which the dither changes. For Pre_filt=8 the phase error in the proposed
CDR is 1.3 degree up to 3.9 degree less than the modified CDR. When large CDR jitter is
expected or small phase error with less clock changes (high value on Pre_filt) is required,
the proposed model is usually preferred.
Table 3.3: Comparison among modified, adaptive and PI2 CDRs ÆŠwith constant locking time
Advantages Disadvantages
Modified CDR Low dither High CDR jitter
Adaptive CDR Lowest CDR jitter High power consumption; Highest dither
Proposed CDR Low CDR jitter with lower power consumption High dither
Table 3.3 depicts the comparison among modified, adaptive and proposed PI2 CDR
systems in terms of the CDR jitter, dither and power consumption under the condition
that locking time is kept the same for all models.
3.3 PI2 CDR C OMPONENTS D ESIGN 25
Due to mismatch, high current leakage, especially in modern technologies, and high
power consumption [43], analog phase interpolator becomes less competitive than digital
phase interpolator. Based on [43], digital phase interpolator composed of four inverters
shown in Figure 3.12 is applied in the proposed PI2 CDR, where inverter 1 and inverter 4
produce delayed and inverted outputs from their inputs In1 and In2. Out2 is the interpo-
lated output.
Out1 V
INV1
In1 In1 In2 Out1 Out2 Out3
INV2
Out2 Out3 delay
Out1 delay
Inputs delay
INV3
In2
Out3
INV4
t
(a) An 1-bit cell schematic (b) Inputs and outputs waves of an 1-bit cell
Figure 3.13 illustrates an example of a 3-bit phase interpolator. Out of two inputs
(phase 0◦ and phase 45◦ ) a 2nd phase step (2 ∗ 5.625◦ = 11.25◦ ) represented by binary code
"010" is generated. Due to the inverter delay, even same phases, for example, four 0◦
phases, have different phase positions.
3 to 2 MUX
Out
ln2
Different from [43], the implemented phase interpolator works in the gigahertz range
(2.5GHz), not megahertz. The rise/fall time of input signals has to be about 0.4 to 2.6
times longer than the phase delay between two input signals in TSMC 65nm LP technol-
ogy depending on the stage where the cell lies. Additionally, multiplexers are required at
the input of the phase interpolator to switch between reference clocks with small phase
differences as well as between two stages. However, unwanted signal spikes take place
during the switching process. To avoid this, two 3-bit digital phase interpolators are ap-
plied illustrated in Figure 3.14a.
control A
switch
3-bit phase clkA
interpolator A unsafe
clkA
2 to 1 MUX
Out
control B
When the phase of the input clock changes, a switching protocol will run. Assuming
that phase interpolator A is passed to the output (switch="0") at the beginning, once the
control signal changes, signal "control B" changes accordingly to the wanted phase. At
least one clock cycle is then waited to give phase interpolator B enough time to settle.
At a safe time point, phase interpolator B is passed to the output (switch="1"). To pre-
vent glitches during the switching, the switching is only allowed in the safe area (green)
shown in Figure 3.14b, where clkA and clkB (outputs of phase interpolator A and phase
interpolator B, respectively) have the same value, either "0" or "1".
3.3.2 Multiplexer
!Control
I12
I14
In1
Out_1
In1 Out_1 I6
In2
In3 Out_2 In2
I7
Out_2
I8
In3
Control
!Control I13
Control
Table 3.4 shows the load of the inputs with and without the dummies. By inserting
dummies the number of load for all input signals are equal to ensure the same phase
difference at outputs as at inputs.
S3
c0 c1
In1
S0
S0
In1 Out1
In2
Out1 S1
In2
S1
In3
Out2 In3
In4 Out2
S2
S2
In4
c2 c3 S3
MUX_4_2 shown in Figure 3.16a with 4 inputs and 2 outputs is required to assign
reference phases from PLL to phase interpolators. To guarantee the same number of
loads for all inputs, a different structure from MUX_3_2 is used to have a small amount
of dummy elements. Figure 3.16b shows 4 inputs and each input is connected with 2
switches, where the upper switch is connected to Out1 and the lower switch is connected
to Out2. To pass the signal pairs (In1 and In2, or In2 and In3, or In3 and In4, or In4 and
In1) to the outputs Out1 and Out2, two switches always have to be closed. The switches
S1..S4 realized by a transmission gate must be controlled in the way that every time only
one switch is closed, which is guaranteed by a control logic. Through this symmetrical
28 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
building it is guaranteed that at every switching point both active inputs always have
one open and one closed transmission gates as loads. For the unused inputs loads are not
cared.
To reduce the high dither in the proposed PI2 CDR, a 3TS CDR is proposed shown in
Figure 3.17. In the figure, an additional DES and a phase interpolator are employed to
stop the phase shifting in the sampling clocks when Iclk is aligned to the center of the
data.
6
I - Phase PI Controller
Iclk
0° Interpolator
90°
PLL 180°
270°
Q2- Phase
Q2clk
0/-2/+- 1
Interpolator
IDATA UP/DN
Data Rx DES Q2DATA Pre-filter
PD
Q1DATA
clkL
In the conventional CDR systems, Iclk is expected to be aligned to center of the data
while Qclk to the edge of the data. However, both clocks move away or towards the
expected positions when CDR is in the alignment state due to the drawback of conven-
tional Alexander PD. In the 3TS CDR, instead of having one clock Qclk, two clocks named
as Q1clk and Q2clk are generated with certain phase differences, which create a zone to
check for the alignment of the CDR. If the data edge falls in that zone, then CDR is said
to be aligned and the phase of the sampling clocks will remain constant. Figure 3.18
illustrates the 3TS method applied in the 3TS CDR.
Accordingly, a modified Alexander PD is proposed to contain four possible states,
i.e., early (+1), late (-1), aligned (-2) and no change (0). Inherited by the conventional
Alexander PD, the phase error at time point k for 3TS CDR is calculated by:
where I(k) and Q2(k) represent the data at the time point k sampled by Iclk and Q2clk. To
build an average phase error E(k) over a time period, e(k) is summed up as follows:
n
X
E(k) = (I(k + 1) ⊕ Q2(k)) − (I(k + 1) ⊕ Q2(k + 1)) (3.8)
k=0
3.4 S IMULATION R ESULTS 29
Q(k)
By comparing E(k) with zero, either early (+1) or late (-1) or no change (0) is provided at
the output of the PD. The aligned state in the modified Alexander PD is conditioned by:
n
X
E(k) = (Q1(k) ⊕ Q2(k)) (3.9)
k=0
Once the state of the 3TS CDR is determined, EN/DS system in Figure 3.17 is employed
to either enable or disable the digital controller. After the read training is finished, PD
and DES are further reused in the write training.
-50.0
-100.0
-150.0
delaygNps)
-200.0
-250.0
-300.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0
timegNns)
QCLK phase in modified CDR [ps] ICLK phase in modified CDR [ps]
2
QCLK phase in PI CDR [ps] ICLK phase in PI2 CDR [ps]
Table 3.5: CDR jitter of modified and PI2 CDR systems in corner simulation
min. jitter [ps] max. jitter [ps]
Modified CDR 8.573 44.909
PI2 CDR 11.57 42.87
Table 3.6: Power of modified and PI2 CDR systems in corner simulation
min. power max. power
(VDD=1.08V) (VDD=1.32V)
Modified CDR 2.788 5.611
PI2 CDR 2.781 5.626
it (< 80ps) fulfilled the required valid data length of the eye diagram (120ps) for GDDR5
application. The power consumption lies between 2.8mW and 5.6mW for both models
(only analog part), since the analog parts of both models are identical.
In the 3TS CDR, Q1clk is selected as the reference clock while others clocks Iclk and
Q2clk are generated so that they lag the reference clock. As the implemented phase inter-
polator has 64 steps, 1LSB refers to 5.625° (6.25ps). When the jitter in the data increases
the zone of Q1clk and Q2clk are enhanced to ensure the alignment of the CDR. Table 3.7
shows the variation of the LSB with/without maximal random jitter in data.
3.4 S IMULATION R ESULTS 31
3.4.1 Summary
In this section, we present a novel phase interpolator based CDR using PI controller for
read/write trainings. By combining the advantages of the modified and the adaptive
model a good compromise between power consumption and phase jitter is achieved in
the proposed CDR. The adaptive model has the overall good performance in all cases
except the doubled power consumption compared to the other models due to the need of
four interpolators instead of two. Compared to the adaptive model, the modified model
has the disadvantage that it is not possible to follow fast phase changing cause by the
jitter. The PI2 CDR combines the benefits of both adaptive and modified models, which
can follow fast phase changes by using nearly the same power as the modified model.
But the disadvantage of the proposed CDR is that it is a bit slower in the phase tracking
than the adaptive CDR and has overall high dither. To address the high dither problem
in the proposed CDR, a 3TS CDR with a modified Alexander PD is suggested. With the
help of the 3TS CDR, the system is able to detect whether the recovered clock is aligned
to the middle of the data as well as the read/write training state.
32 C HAPTER 3 P HASE I NTERPOLATOR B ASED C LOCK D ATA R ECOVERY
Chapter 4
Contents
4.1 State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Signal Integrity Issues in High-Speed Communication System . . . . . . 35
4.3 System-Level Modeling of Equalizer Training . . . . . . . . . . . . . . . . 38
4.3.1 System-Level Modeling of Receiver Equalizer Training . . . . . . . 38
4.3.2 System-Level Modeling of Transmitter Equalizer Training . . . . . 42
4.4 Design and Implementation of Receiver Equalizer . . . . . . . . . . . . . 49
4.4.1 Analog Receiver Equalizer . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.2 LMS Semi-Digital Receiver Equalizer . . . . . . . . . . . . . . . . . 53
4.5 Design and Implementation of Transmitter Equalizer . . . . . . . . . . . 54
4.5.1 Semi-Digital LMS Transmitter Equalizer . . . . . . . . . . . . . . . . 54
4.5.2 Semi-Digital Pilot Transmitter Equalizer . . . . . . . . . . . . . . . . 55
4.5.3 Full-Digital Direct Transmitter Equalizer . . . . . . . . . . . . . . . 57
4.5.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
33
34 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
results show that LMS in semi-digital receiver and transmitter equalizers is the best per-
forming algorithm in the high-speed source-synchronous memory systems with medium
consumed areas. Compared to other algorithms, it can reduce the inter-symbol interfer-
ence caused by both pre- and post-cursors of the channel.
4.1 State-of-the-Art
In high-speed communication systems, non-idealities of channels, such as losses and
cross talks, are the main reasons for causing the severe SI problems, for example, ISI. The
impact of ISI degrades both the timing margin and the voltage margin of the inter-chip
signaling link, which results in a small or in a closed eye opening in the eye diagram. To
compensate for the non-ideal channels, EQs [44] are incorporated in high-speed commu-
nication system. In some practical applications for unknown or time-varying channels,
adaptive equalization circuits [45] [46] are required. Several adaptive equalization algo-
rithms, such as LMS algorithm and recursive least square (RLS) algorithm, have been
developed for adaptively adjusting filter coefficients.
control unit
control unit
adaptive data2
DFE LEQ Ch_data2
data2 Tx EQ
pre-emphasis and receive equalization settings. For example, the lowest power for which
the signaling system provides minimum communication bandwidth without exceeding
a desired BER, can be configured. Similarly, Tx and Rx EQs are adjusted when data are
transmitted to the master control IC.
However, this architecture requires EQs and control units on both sides of the chan-
nels, which significantly increases the overall cost of the system. Particularly, when more
than one slave IC are available in the high-speed communication systems, a large amount
of Tx and Rx EQs are required in the slave ICs. In the case of existing slave ICs, redesign
is usually avoided. Besides, the feedback signals generated by the control units need ad-
ditional back channels and pins. The back channels have a certain impact on the SI so
that the feedback signals can be distorted when adjusting the adaptive Tx EQ.
Amirkhany et al. [49] proposed a tri-modal asymmetric memory controller interface
consisting of a voltage-mode one-tap transmitter equalization in the write direction and
a linear EQ (LEQ)/1-tap DFE in the read direction to compensate for channel ISI. On the
DRAM side, only LEQ is provided in the write direction. This architecture enables to shift
most of the complexity to the controller and keep the DRAM design as simple as possible.
However, it is a non-adaptive architecture, where the Tx EQ setting is customized to the
known channel characteristics of 3-inch stripline FR4 trace. With this architecture the PCB
designs for high-speed communication system are considerably stressed.
0
-5
-10
-20
dB
-25
-30
S11
S21
-35 S22
-40
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
GHz
Figure 4.2: Example of a channel between memory controller and GDDR5 memory
powerDup
WRITEDTrainingDLoop
AddressDTrainingDLoop
D,optionalB
TransmitterDEqualizerD
DTrainingDLoop
WCK2CKDTrainingDLoop
No
WriteDDataDCorrect
ReadDTrainingDLoop
Yes
ReceiverDEqualizerD StartDNormalDOperation
DTrainingDLoop
to make the eye opening larger. The setting of the Rx EQ will be performed in a way
so that errors between the training data and the received data will be below a certain
threshold value. Whether this condition has already been achieved will be checked in a
consecutive interrogation. If the read data are correct, then write training loop is followed.
Otherwise when there were any errors in data due to SI problems, data could not be
read correctly. After solving the SI problems by performing Rx EQ training, the training
sequence is jump back to redo the read training in order to adjust the right receive phase
for corrected read data. The similar process is executed with Tx EQ training loop. After
the interface training, normal operation can be started. Once the BER that is monitored
in the memory controller exceeds to a certain threshold, e.g. 10−12 , due to PVT variations,
the interface needs to be retrained.
Figure 4.4 depicts the proposed architecture [13] by combining the advantages of the
system architecture [47] that uses adaptive EQs and the system architecture [49] that has
a simple slave IC. In order to adjust DFE, LEQ and Tx EQ, two steps are required. The
first step is Rx EQ training which is to adjust the coefficients of Rx EQs by sending data
4.2 S IGNAL I NTEGRITY I SSUES IN H IGH -S PEED C OMMUNICATION S YSTEM 37
patterns provided by algorithmic pattern generator (ALPG) into FIFO in the slave IC
through a trained channel e.g. Ch_data0. Alternatively, the trained channel can be a data
line that is provided for other purposes e.g. transmitting address signals with a lower
data transfer speed and a better signal quality. This loop helps to determine the channel
characteristic of Ch_data1. Once channel characteristic is known, the next step is Tx EQ
training by sending data patterns through Ch_data1. This loop is built up by the same
data channel. The control unit uses the known channel characteristic to set the Tx EQ
coefficients.
Compared to the conventional architecture in Figure 4.1, the proposed configuration
enables the reduction of the complexity in the slave IC dramatically in terms of area and
power. All the adaptive EQs and control units are available only in the master control IC.
Instead, a FIFO is required in the slave IC to store the test data patterns for the training
purpose. Furthermore, the number of the control units can be reduced if two or more
Tx EQs can share the same control unit for those nearby data lines which may have the
similar channel characteristics.
The remainder of this chapter deals with 1) System-level modeling of EQ Trainings
in Matlab/Simulink, 2) Design and implementation of analog Rx EQ and semi-digital Rx
EQ , 3) Implementation of Tx EQ as well as investigation of different algorithms (i.e. LMS,
pilot signal/peak detection, direct).
38 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
FFE DFE
Tx + Received
Channel LEQ
- Data
FBF
1UI
Figure 4.6a and Figure 4.6b depict how FFE cancels the pre-cursors and DFE cancels
the post-cursors internally. In both EQs, each coefficient Ci is multiplied with either the
input signal or delayed versions of the input signal, which results ISI components. By
subtracting the ISI components caused by pre- and post-cursors, the impulse signal is
recovered with delays.
The adaptive control for the coefficients are usually adaptively generated using LMS
algorithm [51] [52], which can be expressed as:
N
X M
X
Y (k) = Ci ν(k − i + 1) + Cj I(k − j) (4.3)
i=1 j=1
in out in out
ideal e in out
out e
mu mu
channel prbs
DFE FFE
mu mu
constant constant
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
constant0 constant1 constant2 constant3 constant4 constant5 constant6 constant7 constant8 constant9
X X X X X X X X X X
product0 product1 product2 product3 product4 product5 product6 product7 product8 product9
+ + + + + + + + + +
add
1
out
As the channel has low-pass filter characteristic, it is modeled as a low pass filter in
Matlab/Simulink shown in Figure 4.8. Constants f 0...f 9 represent the coefficients of the
low-pass filter derived from the impulse response of the channel.
The block prbs produces pseudo random bit sequences (PRBS) as training test patterns
transmitted to the memory controller through a mathematically modeled channel. For
simplicity, FIFO in the memory is replaced with the prbs block which is identical with the
prbs in the memory controller side. The delays caused by FFE shown in Figure 4.6a are
modeled by three unit delays to generate an error signal between the ideal test patterns
and the output of the Rx EQ.
x0 xi x0 xi x0 xi
1 xi y0 1
in y0 yi y0 yi y0 yi
e out
e e e
mu mu mu mu
2
e
3
mu
x0 xi x0 xi x0 xi
2 xi y0 +
ideal y0 yi y0 yi y0 yi + 1
e out
e e e 1 add
mu mu mu mu in
3
e
4
mu
1 z-1 1
xi UD1 x0
4 X + z-1 X +
2
+
mu product3
+
y0
product2 add1 UD2 add2
2
yi
3 X
e
product1
Figure 4.9 and Figure 4.10 depict FFE and DFE, respectively. Both EQs consist of 4
tap blocks. Figure 4.11 illustrates the tap block, where the adaptation of coefficients in
Equations (4.1) and (4.2) takes place.
Figure 4.12 illustrates the coefficients in FFE and DFE during the Rx EQ training. The
total simulation time adds up to 1.05e-6s. The coefficients converge after 0.4e-6s. Table
4.1 lists the coefficients in FFE and DFE after convergence.
c1
c2
c3 0.6
1.5 c4
0.5
Coefficients
Coefficients
1 0.4
0.3
0.5
0.2
0
0.1
-0.5
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Times (s) x10-6 Times (s) x10-6
out in
in C1 C1
prbs e
C2 C2
out in out
C3 C3
mu C4 C4
Z-1 Z-1 Z-1 +
C5 C5
channel
UD1 UD2 UD3 LMS
- LE
add mu
constant
Figure 4.13 depicts the eye diagrams of distorted data and equalized data. Table 4.2
shows their vertical and horizontal eye openings. With Rx EQ, the vertical eye opening is
improved by 42.8% while the horizontal eye opening by 8.1%.
The principle of LMS algorithm [14] is to adjust the Tx EQ coefficients depending on the
error at the output of EQ compared to the expected data as shown in Equation (4.1). The
EQ output EQ(k) is given as:
N
X
EQ(k) = Cj ν(k − i + 1) (4.5)
j=1
Figure 4.14 illustrates the adaptive Tx EQ using LMS algorithm after Rx EQ training
in Matlab/Simulink. The block LMS and LE work similar to the block FFE in Figure 4.9.
The block prbs is reused at the input of the block LMS. Through the trained Rx EQ data
are transmitted from memory to Tx EQ without large errors. As shown in Figure 4.15, the
coefficients converge after 300ns.
4.3 S YSTEM -L EVEL M ODELING OF E QUALIZER T RAINING 43
c1
1.2 c2
c3
c4
1 c5
0.8
Coefficients
0.6
0.4
0.2
0
-0.2
0 0.5 1 1.5 2 2.5 3 3.5
Times (s) x10-7
Pilot signaling and peak detection [15] is another alternative implementation for the con-
trol unit in the Tx EQ training. This method provides the flexibility of adjusting coefficient
values sequentially in adaptive Tx EQ.
Update the Cn
Yes
Error < 0
No
Figure 4.16 illustrates the procedure of pilot signal/peak detection algorithm. The
current coefficient of the Tx EQ is initialized with some large value so that the data peak
after channel can be detected in the receiver while those undetermined coefficients are
set to zero. During the Tx EQ training, several known pilot signals are sent from the
44 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
transmitter. The peak is detected and compared with an expected value. When the peak
is larger than the expected value, the current coefficient will be reduced by one LSB of
a DAC that is used to convert the digital coefficient into analog value, until the peak is
smaller than that value. The next pilot signal will be then sent. This procedure continues
until all the coefficients are determined. Impulse responses for an EQ and a channel are
given by:
n
X
FEQ (n) = Ck δ(n − k + 1) (4.6)
k=1
n
X
FCh (n) = fk δ(n − k + 1) (4.7)
k=1
where v denotes pilot signals. Pilot signals, i.e. "10000", "11000", "10100", "10010" and
"10001" are sent to determine the coefficients. By knowing channel coefficients, pilot sig-
nals and DFE outputs, Tx EQ coefficients are derived.
in out
pilotxsignals
enable1 enable1
counter_enable1 counter_enable1 in
controller C1 C1
enable2 enable2
C2
u counter_enable2 counter_enable2
C3
C2
out in out
enable3 enable3 C3
C4 C4
counter_enable3 counter_enable3
C5 C5
enable4 enable4 channel
counter_enable4 counter_enable4 LE
enable5 enable5
counter_enable5 counter_enable5
counter
out
controlxunit
control
pilotxsignaling/peakxdetection
in
MinMax
Running
comparator Resettable
step1 max(u,y) u out in out in out in out
in y
step2 enable R prbs
DFE FFE channel
reset reset
case[0]
case[1]
case:b{b}
1 u case[2] pilot1 in out
in case[3] SwitchbCasebActionb
SignalbFromb
case[4] Subsystem1
Workspace1
default
switchbcase case:b{b}
pilot2 in out
SignalbFromb SwitchbCasebActionb
Workspace2 Subsystem2
case:b{b}
pilot3 in out Merge 1
SwitchbCasebActionb out
SignalbFromb
Workspace3 Subsystem3
case:b{b}
pilot4 in out Merge
SignalbFromb SwitchbCasebActionb
Workspace4 Subsystem4
case:b{b}
pilot5 in out
SignalbFromb SwitchbCasebActionb
Workspace5 Subsystem5
case:b{b}
prb in out
SignalbFromb SwitchbCasebActionb
Workspace6 Subsystem6
A direct way to calculate the coefficients of LEQ requires single test pattern, for example,
an impulse signal. The initial coefficients are set to zero. After Rx EQ training, the test
pattern is transmitted from memory to memory controller. Let’s assume y represents the
data at the input of FFE, which can be expressed as:
k
X
y(k) = v ⊗ f = fj v(k − j + 1) (4.9)
j=1
and Y denotes the output after channel through LEQ based on Equation (4.8). After Rx
EQ training, the data at the output of DFE can be ensured as good as the ideal data. With
the assumption that channels in both directions have exactly the same characteristics, Y
has approximate value as the output of DFE. By comparing y and Y in Equations (4.9)
and (4.8) respectively, coefficients of Tx EQ can be derived as:
Y (1)/y(1) (k = 1)
Ck = (4.10)
" #
k−1
Y (k) − Cj y(k + 1 − j) C1 /Y (1)
P
(k >= 2)
j=1
where k ∈ N.
Figure 4.19 illustrates the adaptive Tx EQ using direct calculation after Rx EQ training
in Matlab/Simulink. Delays due to FFE shown in Figure 4.6a are under the consideration
for y at the input of the block direct calculation. Figure 4.20 illustrates the block direct cal-
culation internally. The synchronization of the coefficients is taken into account by adding
various integer delays.
46 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
out
prbs in
C1
C2
out in out
C3
C4
C5
channel
LE
directdcalculation
C1
C2
C3
C4
C5
Y
Z-4 1
C1
Integer Delay1
2 y
y C1 C2 -3 2
Z
y Y C2
Integer Delay2
C1 C2
Y
C1
1
Y y
C1 C3 3
Z-2
C2 C3
Y Integer Delay3
C3
y
C1
C2 C4 Z-1 4
C3 C4
Integer Delay4
Y
C4
y
C1
C2 C5 5
C3 C5
C4
Y
C5
Table 4.3 compares different Tx EQs based on LMS algorithm, pilot signal/peak detec-
tion and direct calculation. The main parameters considered are the coefficients of Tx EQ
and the numbers of test data patterns required for the training. Results show that direct
calculation has similar coefficients as LMS algorithm, but only 5 data bits are required
for the training. However, direct calculation has the constraint that it assumes the same
channel characteristic in both the write and the read direction, which may not be the
case in the reality since channel characteristics depend on many factors e.g. impedance
matching. There is usually some difference in the channel characteristics for the same
channel where data are transmitted in different directions. In comparison to direct cal-
culation, adaptive EQs using LMS algorithm and pilot signal/peak detection need much
more time to adjust the coefficients. The numbers of test data bits depend on the individ-
4.3 S YSTEM -L EVEL M ODELING OF E QUALIZER T RAINING 47
ual channel and the numbers of the taps. Moreover, pilot signal/peak detection can only
improve the post-cursor in the impulse response of the channel since peak detection only
searches for the maximal output.
The test channel delivers good performance in terms of small loss as shown in Figure
4.21a. To further explore the adaptivity of the system, a highly lossy channel with smaller
48 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
eye openings is investigated. Figure 4.22 shows eye diagrams for the lossy channel using
the same algorithms. Table 4.5 compares eye openings in both vertical and horizontal.
Without EQ the vertical and horizontal eye opening are 45% and 82.5% with respect to
ideal eye opening. All of three implementations give better performance than without
EQ regarding to the SI. Direct calculation and LMS algorithm show better eye opening
than pilot signal/peak detection.
An adaptive Rx EQ is mainly composed of LMS engines for each coefficient and EQ sum-
mers in both FFE and DFE as shown in Figure 4.23.
Multiplier OTA
LMS engine
LMS engine
Sub
Error
Expected data
According to Equations (4.1) and (4.2), LMS engine works similar to the behavior of an
integrator. These two equations, however, differ in their inputs where expected data is
replaced with the distorted data in case of a FFE. Based on papers [53] [54], the LMS
engine is implemented by a multiplier and an integrator.
As introduced in [55], eight types of transconductance multipliers are classified. Com-
pared to the other transconductance multipliers, the multiplier with Gilbert cell structure
shows good linearity and low noise. Figure 4.24a shows the structure of a Gilbert cell four
quadrant multiplier. It consists of four one quadrant multipliers as seen in Figure 4.24b.
The output current of one quadrant structure can be expressed as:
I ∝ (X + x)(Y + y) (4.11)
where X, Y and x, y indicate dc input voltages and ac input voltages, respectively. There-
fore,
Iout ∝ ([(X + x)(Y + y) + (X − x)(Y − y)]
(4.12)
−[(X − x)(Y − y) + (X + x)(Y − y)])
50 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
VDD
Iout1 Iout2 I
R0 R1
Vo1 Vo2
M1 M3 M5 M7
X+x M1
X+x X-x
M2 M4 M6 M8 Y+y M2
Y+y Y+y
Y-y
Vss
Vss
(a) Four quadrant (b) One quadrant
As a result, the output voltage of the Gilbert cell multiplier Vout is proportional to
4xyRL , where RL is the load resistor. By selecting a proper RL value, the Vout can be equal
to xy.
VDD
M3 M4 M5 M6
Vout
Vin- M1 M2 Vin+
Vbias
M7 M8
Figure 4.26 shows the simulation result when Vin is swept from -50mV to 50mV. The
output of the multiplier is set to be 0.75V in dc state.
One method to realize an integrator as introduced in [56] is to use Gm -C technology.
The transconductor Gm is implemented by a simple operational transconductance ampli-
fier (OTA) shown in Figure 4.25. Three current mirrors M3-M4, M5-M6 and M7-M8 copy
the currents converted from the input voltages by the input differential pair M1-M2. The
output voltage is derived by:
Z
Vout = (gm1 gm3 /gm2 C) (Vin+ − Vin− )dt (4.13)
where gm1 , gm2 , gm3 indicate the transconductance of differential pair M 1 − M 2, current
mirror M 3 − M 4 or M 5 − M 6 and current mirror M 7 − M 8. Comparing the Equation
4.4 D ESIGN AND I MPLEMENTATION OF R ECEIVER E QUALIZER 51
760.0 25.0
V (mV)
V (mV)
750.0 0.0
740.0 -25.0
730.0 -50.0
720.0 -75.0
-50.0 -25.0 0.0 25.0 50.0 -50.0 -25.0 0.0 25.0 50.0
Vin (mV) Vin (mV)
(a) Two differentials outputs (b) The difference between two outputs
(4.13) with the Equation (4.1), gm1 gm3 /gm2 C is corresponding to the step size µ, which
depends on the LMS convergence condition. If the step size is set too large, the system
becomes unstable and the coefficients may not be adapted to the ideal value. Otherwise,
it will takes long training time to reach convergence. Normally the step size is set any
value between 0<µ<<1.
Figure 4.27 shows the ac simulation of OTA. Table 4.6 shows the ac performance of
OTA in detail.
/Vac_in 88880.0
/Vac_in
8-50.0
Phase8(deg)
-100.0
-150.0
-200.0
8850.0
8825.0
88888(dB)
88880.0
8-25.0
8-50.0
Based on [56], the summers in DFE and FFE are realized by using series-gated differen-
tial pairs with common load resistors, as shown in Figure 4.28 and Figure 4.29. The dc
voltages V1 − V4 and V5 − V8 are connected to the OTA outputs. According to Figure 4.6a,
the main tap of receiver equalizer is positioned at the fourth differential pair in the FFE
summer. To subtract the ISI components caused by the pre-cursors, the previous three
differential pairs are connected with the fourth one in an opposite way. Similarly, in the
DFE summer the first differential pair is connected with the following four differential
pairs differently in order to cancel the post-cursors. The FFE output is given by:
3
X
Vf f eout = RL (gm4 ν(4) − gmi ν(i − 1)) (4.14)
i=1
where gm1 , gm2 , gm3 , gm4 represent the transconductances of four input differential pairs
in FFE summer. Similarly,
4
Vdf eout = RL0 (gm Vf f eout − 0
X
gmi I(i − 1)) (4.15)
i=1
0 0 0 0
where gm , gm1 , gm2 , gm3 , gm4 represent the transconductances of five differential pairs in
DFE summer.
VDD
RL RL
V1 V2 V3 V4
Iss
Vbias
VDD
R'L R'L
Vbias1 V5 V6 V7 V8
Iss
Vbias2
Figure 4.30 shows the diagram of a digital LMS engine, which consists of two roms for
encoding and decoding, an adder and two pipelined multipliers. The Verilog code for
the adder and the multiplier is given in Appendix A.3. Based on Equations (4.1) and
(4.2) coefficients are updated. The rom for encoding converts the ADC codes of ideal
or distorted data into the floating data while the rom for decoding converts the abso-
lute floating number into digital analog converter (DAC) codes. In the implementation
ADC and DAC are modeled in Verilog codes. In order to increase the speed of coefficient
adaptation, pipelined structures are used in adders and multipliers.
Ideal /
distorted data
A D
D Rom_in Rom_out A
C C
error stepsize
Figure 4.31 shows the implementation of a pipelined multiplier. The ten bits data
54 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
consists of one sign bit, one integer bit and eight floating bits. If the input is larger than 1,
it is right shifted by 1 bit first. According to the sign bit, complement codes of each input
are calculated. After multiplication the result is left shifted, if the input was right shifted
before. The highest 9 bits are the final result and the complement codes are calculated
again depending on the sign bit of the result.
In1[8]u=1u? In2[8]=1?
integerubit signubit signubit
ofuIn1 ofuIn2
yes yes
u
In1[8:0]uright
u In2[8:0]uright
XOR shiftu1ubitu shiftu1ubitu
Pipeline1
complementucodeu complementucodeu
uofuin1 ofuin2
Pipeline2
In1u*uIn2
ifuintegerubitushifted
thenushiftutheuresult,
andugetu[17:9]
Pipeline3
getucomplement
uofuresult
Pipeline4
out
In semi-digital equalizer, different from the fixed inputs at each differential pair shown
in Figure 4.28 and Figure 4.29 the signs of the coefficients in the semi-digital Rx EQ are
indicated at the output of the adder. By using switches the two differential data inputs
(positive/negative) are exchanged based on the sign bit.
Conventional pilot signal/peak detection method shown in Figure 4.16 requires analog
peak detection circuit for the peak detection. In order not to change the memory architec-
ture dramatically, instead of reducing the coefficients by one LSB, the coefficients in the
proposed algorithm in Figure 4.32 are increased by one LSB equalizer [15].
-
sending-the-old-pilot-data- changing-to-the-next-tap
again and-sending-the-new-pilot-data
comparing-received-data
with-pilot-data
no
detecting-and-
holding-error
error 0-?
yes
Vn+1LSB Vn:-the-bias-voltage-of-n-th-Tap-
----in-EQ-summer-
At the memory side, the same sampler, but with different preset reference voltage
(usually set larger than half of the idea data amplitude) is reused to detect the entire pilot
signals, as shown in Figure 4.33. During the Tx EQ training the output of sampler is first
D Memory=Controller Memory
Adaption A
C
D
Adaption A Vref===Reference=value==
C
D ==Tx=EQ
Adaption A Channel
C Summer
D Sampler2
Adaption A
C
FIFO
D Pilot=data
Adaption A
C
Vref===500mV==
Control=Unit
Trained== Channel
=Rx=EQ
Sampler1
sent to the FIFO in the memory. After that the distorted pilot data in the FIFO are sent
56 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
through the trained Rx EQ to the control unit, where they are compared with the ideal
pilot data in the memory controller. If there is an error, the corresponding bias voltage
in equalizer summer is added by one LSB of DAC. And the same pilot data will be sent
again. Otherwise, the adaptation for this tap is finished and the new pilot data for next
tap will be sent. In this way, coefficients are adjusted one by one.
Assume f1 is the pre-cursor and f2 is the main cursor, the training pilot signals and its
peak corresponding to the coefficient after Tx EQ and channel is given in Table 4.7. When
adjusting the current coefficient e.g. C2 , due to the pre-cursor Y(1) will slightly change its
value based on Equation (4.8). However, this change is considerably small as coefficient
is increased from a minimal value (negative). The drawback of pilot algorithm is that it
can not cancel the ISI components caused by the pre-cursors of the channel since pilot
algorithm works based on the peak detection.
Figure 4.34a shows the block diagram how the comparison works in the control unit.
The hold component detects the error and keeps the error until the reset signal comes.
Vrefb=b500mVbb hold_out
pilotbdata ControlbUnit
Trainedbb error
hold_rst
bRxbEQ Hold bControlbsignal clk2
Sampler1 clk1 generatorb 2ns
hold_out clk2
comp_enable hold_rst
comp1
Adaptation
Comparator comp2
clk2 Counter comp_en
clk2
Counter cn_enable
Controlbsignal counter_out[2:0] comp1
adap_enable forbadaptation
clk2
clk2 comp2
comp1 comp2
counter_out
[2:0] 000 001
pilotbdata
Shiftbregister Rom
clk1: 5GHz
clk2: 500MHz clk1 pilot data 10000 11000
(a) The block diagram of control unit (b) Time sequence of pilot control unit
Based on the signal hold_out, the comparator decides whether to update the current tap
coefficient (comp1=’1’) or to move to the next tap coefficient (comp2=’1’) as shown in
Figure 4.34b. Meanwhile signals comp1 and comp2 also control the shift register, when it
loads and sends the pilot data. The control signal generator produces the control signal
for hold and comparison periodically. In order to ensure the pilot data are sent in the
4.5 D ESIGN AND I MPLEMENTATION OF T RANSMITTER E QUALIZER 57
center of the hold_rst and the comp_en, the frequency of these two control signals is
set to ten times of the working frequency of generator. Therefore, the adaptation of the
coefficient is every 20ns. The address corresponding to all the pilot data in the rom is
connected to the output of the counter in the control unit.
To ensure the data rate of equalizer at 5Gbps, two clock frequencies are used: 5GHz
for hold and shift register and 500MHz for other components. Each adaptation block is
controlled by two signals, cn_enable and adap_enable shown in Figure 4.35, which made
adaptation work in either idle or adapting coefficient or keeping coefficient.
Other than refreshing the coefficients of Tx and Rx EQs by adapting the bias voltages
of equalizer summers in LMS and pilot algorithms, a full digital Tx EQ is implemented.
Figure 4.36 illustrates the loop training for direct calculation.
Figure 4.37 and Figure 4.38 show the implementation of a full-digital adaptive Tx EQ,
which contains multipliers, subtractors, adders and dividers based on Equation (4.10).
58 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
All of these digital components are designed using pipeline structure. Each component
has a different delay. The delay of each coefficient adds up on to the delay of its previous
coefficients. The black value indicates the output delay in the behavior model, while the
red one gives the delay values after synthesis.
Y1
C1 y2
y1
(198delays) C1
Y2 C2
Y1 C1 (498delays)
(508delays)
y3 C1
C1 y4
y2 C3
C2 Y3 C3 y2
Y1 C2 C4
C1 Y4
(828delays) y3 Y1 C1
(858delays) (1158delays)
(1198delays)
C1
C:8coefficient8 has848delays y5
Y:8data8after8Rx8EQ C2
y4
y:8data8after8channel has8198delays C3 Y5 C5
y3 Y1
C4
C1 (1488delays)
and have838delays y2 (1538delays)
Figure 4.39 shows the implementation of a pipelined divider. The 10 bits data consist
of a sign bit, an integer bit, and eight floating bits. The dividend is first extended to 18
bits and the integer bit of divisor is checked whether divisor is smaller than dividend.
The calculation of division begins at the highest bit of dividend. If remainder is larger
than divisor, remainder subtracts divider. Then the result of subtraction is shifted to the
higher bit of remainder and next bit in dividend is shifted to the low bit of remainder.
Meanwhile, one is set in quotient and shifted to the higher bit. Otherwise if remainder is
smaller than divisor, the next bit in dividend is directly shifted to the low bit of remainder.
4.5 D ESIGN AND I MPLEMENTATION OF T RANSMITTER E QUALIZER 59
Zero is then set in quotient and shifted to the higher bit. The loop continues until the
lowest bit in dividend is shifted. The final result (9 bits + 1 sign bit) is given depending
on the dividend and divisor values. The Verilog code for the pipelined divider is given in
Appendix A.3.
divisor[9:0] dividend[9:0]
floatingmbits
sign
mdividend[8:0]
bit divisor[8]m=1? extendedmtom
18mbits
integermbit no yes
divisor[8:0]mrightly
shiftm1mbitm
settingm18mlevelsmpiplines
tomrealizemthemloopminmcalculating
LOOPmcimfromm17mtom0u
themquotientmofmdivider
cremm<<1u>dd[i]
yes and
dividend[9]=1 cres<<1u>0
divisor[9]=0 remm<mdividor?
no
m
dividend[9]=0
divisor[9]=0
ccrem-dividorum<<1u>dd[i]
and
quotient=res[8:0] cres<<1u>1
mmdd:mdividend
dividend[9]=0 dividend[9]=1
rem:mremaind divisor[9]=1 divisor[9]=1
mres:mtempmregistermofmquotient
quotient=
res[8:0]>>1 quotient=res[9:1]
Figure 4.40a illustrates a conventional mixed-signal top-down design flow. The flow
mainly consists of seven design levels. It begins with the mathematical verification of
system concept. Once system mathematical modeling is verified, block-level simulation
based on behavioral models will be performed. This level enables designers to verify
whether block-level subsystems can meet the system-level requirements. After successful
block-level simulation, system is divided into digital and analog portions. Digital sub-
flow comprises several levels such as synthesis, timing simulation and place and route
while analog sub-flow is followed by schematic design and layout. Finally, the designed
layout for mixed-signal is integrated and implemented in ASIC.
For system-level simulation, the Matlab/Simulink framework has been widely adopt-
ed for many years because of its available toolset and ease of use for algorithmic imple-
mentation. Many efforts have been made to integrate Matlab into popular EDA tool flows
for different fields of applications, e.g. RFID [57], Power Electronics [58]. B. Gestner et al.
[59] developed Modelsim-Matlab interface for RTL debugging and verification.
60 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
Tapeout Tapeout
For behavioral-level simulation, Verilog-AMS has been applied for mixed-signal de-
sign environments. As an extension of Verilog-HDL and Verilog-A, Verilog-AMS pro-
vides an accurate behavioral modeling with a single simulator. P. Frey et al. [60] explained
the semantics of Verilog-AMS connect modules between analog and digital contexts.
However, a major problem of the conventional design flow is that it requires a higher
level of refinement from Matlab/Simulink to Verilog-AMS. Although some existing tools,
like Simulink HDL Coder by Mathworks or Synphony HLS by Synopsys, allow synthe-
sizable HDL code generation direct from Matlab/Simulink environment, generated RTL
are often unsuitable for efficient FPGA or ASIC implementation due to their limitation
in supported blocks. Therefore, a "full-custom" Verilog-AMS for behavior modeling of-
ten remains competitive. To benefit from both Matlab/Simulink and Verilog-AMS and
to minimize the design time, Matlab/Simulink-HDL co-simulation is proposed in this
paper.
The need for co-simulation is indispensable in mixed-signal design. It allows individ-
ual components to exchange information in a collaborative manner by different simula-
tion tools running simultaneously. Moreover, different components at different abstrac-
tion levels can be simulated using co-simulation platform. In this way, both design accel-
eration and verification of low-level designs with system-level models can be achieved.
P. Daglio et al. [61] proposed a VHDL-AMS method to explore the mixed-signal do-
main at different abstraction levels by integrating different EDA tools. Based on the de-
sign flow, an example of an embedded flash macro-cell is illustrated as a test case. Verilog-
AMS which is an alternative HDL for mixed-signal design is compared with VHDL-AMS
by F. Pecheux et al. [62] such as language aspects, expression of structure, signal-flow
semantics, etc., where an airbag system application was demonstrated as a case study.
However, both of them are close to implementation level. Their relatively slow simulator
for system validation makes them inefficient for complex applications such as system-on-
chips.
4.5 D ESIGN AND I MPLEMENTATION OF T RANSMITTER E QUALIZER 61
To fill the gap between algorithmic domain and hardware design, many methods [63]
[50] are proposed. Matlab-VHDL [63] shows its potential in validation of different system
architectures, but it is unsuitable for mixed-signal design due to its lack of analog/mixed-
signal support. SystemC-AMS [64] which is an extension of SystemC, combines the pow-
erful means to describe the systems from abstract specifications to RTL models and to
model continuous-time systems/discrete-event systems.
Figure 4.40b illustrates the proposed design flow. Compared to the conventional de-
sign flow, Matlab/Simulink-HDL co-simulation is added in level 2. After system model-
ing, some design blocks are refined in Verilog-AMS. Co-simulation between Matlab and
Verilog-AMS enables verification acceleration of block level implementation. At level 3,
Verilog-AMS models of analog circuits can be replaced by Spectre netlist, if available.
Multiple iterations are needed between level 2 and level 3 to make Verilog-AMS model
fine-tuned with real circuits. Thus, Matlab/Simulink Cadence co-simulation is required.
To verify different algorithms, Matlab/Simulink Cadence co-simulation environment
[17] is set up. The test patterns i.e. PRBS for LMS Tx EQ, pilot patterns for pilot Tx EQ
and an impulse pattern for direct Tx EQ are generated in Matlab and the communication
between Matlab and Cadence is through the coupler as shown in Figure 4.41. The Tx
EQ and Rx EQ are implemented in TSMC LP 65nm technology in Cadence while eye
diagrams, jitter measurements and bathtubs are evaluated in Matlab (see Appendix A.2).
The eye diagram is generated by overlaying the impulse responses of a data sequence
once all the coefficients in the Tx and Rx EQs are determined. Jitter and bathtub are
evaluated according to Appendix A.1.
Cadence Matlab/Simulink
eyek
Txeq_out diagram
TxkEQ Channel
eyek Bathtub
testkpatternk diagram jitter
forkTxkEQ Rxeq_out
Coupler
Bathtub
jitter
RxkEQ Channel
testkpatternk
forkRxkEQ
Co-simulationkofkTxkandkRxkEQs
Figure 4.42 shows bias voltages for the coefficient adaption at the gate-level simulation in
adaptive receiver EQs. The peaks on the voltage curves in Figure 4.42c and Figure 4.42d
are due to the different delays of the digital components after synthesis. The voltage of
FFE_V4 has the highest value among four coefficients, which implies the fourth tap in FFE
is the main tap. Due to the slight difference in the equalizer summer circuit between ana-
log and semi-digital Rx EQ, the voltages such as FFE_V3 in Figure 4.42c decreased to zero
volt and then stayed stable at some negative voltages (sign bit is not shown) while the
voltages in the analog Rx EQ are always positive. In the implementation of semi-digital
adaptive Rx EQ, 8-bit ADC and DAC are used, which implies the resolution of approxi-
mately 3.9mV (Vdd=1V). With a fixed step size µ=0.032, the equalizer is not able to detect
the error lower than 120mV (=3.9mV/0.032), which in return leads to the coefficients of
FFE_3, DFE_3,and DFE_4 to be zero, although their voltages are negative.
650.0 1.25
520.0 1.0
V (mV)
390.0 0.75
V (V)
260.0 0.5
130.0 0.25
0.0
0.0
650.0 800.0
520.0
640.0
V (mV)
390.0
V (mV)
480.0
260.0
320.0
130.0
160.0
0.0 0.0
650.0 650.0
520.0 520.0
V (mV)
390.0
V (mV)
390.0
260.0 260.0
130.0 130.0
0.0 0.0
1.25
650.0
1.0 520.0
0.75
V (mV)
390.0
V (V)
0.5 260.0
0.25 130.0
0.0 0.0
0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0
1.5 1.0
1.2 0.8
0.9
V (V)
0.6
V (V)
0.6 0.4
0.3 0.2
0.0 0.0
1.5 1.0
1.2 0.8
0.9 0.6
V (V)
V (V)
0.6 0.4
0.3 0.2
0.0 0.0
1.5 1.0
1.2 0.8
0.9 0.6
V (V)
V (V)
0.6 0.4
0.3 0.2
0.0 0.0
1.5 1.0
1.2 0.8
0.9 0.6
V (V)
V (V)
0.6 0.4
0.3 0.2
0.0 0.0
0.0 0.32 0.64 0.96 1.28 1.6 0.0 0.32 0.64 0.96 1.28 1.6
time (us) time (us)
Figure 4.43, Figure 4.44 and Figure 4.45 illustrate the eye diagram, jitter and bathtub
4.5 D ESIGN AND I MPLEMENTATION OF T RANSMITTER E QUALIZER 63
Amplitude (V)
Amplitude (V)
Amplitude (V)
0.5 0.5
0.5
0.4 0.4
0.4
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
0.1
0 0
0 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0
-1 -0.8 -0.6 -0.4 -0.2 0 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 t/T
-1.2 t/T
t/T
0 0 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0
-1.4 -1.2 -1.4 -1.4
t/T t/T t/T
Due to the ISI, the eye is almost closed in Figure 4.43a. As a result, the deterministic
jitter (DJ) and random jitter (RJ) are distributed in the entire region, shown in Figure
4.44a. The green solid lines indicate additional user-defined RJ distribution, where its
center is marked with green dotted lines. The two green dotted lines approach to the eye
center when additional user-defined DJ is present. Figure 4.45a demonstrates its bathtub
curves, where they intersect at each other at BER=10−12 . In contrast to Figure 4.43a, the
eye openings for both analog and semi-digital LMS Rx EQs in Figure 4.43b and Figure
4.43c are improved dramatically.
BER
BER
10-10 10-10
10-10
10-15
10-15 -0.6 -0.2 10-15
-0.8 -0.6 -0.4 -0.2 0 -1.2 -1 -0.8 -0.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2
-1 t/T t/T t/T
Figure 4.44b, Figure 4.44c, Figure 4.45b and Figure 4.45c illustrate that the Rx EQs
can reduce the jitter and make bathtub curves wider. Compared with the analog Rx EQ,
the semi-digital Rx EQ has better eye diagram, fewer jitter and wider bathtub curves.
However, it needs more time to reach the convergence shown in Table 4.8.
64 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
Figure 4.46 shows the bias voltages for the coefficient adaption at the gate-level simulation
in LMS and pilot adaptive Tx EQs. The main difference between these two algorithms is
that the voltage in pilot algorithm is adjusted only when the previous voltage is stable
while the voltages in LMS algorithm are updated simultaneously.
1.5 1.5
V (V)
V (V)
0.0
1.0
0.0
V (V)
1.5
0.0
V (V)
1.5
V (V)
0.0 0.0
1.5
1.5
V (V)
V (V)
0.0
1.5
V (V)
0.0
0.0 1.5
1.5
V (V)
V (V)
0.0
1.5 0.0
V (V)
1.5
V (V)
0.0
1.5
V (V)
0.0
0.0
0.0 0.8 1.6 2.4 3.2 4.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0
time (us) time (us)
Figure 4.46: Adaptive bias voltages using pilot and LMS algorithms
-1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 -1.2 -0.8 -0.6 -0.4 -0.2 0 -0.8 -0.6 -0.4 -0.2
t/T -1.4 -1 -1.4 -1.2 -1 0
t/T t/T
Figure 4.47: Eye diagrams using different algorithms in the adaptive Tx EQs
Figure 4.47 shows the eye diagrams based on three different adaptation algorithms
and proves that all of them can improve the data quality dramatically compared with the
eye diagram without equalizer. Among these three algorithms, direct algorithm gives the
most promising performance in terms of jitter and bathtub curves shown in Figure 4.48
and Figure 4.49.
Table 4.9 shows the comparison among the three different algorithms. The sequential
calculation of floating coefficients in the direct algorithm leads to the largest digital area
4.6 S UMMARY 65
0 0 00
-1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0
t/T t/T t/T
100 100
100
10-5 10-5
10-5
BER
BER
BER
10-10 10-10
10-10
10-15 10-15
-1.2 -0.8 -0.6 -0.4 -0.2 10-15 -1.2 -0.8 -0.6 -0.4 -0.2 -1.2 -0.8 -0.6 -0.4 -0.2
-1 -1 -1
t/T t/T t/T
Figure 4.49: Bathtub curves using different algorithms in the adaptive Tx EQs
and its complexity constrains the data rate (up to 1 GHz) after synthesis. Pilot algorithm
occupies the smallest area due to its simplicity. However, it has consumed the longest
training time, which can be optimized by about 50% reduction, if faster clock for the
control unit is used. Regarding to the coefficients, both pilot and direct algorithms have
their limitation in canceling the ISI components caused by the pre-cursor in the channel.
4.6 Summary
In this chapter, a simplified system architecture for high-speed communication systems
and the comparison of different adaptive receiver and transmitter equalizers are pre-
sented. In contrast to the conventional architecture using adaptive equalizers, the pro-
posed architecture simplifies the design in the slave IC in terms of hardware overheads.
To verify the proposed architecture, receiver equalizers are designed with analog and
semi-digital circuits while the transmitter equalizers are implemented with semi-digital
and full-digital circuits. Performances such as eye diagram, jitter, area and training time
are used to evaluate different implementations of receiver and transmitter equalizers. Re-
66 C HAPTER 4 S IGNAL I NTEGRITY IN L OW-P OWER A DAPTIVE E QUALIZER T RAININGS
sults show that with the proposed system architecture SI is significantly improved. Com-
pared with analog receiver equalizer, the semi-digital receiver equalizer shows better eye
diagram, although it needs more training time. Three different algorithms in the trans-
mitter equalizers are also compared in this paper. Direct algorithm takes the least training
time, but it consumes the largest area and runs at relatively slow data rate. Among three
algorithms, LMS shows the best performance with reasonable area and acceptable train-
ing time at high data rate. It differs with the other two algorithms in the cancellation of
both pre- and post-cursors.
Chapter 5
I/O Design
Contents
5.1 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1 I/O Driver Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.2 PVT Compensation Schemes . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Proposed Hybrid Digital Impedance Calibration . . . . . . . . . . . . . . 74
5.2.1 Calibration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Hybrid Calibration Algorithm . . . . . . . . . . . . . . . . . . . . . 76
5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
As the demands for high volume data exchanges increase in many fields such as mul-
timedia and telecommunications, significant efforts have been devoted to improve chip
I/O performances to achieve high bandwidth. However, impedance mismatch between
I/O driver and a transmitting channel causes signal reflections which interfere with the
incoming data in terms of overshoot, undershoot or ringing. To ensure high quality SI at
the data rate beyond several gigabits per second, high-speed interfaces require minimum
variations of output voltage level and slew rate over PVT variations.
Figure 5.1 depicts the P-channel Field-Effect Transistor (PFET) impedance variation
from target impedance of 120Ω of a GDDR5 I/O interface from 76.4Ω to 173.8Ω caused
by the PVT changes without calibration in TSMC 65nm LP technology. Each label is com-
posed of a mark number, a temperature and an impedance value. The large impedance
variation is not acceptable in the circuit. Therefore, impedance calibration becomes key
factors in the high-speed interfaces such as GDDR5. By controlling the impedance of
off-chip driver (OCD) and on-die terminator (ODT), signal reflections can be effectively
absorbed, and thus maintain the SI.
In this chapter, different algorithms for digital impedance calibration are compared
for impedance calibration in terms of power and calibration time.
67
68 C HAPTER 5 I/O D ESIGN
125.0
100.0
M8: 75.0 97.45345
M6: 25.0 87.1182
75.0
M4: -25.0 76.43232
temperature
5.1 State-of-the-art
pu_up_n
out
pu_down
grammable I/O driver depicted in Figure 5.2 consists of an array of binary weighted
5.1 S TATE - OF - THE - ART 69
PMOS transistors and NMOS transistors. The impedance matching is achieved by en-
abling different combination of transistors in array. The first branch is a reference circuit
in the nominal condition without any PVT variations while the other branches are for
the calibration purpose. When driving high at the output the impedance of the driver
is equivalent to the resistance of PMOS array. Similarly, when driving low at the out-
put the impedance of the driver is equivalent to the resistance of the NMOS array. The
total impedance decreases when the number of "on" transistors increases. However, the
non-linearity of the simple programmable I/O driver causes the change of the driver
impedance and its output voltage, which may lead to the matching issues at VDD/2.
pu_up_n
out
pu_down
To flatten the output impedance curve, the current assist I/O driver shown in Figure
5.4 is used by extending the simple programmable I/O driver with additional transistors.
Current assist pull-up NMOS is always on when driving high at the output. In the sim-
ilar way, current assist pull-down PMOS is always on when driving low at the output.
By inserting this additional current NMOS or PMOS, the linearity of the I/O driver can
be improved dramatically. When the number of current assist transistors increase, the
output impedance is close to a constant.
As depicted in Figure 5.3 poly resistance I/O driver architecture is an alternative op-
tion for flattening the output impedance curve by inserting a poly resistor in each branch
of the driver. The non-linearity of the simple programmable I/O driver is compen-
sated by the poly resistor. The percentage of the poly resistor value in the total output
impedance indicates the linearity of the driver, but at the cost of high output capacitance,
which limits the speed of data transmission.
Combination of poly resistor I/O driver architecture and current assist I/O driver
architecture is also a possible solution for ensuring the linearity of the I/O driver.
70 C HAPTER 5 I/O D ESIGN
pu_up_n pu_up
N_PU out
pu_down pu_down_n
P_PD
Figure 5.5 illustrates the conventional calibration logic. After the initial state, it is first
checked if the output voltage is matched within the specified voltage range. When it is
out of that range, match signal switches the state from high to low. As soon as the match
signal goes low, the calibration starts in the next clock cycle. The next step is to check
whether the output voltage of I/O is above upper bound of the specified range or below
the lower bound of that, indicated by the "reference check" box. Depending on that, the
reference transistor will be turned on or off. During the calibration, decision will be made
whether to turn on or off each branch until the match signal goes high.
INITIAL STATE
1
MATCH
0
REFERENCE CHECK
START CALIBRATION
0 1
MATCH
A review of PVT compensation circuits is given by the authors in [66]. Figure 5.6
shows the types of compensation circuits in I/O driver, which can either be analog or
digital. The analog impedance calibration is based on adjusting the gate voltage of the
5.1 S TATE - OF - THE - ART 71
compensating transistor until the drain voltage matches to the reference voltage which is
equal to the voltage drop across a known external resistor. In digital impedance calibra-
tion, several transistors constituting the I/O driver are connected in parallel. Each leg of
the transistor in the I/O driver can be turned on or off. Same as the analog calibration, the
drain voltage of the parallel transistors is compared with the reference voltage through
a comparator. Depending on the comparator output, the on or off state of the current
device is determined. Until the state of the last device is known, the total driver strength
is presented by all the "on" transistors.
I/O(driver
With(PVT(compensation(block( Uncompensated
Compensating(of(output(resistance(variation(
Analog(compensation( Digital(compensation(
(current(control)(( (legs)((
The primary difference between analog and digital impedance calibration is that in
the digital impedance calibration the transistors served as I/O driver are fully turned on
or off while in the analog scheme the transistor is partially on. The gate voltage in the
analog impedance calibration is dependent on the current change due to PVT variation.
In other words, analog impedance calibration is sensitive to noise. Compared to the ana-
log scheme, digital impedance calibration is more immune to noise and suitable for chips
with high I/O counts. Therefore, digital impedance calibration under PVT variations is
focused and further explored.
Authors in [67] proposed two circuits to detect PVT variations: DLL based and ring
oscillator based PVT compensation. The DLL based circuit detects the PVT variations
by generating the phase error between the input clock signal and the output of an I/O
driver while a ring oscillator based circuit converts the current variations into a clock
signal by employing a current controlled oscillator (CCO). Compared to the DLL based
circuit, the ring oscillator based circuit is able to detect even skewed corners with the help
of a modified current reference circuit by diode connected PMOS and NMOS transistors.
In their work, linear search using equal-sized transistors as I/O driver is applied dur-
ing the impedance calibration. The linear search starts from the center value of the total
72 C HAPTER 5 I/O D ESIGN
VDD
VREF0
+
P0
AFE -
VREF1
+
P1
-
VREF2
+
P2
-
VREF3
+
P3
-
gnd
driver strengths. Based on the output of the comparator, the desired value is searched
sequentially. In the linear search, the number of the calibration transistors is high, which
leads to a high number of comparators illustrated in Figure 5.7. Figure 5.8 illustrates the
digital comparator in its digital logic.
D[n:0]
Comparator shl
P[n:0]
shr
ifATyp[n:0]A>AP[n:0]AAthenAshl=1
ifATyp[n:0]A<AP[n:0]AAthenAshr=1 lock
Register C[n:0]
Typ[n:0]
ifATyp[n:0]A=AP[n:0]AAthenAlock=1
CAL_CLOCK
However, both DLL based and ring oscillator based PVT compensation circuits require
a large number of comparators to generate thermometer codes for a high resolution when
compensating the PVT variations by comparing the thermometer codes with a reference
code in the nominal condition. The number of comparators increases exponentially with
the number of calibration legs. For different I/O driver circuits, using diode connected
PMOS and NMOS transistors are not able to detect skewed corners.
D. bhattacharya et al. [68] suggested a compensation circuit in Figure 5.9 including a
monitor circuit indicating the status of pull-up or pull-down impedance over variations
in PVT conditions and a control circuit in Figure 5.10 to generate PVT bits. In the monitor
circuit, a reference current over a precision resistor is fed into the duplicated monitor
device. The resulted voltage over the impedance is then converted into digital control
5.1 S TATE - OF - THE - ART 73
P-CURRENTdMIRROR PULLdUPdMONITOR
DEVICE
Vref REFERENCE
VOLTAGE MP2
MP1 MP3 MP4
I1 Ipd I2
NREF
REFERENCEdCURRENT
GENERATOR PULLdDOWN R2
MONITOR
+ Mref DEVICE Ipu
PREF
- R1
N-CURRENTd
MIRROR
ddIref
MN1 MN2 MN3
EXTERNAL
PRECISION
RESISTOR
CP[n:0]
P Ref VOLTAGES
PREF ANALOG TO
DIGITAL
CONVERTER
AND LATCHES
NREF
N Ref VOLTAGES
CN[n:0]
bits with given threshold voltages for making comparison decisions. The drawback of
this circuit is its high power consumption and area penalty by using a large number of
comparators in the ADC circuit when high resolution is required.
S. Mei [69] suggested a compensation circuit including a driver circuit, a comparator
circuit and a binary searcher [70]. Figure 5.11 illustrates the binary search algorithm for
four bits. The advantage of binary search algorithm is its fast convergence. For example,
the binary search takes at maximum six clock cycles for 64 driver strength values (6-bit)
to determine a particular strength of the target impedance. On the other hand, it has the
drawback of its higher power consumption by switching each bit sequentially compared
to the linear search. Furthermore, the resulted switching noise may affect the SI. On the
74 C HAPTER 5 I/O D ESIGN
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
other hand, the step size of binary search halves in each clock cycle, thus reducing the
switching noise.
To compensate those disadvantages, the need for a power-efficient calibration method
with less complexity and less silicon area arises.
PMOS DRIVER
MODE
MATCH
MODE MODE COMP_IN PMOS CALIBRATION PMOS DUPLICATE
SEL_LINE LOGIC DRIVER (120Ω)
PMOS _BITS[6:0]
CLOCK
Reference
OUTPUT resistor
VOLTAGE MATCH
MATCH
MATCH SEL_LINE
CLOCK CONTROLLER NMOS _BITS[6:0] NMOS DUPLICATE
DRIVER
Figure 5.12 depicts the proposed hybrid architecture for digital impedance calibration.
The main components in Figure 5.12 are mode, match, comparator and controller. The
match block in Figure 5.12 provides the information to the calibration block about calibra-
tion start and stop. Its internal architecture is the window comparator architecture shown
in Figure 5.13. The upper and lower limits set by two comparators are VREF + 1% ∗ VREF
(=Refup ) and VREF − 1% ∗ VREF (=Reflow ), respectively, where VREF is half of the power
supply VDD. The comparator block provides the information to the calibration block
about the status of the output voltage. The negative input of the comparator is set to
be VREF .
VREF+5%*VREF
+
-
Net_P
MODE
Net_N
MUX
+
-
SEL_LINE
VREF-5%*VREF
Similar to the match circuit, the mode block is composed of two comparators, but
with different upper and lower limits at the inputs of the comparators. The upper and
lower bounds are determined by the strength of the calibration transistors. The function
of the mode block is to select between the conventional impedance calibration algorithm
(binary search) and an alternative calibration algorithm. Compared to the previous work
[67] [68], this architecture contains only five comparators independent of the required
resolution, which saves power and chip area.
As defined in [71], the calibration sequence starts with the PFET impedance consisting
of only PMOS transistors. The PFET impedance is calibrated against the external preci-
sion resistor. The voltage across the resistor is then fed to the match circuit. If this voltage
is within the specified range in the match circuit, no calibration is required. Once PVT
varies which causes the voltage across the resistor to drift beyond the specified voltage
range, the calibration process starts. At the beginning of the calibration, the strength con-
trol is first set to the center value by the control codes. During the calibration process, the
drive strength is increased or decreased according to the comparator output by adjust-
ing the calibration codes. The process lasts until the match signal goes high. When the
calibration stops, calibration codes are saved. After the PFET impedance calibration, the
N-channel Field-Effect Transistor (NFET) impedance calibration starts. Controller enables
the NFET calibration, disables the PFET and selects the corresponding output to each sub
block by using an analog multiplexer. The NFET calibration is performed against a cali-
brated PFET section. In case that the target impedances of NFET and PFET are not equal,
the setting of the widths of the PFET or NFET may vary.
76 C HAPTER 5 I/O D ESIGN
Following the calibration sequence, controller sequentially selects the PMOS, NMOS
and terminator section indicated by its output "SEL_LINE" (2 bits) during the calibration
process. When match signal is high, the controller waits two clocks for allowing the
output of comparators to settle down before switching the "SEL_LINE" signal.
Figure 5.14 illustrates the proposed hybrid calibration logic. As match signal goes to
low, mode signal is checked if hybrid calibration is used as opposed to the conventional
calibration logic shown in Figure 5.5.
INITIAL STATE
1
MATCH
0
1 (new) 0 (conventional)
MODE
1 0 0 1
MATCH MATCH
In the proposed hybrid calibration, poly resistor I/O architecture is used, as it reduces
the effect of PVT variations and provides the linearity in the I/O driver. To ensure good
linearity, 75% of the total impedance is contributed by poly resistors while 25% by tran-
sistors. As defined in [71], the I/O cell for a GDDR5 memory consists of 60Ω PFET and
40Ω NFET in pseudo open drain (POD) interface. For the symmetry reason, the same
specification is suitable for the I/O cell in the memory controller. In order to mitigate the
effects of SSO noise on SI, two 120Ω PFET sections and two 80Ω NFET impedance sec-
tions connected in parallel are implemented to control the slew rate [72] and to meet the
specification for I/O cell impedance as well.
Figure 5.15 presents the PFET part in the I/O driver using binary search algorithm.
Similar to the programmable I/O driver, the first branch is a reference circuit for no PVT
variations. Depending on both the required resolution and CMOS technology, an ex-
ample of six employed calibration bits is shown for a GDDR5 interface in TSMC 65nm
LP technology as with five calibration bits not all corners under PVT variations can be
calibrated. The transistors in the calibration branches are weighted 8W-4W-2W-1W-0.5W-
5.2 P ROPOSED H YBRID D IGITAL I MPEDANCE C ALIBRATION 77
Ref
VDD
x16W x8W x4W x2W x1W x0.5W x0.25W
[0] [6] [5] [4] [3] [2] [1]
0.25W while the resistors are weighted 1/4R-1/2R-R-2R-4R-8R where W represents the
width of a finger in each transistor and R is the unit poly resistor resistance in each branch.
The strength of the branch decreases with the bit priority order. Most significant bit (MSB)
has the highest strength while least significant bit (LSB) has the lowest strength. It should
be noted that the width of calibrated PMOS driver in Figure 5.12 is increased by 1.5 times
with the same calibration code delivered by the PFET impedance calibration to meet the
target impedance.
One approach of hybrid impedance calibration is to apply linear search and binary search.
If the output voltage of the driver for a corner is within the voltage window specified
in the mode circuit, the linear search is applied. For the linear search three additional
branches with a constant step size are used. The width of each transistor is 0.5W con-
nected in series with a resistance 4R for each branch. The linear step size is set small
enough to calibrate those corners close to ±1%*VREF variations. The linear search starts
from the center value and then traverses sequentially in the direction of ±1%*VREF varia-
tions based on the result of the comparator. During the linear search in case that the PFET
output voltage is smaller than Reflow or the NFET output voltage larger than Refup , the
reference transistor is turned on and all calibration transistors are turned off. For the other
case that the PFET output voltage is larger than Refup or the NFET output voltage smaller
than Reflow , the reference transistor is turned off and three most significant bit transistors
are turned on simultaneously to act as a reference. The calibration is performed using
linear step transistors.
An alternative hybrid impedance calibration is to use reduced binary (RB) and binary
search. The RB approach reuses the same calibration transistors. When the mode signal
is high, the calibration logic uses only last three calibration transistors to calibrate the
78 C HAPTER 5 I/O D ESIGN
impedance. During the RB search in case that the PFET output voltage is smaller than
Reflow or the NFET output voltage larger than Refup , the reference transistor is turned on.
Three most significant bit transistors are turned off. The calibration is performed using
three least significant bit transistors. If the PFET output voltage is larger than Refup or
the NFET output voltage smaller than Reflow , the reference transistor is turned off. In this
case the three most significant bit calibration transistors are turned on simultaneously.
These transistors act as a reference while the remaining three calibration transistors pull
back the voltage within the range of match circuit.
Table 5.1 shows the significant power gains when the reference transistor is on and
three most significant bit transistors are off. The hybrid linear algorithm consumes less
power than hybrid RB model since hybrid linear algorithm provides less strengths.
Table 5.2 and 5.3 show the statistics for PFET and NFET, respectively. In the TSMC
CMOS 65nm LP technology, the number of corners which need to be calibrated for NFET
is more than that for PFET due to the fact that PVT variation in NFET is more severe than
that in PFET. Compared to the conventional binary search algorithm, an improvement
5.3 S IMULATION R ESULTS 79
of calibration time in PFET is 12% for hybrid linear and 26% for hybrid RB, respectively.
In the case of NFET, the calibration time is reduced by 14.3% and 21%, respectively. For
both PFET and NFET impedance calibration, power is saved for more corners when using
hybrid RB compared to hybrid linear as it has more strengths than the linear search.
83 83 83
M3: 25.0 82.1263 M10: -25.0 81.6407 M9: 25.0 81.7963 M14: 25.0 82.3499
M1: -25.0 82.336
82 M6: 75.0 82.1215 82 M12: 75.0 81.4368 82 M18: 75.0 81.9568
M15: -25.0 81.0066
81 81 81
80 80 80
79 79 79
Figure 5.16 and 5.17 illustrate the calibrated PFET impedance and NFET impedance
for three operating temperatures under all process and voltage variations. The calibrated
impedance for PFET is within +/-2% of 120Ω while the calibrated impedance for NFET
is within +3.3% and -2.8% of 80Ω. The larger variation of he NFET impedance compared
to the PFET impedance is due to the fact that it is calibrated against the calibrated PMOS
section.
80 C HAPTER 5 I/O D ESIGN
As to the required chip area, [48] requires 2n comparators for a n-bit calibration code.
The number of comparators increases exponentially with the number of calibration legs.
However, the proposed method needs only 5 comparators independent on the number of
calibration legs.
5.4 Summary
The implemented architecture requires less comparators for impedance calibration than
the previous work, which saves considerable amount of power and area. Furthermore,
the output voltage of the driver is kept analog in the proposed calibration architecture,
which avoids the inaccuracy in the digital domain.
In this chapter, two hybrid impedance calibration algorithms are proposed to provide
significant advantages with minimized calibration time and reduced power consumption
as compared to the conventional binary search algorithm. Furthermore, with the hybrid
scheme switching noise is reduced in the circuit to maintain the SI. Compared to the
hybrid linear algorithm, the hybrid RB gives better timing performance. Hybrid linear
requires extra calibration transistor branches, thus increasing the area. However, power
difference between both hybrid algorithms is not significant.
Chapter 6
Contents
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
This thesis has presented a PHY link design and optimization for high-speed low-
power communication systems. The architectural design, system-level and circuit-level
verification of primary components in the PHY link, i.e., CDR, EQs and I/O interface,
have been described and considered so far in the previous chapters. Investigation on
those components is aimed to improve the entire system performance. This chapter will
summarize the contribution of the work (Section 6.1) and some aspects related to potential
future research investigations (Section 6.2).
6.1 Summary
The contributions of this thesis can be summarized in the following points:
81
82 C HAPTER 6 S UMMARY AND O UTLOOK
• Design and Analysis of low-jitter CDR with low power consumption [12]: Phase
interpolator based CDRs, i.e., modified CDR, adaptive CDR and proposed PI2 CDR,
are implemented for read and write training. A good compromise between power
consumption and phase jitter is achieved in the proposed CDR. Furthermore, a
mechanism to stop CDR training using 3TS scheme is also proposed and imple-
mented.
• Further adaptive trainings, e.g. power management training via dynamic voltage
and frequency scaling, can be integrated into the training scheme. Through the
trainings, consumed electrical power is able to be adapted to the currently needed
performance of the system to maximize the power efficiency for various bandwidth
applications.
1 1 t−µ 2
RJ(t) = √ · e− 2 ( σ ) (A.1)
σ 2π
where µ is center of the gaussian-distribution and σ is standard deviation of the distri-
bution. In the dual-Dirac model, the standard deviation has to be converted with the
multiplication-factor αBER [55], which can be expressed by:
√
αBER = 2 2 · erf c−1 (2 · BER) (A.2)
A bathtub plot BER(x), as shown in Figure A.2 is derived from the jitter distribution.
The eye opening t(BER) is resulted by the separation of the left and right BER curves at
a given BER expressed by:
85
86 A PPENDIX A I MPROVEMENT OF S IGNAL I NTEGRITY BY A DAPTIVE E QUALIZER T RAININGS
DJ
L R L R
10-6
BER
10-9
t (10-12)
10-12
x
0 XL (10-12) XR (10-12) TB
f i d R i s e = fopen ( [ ’ / v i c t i m ’ , ’ _impulse4 ’ , ’ . t x t ’ ] ) ;
[ chData_impulse , t ] = readFData ( f i d R i s e ) ; %read channel coefficients
chData_impulse = dataResample ( chData_impulse , t , f s ) ;
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
t a p s =[ l e q _ c 1 , l e q _ c 2 , l e q _ c 3 , l e q _ c 4 , l e q _ c 5 , l e g _ c 6 , l e g _ c 7 , l e g _ c 8 ] ; %Tx EQ coefficients
%taps =[ ffe_c1 , ffe_c2 , ffe_c3 , ffe_c4 , dfe_c1 , dfe_c2 , dfe_c3 , dfe_c4 ]; %Rx EQ coefficients
A.2 MATLAB C ODE FOR S IGNAL E VALUATION IN A DAPTIVE T RAININGS 87
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
% eye diagram
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
order = 1 0 ;
nTruncBegin = 5 0 ;
nTruncEnd = 5 0 ;
prb=prbs ( order , nStop ) ;
prb=double ( prb ) ;
new_pattern=conv ( prb * ( Max−Min)+Min , taps , ’ same ’ ) ; % s i g n a l w i t h EQ
s i g n a l 3 =prbsResample ( prb * ( Max−Min)+Min , fd , f s ) ; % r e s a m p l e d s i g n a l w i t h o u t EQ
s i g n a l 4 =prbsResample ( new_pattern , fd , f s ) ; % r e s a m p l e d s i g n a l w i t h EQ
s i g n a l 3 =conv ( s i g n a l 3 , chData_impulse )/ o s r ; % s i g n a l a f t e r c h a n n e l w i t h o u t EQ
s i g n a l 4 =conv ( s i g n a l 4 , chData_impulse )/ o s r ; % s i g n a l a f t e r c h a n n e l w i t h EQ
e y e s h i f t 1 = −0.5;
e y e s h i f t 2 = −0.35;
e y e s h i f t 1 = a d d E y e s h i f t ( e y e s h i f t 1 , g e t E y e s h i f t ( chData_impulse , o s r ) ) ;
e y e s h i f t 2 = a d d E y e s h i f t ( e y e s h i f t 2 , g e t E y e s h i f t ( chData_impulse , o s r ) ) ;
eye1= c a l c _ e y e ( s i g n a l 3 , osr , e y e s h i f t 1 * o s r ) ;
eye2= c a l c _ e y e ( s i g n a l 4 , osr , e y e s h i f t 2 * o s r ) ;
eye1= c a l c _ t r u n c _ e y e ( eye1 , nTruncBegin , nTruncEnd ) ;
eye2= c a l c _ t r u n c _ e y e ( eye2 , nTruncBegin , nTruncEnd ) ;
figure
p l o t _ e y e ( eye1 ) ;
t i t l e ( [ ’ cadence Eyediagram without e q u a l i z e r ’ ] ) ;
ylim ( [ 0 1 ] ) ;
figure
p l o t _ e y e ( eye2 ) ;
t i t l e ( [ ’ cadence Eyediagram equalizer ’ ] ) ;
ylim ( [ 0 1 ] ) ;
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
% calculate and plot the bathtub
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
try
bathtub1= c a l c _ b a t h t u b ( eye1 , eyeLevel1 , BER , RJ , DJ ) ;
catch
warning ( [ ’ Could not c a l c u l a t e d bathtub−p l o t f o r ’ . . .
’ TP ’ , num2str ( i ) , ’ . Reason could be ’ . . .
’ no eye opening or a wrong s h i f t e d eye ’ ] ) ;
end
figure
p l o t _ b a t h t u b ( bathtub1 ) ;
t i t l e ( [ ’ cadence Bathtub TP without EQ ’ ] ) ;
try
bathtub2= c a l c _ b a t h t u b ( eye2 , eyeLevel2 , BER , RJ , DJ ) ;
catch
warning ( [ ’ Could not c a l c u l a t e d bathtub−p l o t f o r ’ . . .
’ TP ’ , num2str ( i ) , ’ . Reason could be ’ . . .
’ no eye opening or a wrong s h i f t e d eye ’ ] ) ;
end
figure
p l o t _ b a t h t u b ( bathtub2 ) ;
t i t l e ( [ ’ cadence Bathtub TP with EQ ’ ] ) ;
88 A PPENDIX A I MPROVEMENT OF S IGNAL I NTEGRITY BY A DAPTIVE E QUALIZER T RAININGS
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
% calculate and plot the jitter
%−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
try
j i t t e r 1 = c a l c _ j i t t e r ( eye1 , eyeLevel1 , BER , RJ , DJ ) ;
catch
warning ( [ ’ Could not c a l c u l a t e d j i t t e r −p l o t f o r ’ . . .
’ TP ’ , num2str ( i ) , ’ . Reason could be ’ . . .
’ no eye opening or a wrong s h i f t e d eye ’ ] ) ;
end
figure
p l o t _ j i t t e r ( j i t t e r 1 , RJ , DJ , BER ) ;
t i t l e ( [ ’ cadence J i t t e r TP without EQ ’ ] ) ;
try
j i t t e r 2 = c a l c _ j i t t e r ( eye2 , eyeLevel2 , BER , RJ , DJ ) ;
catch
warning ( [ ’ Could not c a l c u l a t e d j i t t e r −p l o t f o r ’ . . .
’ TP ’ , num2str ( i ) , ’ . Reason could be ’ . . .
’ no eye opening or a wrong s h i f t e d eye ’ ] ) ;
end
figure
p l o t _ j i t t e r ( j i t t e r 2 , RJ , DJ , BER ) ;
t i t l e ( [ ’ cadence J i t t e r TP with EQ ’ ] ) ;
i f ( Fs >= Fd )
out = [ f a l s e ] ;
out ( : ) = [ ] ;
time = 0 : 1 / Fs : ( length ( i n ) ) / Fd − 1/ Fs ;
j = 1;
f o r i = 1 : length ( time )
i f ( ( ( i −1)/ Fs ) >= ( j /Fd ) )
j = j +1;
end
out = [ out , i n ( j ) ] ;
end
else
e r r o r ( ’ Fs i s should be l a r g e r than Fd ’ ) ;
end
f u n c t i o n e y e s h i f t = g e t E y e s h i f t ( imp , o s r )
function e y e s h i f t = addEyeshift ( s h i f t 1 , s h i f t 2 )
e y e s h i f t = mod( s h i f t 1 + s h i f t 2 + 1 , 2 ) − 1 ;
o s r 1 =round ( o s r / 8 ) ;
o s r 2 =2 * o s r 1 ;
LL=length ( s i g n a l ) ;
s h i f t _ o s r =−s h i f t _ o s r ;
i f abs ( s h i f t _ o s r ) > o s r
warning ( ’ e y e s h i f t > +/− T i s not allowed . ’ ) ;
else
x= s i g n a l ( round ( s h i f t _ o s r +1−o s r 1 +3 * o s r / 2 ) : LL ) ;
num=round ( s h i f t _ o s r +1−o s r 1 1 +3 * o s r / 2 ) ;
L= f l o o r ( LL/ o s r ) −3;
A=z e r o s ( osr , L ) ;
A( : ) = x ( 1 : L * o s r ) ;
f o r i = 1 :w2+1
A=[A ; [ A( i , 2 : L ) x ( L * o s r + i ) ] ] ;
end
t =[− o s r / 2 : o s r /2+ o s r 2 ] ’ + s h i f t _ o s r −o s r 1 ;
t=t/osr ;
end
Eye = [ t ,A ] ;
end
i f RJ < minRJ
RJ = 0 ;
end
i f RJ = 0
f o r i = 1 : length ( x )
i f x ( i ) <= muL || x ( i ) >= muR
data ( i ) = 1 ;
e l s e i f ( x ( i ) > muL && x ( i −1) <= muL) . . .
|| ( x ( i ) < muR && x ( i +1) >= muR)
data ( i ) = 10^( −20);
else
data ( i ) = NaN;
end
end
else
alpha = 2 * s q r t ( 2 ) * e r f c i n v ( 2 * BER ) ;
d i s t r R J L e f t = normpdf ( x , ( muL) , RJ/alpha ) ;
d i s t r R J R i g h t = normpdf ( x , ( muR) , RJ/alpha ) ;
distrRJ = distrRJRight + distrRJLeft ;
data = e r f ( d i s t r R J ) ;
bathtub . xAxes = x ;
bathtub . data = data ;
bathtub . BER = BER ;
l e v e l = eyeLevel ;
A.3 V ERILOG C ODE FOR LMS E NGINE AND D IRECT T X EQ 91
alpha = 2 * s q r t ( 2 ) * e r f c i n v ( 2 * BER ) ;
jitter . distribution . RJ = d i s t r R J ’ ;
jitter . distribution . DJ = c r o s s V e c t ;
jitter . distribution .mu = [muL, muR ] ;
jitter . distribution . xAxes = time ;
i f RJ == 0 && DJ == 0
legend ( ’ DJ ’ ) ;
e l s e i f RJ == 0 && DJ ~= 0
l i n e ( [ muL, muL] , [ 0 , 1 ] , ’ Color ’ , ’ b ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muL+DJ /2 ,muL+DJ / 2 ] , [ 0 , 1 ] , ’ Color ’ , ’ g ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muR,muR] , [ 0 , 1 ] , ’ Color ’ , ’ b ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muR−DJ /2 ,muR−DJ / 2 ] , [ 0 , 1 ] , ’ Color ’ , ’ g ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
legend ( ’ DJ ’ , ’mu ’ , ’mu + user−d e f i n e d DJ ’ ) ;
e l s e i f RJ ~= 0 && DJ == 0
p l o t ( xAxes , R J D i s t r /max ( R J D i s t r ) , ’ g ’ ) ;
l i n e ( [ muL, muL] , [ 0 , 1 ] , ’ Color ’ , ’ b ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muR,muR] , [ 0 , 1 ] , ’ Color ’ , ’ b ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
legend ( ’ DJ ’ , ’ user−d e f i n e d RJ ’ , ’mu ’ ) ;
e l s e i f RJ ~= 0 && DJ ~= 0
p l o t ( xAxes , R J D i s t r /max ( R J D i s t r ) , ’ g ’ ) ;
l i n e ( [ muL, muL] , [ 0 , 1 ] , ’ Color ’ , ’ b ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muL+DJ /2 ,muL+DJ / 2 ] , [ 0 , 1 ] , ’ Color ’ , ’ g ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muR,muR] , [ 0 , 1 ] , ’ Color ’ , ’ b ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
l i n e ( [ muR−DJ /2 ,muR−DJ / 2 ] , [ 0 , 1 ] , ’ Color ’ , ’ g ’ , ’ L i n e s t y l e ’ , ’ −. ’ )
legend ( ’ DJ ’ , ’ user−d e f i n e d RJ ’ , ’mu ’ , ’mu + user−d e f i n e d DJ ’ ) ;
end
title ( ’ Jitter ’ );
x l a b e l ( ’ t /T ’ ) ;
g r i d on ;
hold o f f ;
end
A.3.1 Multiplier
92 A PPENDIX A I MPROVEMENT OF S IGNAL I NTEGRITY BY A DAPTIVE E QUALIZER T RAININGS
input c l k , r s t ;
input [ 9 : 0 ] in_a , in_b ;
output [ 9 : 0 ] y_out ;
always @ ( posedge c l k )
begin
i f ( r s t ==1)
begin
tmp1 <= 1 ’ b0 ;
tmp2 <= 1 ’ b0 ;
tmp3 <= 1 ’ b0 ;
end
else
begin
tmp1 <= i n _ a [ 8 ] ;
tmp2 <= tmp1 ;
tmp3 <= tmp2 ;
end
end
A.3.2 Adder
module adder ( out_abs , sign , out , in1 , in2 , keep , c l k , r s t ) ;
input [ 9 : 0 ] in1 , i n 2 ;
input c l k , r s t ;
input keep ;
output [ 9 : 0 ] out_abs , out ;
output s i g n ;
reg [ 9 : 0 ] out_abs , out ;
reg s i g n ;
reg [ 9 : 0 ] temp , temp1 , temp2 , temp3 ;
A.3.3 Divider
module p i p e l i n e d _ d i v i 2 ( r e s , a b s _ r e s , sign , dividend , d i v i s o r , c l k ) ;
input [ 9 : 0 ] dividend ;
input [ 9 : 0 ] d i v i s o r ;
94 A PPENDIX A I MPROVEMENT OF S IGNAL I NTEGRITY BY A DAPTIVE E QUALIZER T RAININGS
input c l k ;
output reg [ 9 : 0 ] r e s = 1 0 ’ b0 ; / / r e s u l t of quotient
output reg [ 9 : 0 ] a b s _ r e s = 1 0 ’ b0 ;
output reg s i g n ;
reg [ 8 : 0 ] rem = 9 ’ b0 ; / / r e m a i n d e r
reg [ 1 7 : 0 ] r e s 1 8 , dd18 ;
reg [ 1 7 : 0 ] r e s 1 7 , dd17 ;
reg [ 1 7 : 0 ] r e s 1 6 , dd16 ;
reg [ 1 7 : 0 ] r e s 1 5 , dd15 ;
reg [ 1 7 : 0 ] r e s 1 4 , dd14 ;
reg [ 1 7 : 0 ] r e s 1 3 , dd13 ;
reg [ 1 7 : 0 ] r e s 1 2 , dd12 ;
reg [ 1 7 : 0 ] r e s 1 1 , dd11 ;
reg [ 1 7 : 0 ] r e s 1 0 , dd10 ;
reg [ 1 7 : 0 ] r e s 9 , dd9 ;
reg [ 1 7 : 0 ] r e s 8 , dd8 ;
reg [ 1 7 : 0 ] r e s 7 , dd7 ;
reg [ 1 7 : 0 ] r e s 6 , dd6 ;
reg [ 1 7 : 0 ] r e s 5 , dd5 ;
reg [ 1 7 : 0 ] r e s 4 , dd4 ;
reg [ 1 7 : 0 ] r e s 3 , dd3 ;
reg [ 1 7 : 0 ] r e s 2 , dd2 ;
reg [ 1 7 : 0 ] r e s 1 , dd1 , dd0 ;
reg [ 8 : 0 ] rem18 , dr18 , dt18 ;
reg [ 8 : 0 ] rem17 , dr17 , dt17 ;
reg [ 8 : 0 ] rem16 , dr16 , dt16 ;
reg [ 8 : 0 ] rem15 , dr15 , dt15 ;
reg [ 8 : 0 ] rem14 , dr14 , dt14 ;
reg [ 8 : 0 ] rem13 , dr13 , dt13 ;
reg [ 8 : 0 ] rem12 , dr12 , dt12 ;
reg [ 8 : 0 ] rem11 , dr11 , dt11 ;
reg [ 8 : 0 ] rem10 , dr10 , dt10 ;
reg [ 8 : 0 ] rem9 , dr9 , dt9 ;
reg [ 8 : 0 ] rem8 , dr8 , dt8 ;
reg [ 8 : 0 ] rem7 , dr7 , dt7 ;
reg [ 8 : 0 ] rem6 , dr6 , dt6 ;
reg [ 8 : 0 ] rem5 , dr5 , dt5 ;
reg [ 8 : 0 ] rem4 , dr4 , dt4 ;
reg [ 8 : 0 ] rem3 , dr3 , dt3 ;
reg [ 8 : 0 ] rem2 , dr2 , dt2 ;
reg [ 8 : 0 ] rem1 , dr1 , dt1 , dr0 , dt0 ;
reg sign18 , sign17 , sign16 , sign15 , sign14 ,
sign13 , sign12 , sign11 , sign10 , sign9 ,
sign8 , sign7 , sign6 , sign5 , sign4 ,
sign3 , sign2 , sign1 , s i g n 0 ;
always @ ( posedge c l k ) // 1 8
begin
rem18 <= 4 ’ b0 ;
r e s 1 8 <= 8 ’ b0 ;
dd18 <= { dividend [ 8 : 0 ] , 9 ’ b0 } ;
s i g n 1 8 <= dividend [ 9 ] ^ d i v i s o r [ 9 ] ;
dt18 <= d i v i s o r ;
if ( divisor [8]==1)
dr18 <= d i v i s o r [ 8 : 0 ] > > 1 ;
else
dr18 <= d i v i s o r [ 8 : 0 ] ;
end
always @ ( posedge c l k ) // 1 7
begin
rem17 <= ( ( rem18 −(( rem18<dr18 ) ? 9 ’ b0 : dr18 )) < <1)+ dd18 [ 1 7 ] ;
r e s 1 7 <= ( r e s 1 8 < <1)+(( rem17<dr17 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd17 <= dd18 ;
dr17 <= dr18 ;
A.3 V ERILOG C ODE FOR LMS E NGINE AND D IRECT T X EQ 95
s i g n 1 0 <= s i g n 1 1 ;
end
always @ ( posedge c l k ) / / 9
begin
rem9 <= ( ( rem10 −(( rem10<dr10 ) ? 9 ’ b0 : dr10 )) < <1)+ dd10 [ 9 ] ;
r e s 9 <= ( r e s 1 0 < <1)+(( rem9<dr9 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd9 <= dd10 ;
dr9 <= dr10 ;
dt9 <= dt10 ;
s i g n 9 <= s i g n 1 0 ;
end
always @ ( posedge c l k ) / / 8
begin
rem8 <= ( ( rem9 −(( rem9<dr9 ) ? 9 ’ b0 : dr9 )) < <1)+ dd9 [ 8 ] ;
r e s 8 <= ( r e s 9 < <1)+(( rem8<dr8 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd8 <= dd9 ;
dr8 <= dr9 ;
dt8 <= dt9 ;
s i g n 8 <= s i g n 9 ;
end
always @ ( posedge c l k ) / / 7
begin
rem7 <= ( ( rem8 −(( rem8<dr8 ) ? 9 ’ b0 : dr8 )) < <1)+ dd8 [ 7 ] ;
r e s 7 <= ( r e s 8 < <1)+(( rem7<dr7 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd7 <= dd8 ;
dr7 <= dr8 ;
dt7 <= dt8 ;
s i g n 7 <= s i g n 8 ;
end
always @ ( posedge c l k ) // 6
begin
rem6 <= ( ( rem7 −(( rem7<dr7 ) ? 9 ’ b0 : dr7 )) < <1)+ dd7 [ 6 ] ;
r e s 6 <= ( r e s 7 < <1)+(( rem6<dr6 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd6 <= dd7 ;
dr6 <= dr7 ;
dt6 <= dt7 ;
s i g n 6 <= s i g n 7 ;
end
always @ ( posedge c l k ) // 5
begin
rem5 <= ( ( rem6 −(( rem6<dr6 ) ? 9 ’ b0 : dr6 )) < <1)+ dd6 [ 5 ] ;
r e s 5 <= ( r e s 6 < <1)+(( rem5<dr5 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd5 <= dd6 ;
dr5 <= dr6 ;
dt5 <= dt6 ;
s i g n 5 <= s i g n 6 ;
end
always @ ( posedge c l k ) / / 4
begin
rem4 <= ( ( rem5 −(( rem5<dr5 ) ? 9 ’ b0 : dr5 )) < <1)+ dd5 [ 4 ] ;
r e s 4 <= ( r e s 5 < <1)+(( rem4<dr4 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd4 <= dd5 ;
dr4 <= dr5 ;
dt4 <= dt5 ;
s i g n 4 <= s i g n 5 ;
end
always @ ( posedge c l k ) // 3
begin
rem3 <= ( ( rem4 −(( rem4<dr4 ) ? 9 ’ b0 : dr4 )) < <1)+ dd4 [ 3 ] ;
r e s 3 <= ( r e s 4 < <1)+(( rem3<dr3 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd3 <= dd4 ;
dr3 <= dr4 ;
dt3 <= dt4 ;
s i g n 3 <= s i g n 4 ;
A.3 V ERILOG C ODE FOR LMS E NGINE AND D IRECT T X EQ 97
end
always @ ( posedge c l k ) // 2
begin
rem2 <= ( ( rem3 −(( rem3<dr3 ) ? 9 ’ b0 : dr3 )) < <1)+ dd3 [ 2 ] ;
r e s 2 <= ( r e s 3 < <1)+(( rem2<dr2 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd2 <= dd3 ;
dr2 <= dr3 ;
dt2 <= dt3 ;
s i g n 2 <= s i g n 3 ;
end
always @ ( posedge c l k ) // 1
begin
rem1 <= ( ( rem2 −(( rem2<dr2 ) ? 9 ’ b0 : dr2 )) < <1)+ dd2 [ 1 ] ;
r e s 1 <= ( r e s 2 < <1)+(( rem1<dr1 ) ? 1 ’ b0 : 1 ’ b1 ) ;
dd1 <= dd2 ;
dr1 <= dr2 ;
dt1 <= dt2 ;
s i g n 1 <= s i g n 2 ;
end
always @ ( posedge c l k )
begin
rem <= rem1 −(( rem1<dr1 ) ? 9 ’ b0 : dr1 ) ;
dd0 <= dd1 ;
dr0 <= dr1 ;
dt0 <= dt1 ;
sign0 <= s i g n 1 ;
i f ( ( dt0 [ 8 ] = = 1 ) & & ( ! dd0 [ 1 7 ] = = 1 ) )
begin
r e s <= { sign0 , r e s 1 [ 8 : 0 ] > > 1 } ;
sign <= s i g n 0 ;
a b s _ r e s <= { 1 ’ b0 , r e s 1 [ 8 : 0 ] > > 1 } ;
end
e l s e i f ( ( dt0 [8]==1)&&( dd0 [ 1 7 ] = = 1 ) )
begin
r e s <= { sign0 , r e s 1 [ 9 : 1 ] } ;
sign <= s i g n 0 ;
a b s _ r e s <= { 1 ’ b0 , r e s 1 [ 9 : 1 ] } ;
end
else
begin
r e s <= { sign0 , r e s 1 [ 8 : 0 ] } ;
sign <= s i g n 0 ;
a b s _ r e s <= { 1 ’ b0 , r e s 1 [ 8 : 0 ] } ;
end
end
endmodule
References
99
100 R EFERENCES
[17] A. Jaiswal, Y. Fang, K. Hofmann, and P. Gregorius. “An efficient methodology for
mixed-signal high-speed memory design using Matlab/Simulink-HDL co-simula-
tion”. In: Solid-State and Integrated Circuit Technology (ICSICT), 2012 IEEE 11th In-
ternational Conference on (Oct. 2012), pp. 1–3.
[18] Y. Fang, U. Muhammad, A. Jaiswal, and K. Hofmann. “Low-power design of hy-
brid digital impedance calibration for process, voltage, temperature compensa-
tions”. In: Electronics, Circuits, and Systems (ICECS), 2013 IEEE 20th International
Conference on (Dec. 2013), pp. 37–40.
[19] Intel® Z87 Chipset Platform Diagram. http://www.intel.com/content/www/
us / en / chipsets / performance - chipsets / z87 - chipset - diagram .
html.
[20] http://www.pcisig.com.
[21] http://www.hypertransport.org.
[22] HyperTranspor™ I/O Link Specification. http : / / www . hypertransport . org /
default.cfm?page=HyperTransportSpecifications31.
[23] http://www.jedec.org.
[24] H. Yamaguchi, H. Tamura, Y. Doi, Y. Tomita, T. Hamada, M. Kibune, S. Ohmoto,
K. Tateishi, O. Tyshchenko, A. Sheikholeslami, T. Higuchi, J. Ogawa, T. Saito, H.
Ishida, and K. Gotoh. “A 5Gb/s transceiver with an ADC-based feedforward CDR
and CMA adaptive equalizer in 65nm CMOS”. In: Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), 2010 IEEE International (Feb. 2010), pp. 168–169.
ISSN: 0193-6530.
[28] M.S. Chen, Y.N. Shih, C.L. Lin, H.W. Hung, and J. Lee. “A Fully-Integrated 40-
Gb/s Transceiver in 65-nm CMOS Technology”. In: Solid-State Circuits, IEEE Jour-
nal of (Mar. 2012), Vol. 47, No. 3, 627–640, ISSN: 0018-9200.
[29] M.T. Hsieh and G. Sobelman. “Architectures for multi-gigabit wire-linked clock
and data recovery”. In: Circuits and Systems Magazine, IEEE (Apr. 2008), Vol. 8, No.
4, 45–57, ISSN: 1531-636X.
[30] M.H. Perrott, Y.T. Huang, R.T. Baird, B.W. Garlepp, D. Pastorello, E.T. King, Q.C.
Yu, D.B. Kasha, P. Steiner, L.G. Zhang, J. Hein, and B. Del Signore. “A 2.5-Gb/s
Multi-Rate 0.25-µm CMOS Clock and Data Recovery Circuit Utilizing a Hybrid
Analog/Digital Loop Filter and All-Digital Referenceless Frequency Acquisition”.
In: Solid-State Circuits, IEEE Journal of (Dec. 2006), Vol. 41, No. 12, 2930–2944, ISSN:
0018-9200.
[31] H.H. Chang, R.J. Yang, and S.I. Liu. “Low jitter and multirate clock and data re-
covery circuit using a MSADLL for chip-to-chip interconnection”. In: Circuits and
Systems I: Regular Papers, IEEE Transactions on (Dec. 2004), Vol. 51, No. 12, 2356–
2364, ISSN: 1549-8328.
[32] R. Kreienkamp, U. Langmann, C. Zimmermann, T. Aoyama, and H. Siedhoff. “A
10-Gb/s CMOS clock and data recovery circuit with an analog phase interpolator”.
In: Solid-State Circuits, IEEE Journal of (Mar. 2005), Vol. 40, No. 3, 736–743, ISSN:
0018-9200.
[33] K. Maruko, T. Sugioka, H. Hayashi, Zhou Z.W., Y. Tsukuda, Y. Yagishita, H. Kon-
ishi, T. Ogata, H. Owa, T. Niki, K. Konda, M. Sato, H. Shiroshita, T. Ogura, T. Aoki,
H. Kihara, and S. Tanaka. “A 1.296-to-5.184Gb/s Transceiver with 2.4mW/(Gb/s)
Burst-mode CDR using Dual-Edge Injection-Locked Oscillator”. In: Solid-State Cir-
cuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International (Feb.
2010), pp. 364–365. ISSN: 0193-6530.
[34] F. Spagna, L.D. Chen, M. Deshpande, Y.P. Fan, D. Gambetta, S. Gowder, S. Iyer, R.
Kumar, P. Kwok, R. Krishnamurthy, C.C. Lin, R. Mohanavelu, R. Nicholson, J. Ou,
M. Pasquarella, K. Prasad, H. Rustam, L. Tong, A. Tran, J. Wu, and X.G. Zhang. “A
78mW 11.8Gb/s serial link transceiver with adaptive RX equalization and baud-
rate CDR in 32nm CMOS”. In: Solid-State Circuits Conference Digest of Technical Pa-
pers (ISSCC), 2010 IEEE International (Feb. 2010), pp. 366–367. ISSN: 0193-6530.
[35] Z. Gao, Yu. H., P. Chiang, Y. Yang, and F. Zhang. “A 10Gb/s wire-line transceiver
with half rate period calibration CDR”. In: Circuits and Systems, 2009. ISCAS 2009.
IEEE International Symposium on (May 2009), pp. 1827–1830.
[36] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, N. Masuda, T. Takemoto,
F. Yuki, and T. Saito. “A 12.3-mW 12.5-Gb/s Complete Transceiver in 65-nm CMOS
Process”. In: Solid-State Circuits, IEEE Journal of (Dec. 2010), Vol. 45, No. 12, 2838–
2849, ISSN: 0018-9200.
R EFERENCES 103
[37] S. Sidiropoulos and M.A. Horowitz. “A semidigital dual delay-locked loop”. In:
Solid-State Circuits, IEEE Journal of (Nov. 1997), Vol. 32, No. 11, 1683–1692, ISSN:
0018-9200.
[38] J. Park, J.F. Liu, L.R. Carley, and C.P. Yue. “A 1-V, 1.4-2.5 GHz Charge-Pump-Less
PLL for a Phase Interpolator Based CDR”. In: Custom Integrated Circuits Conference,
2007. CICC ’07. IEEE (Sept. 2007), pp. 281–284.
[39] S.J. Hu, C. Jia, K. Huang, C. Zhang, X.Q. Zheng, and Z.H. Wang. “A 10Gbps CDR
based on phase interpolator for source synchronous receiver in 65nm CMOS”. In:
Circuits and Systems (ISCAS), 2012 IEEE International Symposium on (May 2012),
pp. 309–312. ISSN: 0271-4302.
[40] K. Desai, R. Nagulapalli, V. Krishna, R. Palwai, P. Kumar Venkatesan, and V. Khaw-
she. “High Speed Clock and Data Recovery Circuit with Novel Jitter Reduction
Technique”. In: VLSI Design, 2010. VLSID ’10. 23rd International Conference on (Jan.
2010), pp. 300–305. ISSN: 1063-9667.
[41] M.Y. He and J. Poulton. “A CMOS mixed-signal clock and data recovery circuit for
OIF CEI-6G+ backplane transceiver”. In: Solid-State Circuits, IEEE Journal of (Mar.
2006), Vol. 41, No. 3, 597–606, ISSN: 0018-9200.
[42] P. Gregorius, ed. Verfahren zur Symbolsynchronisation in der schnellen seriellen Dat-
enuebertragung. Shaker, 2007.
[43] S. Kumaki, A.H. Johari, T. Matsubara, I. Hayashi, and H. Ishikuro. “A 0.5V 6-bit
scalable phase interpolator”. In: Circuits and Systems (APCCAS), 2010 IEEE Asia
Pacific Conference on (Dec. 2010), pp. 1019–1022.
[44] Y. Tomita, M. Kibune, J. Ogawa, W.W. Walker, H. Tamura, and T. Kuroda. “A 10-
Gb/s receiver with series equalizer and on-chip ISI monitor in 0.11-µm CMOS”.
In: Solid-State Circuits, IEEE Journal of (Apr. 2005), Vol. 40, No. 4, 986–993, ISSN:
0018-9200.
[45] X.F. Lin, J. Liu, H. Lee, and H. Liu. “A 2.5- to 3.5-Gb/s Adaptive FIR Equalizer With
Continuous-Time Wide-Bandwidth Delay Line in 0.25-µm CMOS”. In: Solid-State
Circuits, IEEE Journal of (Aug. 2006), Vol. 41, No. 8, 1908–1918, ISSN: 0018-9200.
[46] Y. Hidaka, W.X. Gai, H. Takeshi, J.H. Jiang, Y. Koyanagi, and H. Osone. “A 4-
Channel 1.25-10.3 Gb/s Backplane Transceiver Macro With 35 dB Equalizer and
Sign-Based Zero-Forcing Adaptive Control”. In: Solid-State Circuits, IEEE Journal of
(Dec. 2009), Vol. 44, No. 12, 3547–3559, ISSN: 0018-9200.
[47] J. Zerbe, F. Chen, A. Ho, R. Farjad-Rad, J. Poulton, K. Donnelly, and B. Leibowitz.
“High-speed signaling systems with adaptable pre-emphasis and equalization”.
In: (Aug. 2006). US Patent App. 11/336,045.
104 R EFERENCES
[48] J.L. Zerbe, C.W. Werner, V. Stojanovic, F. Chen, J. Wei, G. Tsang, D. Kim, W.F.
Stonecypher, A. Ho, T.P. Thrush, R.T. Kollipara, M.A. Horowitz, and K.S. Don-
nelly. “Equalization and clock recovery for a 2.5-10-Gb/s 2-PAM/4-PAM back-
plane transceiver cell”. In: Solid-State Circuits, IEEE Journal of (Dec. 2003), Vol. 38,
No. 12, 2121–2130, ISSN: 0018-9200.
[49] A. Amirkhany, J. Wei, N.K. Mishra, J. Shen, W.T. Beyene, C. Chen, T. J. Chin, D.
Dressler, C. Huang, V.P. Gadde, M. Hekmat, K. Kaviani, H. Lan, P. Le, Mahabalesh-
wara, C. Madden, S. Mukherjee, L. Raghavan, K. Saito, D. Secker, A. Sendhil, R.
Schmitt, S. Fazeel, G.S. Srinivas, T. Wu, C. Tran, A. Vaidyanath, K. Vyas, L. Yang,
M. Jain, K.Y.K. Chang, and X.C. Yuan. “A 12.8-Gb/s/link Tri-Modal Single-Ended
Memory Interface”. In: Solid-State Circuits, IEEE Journal of (Apr. 2012), Vol. 47, No.
4, 911–925, ISSN: 0018-9200.
[50] M. Lin and K.T. Cheng. “Testable Design for Adaptive Linear Equalizer in High-
Speed Serial Links”. In: Test Conference, 2006. ITC ’06. IEEE International (Oct. 2006),
pp. 1–10. ISSN: 1089-3539.
[51] J. Sonntag, J. Stonick, J. Gorecki, B. Beale, B. Check, X.M. Gong, J. Guiliano, K. Lee,
B. Lefferts, D. Martin, U.K. Moon, A. Sengir, S. Titus, G.Y. Wei, D. Weinlader, and
Y.H. Yang. “An adaptive PAM-4 5 Gb/s backplane transceiver in 0.25 µm CMOS”.
In: Custom Integrated Circuits Conference, 2002. Proceedings of the IEEE 2002 (2002),
pp. 363–366.
[52] D. Tonietto, J. Hogeboon, E. Bensoudane, S. Sadeghi, H. Khor, and P. Krotnev. “A
7.5Gb/s transmitter with self-adaptive FIR”. In: VLSI Circuits, 2008 IEEE Sympo-
sium on (June 2008), pp. 198–199.
[53] M.Q. Le, P.J. Hurst, and K.C. Dyer. “An analog DFE for disk drives using a mixed-
signal integrator”. In: VLSI Circuits, 1998. Digest of Technical Papers. 1998 Symposium
on (June 1998), pp. 156–157.
[54] J.E.C. Brown, P.J. Hurst, I. Agi, and L. Der. “Analog decision-feedback equalizer
architectures”. In: VLSI Signal Processing, VII, 1994., [Workshop on] (1994), pp. 266–
275.
[55] G. Han and E. Sanchez-Sinencio. “CMOS transconductance multipliers: a tuto-
rial”. In: Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transac-
tions on (Dec. 1998), Vol. 45, No. 12, 1550–1563, ISSN: 1057-7130.
[56] M. Kargar and M.M. Green. “A 10 Gb/s adaptive analog decision feedback equal-
izer for multimode fiber dispersion compensation in 0.13 µm CMOS”. In: ESS-
CIRC, 2010 Proceedings of the (Sept. 2010), pp. 550–553. ISSN: 1930-8833.
[57] C. Angerer, R. Langwieser, and M. Rupp. “Evaluation and Exploration of RFID
Systems by Rapid Prototyping”. In: Personal Ubiquitous Comput. (Mar. 2012), Vol.
16, No. 3, 309–321, ISSN: 1617-4909.
R EFERENCES 105
[70] J. Koo, G.S. Kim, J. Song, K.W. Kim, Y.J. Choi, and C. Kim. “Small-area high-
accuracy ODT/OCD by calibration of global on-chip for 512M GDDR5 applica-
tion”. In: Custom Integrated Circuits Conference, 2009. CICC ’09. IEEE (Sept. 2009),
pp. 717–720.
[71] POD15, 1.5V Pseudo Open Drain I/O. www.jedec.org/standards-documents/
docs/jesd-8-20. Accessed: 2013-09-30.
[72] R. Senthinathan and J.L. Prince. “Application specific CMOS output driver circuit
design techniques to reduce simultaneous switching noise”. In: Solid-State Circuits,
IEEE Journal of (Dec. 1993), Vol. 28, No. 12, 1383–1388, ISSN: 0018-9200.
[73] Agilent Technologies. Jitter Analysis: The dual-Dirac Model, RJ/DJ, and Q-Scale. Whi-
te Paper, 2004.
[74] W. El-Reedy, A.A. El-Moursy, and A.H. Fahmy. “High Performance Memory Re-
quests Scheduling Technique for Multicore Processors”. In: High Performance Com-
puting and Communication 2012 IEEE 9th International Conference on Embedded Soft-
ware and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on (June
2012), pp. 127–134.
[75] N. Rafique, W.T. Lim, and M. Thottethodi. “Effective Management of DRAM Band-
width in Multicore Processors”. In: Parallel Architecture and Compilation Techniques,
2007. PACT 2007. 16th International Conference on (Sept. 2007), pp. 245–258. ISSN:
1089-795X.
[76] B.L. Jacob, P.M. Chen, S.R. Silverman, and T.N. Mudge. “An analytical model for
designing memory hierarchies”. In: Computers, IEEE Transactions on (Oct. 1996),
Vol. 45, No. 10, 1180–1194, ISSN: 0018-9340.
[77] A.C. Hsieh, T.T. Hwang, M.T. Chang, M.H. Tsai, C.M. Tseng, and H.C. Li. “TSV
redundancy: Architecture and design issues in 3D IC”. In: Design, Automation Test
in Europe Conference Exhibition (DATE), 2010 (Mar. 2010), pp. 166–171. ISSN: 1530-
1591.
[78] E.J. Marinissen and Y. Zorian. “Testing 3D chips containing through-silicon vias”.
In: Test Conference, 2009. ITC 2009. International (Nov. 2009), pp. 1–11.
[79] H.H.S. Lee and K. Chakrabarty. “Test Challenges for 3D Integrated Circuits”. In:
Design Test of Computers, IEEE (Sept. 2009), Vol. 26, No. 5, 26–35, ISSN: 0740-7475.
[80] S.H. Pan, N. Chang, and T. Hitomi. “3D-IC dynamic thermal analysis with hierar-
chical and configurable chip thermal model”. In: 3D Systems Integration Conference
(3DIC), 2013 IEEE International (Oct. 2013), pp. 1–8.
[81] Y.S. Huang, Y.H. Liu, and J.D. Huang. “Layer-Aware Design Partitioning for Ver-
tical Interconnect Minimization”. In: VLSI (ISVLSI), 2011 IEEE Computer Society
Annual Symposium on (July 2011), pp. 144–149. ISSN: 2159-3469.
R EFERENCES 107
[82] H.L. Chang, H.C. Lai, T.Y. Hsueh, W.K. Cheng, and M.C. Chi. “A 3D IC designs
partitioning algorithm with power consideration”. In: Quality Electronic Design
(ISQED), 2012 13th International Symposium on (Mar. 2012), pp. 137–142. ISSN: 1948-
3287.
[83] S. Mukherjee, O. Dan, A. Vaidyanath, D. Dressler, and A. Sendhil. “Challenges in
extending single-ended graphics memory data rates”. In: Electrical Performance of
Electronic Packaging and Systems (EPEPS), 2012 IEEE 21st Conference on (Oct. 2012),
pp. 39–42.
[84] O. Dan, S. Chang, C. Madden, J.H. Kim, R. Schmitt, M. Li, C. Ware, B. Leibowitz,
Y. Frans, and N. Nguyen. “Design and characterization of a 12.8GB/s low power
differential memory system for mobile applications”. In: Electrical Performance of
Electronic Packaging and Systems, 2009. EPEPS ’09. IEEE 18th Conference on (Oct.
2009), pp. 33–36.
[85] K. Yoo and G. GHan. “An adaptation method for FIR pre-emphasis filter on back-
plane channel”. In: Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE
International Symposium on (May 2006), 4 pp.–.
[86] B. Deutschmann and T. Ostermann. “CMOS output drivers with reduced ground
bounce and electromagnetic emission”. In: Solid-State Circuits Conference, 2003. ES-
SCIRC ’03. Proceedings of the 29th European (Sept. 2003), pp. 537–540.
[87] L. Yang and J.S. Yuan. “Output buffer design for low noise and load adaptability”.
In: Circuits, Devices and Systems, IEE Proceedings - (Apr. 2005), Vol. 152, No. 2, 146–
150, ISSN: 1350-2409.
[88] R.J. Baker. CMOS Circuit Design, Layout, and Simulation. 3rd. Wiley-IEEE Press,
2010. ISBN: 0470881321, 9780470881323.
[89] P.E. Allen and D.R. Holberg. CMOS Analog Circuit Design. Oxford series in electri-
cal and computer engineering. Oxford University Press, 2002. ISBN: 9780195116441.
[90] S. Williams, H. Thompson, M. Hufford, and E. Naviasky. “An improved CMOS
ring oscillator PLL with less than 4ps RMS accumulated jitter”. In: Custom Inte-
grated Circuits Conference, 2004. Proceedings of the IEEE 2004 (Oct. 2004), pp. 151–
154.
[91] P. Raha. “A 0.6-4.2V low-power configurable PLL architecture for 6 GHz-300 MHz
applications in a 90 nm CMOS process”. In: VLSI Circuits, 2004. Digest of Technical
Papers. 2004 Symposium on (June 2004), pp. 232–235.
[92] S.D. Brown and Z.G. Vranesic. Fundamentals of digital logic with VHDL design. McGr
aw-Hill, 2000. ISBN: 9780070125919.
[93] A. Jaiswal, Y. Fang, and K. Hofmann. “A Wide Range Programmable Duty Cycle
Corrector”. In: Patent, filed on 10 Sep. (2012).
[94] A. Jaiswal, Y. Fang, and K. Hofmann. “Method and Device for Correcting a Phase
Shift In a Time Synchronised System”. In: Patent, filed on 5th 29 Jun. (2012).
Invention Disclosures
[16] Y. Fang, A. Jaiswal, and K. Hofmann. “Method and Means for Improving the Data
Transfer Integrity In a Time Synchronised System”. In: Patent, filed on 10th Sep.
(2012).
[93] A. Jaiswal, Y. Fang, and K. Hofmann. “A Wide Range Programmable Duty Cycle
Corrector”. In: Patent, filed on 10 Sep. (2012).
[94] A. Jaiswal, Y. Fang, and K. Hofmann. “Method and Device for Correcting a Phase
Shift In a Time Synchronised System”. In: Patent, filed on 5th 29 Jun. (2012).
109
List of Own Publications
[12] Y. Fang, J. Bargon, A. Jaiswal, and K. Hofmann. “A Low-Jitter Clock and Data Re-
covery Using Proportional-Integral Controller for GDDR5 Interface Trainings”. In:
Integrated Circuit Design and Technology, 2014. ICICDT ’14. International Conference
on (May 2014).
[13] Y. Fang, L. Chen, A. Jaiswal, K. Hofmann, and P. Gregorius. “Adaptive Equalizer
Training for High-Speed Low-Power Communication Systems”. In: Digital System
Design (DSD), 2013 Euromicro Conference on (Sept. 2013), pp. 745–751.
[14] Y. Fang, A. Jaiswal, and K. Hofmann. “Low-power signal integrity trainings for
multi-clock source-synchronous memory systems”. In: SOC Conference (SOCC),
2013 IEEE 26th International (Sept. 2013), pp. 319–324. ISSN: 2164-1676.
[15] Y. Fang, L. Chen, A. Jaiswal, and K. Hofmann. “Transmitter equalizer training
based on pilot signal and peak detection”. In: Electronics, Circuits, and Systems
(ICECS), 2013 IEEE 20th International Conference on (Dec. 2013), pp. 377–380.
[17] A. Jaiswal, Y. Fang, K. Hofmann, and P. Gregorius. “An efficient methodology for
mixed-signal high-speed memory design using Matlab/Simulink-HDL co-simula-
tion”. In: Solid-State and Integrated Circuit Technology (ICSICT), 2012 IEEE 11th In-
ternational Conference on (Oct. 2012), pp. 1–3.
[18] Y. Fang, U. Muhammad, A. Jaiswal, and K. Hofmann. “Low-power design of hy-
brid digital impedance calibration for process, voltage, temperature compensa-
tions”. In: Electronics, Circuits, and Systems (ICECS), 2013 IEEE 20th International
Conference on (Dec. 2013), pp. 37–40.
111
List of Unrelated Publications
113
Supervised Theses
[97] D. Noll. “A Survey of the High-Speed Chip-to-Chip I/O Circuits Evolution”. In:
Pro-seminar (2011).
[98] T. Casper. “Design of a PLL for chip-to-chip I/O Circuits”. In: Pro-seminar (2011).
[99] X. Ding. “System-level modeling of a phase interpolator based clock and data re-
covery using Simulink”. In: Pro-seminar (2012).
[100] K. Nawaz. “High speed duty cycle corrector”. In: Project Seminar (2012).
[101] M.A. Mosabbah. “Level shifters for high speed system”. In: Project Seminar (2012).
[102] M. D. Serrano. “Transceivers for high-speed IO”. In: Project Seminar (2012).
[103] L. Chen. “Equalization circuits for noise cancellation”. In: Master Thesis (2013).
[104] D. Gadler. “Phase interpolator-based PLL”. In: Master Thesis (2013).
[105] U. Muhammad. “Digital impedance calibration for PVT compensation”. In: Master
Thesis (2013).
[106] D. Walk. “Common knowledge of serializer-deserializer”. In: Pro-seminar (2013).
[107] N. Eberlein. “A review on Memory Power Management via Dynamic Voltage/Fr-
equency Scaling”. In: Pro-seminar (2013).
[108] J. Bargon. “Implementation of clock data recovery”. In: Master Thesis (2014).
115
Curriculum Vitae
Personal Data :
Surname : Fang
Name : Yuan
Birth date : 16 October 1983
Location : Shanghai, China
School Education :
Year School level School name
1989 − 1994 Primary school Ri Hui Liu Cun School, Shanghai, China
1994 − 1998 Secondary school Middle School attached to Shanghai Normal
University, Shanghai, China
1998 − 2001 High school Middle School attached to Shanghai Normal
University, Shanghai, China
2003 − 2004 Language school Technische Universität Dresden Institute of
advanced studies, Dresden, Germany
Higher Education :
Year Degree Institute
2004 − 2006 Intermediate Diploma Technische Universität Dresden,
Dresden, Germany
2006 − 2009 Bachelor Technische Universität München,
Munich, Germany
2006 − 2010 Diploma Technische Universität München,
Munich, Germany
2010 − 2014 Doctoral Technische Universität Darmstadt,
Darmstadt, Germany
117
Work Experience :
Year Position Organizations
2010 − 2014 Research Assistant Technische Universität Darmstadt,
Darmstadt, Germany
2008 − 2009 Internship Infineon, Munich, Germany
2008 − 2009 Bachelor Thesis MUNEDA, Munich, Germany