TW QFabric TrafficFlows
TW QFabric TrafficFlows
TW QFabric TrafficFlows
all the member switches and enables them to function as a single unit. So, if your Data Center
deploys a QFabric system with one hundred QFX3500 nodes, then those one hundred switches
will act like a single switch.
Traffic flows differently in this super-sized Virtual Chassis that spans your entire Data Center.
Knowing how traffic moves is critical to understanding and architecting Data Center operations, but it is also necessary to ensure efficient day-to-day operations and troubleshooting.
This Week: QFabric System Traffic Flows and Troubleshooting is a deep dive into how the QFabric
system externalizes the data plane for both user data and data plane traffic and why thats such
a massive advantage from an operations point of view.
Operate and effectively troubleshoot issues that you might face with a QFabric deployment.
9 781936 779871
www.juniper.net/books
51600
By Ankit Chadha
all the member switches and enables them to function as a single unit. So, if your Data Center
deploys a QFabric system with one hundred QFX3500 nodes, then those one hundred switches
will act like a single switch.
Traffic flows differently in this super-sized Virtual Chassis that spans your entire Data Center.
Knowing how traffic moves is critical to understanding and architecting Data Center operations, but it is also necessary to ensure efficient day-to-day operations and troubleshooting.
This Week: QFabric System Traffic Flows and Troubleshooting is a deep dive into how the QFabric
system externalizes the data plane for both user data and data plane traffic and why thats such
a massive advantage from an operations point of view.
Operate and effectively troubleshoot issues that you might face with a QFabric deployment.
9 781936 779871
www.juniper.net/books
51600
By Ankit Chadha
This Week:
QFabric System Traffic Flows and Troubleshooting
By Ankit Chadha
ii
iii
iv
iv
Information Experience
This book is singularly focused on one aspect of networking technology. There are
other sources at Juniper Networks, from white papers to webinars to online forums
such as J-Net (forums.juniper.net). Look for the following sidebars to directly access
other superb informational resources:
MORE? Its highly recommended you go through the technical documentation and the
minimum requirements to get a sense of QFabric hardware and deployment before
you jump in. The technical documentation is located at www.juniper.net/documentation. Use the Pathfinder tool on the documentation site to explore and find the
right information for your needs.
Figure A.1
STP works on the basis of blocking certain ports, meaning that some ports
can potentially be overloaded, while the blocked ports do not forward any
traffic at all. This is highly undesirable, especially because the switch ports
deployed in a Data Center are rather costly.
This situation of some ports not forwarding any traffic can be overcome
somewhat by using different flavors of the protocol, like PVST or MSTP,
but STP inherently works on the principle of blocking ports. Hence, even
with PVST or MSTP, complete load balancing of traffic over all the ports
cannot be achieved. Using PVST and MSTP versions, load balancing can be
done across VLANs one port can block for one VLAN or a group of
VLANs and another port can block for the rest of the VLANs. However,
there is no way to provide load balancing for different flows within the
same VLAN.
Spanning Tree relies on communication between different switches. If there
is some problem with STP communication, then the topology change
recalculations that follow can lead to small outages across the whole Layer
2 domain. Even small outages like these can cause significant revenue loss
for applications that are hosted on your Data Center.
By comparison, a completely scaled QFabric system can have up to 128 member
switches. This new technology works by combining all the member switches and
making them function as a single unit to other external devices. So if your Data
Center deploys a QFabric with one hundred QFX3500 nodes, then all those one
hundred switches will act as a single switch. In short, that single switch (QFabric)
will have (100x48) 4800 ports!
vi
Since all the different QFX3500 nodes act as a single switch, there is no need to run
any kind of loop prevention protocol like Spanning Tree. At the same time, there is no
compromise on redundancy because all the Nodes have redundant connections to the
backplane (details on the connections between different components of a QFabric
system are discussed throughout this book). This is how the QFabric solution takes
care of the STP problem within the Data Center.
Consider the case of a traditional (layered) Data Center design. Note that if two hosts
connected to different access switches need to communicate with each other, they need
to cross multiple switches in order to do that. In other words, communication in the
same or a different VLAN might need to cross multiple switch hops to be successful.
Since all the Nodes within a QFabric system work together and act as a large single
switch, all the external devices connected to the QFabric Nodes (servers, filers, load
balancers, etc.) are just one hop away from each other. This leads to a lower number
of lookups, and hence, considerably reduces latency.
Physical Components
A QFabric system has the following physical components as shown in Figure A.2:
Nodes: these are the top-of-rack (TOR) switches to which external devices are
connected. All the server-facing ports of a QFabric system reside on the Nodes.
There can be up to 128 Nodes in a QFabric-G system and up to 16 Nodes in a
QFabric-M implementation. Up to date details on the differences between
various QFabric systems can be found here: http://www.juniper.net/us/en/
products-services/switching/qfabric-system/#overview.
Interconnects: The Interconnects act as the backplane for all the data plane
traffic. All the Nodes should be connected to all the Interconnects as a best
practice. There can be up to four Interconnects (QFX3008-I) in both QFabric-G
and QFabric-M implementations.
Director Group: There are two Director devices (DG0 and DG1) in both
QFabric-G and QFabric-M implementations. These Director devices are the
brains of the whole QFabric system and host the necessary virtual components
(VMs) that are critical to the health of the system. The two Director devices
operate in a master/slave relationship. Note that all the protocol/route/inventory
states are always synced between the two.
Control Plane Ethernet Switches: These are two independent EX VCs or EX
switches (in case of QFabric-G and QFabric-M, respectively) to which all the
other physical components are connected. These switches provide the necessary
Ethernet network over which the QFabric components can run the internal
protocols that maintain the integrity of the whole system. The LAN segment
created by these devices is called the Control Plane Ethernet segment or the CPE
segment.
Figure A.2
vii
Virtual Components
The Director devices host the following Virtual Machines:
Network Node Group VM: The NWNG-VM are the routing brains for a
QFabric system, where all the routing protocols like OSPF, BGP, or PIM are
run. There are two NWNG-VMs in a QFabric system (one hosted on each
DG) and they operate in an active/backup fashion with the active VM always
being hosted on the master Director device.
Fabric Manager: The Fabric Manager VM is responsible for maintaining the
hardware inventory of the whole system. This includes discovering new
Nodes and Interconnects as theyre added and keeping a track of the ones that
are removed. The Fabric Manager is also in charge of keeping a complete
topological view of how the Nodes are connected to the Interconnects. In
addition to this, the FM also needs to provide internal IP addresses to every
other component to allow for the internal protocols to operate properly.
There is one Fabric Manager VM hosted on each Director device and these
VMs operate in an active/backup configuration.
Fabric Control: The Fabric Manager VM is responsible for distributing
various routes (Layer 2 or Layer 3) to different Nodes of a QFabric system.
This VM forms internal BGP adjacencies with all the Nodes and Interconnects
and sends the appropriate routes over these BGP peerings. There is one Fabric
Manager VM hosted on each Director device and these operate in an active/
active fashion.
viii
Chapter 1
Physical Connectivity and Discovery
10
This chapter discusses what a plain-vanilla QFabric system is supposed to look like. It
does not discuss any issues in the data plane or about packet forwarding; its only
focus is the internal workings of QFabric and checking the protocols that are instrumental in making the QFabric system function as a single unit.
The important first step in setting up a QFabric system is to cable it correctly.
MORE? Juniper has great documentation about cabling and setting up a QFabric system, so it
wont be repeated here. If you need to, review the best practices of QFabric cabling:
https://www.juniper.net/techpubs/en_US/junos11.3/information-products/pathwaypages/qfx-series/qfabric-deployment.html
Make sure that the physical connections are made exactly as mentioned in the deployment guide. Thats how the test units used for this book were set up. Any variations in
your lab QFabric system might cause discrepencies with the correlating output that is
shown in this book.
11
Bond1: This aggregated interface is used mainly for internal Control plane
communication between the Director devices and other QFabric components
like the Nodes and the Interconnects.
Eth0: This is the management interface of the DG. This interface gets connected to the network and you can SSH to the IP address of this interface from
from an externally reachable machine. Each Director device has an interface
called Eth0, which should be connected to the management network. At the
time of installation, the QFabric system prompts the user to enter the IP
address for the Eth0 interface of each Director device. In addition to this, the
user is required to add a third IP address called the VIP (Virtual IP Address).
This VIP is used to manage the operations of QFabric, such as SSH, telnet, etc.
Also, the CLI command show fabric administration inventory director-group
status shows the status of all the interfaces. Here is sample output of this CLI
command:
root@TEST-QFABRIC>showfabricadministrationinventorydirector-groupstatus
DirectorGroupStatusTueFeb1108:32:50CST2014
MemberStatusRoleMgmtAddressCPUFreeMemoryVMsUpTime
----------------------------------------------------------------dg0onlinemaster172.16.16.51%3429452k497days,02:14hrs
dg1onlinebackup172.16.16.60%8253736k369days,23:42hrs
MemberDeviceId/AliasStatusRole
-------------------------------------dg0TSTDG0onlinemaster
MasterServices
--------------DatabaseServeronline
LoadBalancerDirectoronline
QFabricPartitionAddressoffline
DirectorGroupManagedServices
------------------------------SharedFileSystemonline
NetworkFileSystemonline
VirtualMachineServeronline
LoadBalancer/DHCPonline
HardDriveStatus
----------------VolumeID:0FFF04E1F7778DA3optimal
PhysicalID:0online
PhysicalID:1online
ResyncProgressRemaining:00%
ResyncProgressRemaining:10%
SizeUsedAvailUsed%Mountedon
---------------------------423G36G366G9%/
99M16M79M17%/boot
93G13G81G14%/pbdata
DirectorGroupProcesses
-----------------------DirectorGroupManageronline
PartitionManageronline
SoftwareMirroringonline
SharedFileSystemmasteronline
SecureShellProcessonline
NetworkFileSystemonline
FTPServeronline
Syslogonline
DistributedManagementonline
12
SNMPTrapForwarderonline
SNMPProcessonline
PlatformManagementonline
InterfaceLinkStatus
--------------------ManagementInterfaceup
ControlPlaneBridgeup
ControlPlaneLAGup
CPLink[0/2]down
CPLink[0/1]up
CPLink[0/0]up
CPLink[1/2]down
CPLink[1/1]up
CPLink[1/0]up
CrossoverLAGup
CPLink[0/3]up
CPLink[1/3]up
MemberDeviceId/AliasStatusRole
-------------------------------------dg1TSTDG1onlinebackup
DirectorGroupManagedServices
------------------------------SharedFileSystemonline
NetworkFileSystemonline
VirtualMachineServeronline
LoadBalancer/DHCPonline
HardDriveStatus
----------------VolumeID:0A2073D2ED90FED4optimal
PhysicalID:0online
PhysicalID:1online
ResyncProgressRemaining:00%
ResyncProgressRemaining:10%
SizeUsedAvailUsed%Mountedon
---------------------------423G39G362G10%/
99M16M79M17%/boot
93G13G81G14%/pbdata
DirectorGroupProcesses
-----------------------DirectorGroupManageronline
PartitionManageronline
SoftwareMirroringonline
SharedFileSystemmasteronline
SecureShellProcessonline
NetworkFileSystemonline
FTPServeronline
Syslogonline
DistributedManagementonline
SNMPTrapForwarderonline
SNMPProcessonline
PlatformManagementonline
InterfaceLinkStatus
--------------------ManagementInterfaceup
ControlPlaneBridgeup
ControlPlaneLAGup
CPLink[0/2]down
CPLink[0/1]up
CPLink[0/0]up
CPLink[1/2]down
13
CPLink[1/1]up
CPLink[1/0]up
CrossoverLAGup
CPLink[0/3]up
CPLink[1/3]up
root@TEST-QFABRIC>
--snip--
Note that this output is taken from a QFabric-M system, and hence, port 0/2 is
down on both the Director devices.
Details on how to connect these ports on the DGs is discussed in the QFabric
Installation Guide cited at the beginning of this chapter, but once the physical
installation of a QFabric system is complete, you should verify the status of all the
ports. Youll find that once a QFabric system is installed correctly, it is ready to
forward traffic, and the plug-and-play features of the QFabric technology make it
easy to install and maintain.
However, a single QFabric system has multiple physical components, so lets assume
youve cabled your test bed correctly in your lab and review how a QFabric system
discovers its multiple components and makes sure that those different components
act as a single unit.
14
qfabric-admin@NW-NG-0>showvirtual-chassisprotocoladjacencyprovisioning
Interface System State Hold (secs)
vcp1.32768P7814-C Up 28
vcp1.32768 P7786-C Up 28
vcp1.32768 R4982-C Up 28
vcp1.32768 TSTS2510b Up 29
vcp1.32768 TSTS2609bUp 28
vcp1.32768 TSTS2608a Up 27
vcp1.32768 TSTS2610b Up 28
vcp1.32768 TSTS2611b Up 28
vcp1.32768 TSTS2509b Up28
vcp1.32768 TSTS2511a Up 29
vcp1.32768 TSTS2511b Up 28
vcp1.32768 TSTS2510a Up 28
15
vcp1.32768 TSTS2608b Up 29
vcp1.32768 TSTS2610a Up 28
vcp1.32768 TSTS2509a Up 28
vcp1.32768 TSTS2611a Up 28
vcp1.32768 TSTS1302b Up 29
vcp1.32768TSTS2508a Up 29
vcp1.32768 TSTS2508b Up 29
vcp1.32768 TSTNNGS1205a Up 27
vcp1.32768 TSTS1302a Up 28
vcp1.32768 __NW-INE-0_RE0 Up 28
vcp1.32768 TSTNNGS1204a Up 29
vcp1.32768 G0548/RE0 Up 27
vcp1.32768 G0548/RE1 Up 28
vcp1.32768G0530/RE1 Up 29
vcp1.32768 G0530/RE0 Up 28
vcp1.32768 __RR-INE-1_RE0 Up 29
vcp1.32768 __RR-INE-0_RE0 Up 29
vcp1.32768 __DCF-ROOT.RE0 Up 29
vcp1.32768 __DCF-ROOT.RE1 Up 28
{master}
The same output can also be viewed from the Fabric Manager VM:
root@Test-QFabric>requestcomponentloginFM-0
Warning:Permanentlyadded'dcfnode---dcf-root,169.254.192.17'(RSA)tothelistofknownhosts.
---JUNOS12.2X50-D41.1built2013-03-2221:44:05UTC
qfabric-admin@FM-0>
qfabric-admin@FM-0>showvirtual-chassisprotocoladjacencyprovisioning
Interface System State Hold (secs)
vcp1.32768 P7814-CUp 27
vcp1.32768 P7786-C Up 28
vcp1.32768 R4982-C Up 29
vcp1.32768 TSTS2510b Up 29
vcp1.32768 TSTS2609b Up28
vcp1.32768 TSTS2608a Up 29
vcp1.32768 TSTS2610b Up 28
vcp1.32768 TSTS2611b Up 28
vcp1.32768 TSTS2509b Up 27
vcp1.32768 TSTS2511a Up 29
vcp1.32768 TSTS2511b Up 29
vcp1.32768 TSTS2510a Up 27
vcp1.32768 TSTS2608b Up 28
vcp1.32768TSTS2610a Up 28
vcp1.32768 TSTS2509a Up 28
vcp1.32768 TSTS2611a Up 28
vcp1.32768 TSTS1302b Up 28
vcp1.32768 TSTS2508a Up 27
vcp1.32768 TSTS2508b Up 29
vcp1.32768 TSTNNGS1205aUp 28
vcp1.32768 TSTS1302a Up 29
vcp1.32768__NW-INE-0_RE0 Up 28
vcp1.32768 TSTNNGS1204aUp 28
vcp1.32768 G0548/RE0 Up 28
vcp1.32768 G0548/RE1 Up 29
vcp1.32768 G0530/RE1Up 28
vcp1.32768 G0530/RE0 Up 27
vcp1.32768 __RR-INE-1_RE0 Up 29
vcp1.32768 __NW-INE-0_RE1 Up 28
vcp1.32768 __DCF-ROOT.RE0 Up29
vcp1.32768 __RR-INE-0_RE0 Up 28
--snip--
16
VCCPD Hellos are sent every three seconds and the adjacency is lost if the peers dont
see each others Hellos for 30 seconds.
After the Nodes and Interconnects form VCCPD adjacencies with the Fabric Manager VM, the QFabric system has a view of all the connected components.
Note that the VCCPD adjacency only provides details about how many Nodes and
Interconnects are present in a QFabric system. VCCPD does not provide any information about the data plane of the QFabric system; that is, it doesnt provide information about the status of connections between the Nodes and the Interconnects.
17
18
1. The only connections present are the DG0-DG1 connections and the connections
between the Director devices and the EX Series VC.
Figure 1.1
1.1. Note that DG0 and DG1 would assign IP addresses of 1.1.1.1 and
1.1.1.2 respectively to their bond0 links. This is the link over which the
Director devices sync up with each other.
1.2. The Fabric Manager VM running on the DGs would run VCCPD and
the DGs will send VCCPD Hellos on their links to the EX Series VC. Note
that there would be no VCCPD neighbors at this point in time as Nodes
and Interconnects are yet to be connected. Also, the Control plane switches
(EX Series VC) do not participate in the VCCPD adjacencies. Their
function is only to provide a Layer 2 segment for all the components to
communicate with each other.
Figure 1.2
19
2. In Figure 1.2 two Interconnects (IC-1 and IC-2) are connected to the EX Series VC.
2.1. The Interconnects start running VCCPD on the link connected to the EX
Series VC. The EX Series VC acts as a Layer 2 switch and only floods the
VCCPD packets.
2.2. The Fabric Manager VMs and the Interconnects see each others
VCCPD Hellos and become neighbors. At this point in time, the DGs know
that IC-1 and IC-2 are a part of the QFabric system.
20
Figure 1.3
3. In Figure 1.3 two new Node devices (Node-1 and Node-2) are connected to the
EX Series VC.
3.1. The Nodes start running VCCPD on the links connected to the EX
Series VC. Now the Fabric Manager VMs know that there are four devices
in the QFabric inventory: IC-1, IC-2, Node-1, and Node-2.
3.2. Note that none of the FTE interfaces of the Nodes are up yet. This
means that there is no way for the Nodes to forward traffic (there is no data
plane connectivity). Whenever such a condition occurs, Junos disables all
the 10GbE interfaces on the Node devices. This is a security measure to
make sure that a user cannot connect a production server to a Node device
that doesnt have any active FTE ports. This also makes troubleshooting
very easy. If all the 10GbE ports of a Node device go down even when
devices are connected to it, the first place to check should be the status of
the FTE links. If none of the FTE links are in the up/up state, then all the
10GbE interfaces will be disabled. In addition to bringing down all the
10GbE ports, the QFabric system also raises a major system alarm. The
alarms can be checked using the show system alarms CLI command.
Figure 1.4
21
4.3 The Nodes and the Interconnects will run VCCPDf on the FTE links and
see each other.
4.4 This VCCPDf information is fed to the Director devices. At this point in
time, the Directors know that:
There are four devices in the QFabric system. This was established at
point# 3.1.
Node-1 is connected to IC-1 and Node-2 is connected to IC-2.
5. Note that some of the data plane of the QFabric is connected, but there would be
no connectivity for hosts across Node devices. This is because Node-1 has no way to
reach Node-2 via the data plane and vice-versa, as the Interconnects are never
connected to each other. The only interfaces for the internal data plane of the
QFabric system are the 40GbE FTE interfaces. In this particular example, Node-1 is
connected to IC-1, but IC-1 is not connected to Node-2. Similarly, Node-2 is
connected to IC-2, but IC-2 is not connected to Node-1. Hence, hosts connected
behind Node-1 have no way of reaching hosts connected behind Node-2, and
vice-versa.
22
Figure 1.5
6. In Figure 1.5 Node-1 is connected to IC-2. At this point, the Fabric Manager has
the following information:
6.4 At this point in time, hosts connected behind Node-1 should be able to
communicate with hosts connected behind Node-2 (provided that the basic
laws of networking like VLAN, routing, etc. are obeyed).
Figure 1.6
23
Node-2 is Connected to IC-1, which Completes the DataPlane of the QFabric System
7.1 The Nodes and IC-1 discover each other using VCCPDf and send this
information to Fabric Manager VM running on the Directors.
7.2 Now the FM realizes that Node-1 can reach Node-2 via IC-1, also.
7.3 After the FM finishes programming the tables of Node-1 and Node-2,
both Node devices will have two next hops to reach each other. These two
next hops can be used for load-balancing purposes. This is where the
QFabric solution provides excellent High Availability and also effective
load balancing of different flows as we add more 40GbE uplinks to the
Node devices.
At the end of all these steps, the internal VCCPD and VCCPDf adjacencies of the
QFabric would be complete, and the Fabric Manager will have a complete topological view of the system.
24
Chapter 2
Accessing Individual Components
26
Before this book demonstrates how to troubleshoot any problems, this chapter will
educate the reader about logging in to different components and how to check and
retrieve logs at different levels (physical and logical components) of a QFabric system.
Details on how to configure a QFabric system and aliases for individual Nodes are
documented in the QFabric Deployment Guide at www.juniper.net/documentation.
27
IC001/RE1Connected
Fabricmanager
FM-0ConnectedConfigured
Fabriccontrol
FC-0ConnectedConfigured
FC-1ConnectedConfigured
Diagnosticroutingengine
DRE-0ConnectedConfigured
This output shows the alias and the serial number (mentioned under the Identifier
column) of every Node that is a part of the QFabric system. It also shows if the
Node is a part of an SNG, Redundant-SNG, or the Network Node Group.
The rightmost column of the output shows the state of each component. Each
component of the QFabric should be in Connected state. If a component shows up
as Disconnected, then there must be an underlying problem and troubleshooting is
required to find out the root cause.
As shown in Figure 2.1, this particular fabric system has six Nodes and two Interconnects. Node-0 and Node-1 are part of the Network Node Group, Node-2 and
Node-3 are part of a Redundant-SNG named RSNG-1, Node-4 and Node-5 are
part of another Redundant-SNG named RSNG-2.
Figure 2.1
28
to allot IP addresses to Node groups and the Interconnects and IPs in the
169.254.128.x range are allotted to Node devices and to VMs. These IP addresses are
used for internal management and can be used to log in to individual components
from the Director devices Linux prompt. The IP addresses of the components can be
seen using the dns.dump utility, which is located under /root on the Director devices.
Here is an example showing sample output from dns.dump and explaining how to
log in to various components:
[root@dg0~]#./dns.dump
;<<>>DiG9.3.6-P1-RedHat-9.3.6-4.P1.el5<<>>[email protected]
;;globaloptions:printcmd
pkg.dcbg.juniper.net.600INSOAns.pkg.dcbg.juniper.net.mail.pkg.dcbg.juniper.
net.104360060072003600
pkg.dcbg.juniper.net.600INNSns.pkg.dcbg.juniper.net.
pkg.dcbg.juniper.net.600INA169.254.0.1
pkg.dcbg.juniper.net.600INMX1mail.pkg.dcbg.juniper.net.
dcfnode---DCF-ROOT.pkg.dcbg.juniper.net.45INA169.254.192.17<<<<<<<DCFRoot(FM's)IPaddress
dcfnode---DRE-0.pkg.dcbg.juniper.net.45INA169.254.3.3
dcfnode-3b46cd08-9331-11e2-b616-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.15
dcfnode-3d9b998a-9331-11e2-bbb2-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.16
dcfnode-4164145c-9331-11e2-a365-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.17
dcfnode-43b35f38-9331-11e2-99b1-00e081c53280.pkg.dcbg.juniper.net.45INA169.254.128.18
dcfnode-A9122-RE0.pkg.dcbg.juniper.net.45INA169.254.128.5
dcfnode-A9122-RE1.pkg.dcbg.juniper.net.45INA169.254.128.8
dcfnode-BBAK0431.pkg.dcbg.juniper.net.45INA169.254.128.20
dcfnode-default---FABC-INE-A9122.pkg.dcbg.juniper.net.45INA169.254.193.0
dcfnode-default---FABC-INE-IC001.pkg.dcbg.juniper.net.45INA169.254.193.1
dcfnode-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34<<<<NW-INE'sIPaddress
dcfnode-default---RR-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.35<<<<FC-0'sIPaddress
dcfnode-default---RR-INE-1.pkg.dcbg.juniper.net.45INA169.254.192.36
dcfnode-default-RSNG-1.pkg.dcbg.juniper.net.45INA169.254.193.11
dcfnode-default-RSNG-2.pkg.dcbg.juniper.net.45INA169.254.193.12
dcfnode-IC001-RE0.pkg.dcbg.juniper.net.45INA169.254.128.6<<<<IC'sIPaddress
dcfnode-IC001-RE1.pkg.dcbg.juniper.net.45INA169.254.128.7
dcfnode-P1377-C.pkg.dcbg.juniper.net.45INA169.254.128.21
dcfnode-P4423-C.pkg.dcbg.juniper.net.45INA169.254.128.19
dcfnode-P6690-C.pkg.dcbg.juniper.net.45INA169.254.128.22
dcfnode-P6966-C.pkg.dcbg.juniper.net.45INA169.254.128.24<<<<<node'sIPaddress
dcfnode-P6972-C.pkg.dcbg.juniper.net.45INA169.254.128.23<<<<<node'sIPaddress
mail.pkg.dcbg.juniper.net.600INA169.254.0.1
ns.pkg.dcbg.juniper.net.600INA169.254.0.1
server.pkg.dcbg.juniper.net.600INA169.254.0.1
--snip--
29
{master}
root@NW-NG-0>exit
root@NW-NG-0%exit
logout
Connectionto169.254.192.34closed.
The Node devices with serial numbers P1377-C and P4423-C are a part of the Node
group named RSNG-1. This information is present in the output of show fabric
administration inventory shown above.
As mentioned previously, the RSNG abstraction works on the concept of a Virtual
Chassis. Here is a CLI snippet showing the result of a login attempt to the IP
addresses of the Nodes which are a part of RSNG-1:
[root@dg0~]#./dns.dump|grepRSNG-1
dcfnode-default-RSNG-1.pkg.dcbg.juniper.net.45INA169.254.193.11
dcfnode-default-RSNG-1.pkg.dcbg.juniper.net.45INA169.254.193.11
[root@dg0~]#[email protected]
[email protected]'spassword:
---JUNOS12.2X50-D41.1built2013-03-2221:43:51UTC
root@RSNG-1%
root@RSNG-1%
root@RSNG-1%cli
{master}<<<<<<<<<Masterprompt.Chapter-3discussesmoreaboutmaster/
backupREswithinvariousNode-Groups
root@RSNG-1>showvirtual-chassis
PreprovisionedVirtualChassis
VirtualChassisID:0000.010b.0000
Mstr
MemberIDStatusModelprioRoleSerialNo
0(FPC0)Prsntqfx3500128Master*P4423-C
1(FPC1)Prsntqfx3500128BackupP1377-C
{master}
root@RSNG-1>
30
[root@dg0~]#[email protected]
Theauthenticityofhost'169.254.128.19(169.254.128.19)'can'tbeestablished.
RSAkeyfingerprintis9e:aa:da:bb:8d:e4:1b:74:0e:57:af:84:80:c3:a8:9d.
Areyousureyouwanttocontinueconnecting(yes/no)?yes
Warning:Permanentlyadded'169.254.128.19'(RSA)tothelistofknownhosts.
[email protected]'spassword:
---JUNOS12.2X50-D41.1built2013-03-2221:43:51UTC
root@RSNG-1%
root@RSNG-1%cli
{master}<<<<<<<<RSNG-masterprompt
Nodes P6966-C and BBAK0431 are the line cards of this NW-NG-0 VM. Since the
REs of these Node devices are not active at all, there is no configuration that is pushed
down to the line cards. Here are the snippets from the login prompt of the member
Nodes of the Network Node Group:
[root@dg0~]#./dns.dump|grepP6966
dcfnode-P6966-C.pkg.dcbg.juniper.net.45INA169.254.128.24
dcfnode-P6966-C.pkg.dcbg.juniper.net.45INA169.254.128.24
31
[root@dg0~]#[email protected]
Theauthenticityofhost'169.254.128.24(169.254.128.24)'can'tbeestablished.
RSAkeyfingerprintisf6:64:18:f5:9d:8d:29:e7:95:c0:d7:4f:00:a7:3d:30.
Areyousureyouwanttocontinueconnecting(yes/no)?yes
Warning:Permanentlyadded'169.254.128.24'(RSA)tothelistofknownhosts.
[email protected]'spassword:
Permissiondenied,pleasetryagain.
Note that since no configuration is pushed down to the line cards in the case of
NW-NG-0, it means that a user cant log in to the line cards (the credentials
wouldnt work as the configuration is not pushed to the line cards at all). You need
to connect to the line cards using their console if you intend to check details on the
NWNG line cards. Also, note that the logs from the line cards will reflect on the logs
located in /var/log/messages file on the NW-NG-0 VM. One more method to log in
to the Nodes belonging to the NW-NG is to telnet to them. However, note that no
configuration is visible on these Nodes as their Routing Engines are disabled and
hence no configuration is pushed to them.
5. From the CLI. Logging in to individual components requires user-level privileges,
which allow such logins. The remote-debug-permission CLI setting needs to be
configured for this. Here is the configuration used on QFabric mentioned in this
chapter:
root@Test-QFABRIC>showconfigurationsystem
host-nameTest-QFABRIC;
authentication-order[radiuspassword];
root-authentication{
encrypted-password"$1$LHY6NN4P$cnOMoqUj4OXKMaHOm2s.Z.";##SECRET-DATA
remote-debug-permissionqfabric-admin;
}
There are three permissions that you can set at this hierarchy:
qfabric-admin: Permits a user to log in to individual QFabric switch components, issue show commands, and to change component configurations.
qfabric-operator: Permits a user to log in to individual QFabric switch components and issue show commands.
qfabric-user: Prevents a user from logging in to individual QFabric switch
components.
Also, note that a user needs to have admin control privileges to add this statement to
the devices configuration.
MORE? Complete details on QFabrics system login classes can be found at this link: http://
www.juniper.net/techpubs/en_US/junos13.1/topics/concept/access-login-class-qfabric-overview.html.
Once a user has the required remote debug permission set, they can access the
individual components using the request component login command:
root@Test-QFABRIC>requestcomponentlogin?
Possiblecompletions:
<node-name>Inventorynamefortheremotenode
A9122/RE0Interconnectdevicecontrolboard
A9122/RE1Interconnectdevicecontrolboard
BBAK0431Nodedevice
--SNIP--
32
And from 13.1 onwards, you can check the logs for a specific component from the
QFabric CLI:
root@qfabric>showlogmessages?
Possiblecompletions:
<[Enter]>Executethiscommand
<component>
director-deviceShowlogsfromadirectordevice
infrastructure-deviceShowlogsfromainfrastructuredevice
interconnect-deviceShowlogsfromainterconnectdevice
node-deviceShowlogsfromanodedevice
|Pipethroughacommand
root@qfabric>
33
root@qfabric>showfabricadministrationinventory
ItemIdentifierConnectionConfiguration
Nodegroup
BBAK1280ConnectedConfigured
BBAK1280Connected
BBAM7499ConnectedConfigured
BBAM7499Connected
BBAM7543ConnectedConfigured
BBAM7543Connected
BBAM7560ConnectedConfigured
BBAM7560Connected
BBAP0747ConnectedConfigured
BBAP0747Connected
BBAP0748ConnectedConfigured
BBAP0748Connected
BBAP0750ConnectedConfigured
BBAP0750Connected
BBPA0737ConnectedConfigured
BBPA0737Connected
NW-NG-0ConnectedConfigured
BBAK6318Connected
BBAM7508Connected
P1602-CConnectedConfigured
P1602-CConnected
P2129-CConnectedConfigured
P2129-CConnected
P3447-CConnectedConfigured
P3447-CConnected
P4864-CConnectedConfigured
P4864-CConnected
Interconnectdevice
IC-BBAK7828ConnectedConfigured
BBAK7828/RE0Connected
IC-BBAK7840ConnectedConfigured
BBAK7840/RE0Connected
IC-BBAK7843ConnectedConfigured
BBAK7843/RE0Connected
Fabricmanager
FM-0ConnectedConfigured
Fabriccontrol
FC-0ConnectedConfigured
FC-1ConnectedConfigured
Diagnosticroutingengine
DRE-0ConnectedConfigured
root@qfabric>
[root@dg0~]#./dns.dump|grepBBAK1280
dcfnode-BBAK1280.pkg.dcbg.juniper.net.45INA 169.254.128.7
dcfnode-default---BBAK1280.pkg.dcbg.juniper.net.45INA169.254.193.2
dcfnode-BBAK1280.pkg.dcbg.juniper.net.45INA 169.254.128.7
dcfnode-default---BBAK1280.pkg.dcbg.juniper.net.45INA169.254.193.2
[root@dg0~]#[email protected]
Theauthenticityofhost'169.254.128.7(169.254.128.7)'can'tbeestablished.
RSAkeyfingerprintisa3:3e:2f:65:9d:93:8f:e3:eb:83:08:c3:01:dc:b9:c1.
Areyousureyouwanttocontinueconnecting(yes/no)?yes
Warning:Permanentlyadded'169.254.128.7'(RSA)tothelistofknownhosts.
Password:
---JUNOS13.1I20130306_1309_dc-builderbuilt2013-03-0614:56:57UTC
root@BBAK1280%
After logging in to a component, the logs can be viewed either by using the show log
<filename> Junos CLI command or by logging in to the shell mode and checking out
the contents of the /var/log directory:
34
root@BBAK1280>showlogmessages?
Possiblecompletions:
<filename>Nameoflogfile
messagesSize:185324,Lastchanged:Jun0220:59:46
messages.0.gzSize:4978,Lastchanged:Jun0100:45:00
--SNIP-root@BBAK1280>showlogchassisd?
Possiblecompletions:
<filename>Nameoflogfile
chassisdSize:606537,Lastchanged:Jun0220:41:04
chassisd.0.gzSize:97115,Lastchanged:May1805:36:24
root@BBAK1280>
root@BBAK1280>exit
root@BBAK1280%cd/var/log
root@BBAK1280%ls-lrt|grepmessages
-rw-rw----1rootwheel4630May1113:45messages.9.gz
-rw-rw----1rootwheel4472May1321:45messages.8.gz
--SNIP-root@BBAK1280%
This is true for other components as well (Nodes, Interconnects, and VMs). However, the rule of checking logs only at the active RE of a component still applies.
As with any other Junos platform, /var/log is a very important location as far as log
collection is concerned. But there is also the /var/log/messages file, which records the
general logs that are recorded for the DG devices.
[root@dg0tmp]#cd/var/log
[root@dg0log]#ls
add_device_dg0.logcron.4.gzmessagessecure.3.gz
anaconda.logcupsmessages.1.gzsecure.4.gz
--SNIP--
/tmp
This is the location that contains all the logs pertaining to the configuration push
events within the subdirectory named sfc-captures. Whenever a configuration is
committed on the QFabric CLI, it is pushed to the various components. The logs
pertaining to these processes can be found at this location and in all the core files:
[root@dg0sfc-captures]#cd/tmp
[root@dg0tmp]#ls
1296.sfcauth26137.sfcauth32682.sfcauthcorefiles
--SNIP-[root@dg0tmp]#cdsfc-captures/
35
[root@dg0sfc-captures]#ls
03170323032903350341034703530359036503710377
03180324033003360342034803540360036603720378
0319032503310337034303490355036103670373last.txt
0320032603320338034403500356036203680374misc
0321032703330339034503510357036303690375sfc-database
0322032803340340034603520358036403700376
A big part of troubleshooting any networking issue is enabling trace options and
analyzing the logs. To enable trace options on a specific component, the user needs to
have superuser access. Here is how that can be done:
qfabric-admin@NW-NG-0>startshell
%su
Password:
root@NW-NG-0%cli
{master}
fabric-admin@NW-NG-0>configure
Enteringconfigurationmode
{master}[edit]
qfabric-admin@NW-NG-0#setprotocolsospftraceoptionsflagall
{master}[edit]
qfabric-admin@NW-NG-0#commit<<<<commitatcomponent-level
commitcomplete
{master}[edit]
NOTE
If any trace options are enabled at a component level, and a commit is done from the
QFabric CLI, then the trace options configured at the component will be removed.
MORE? There is another method of enabling trace options on QFabric and it is documented
at the following KB article: http://kb.juniper.net/InfoCenter/
index?page=content&id=KB21653.
Whenever trace options are configured at a component level, the corresponding file
containing the logs is saved on the file system of the active RE for that component.
36
Note that there is no way that an external device connected to QFabric can connect
to the individual components of a QFabric system.
Because the individual components can be reached only by the Director devices, and
the external devices (say, an SNMP server) can only connect to the DGs as well, you
need to follow this procedure to retrieve any files that are located on the file system
of a component:
1. Save the file from the component on to the DG.
2. Save the file from the DG to the external server/device.
This is because the management of the whole QFabric system is done using the VIP
that is allotted to the DGs. Since QFabric is made up of a lot of physical components, always consider a QFabric system as a network of different devices. These
different components are connected to each other on a common LAN segment,
which is the control plane Ethernet segment. In addition to this, all the components
have an internal management IP address in the 169.254 IP address range. These IP
addresses can be used to copy files between different components.
Here is an example of how to retrieve log files from a component (NW-NG in this
case):
root@NW-NG-0%ls-lrt/var/log|grepospf
-rw-r-----1rootwheel59401Apr808:26ospf-traces<<<<thelogfileissavedat/var/
logontheNW-INEVM
root@NW-NG-0%exit
logout
Connectionto169.254.192.34closed.
[root@dg0~]#./dns.dump|grepNW-INE
dcf-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34
dcf-default---NW-INE-0.pkg.dcbg.juniper.net.45INA169.254.192.34
[root@dg0~]#
[root@dg0~]#
[root@dg0~]#ls-lrt|grepospf
[root@dg0~]#
[root@dg0~]#
[root@dg0~]#[email protected]://var/log/ospf-traces.
[email protected]'spassword:
ospf-traces100%59KB59.0KB/s00:00
[root@dg0~]#ls-lrt|grepospf
-rw-r-----1rootroot60405Apr801:27ospf-traces
Here, youve successfully transferred the log file to the DG. Since the DGs have
management access to the gateway, you can now transfer this file out of the QFabric
system and onto the required location.
Just like trace options, core files are also saved locally on the file system of the
components. These files can be retrieved the same way as trace option files are
retrieved:
37
First, save the core file from the component onto the DG.
Once the file is available on the DG it can be accessed via other devices that
have IP connectivity to the Director devices.
Inbuilt Scripts
There are several inbuilt scripts in the QFabric system that can be run to check the
health of or gather additional information about the system. These scripts are present
in the /root directory of the DGs. Most of the inbuilt scripts are leveraged by the
system in the background (to do various health checks on the QFabric system). The
names of the scripts are very intuitive and here are a few that can be extremely useful:
dns.dump: Shows the IP addresses corresponding to all the components (its
already been used multiple times in this book).
createpblogs: This script gathers the logs from all the components and stores it
as /tmp/pblogs.tgz. From Junos 12.3 and up, this log file is saved at /pbdata/
export/rlogs/ location. This script is extremely useful to have when troubleshooting QFabric. Best practice suggests running this script before and after
every major change that is done on the QFabric system. That way youll know
how the logs looked before and then after the change, something useful for both
JTAC and yourself when it comes time to troubleshoot issues.
pingtest.sh: This script pings all the components of the QFabric system and
reports their status. If any of the Nodes are not reachable, then a suitable status
is shown for that Node. Here is what a sample output would look like:
[root@dg1~]#./pingtest.sh
---->Detectednewhostdcfnode---DCF-ROOT
dcfnode---DCF-ROOT-ok
---->Detectednewhostdcfnode---DRE-0
dcfnode---DRE-0-ok
---->Detectednewhostdcfnode-13daf6fc-9b6c-11e2-bafc-00e081ce1e76
dcfnode-13daf6fc-9b6c-11e2-bafc-00e081ce1e76-ok
---->Detectednewhostdcfnode-150d8a4e-9b6c-11e2-a1ae-00e081ce1e76
dcfnode-150d8a4e-9b6c-11e2-a1ae-00e081ce1e76-ok
---->Detectednewhostdcfnode-16405946-9b6c-11e2-a345-00e081ce1e76
dcfnode-16405946-9b6c-11e2-a345-00e081ce1e76-ok
38
---->Detectednewhostdcfnode-17732b54-9b6c-11e2-a937-00e081ce1e76
dcfnode-17732b54-9b6c-11e2-a937-00e081ce1e76-ok
---->Detectednewhostdcfnode-226b5716-9b80-11e2-aea7-00e081ce1e76
--snip--
Certain scripts can cause some traffic disruption and hence should never be run on a
QFabric system that is carrying production traffic, for instance: format.sh, dcf_sfc_
wipe_cluster.sh, reset_initial_configuration.sh.
Q: What is the IP address range that is allocated to Node groups and Node devices?
Node devices: 169.254.128.x
Node groups: 169.254.193.x
Q: What inbuilt script can be used to obtain the IP address allocated to the different
components of a QFabric system?
The dns.dump script is located at
Chapter 3
Control Plane and Data Plane Flows
40
One of the goals of this book is to help you efficiently troubleshoot an issue on a
QFabric system. To achieve this, its important to understand exactly how the internal
QFabric protocols operate and the packet flow of both the data plane and control
plane traffic.
Routing Engines
This chapter discusses the path of packets for control plane and data plane traffic.
The following are bulleted lists about the protocols run on these abstractions or Node
groups.
Figure 3.1
41
42
Figure 3.2
Figure 3.3
43
Route Propagation
As with any other internetworking device, the main job of QFabric is to send traffic
end-to-end. To achieve this, the system needs to learn various kinds of routes (such
as Layer 2 routes, Layer 3 routes, ARP, etc.).
As discussed earlier, there can be multiple active REs within a single QFabric system.
Each of these REs can learn routes locally, but a big part of understanding how
QFabric operates is to know how these routes are exchanged between various REs
within the system.
One approach to exchanging these routes between different REs is to send all the
routes learned on one RE to all the other active REs. While this is simple to do, such
an implementation will be counter productive because all the routes eventually will
need to be pushed down to the PFE so that hardware forwarding can take place. If
you send all the routes to every RE, then the complete scale of the QFabric comes
down to the table limits of a single PFE. It means that the scale of the complete
QFabric solution is as good as the scale of a single RE. This is undesirable and the
next section discusses how Junipers QFabric technology maintains scale with a
distributed architecture.
44
Maintaining Scale
One of the key advantages of the QFabric architecture is its scale. The scaling numbers of MAC addresses and IP addresses obviously depend on the number of Nodes
that are a part of a QFabric system because the data plane always resides on the
Nodes and you need the routes to be programmed in the PFE (the data plane) to
ensure end-to-end traffic forwarding.
As discussed earlier, all the routes learned on an RE are not sent to every other RE.
Instead, an RE receives only the routes that it needs to forward data. This poses a big
question: What parameters decide if a route should be sent to a Nodes PFE or not?
The answer is: it depends on the kind of route. The deciding factor for a Layer 2 route
is different from the factor for a Layer 3 route. Lets examine them briefly to understand these differences.
Layer 2 Routes
A Layer 2 route is the combination of a VLAN and a MAC address (VLAN-MAC
pair) and the information stored in the Ethernet switching table of any Juniper EX
Series switch. Now, Layer 2 traffic can be either unicast or BUM (Broadcast, Unknown-unicast, or Multicast in which all three kinds of traffic would be flooded
within the VLAN).
Figure 3.4 is a representation of a QFabric system where Node-1 has active ports in
VLANs 10 and 20 connected to it, Node-2 has hosts in VLANs 20 and 30 connected
to it, and both Node-3 and Node-4 have hosts in VLANs 30 and 40 connected to it.
Active ports means that the Nodes either have hosts directly connected to them, or
that the hosts are plugged into access switches and these switches plug into the
Nodes. For the sake of simplicity, lets assume that all the Nodes shown in Figure 3.4
are SNGs, meaning that for this section, the words Node and RE can be used interchangeably.
Figure 3.4
45
Consider that Host-2 wants to send traffic to Host-3, and lets assume that the MAC
address of Host-3 is already learned. This would cause traffic to be Layer 2 Unicast
traffic as both the source and destination devices are in the same VLAN. When
Node-1 sees this traffic coming in from Host-2, all of it should be sent over to
Node-2 internally within the QFabric example in Figure 3.4. When Node-2 receives
this traffic, it should be sent unicast to the port when the host is connected. This
kind of communication means:
Host-3s MAC address is learned on Node-2. There should be some way to
send this layer-2-routes information over to Node-1. Once Node-1 has this
information, it knows that everything destined to Host-3s MAC address
should be sent to Node-2 over the data plane of the QFabric.
This is true for any other host in VLAN-20 that is connected on any other
Node.
Note that if Host-5 wishes to send some traffic to Host-3, then this traffic must
be routed at Layer 3, as these hosts are in different VLANs. The regular laws
of networking would apply in this case and Host-1 would need to resolve the
ARP for its gateway. The same concept would apply if Host-6 wishes to send
some data to Host-3. Since none of the hosts behind Node-3 ever need to
resolve the MAC address of Host-3 to be able to send data to it, there is no
need for Node-2 to advertise Host-3s MAC address to Node-3. However, this
would change if a new host in VLAN-20 is connected behind Node-3.
Conclusion: if a Node learns of a MAC address in a specific VLAN, then this MAC
address should be sent over to all the other Nodes that have an active port in that
particular VLAN. Note that this communication of letting other Nodes know about
a certain MAC address would be a part of the internal Control Plane traffic within
the QFabric system. This data will not be sent out to devices that are connected to
the Nodes of the QFabric system. Hence, for Layer 2 routes, the factor that decides
whether a Node gets that route or not is the VLAN.
Layer 2 BUM Traffic
Lets consider that Host-4 sends out Layer 2 broadcast traffic, that is, frames in
which the destination MAC address is ff:ff:ff:ff:ff:ff and that all this traffic should be
flooded in VLAN-30. In the QFabric system depicted in Figure 3.4, there are three
Nodes that have active ports in VLAN-30: Node-2, Node-3, and Node-4. What
happens?
All the broadcast traffic originated by the Host-4 should be sent internally to
Node-3 and Node-4 and then these Nodes should be able to flood this traffic
in VLAN-30.
Since Node-1 doesnt have any active ports in VLAN-30, it doesnt need to
flood this traffic out of any revenue ports or server facing ports. This means
that Node-2 should not send this traffic over to Node-1. However, at a later
time, if Node-1 gets an active port in VLAN-30 then the broadcast traffic will
be sent to Node-1 as well.
These points are true for BUM traffic assuming that IGMP snooping is
disabled.
46
In conclusion, if a Node receives BUM traffic in a VLAN, then all that traffic should
be sent over to all the other Nodes that have an active port in that VLAN and not to
those Nodes that do not have any active ports in this VLAN.
Layer 3 Routes
Layer 3 routes are good old unicast IPv4 routes. Note that only the NW-NG-VM has
the ability to run Layer 3 protocols with externally connected devices, hence at any
given time, the active NW-NG-VM has all the Layer 3 routes learned in all the
routing instances that are configured on a given QFabric system. However, not all
these routes are sent to the PFE of all the Nodes within an NW-NG.
Lets use Figure 3.5, which represents a Network Node Group, for the discussion of
Layer 3 unicast routes. All the Nodes shown are the part of NW-NG. Host-1 is
connected to Node-1, Host-2 and Host-3 are connected to Node-2, and Host-4 is
connected to Node-3. You can see that all the IP addresses and the subnets are shown
as well. Additionally, the subnets for Host-1 and Host-2 are in routing instance RED,
whereas the subnets for Host-3 and Host-4 are in routing instance BLUE. The default
gateways for these hosts are the Routed VLAN Interfaces (RVIs) that are configured
and shown in the diagram in Figure 3.5.
Lets assume that there are hosts and devices connected to all three Nodes in the
default (master) routing instance, although not shown in the diagram. The case of
IPv4 routes is much simpler than Layer 2 routes. Basically, its the routing instance
that decides if a route should be sent to other REs or not.
Figure 3.5
47
In the QFabric configuration shown in Figure 3.5, the following takes place:
Node-1 and Node-2 have one device each connected in routing instance RED.
The default gateway (interface vlan.100) for these devices resides on the active
NW-NG-VM, meaning that the NW-NG-VM has two direct routes in this
routing instance, one for the subnet 1.1.1.0/24 and the other for 2.2.2.0/24.
Since the route propagation deciding factor for Layer 3 routes is the routing
instance, the active NW-NG-VM sends the route of 1.1.1.0/24 and 2.2.2.0/24s
to both Node-1 and Node-2 so that these routes can be programmed in the data
plane (PFE of the Nodes).
The active NW-NG-VM will not send the information about the directly
connected routes in routing instance BLUE over to Node-1 at all. This is
because Node-1 doesnt have any directly connected devices in the BLUE
routing instance.
This is true for all kinds of routes learned within a routing instance; they could
either be directly connected, static, or learned via routing protocols like BGP,
OSPF, or IS-IS.
All of the above applies to Node-2 and Node-3 for the routing instance named
BLUE.
All of the above applies to SNGs, RSNGs, and the master routing instance.
In conclusion, the route learning always takes place at the active NW-NG-VM and
only selective routes are propagated to the individual Nodes for programming the
data plane (PFE of the Nodes). The individual Nodes get the Layer 3 routes from the
active NW-NG-VM only if the Node has an active port in that routing instance.
This concept of sending routes to an RE/PFE only if it needs that route ensures that
you do not send all the routes everywhere. Thats the high scale at which a QFabric
system can operate. Now lets discuss how are those routes are sent over to different
Nodes.
Layer 2 Routes
Layer 3 Routes
This token acts as the RD/RT and acts as the deciding factor about
whether a route should be sent to a Node or not.
48
Each active RE within the QFabric system forms a BGP peering with the VMs called
FC-0 and FC-1. All the active REs send all Layer 2 and Layer 3 routes over to the
FC-0 and FC-1 VMs via BGP. These VMs only send the appropriate routes over to
individual REs (only the routes that the REs need).
The FC-0 and FC-1 VMs act as route reflectors. However, these VMs follow the
rules of QFabric technology when it comes to deciding which routes to be sent to
which RE (not sending the routes that an RE doesnt need).
Figure 3.6 shows all the components (SNG, RSNG, and NW-NG VMs) sending all
of their learned routes (Layer 2 and Layer 3) over to the Fabric Control VM.
Figure 3.6
However, the Fabric Control VM sends only those routes to a component that are
relevant to it. In Figure 3.7, the different colored arrows signify that the relevant
routes that the Fabric Control VMs send to each component may be different.
Lets look at some show command snippets that will demonstrate how a local route
gets sent to the FC, and then how the other Nodes see it. They will be separated into
Layer 2 and Layer 3 routes, and most of the snippets have bolded notes preceded by
<<.
Figure 3.7
Here the Node named MLRSNG01a is a member of the RSNG named RSNG0:
root@TEST-QFABRIC#runshowfabricadministrationinventorynode-groupsRSNG0
ItemIdentifierConnectionConfiguration
Nodegroup
RSNG0ConnectedConfigured
MLRSNG01aP6810-CConnected
MLRSNG02aP7122-CConnected
[edit]
49
50
PreprovisionedVirtualChassis
VirtualChassisID:0000.0103.0000
Mstr
MemberIDStatusModelprioRoleSerialNo
0(FPC0)Prsntqfx3500128Master*P6810-C<<MLRSNG01aorfpc0
1(FPC1)Prsntqfx3500128BackupP7122-C
{master}
qfabric-admin@RSNG0>
The hardware token for a VLAN can be obtained using the CLI using the following
commands:
qfabric-admin@RSNG0>showvlansV709---qfabricextensive
VLAN:V709---qfabric,Createdat:ThuNov1405:39:282013
802.1QTag:709,Internalindex:4,AdminState:Enabled,Origin:Static
Protocol:PortMode,Macagingtime:300seconds
Numberofinterfaces:Tagged0(Active=0),Untagged0(Active=0)
{master}
qfabric-admin@RSNG0>showfabricvlan-domain-mapvlan4
VlanL2DomainL3-IflL3-Domain
41200
{master}
qfabric-admin@RSNG0>
The Layer 2 domain shown in the output of show fabric vlan-domain-map vlan
<internal-index> contains the same value as that of the hardware token of the
VLAN and its also called the L2Domain-Id for a particular VLAN.
As discussed earlier, this route is sent over to the FC-VM. This is how the route
looks on the FC-VM (note that the FC-VM uses a unique table called bgp.bridgevpn.0 :
qfabric-admin@FC-0>showroutefabrictablebgp.bridgevpn.0
--snip-65534:1:12.ac:4b:c8:f8:68:97/152
*[BGP/170]6d07:42:56,localpref100
ASpath:I,validation-state:unverified
>to128.0.130.6viadcfabric.0,Push1719,Push10,Push25(top)
[BGP/170]6d07:42:56,localpref100,from128.0.128.8
ASpath:I,validation-state:unverified
>to128.0.130.6viadcfabric.0,Push1719,Push10,Push25(top)
51
So, the next hop for this route is being shown as 128.0.130.6. Its clear from the
output snippets mentioned earlier, that this is the internal IP address for the RSNG.
The bolded portion of the route shows the hardware token of the VLAN. The output
snippets above showed that the token for VLAN.709 is 12.
The labels that are being shown in the output of the route at the FC-VM are specific
to the way the FC-VM communicates with this particular RE (the RSNG). The
origination and explanation of these labels is beyond the scope of this book.
As discussed earlier, a Layer 2 route should be sent across to all the Nodes that have
active ports in that particular VLAN. In this specific example, here are the Nodes that
have active ports in VLAN.709:
root@TEST-QFABRIC#runshowvlans709
NameTagInterfaces
V709709
MLRSNG01a:xe-0/0/8.0*,NW-NG-0:ae0.0*,NW-NG-0:ae34.0*,
NW-NG-0:ae36.0*,NW-NG-0:ae38.0*
[edit]
Since the NW-NG Nodes are active for VLAN 709, the active NW-NG-VM should
have the Layer 2 route under discussion (ac:4b:c8:f8:68:97 in VLAN 709) learned
via the FC-VM via the internal BGP protocol. Here are the corresponding show
snippets from the NW-NG-VM (note that whenever the individual REs learn Layer 2
routes from the FC, they are stored in the table named default.bridge.0):
root@TEST-QFABRIC#runrequestcomponentloginNW-NG-0
Warning:Permanentlyadded'dcfnode-default---nwine-0,169.254.192.34'(RSA)tothelistofknownhosts.
Password:
---JUNOS13.1I20130618_0737_dc-builderbuilt2013-06-1808:51:07UTC
Atleastonepackageinstalledonthisdevicehaslimitedsupport.
Run'fileshow/etc/notices/unsupported.txt'fordetails.
{master}
qfabric-admin@NW-NG-0>showroutefabrictabledefault.bridge.0
--snip-12.ac:4b:c8:f8:68:97/88
*[BGP/170] 1d 10:53:47, localpref 100, from 128.0.128.6
AS path: I, validation-state: unverified
> to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1719 PFE Id 10 Port Id 25
[BGP/170] 1d 10:53:47, localpref 100, from 128.0.128.8
AS path: I, validation-state: unverified
> to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1719 PFE Id 10 Port Id 25
The bolded portion of the snippet shows the token for VLAN.709 (12), the destination PFE-ID and the Port-ID are data plane entities. This is the information that gets
pushed down to the PFE of the member Nodes and then these details are used to
forward data in hardware. In this example, whenever a member Node of the NW-NG
gets traffic for this MAC address, it sends this data via the FTE links to the Node with
PFE-ID of 10. The PFE-IDs of all the Nodes within a Node group can be seen by
logging into the corresponding VM and correlating the outputs of show fabric
multicast vccpdf-adjacency and show virtual chassis. In this example, its the
RSNG that locally learns the Layer 2 route of ac:4b:c8:f8:68:97 in VLAN 709. Here
are the outputs of commands that show which Node has the PFE of 10:
root@TEST-QFABRIC# run request component login RSNG0
Warning: Permanently added 'dcfNode-default-rsng0,169.254.193.3' (RSA) to the list of known hosts.
Password:
52
---JUNOS13.1I20130618_0737_dc-builderbuilt2013-06-1808:50:01UTC
Atleastonepackageinstalledonthisdevicehaslimitedsupport.
Run'fileshow/etc/notices/unsupported.txt'fordetails.
{master}
qfabric-admin@RSNG0>showfabricmulticastvccpdf-adjacency
Flags:S-Stale
SrcSrcSrcDestSrcDest
DevidINEDevtypeDevidInterfaceFlagsPortPort
934TOR256n/a-1-1
934TOR512n/a-1-1
10259(s)TOR256fte-0/1/1.3276813
10259(s)TOR512fte-0/1/0.3276803
1134TOR256n/a-1-1
1134TOR512n/a-1-1
12259(s)TOR256fte-1/1/1.3276812
12259(s)TOR512fte-1/1/0.3276802
256260F29n/a-1-1
256260F210n/a-1-1
256260F211n/a-1-1
256260F212n/a-1-1
512261F29n/a-1-1
512261F210n/a-1-1
512261F211n/a-1-1
512261F212n/a-1-1
{master}
The Src Dev ID shows the PFE-IDs of the member Nodes and the Interface column
shows the FTE interface that goes to the interconnects. The highlighted output
shows that the device with fpc-0 has the PFE-ID of 10 (fte-0/1/1 means that the port
belongs to member Node which is fpc-0).
The output of show
qfabric-admin@RSNG0>showvirtual-chassis
PreprovisionedVirtualChassis
VirtualChassisID:0000.0103.0000
Mstr
MemberIDStatusModelprioRoleSerialNo
0(FPC0)Prsntqfx3500128Master*P6810-C
1(FPC1)Prsntqfx3500128BackupP7122-C
{master}
These two snippets show that the device with fpc-0 is the Node with device ID of
P6810-C. Also, the MAC address was originally learned on port xe-0/0/8 (refer to
the preceeding outputs).
The last part of the data plane information on the NW-NG was the port-ID of the
Node with PFE-ID = 10. The PFE-ID generation is Juniper confidential information
and beyond the scope of this book. However, the port-ID shown in the output of
show route fabric table default.bridge.0 would always be 17 more than the
actual port-number of the ingress Node in case when QFX 3500s are being used as
the Nodes. In this example, the MAC address was learned on xe-0/0/8 on the RSNG
Node. This means that the port-ID being shown on the NW-NG should be 8 + 17 =
25. This is exactly the information that we saw in the output of show route fabric
default.bridge.0 earlier.
53
This information is similar to that which was seen in the case of a Layer 2 route. Since
this particular route is a direct route on the NW-NG, then the IP address of
128.0.128.4 and the corresponding data plane information (PFE-ID: 9 and Port-ID:
21) should reside on the NW-NG. Here are the verification commands:
qfabric-admin@NW-NG-0>showfabricsummary
AutonomousSystem:100
INEId:128.0.128.4<<<<thisiscorrect
INEType:Network
SimulationMode:SI
{master}
qfabric-admin@NW-NG-0>showfabricmulticastvccpdf-adjacency
Flags:S-Stale
SrcSrcSrcDestSrcDest
54
DevidINEDevtypeDevidInterfaceFlagsPortPort
934(s)TOR256fte-2/1/1.3276810
934(s)TOR512fte-2/1/0.3276800
10259TOR256n/a-1-1
10259TOR512n/a-1-1
1134(s)TOR256fte-1/1/1.3276811
1134(s)TOR512fte-1/1/0.3276801
12259TOR256n/a-1-1
12259TOR512n/a-1-1
256260F29n/a-1-1
256260F210n/a-1-1
256260F211n/a-1-1
256260F212n/a-1-1
512261F29n/a-1-1
512261F210n/a-1-1
512261F211n/a-1-1
512261F212n/a-1-1
{master}
So the PFE-ID of 9 indeed resides on the NW-NG. According to the output of show
route fabric table bgp.l3vpn.0 taken from the RSNG, the port-ID of the remote
Node is 21. This means that the corresponding port number on the NW-NG should
be xe-2/0/4 (4 + 17 = 21). Note that the original Layer 3 route was a direct route
because of the configuration on ae4 on the NW-NG. Hence one should expect
xe-2/0/4 to be a part of ae4. Here is what the configuration looks like on the NWNG:
qfabric-admin@NW-NG-0>showconfigurationinterfacesxe-2/0/4
description"NW-NG-0:ae4toTSTRaxe-4/2/2";
metadataMLNNG02a:xe-0/0/4;
ether-options{
802.3adae4;<<<thisisexactlytheexpectedinformation
}
{master}
BUM Traffic
A QFabric system can have 4095 VLANs configured on it and can also be comprised
of multiple Node groups. A Node group may or may not have any active ports in a
specific VLAN. To maintain scale within a QFabric system, whenever data has to be
flooded it is sent only to those Nodes which have an active port in the VLAN in
question.
To make sure that flooding takes place according to these rules, the QFabric technology introduces the concept of a Multicast Core Key. A Multicast Core Key is a 7-bit
value and it identifies a group of Nodes for the purposes of replicating BUM traffic.
This value is always generated by the active NW-NG-0 VM and is advertised to all
the Nodes so that correct replication and forwarding of BUM traffic can take place.
As discussed, a Node should receive BUM traffic in a VLAN only if it has an active
port (which is in up/up status) in that given VLAN. To achieve this, whenever a
Nodes interface becomes an active member of a VLAN, that Node relays this
information to the NW-NG-0 VM over the CPE network. The NW-NG-0 VM
processes this information from all the Nodes and generates a Multicast Core Key for
that VLAN. This Multicast Core Key has an index of all the Nodes that subscribe to
this VLAN (that is, the Nodes which have an active port in this VLAN). The Core
Key is then advertised to all the Nodes and all the Interconnects by the NW-NG-0
VM over the CPE network. This processes is hereafter referred to as a Node subscribing to the VLAN.
55
Once the Nodes and Interconnects receive this information, they install a broadcast
route in their default.bridge.0 table and the next hop for this route is the Multicast
Core Key number. With this information, the Nodes and Interconnects are able to
send the BUM data only to Nodes that subscribe to this VLAN.
Note that there is a specific table called default.fabric.0 that contains all the information regarding the Multicast Core Keys. This includes the information that the
NW-NG-0 VM receives from the Nodes when they subscribe to a VLAN.
Here is a step wise explanation of this process for vlan.29:
1. Vlan.29 is present only on the Nodes that are a part of the Network Node group:
root@TEST-QFABRIC>showvlansvlan.29
NameTagInterfaces
vlan.2929
NW-NG-0:ae0.0*,NW-NG-0:ae34.0*,NW-NG-0:ae36.0*,
NW-NG-0:ae38.0*
3. Since vlan.29 has active ports only on the Network Node Group, this VLAN
shouldnt exist on any other Node group:
root@TEST-QFABRIC>requestcomponentloginRSNG0
Warning:Permanentlyadded'dcfnode-defaultrsng0,169.254.193.3'(RSA)tothelistofknownhosts.
Password:
---JUNOS13.1I20130618_0737_dc-builderbuilt2013-06-1808:50:01UTC
Atleastonepackageinstalledonthisdevicehaslimitedsupport.
Run'fileshow/etc/notices/unsupported.txt'fordetails.
{master}
qfabric-admin@RSNG0>showvlans29
error:vlanwithtag29doesnotexist
{master}
qfabric-admin@RSNG0>
4. At this point in time, the NW-NG-0s default.fabric.0 table does not contain only
local information:
qfabric-admin@NW-NG-0>showfabricsummary
AutonomousSystem:100
INEId:128.0.128.4
INEType:Network
SimulationMode:SI
{master}
qfabric-admin@NW-NG-0>...0fabric-route-typemcast-routesl2domain-id5
default.fabric.0:88destinations,92routes(88active,0holddown,0hidden)
RestartComplete
56
+=ActiveRoute,-=LastActive,*=Both
5.ff:ff:ff:ff:ff:ff:128.0.128.4:128:000006c3(L2D_PORT)/184
*[Fabric/40]11w1d02:59:50
>to128.0.128.4:128(NE_PORT)viaae0.0,Layer2FabricLabel1731
5.ff:ff:ff:ff:ff:ff:128.0.128.4:162:000006d3(L2D_PORT)/184
*[Fabric/40]6w3d09:03:04
>to128.0.128.4:162(NE_PORT)viaae34.0,Layer2FabricLabel1747
5.ff:ff:ff:ff:ff:ff:128.0.128.4:164:000006d1(L2D_PORT)/184
*[Fabric/40]11w1d02:59:50
>to128.0.128.4:164(NE_PORT)viaae36.0,Layer2FabricLabel1745
5.ff:ff:ff:ff:ff:ff:128.0.128.4:166:000006d5(L2D_PORT)/184
*[Fabric/40]11w1d02:59:50
>to128.0.128.4:166(NE_PORT)viaae38.0,Layer2FabricLabel1749
{master}
The command executed above is show route fabric table default.fabric.0
fabric-route-type mcast-routes l2domain-id 5.
5. The user configures a port on the Node group named RSNG0 in vlan.29. After
this, RSNG0 started displaying the details for vlan.29:
root@TEST-QFABRIC#...hernet-switchingport-modetrunkvlanmembers29
[edit]
root@TEST-QFABRIC#commit
commitcomplete
[edit]
root@TEST-QFABRIC#show|comparerollback1
[editinterfaces]
+P7122-C:xe-0/0/9{
+unit0{
+familyethernet-switching{
+port-modetrunk;
+vlan{
+members29;
+}
+}
+}
+}
[edit]
qfabric-admin@RSNG0>showvlans29extensive
VLAN:vlan.29---qfabric,Createdat:ThuFeb2713:16:392014
802.1QTag:29,Internalindex:7,AdminState:Enabled,Origin:Static
Protocol:PortMode,Macagingtime:300seconds
Numberofinterfaces:Tagged1(Active=1),Untagged0(Active=0)
xe-1/0/9.0*,tagged,trunk
{master}
57
>to128.0.130.6:49174(NE_PORT)viaxe-1/0/9.0,Layer2FabricLabel1729
{master}
This route is then sent over to NW-NG-0 via the Fabric Control VM.
7. NW-NG-0 receives the route from RSNG0 and updates its default.fabric.0 table:
qfabric-admin@NW-NG-0>...ic-route-typemcast-routesl2domain-id5
--snip
5.ff:ff:ff:ff:ff:ff:128.0.130.6:49174:000006c1(L2D_PORT)/184
*[BGP/170] 00:07:34, localpref 100, from 128.0.128.6 <<<< 128.0.128.6 is RSNG0s IP
address
AS path: I, validation-state: unverified
> to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1729 PFE Id 12 Port Id 26
[BGP/170] 00:07:34, localpref 100, from 128.0.128.8
AS path: I, validation-state: unverified
> to 128.0.130.6 via dcfabric.0, Layer 2 Fabric Label 1729 PFE Id 12 Port Id 26
8. NW-NG-0 VM checks its database to find out the list of Nodes that already
subscribe to vlan.29 and generates a PFE-map. This PFE-map contains the indices of
all the Nodes that subscribe to vlan.29:
qfabric-admin@NW-NG-0>showfabricmulticastrootvlan-group-pfe-map
L2domainGroupFlagPFEmapMrouterPFEmap
22.255.255.255.25561A00/30/0
55.255.255.255.25561A00/30/0
--snip--
Check the entry corresponding to the L2Domain-ID for the corresponding VLAN. In
this case, the L2Domain-ID for vlan.29 is 5.
9. The NW-NG-0 VM creates a Multicast Core-Key for the PFE-map (4101 in this
case):
qfabric-admin@NW-NG-0>...multicastrootlayer2-group-membership-entries
GroupMembershipEntries:
--snip-L2domain:5
Group:Source:5.255.255.255.255
Multicastkey:4101
PacketForwardingmap:1A00/3
--snip-The command used here was show fabric multicast root layer2-group-membership-entries. This command is only available in Junos 13.1 and higher. In earlier
versions of Junos the show fabric multicast root map-to-core-key command can be
58
11. The NW-NG-0 VM sends out a broadcast route for the corresponding VLAN to
all the Nodes and the Interconnects. The next hop for this route is set to the
Multicast Core Key number. This route is placed in the default.bridge.0 table and is
used to forward and flood the data traffic. The Nodes and Interconnects will install
this route only if they have information for the Multicast Core Key in their default.
fabric.0 table. In this example, note that the next hop contains the information for
the Multicast Core Key as well:
qfabric-admin@RSNG0>showroutefabrictabledefault.bridge.0l2domain-id5
--snip-5.ff:ff:ff:ff:ff:ff/88
*[BGP/170]00:38:13,localpref100,from128.0.128.6
ASpath:I,validation-state:unverified
>to128.0.128.4:57005(NE_
PORT)viadcfabric.0,MultiCast-Corekey:4101Keylen:7
[BGP/170]00:38:13,localpref100,from128.0.128.8
ASpath:I,validation-state:unverified
>to128.0.128.4:57005(NE_
PORT)viadcfabric.0,MultiCast-Corekey:4101Keylen:7
The eleven steps mentioned here are a deep dive into how the Nodes of a QFabric
system subscribe to a given VLAN. The aim of this technology is to make sure that
all the Nodes and the Interconnects have consistent information regarding which
Nodes subscribe to a specific VLAN. This information is critical to ensuring that
there is no excessive flooding within the data plane of a QFabric system.
At any point in time, there may be multiple Nodes that subscribe to a VLAN, raising
the question of where a QFabric system should replicate BUM traffic. QFabric
systems replicate BUM traffic at the following places:
Ingress Node: Replication takes place only if:
There are any local ports in the VLAN where BUM traffic was received.
BUM traffic is replicated and sent out on the server facing ports.
There are any remote Nodes that subscribe to the VLAN in question. BUM
traffic is replicated and sent out towards these specific Nodes over the 40GbE
FTE ports.
Interconnects: Replication takes place if there are any directly connected Nodes
that subscribe to the given VLAN.
Egress Node: Replication takes place only if there are any local ports that are
active in the given VLAN.
Differences Between Control Plane Traffic and Internal Control Plane Traffic
Most of this chapter has discussed the various control plane characteristics of the
QFabric system and how the routes are propagated from one RE to another. With
this background, note the following functions that a QFabric system has to perform
to operate:
59
Control plane tasks: form adjacencies with other networking devices, learn
Layer 2 and Layer 3 routes
Data plane tasks: forward data end-to-end
Internal control plane tasks: discover Nodes and Interconnects, maintain
VCCPD, VCCPDf adjacencies, health of VMs, exchange routes within the
QFabric system to enable communication between hosts connected on different Nodes
The third bullet here makes QFabric a special system. All the control plane traffic
that is used for the internal workings of the QFabric system is referred to as internal
control plane traffic. And the last pages of this chapter are dedicated to bringing out
the differences between the control plane and the internal control plane traffic. Lets
consider the following QFabric system shown in Figure 3.8.
Figure 3.8
In Figure 3.8, the data plane is shown using blue lines and the CPE is shown using
green lines. There are four Nodes, two Interconnects, and four Hosts. Host-1 and
Host-2 are in the RED-VLAN (vlan.100), Host-3 and Host-4 are in YELLOWVLAN (vlan.200). Node-1, Node-2, and Node-3 are SNGs, whereas Node-4 is an
NW-NG Node and has Host-4, as well as a router (R1), directly connected to it.
Finally, BGP is running between the QFabric and R1.
60
Lets discuss the following traffic profiles: Internal Control Plane and Control Plane.
Internal Control Plane
VCCPDf Hellos between the Nodes and the Interconnects are an example of
internal control plane traffic.
Similarly, the BGP sessions between the Node groups and the FC-VM is also
an example of internal control plane traffic.
Note that the internal control plane traffic is also generated by the CPU, but
its used for forming and maintaining the states of protocols that are critical to
the inner-workings of a QFabric system.
Also, the internal control plane traffic is always used only within the QFabric
system. The internal control plane traffic will never be sent out of any Nodes.
Control Plane
Start a ping from Host-1 to its default gateway. Note that the default-gateway
for Host-1 resides on the QFabric. In order for the ICMP pings to be successful, Host-1 will need to resolve ARP for the gateways IP address. Note that
Host-1 is connected to Node-1, which is an SNG, and the RE functionality is
always locally active on an SNG. Hence the ARP replies will be generated
locally by Node-1s CPU (SNG). The ARP replies are sent out to Host-1 using
the data plane on Node-1.
The BGP Hellos between the QFabric system and R1 will be generated by the
active NW-NG-VM, which is running on the DGs. Even though R1 is directly
connected to Node-4, the RE functionality on Node-4 is disabled because it is
a part of the Network Node Group. The BGP Hellos are sent out to R1 using
the data plane link between Node-4 and R1.
The control plane traffic is always between the QFabric system and an external entity. This means that the control plane traffic eventually crosses the data
plane, too, and goes out of the QFabric system via some Node(s).
To be an effective QFabric administrator, it is extremely important to know which
RE/PFE would be active for a particular abstraction or Node group. Note that all
the control plane traffic for a particular Node group is always originated by the
active RE for that Node group. This control plane traffic is responsible for forming
and maintaining peerings and neighborships with external devices. Here are some
specific examples:
A server connected to SNG (active-RE is the Nodes RE) via a single link: In
this situation, if LLDP is running between the server and the QFabric, then the
RE of the Node is responsible for discovering the server via LLDP. The LLDP
PDUs will be generated by the RE of the Node, which will help the server with
the discovery of the QFabric system.
RSNG: For an RSNG, the REs of the Nodes have an active/passive relationship. For example:
A server with two NICs connected to each Node of an RSNG: This is the
classic use case for an RSNG for directly connecting servers to a QFabric
system. Now if LACP is running between the server and the QFabric system,
61
then the active RE is responsible for exchanging LACP PDUs with the server to
make sure that the aggregate link stays up.
An access switch is connected to each Node of an RSNG: This is a popular
use case in which the RSNG (QFabric) acts as an aggregation point. However,
you can eliminate STP by connecting the access switch to each Node of the
RSNG and by running LACP on the aggregate port. This leads to a flat
network design. Again the active RE of the RSNG is responsible for exchanging LACP PDUs with the server to make sure that the aggregate link stays up.
Running BGP between NW-NG-0 and an MX (external router): As expected,
its the responsibility of the active NW-NG-0 VM (located on the active-DG)
to make sure that the necessary BGP communication (keep alives, updates,
etc.) takes place with the external router.
62
Chapter 4
Data Plane Forwarding
64
This chapter concerns how a QFabric system forwards data. While the previous
chapters in this book have explained how to verify the working of the control plane
of a QFabric system, this chapter focuses only on the data plane.
The following phrases are used in this chapter:
Ingress Node: the Node to which the source of a traffic stream is connected.
Egress Node: the Node to which the destination of a traffic stream is connected.
NOTE
The data plane on the Nodes and the Interconnects resides on the ASIC chip. Accessing and troubleshooting the ASIC is Juniper confidential and beyond the scope of this
book.
This chapter covers the packet paths for the following kinds of traffic within a
QFabric system:
ARP resolution at the QFabric for end-to-end traffic
Layer 2 traffic (known destination MAC address with source and destination
connected on the same Node)
Layer 2 traffic (known destination MAC address with source and destination
connected on different Nodes)
Layer 2 traffic (BUM traffic)
Layer 3 traffic (destination prefix is learned on the local Node)
Layer 3 traffic (destination prefix is learned on a remote Node)
Example with an end-to-end ping between two hosts connected to different
Nodes on the QFabric
65
Figure 4.1
Think of QFabric as a large switch. When a regular switch needs to resolve ARP,
then it is required to flood an ARP request in the corresponding VLAN. The QFabric should behave in a similar way. In this particular example, the QFabric should
send out the ARP request for Host-B on all the ports that have VLAN-2 enabled.
There are three such ports: one locally on Node-1, and one each on Node-2 and
Node-3.
Since Node-1 has VLAN-2 active locally, it would also have the information about
VLAN-2s broadcast tree. Whenever a Node needs to send BUM traffic on a VLAN
that is active locally as well, that traffic is always sent out on the broadcast tree.
66
In this example, Node-1, Node-2, and Node-3 subscribe to the broadcast tree for
VLAN-2. Hence, this request is sent out of the FTE ports towards Node-2 and
Node-3. Once these broadcast frames (ARP requests) reach Node-2, they are flooded
locally on all the ports that are active in VLAN-2.
Here is the sequence of steps:
1. Host-A wants to send some data to Host-B. Host-A is in VLAN-1 and Host-B is in
VLAN-2. The IP and MAC addresses are shown in Figure 4.2.
2. Host-A sends this data to the default gateway (the QFabrics RVI for VLAN-1).
3. The QFabric system needs to generate an ARP request for Host-Bs IP address.
4. Since the destination prefix (VLAN) is also active locally, generate the ARP request
locally on the ingress Node (Node-1).
Figure 4.2
Node-1 sends out this ARP request on the local ports that are active for VLAN-2. In
addition, Node-1 consults its VLAN broadcast tree and finds out that Node-2 and
Node-3 also subscribe to VLAN-2s broadcast tree. Node-1 sends out the ARP
request over the FTE links. An extra header called the fabric header is added on all
the traffic going on the FTE links to make sure that only Node-2 and Node-3 receive
this ARP request.
The IC receives this ARP request from Node-1. The IC looks at the header appended
by Node-1 and finds out that this traffic should be sent only to Node-2 and Node-3.
The IC has no knowledge of the kind of traffic that is encapsulated within the fabric
header.
In Figure 4.3, Node-2 and Node-3 receive one ARP request each from their FTE
links. These Nodes flood the request on all ports that are active in VLAN-2.
Host-B replies to the ARP request.
Figure 4.3
67
Node-2 and Node-3 Receive Request from the Data Plane (Interconnect)
At this point in time, the QFabric system learns the ARP entry (ARP route) for
Host-2 (see Figure 4.4). Using the Fabric Control VM, this route is advertised to all
the Nodes that have active ports in this VRF. Note that the ARP route will be
advertised to relevant Nodes based on the same criteria as regular Layer 3 routes,
that is, based on the VRF.
Figure 4.4
68
1. Node-1 now knows how to resolve the ARP for Host-Bs IP address. This is the
only information that Node-1 needs to be able to send traffic to Host-B.
2. Host-As data is successfully sent over to Host-B via the QFabric system.
The Ingress Node Does Not Have the Destination VLAN Configured On It
Refer to Figures 4.5 4.7, in this case, Host-A starts sending data to Host-E. Node-1
is the ingress Node and it does not have any active port in the destination VLAN
(VLAN-3).
This is a special case in which you need additional steps to make end-to-end traffic
work properly. Thats because the ingress Node doesnt have the destination VLAN
and hence doesnt subscribe to that VLANs broadcast tree. Since Node-1 doesnt
subscribe to destination VLANs broadcast tree, it has no way to know which Nodes
should receive BUM traffic in that VLAN.
Note that the Network Node Group is the abstraction that holds most of the routing
functionality of the QFabric. Hence, youll need to make use of the Network Node
Group to resolve ARP in such a scenario.
Here is the sequence of steps that will take place:
1. Host-A wants to send some data to Host-E. Host-A is in VLAN-1 and Host-E is in
VLAN-3. The IP and MAC addresses are shown in Figure 4.2.
2. Host-A sends this data to the default gateway (the QFabrics RVI for VLAN-1).
3. The QFabric system needs to generate an ARP request for Host-Es IP address.
4. Since the destination-prefix (VLAN) is not active locally, Node-1 has no way of
knowing where to send the ARP request in VLAN-3. Because of this, Node-1 cannot
generate the ARP request locally.
5. Node-1 is aware that Node-3 belongs to the Network Node Group. Since the
NW-NG hosts the routing-functionality of a QFabric, Node-1 must send this data to
the NW-NG for further processing.
Figure 4.5
Node-1 Encapsulates the Data with Fabric Header and Sends It to the Interconnects
69
6. Node-1 encapsulates the data received from Host-A with a fabric-header and
sends it over to the NW-NG Nodes (Node-3 in this example).
Figure 4.6
7. Node-3 receives the data from Node-1 and immediately knows that ARP must be
resolved. Since resolving ARP is a Control plane function, this packet is sent over to
the active NW-NG-VM. Since the VM resides on the active DG, this packet is now
sent over the CPE links so that it reaches the active NW-NG-VM.
Figure 4.7
NW-NG-0 VM Generates the ARP Request and Sends It Towards the Nodes
70
8. The active NW-NG-VM does a lookup and knows that ARP-requests need to be
generated for Host-Bs IP address. The ARP request is generated locally. Note that
this part of the process takes place on the NW-NG-VM that is located on the masterDG. However, the ARP request must be sent out by the Nodes so that it can reach the
correct host. For this to happen, the active NW-NG-VM sends out one copy of the
ARP request to each Node that is active for the destination VLAN. The ARP requests
from the active NW-NG-VM are sent out on the CPE network.
9. In this specific example (the QFabric system depicted in Figure 4.2), there is only
one Node that has VLAN-3 configured on it: Node-4. As a result, the NW-NG VM
sends the ARP request only to Node-4. This Node receives the ARP request on its
CPE links and floods it locally in VLAN-3. This is how the ARP request reaches the
correct destination host.
10. Host-B replies to the ARP request.
11. At this point in time, the QFabric system learns the ARP entry (ARP route) for
Host-2. Using the Fabric Control VM, this route is advertised to all the Nodes that
have active ports in this VRF. This is the same process that was discussed in section
4.2.
12. Since the QFabric knows how to resolve the ARP for Host-Bs IP address,
Host-As data is successfully sent to Host-B via the QFabric system.
Layer 2 Traffic (Known Destination MAC Address with Source and Destination
Connected on the Same Node)
This is the simplest traffic forwarding case wherein the traffic is purely Layer 2 and
both the source and destination are connected to the same Node.
In this scenario, the Node acts as a regular standalone switch as far as data plane
forwarding is concerned. Note that QFabric will need to learn MAC addresses in
order to forward the Layer 2 traffic as unicast. Once the active RE for the ingress
Node group learns the MAC address, it will interact with Fabric Control VM and
send that MAC address to all the other Nodes that are active in that VLAN.
Layer 2 Traffic (Known Destination MAC Address with Source and Destination
Connected on Different Nodes)
In this scenario, refer again to Figure 4.2, where Host-C wants to send some data to
Host-B. Note that they are both in VLAN-2 and hence the communication between
them would be purely Layer 2 from QFabrics perspective. Node-1 is the ingress
Node and Node-2 is the egress Node. Since the MAC address of Host-B is already
known to QFabric , the traffic from Host-C to Host-B will be forwarded as unicast by
the QFabric system.
Here is the sequence of steps that will take place:
1. Node-1 receives data from Host-C and looks up the Ethernet-header. The
destination-MAC address is that of Host-B. This MAC address is already learned by
the QFabric system.
2. At Node-1, this MAC address would be present in the default.bridge.0 table.
3. The next-hop for this MAC address would point to Node-2.
71
4. Node-1 adds the fabric-header on this data and sends the traffic out on its FTE
link. The fabric header contains the PFE-id of Node-2.
5. The IC receives this information and does a lookup on the fabric-header. This
reveals that the data should be sent towards Node-2. The IC then sends the data
towards Node-2.
6. Node-2 receives this traffic on its FTE link. The fabric-header is removed and a
lookup is done on the Ethernet-header.
7. The destination-MAC is learned locally and points to the interface connected to
Host-B.
8. Traffic is sent out towards Host-B.
72
Figure 4.8
In Figure 4.8, the prefixes for Host-A and Host-B are learned by the QFabrics
Network Node Group-VM. Both these prefixes are learned from the routers that are
physically located behind Node-1. Assuming that Host-A starts sending some traffic
to Host-B, here is the sequence of steps that would take place:
1. Data reaches Node-1. Since this is a case for routing, the destination MAC
address would be that of the QFabric. The destination IP address would be that of
Host-B.
2. Node-1 does a local lookup and finds that the prefix is learned locally.
3. Node-1 decrements the TTL and sends the data towards R2 after making
appropriate changes to the Ethernet header.
4. Note that since the prefix was learned locally, the data is never sent out on the
FTE links.
As Figure 4.8 illustrates, the QFabric system acts as a regular networking router in
this case. The functionality here is to make sure that the QFabric system obeys all
73
the basic laws of networking, such as resolving ARP for the next hop routers IP
address, etc. Also, the ARP resolution shown is for a locally connected route and that
was discussed as a separate case study earlier in this chapter.
Just like a regular router, the QFabric system makes sure that the TTL is also decremented for all IP-routed traffic before it leaves the egress Node.
74
Figure 4.9
Figure 4.9 shows a QFabric system that is typical of one that might be deployed in a
Data Center. This system has one Redundant-Server-Node group called RSNG-1.
Node-1 and Node-2 are part of RSNG-1 and Node-1 is the master Node. Node-3
and Node-4 are member Nodes of the Network Node Group abstraction. Lets
assume DG0 is the master, and hence, the active Network Node Group-VM resides
on DG0. The CPE and the Director devices are not shown in Figure 4.9.
Server-1 is dual-homed and is connected to both the Nodes of RSNG-1. The links
coming from Server-1 are bundled together as a LAG on the QFabric system.
Node-3 and Node-4 are connected to a router R1. R1s links are also bundled up as
a LAG. There is OSPF running between the QFabric system and router R1 and R1 is
advertising the subnet of Host-2 towards the QFabric system. Note that the routing
functionality of the QFabric system resides on the active Network Node GroupVM. Hence the OSPF adjacency is really formed between the VM and R1.
Finally, lets assume that this is a new QFabric system (no MAC addresses or IP
routes have been learned). So, the sequence of steps would be as follows.
At RSNG-1
At RSNG-1 the sequence of steps for learning Server-1s MAC address would be:
1. Node-1 is the master Node within RSNG-1. This means that the active RE resides
on Node-1.
2. Server-1 sends some data towards the QFabric system. Since Server-1 is
connected to both Node-1 and Node-2 using a LAG, this data can be received on
either of the Nodes.
2a. If the data is received on Node-1, then the MAC address of Server-1 is
learned locally in VLAN-1.
75
2b. If the data is received on Node-2, then it must be first sent over to the
active-RE so that MAC-address learning can take place. Note that this first
frame is sent to Node-1 over the CPE links. Once the MAC address of Server-1
is learned locally, this data is no longer sent to Node-1.
3. Once the MAC address is learned locally on Node-1, it must also send this Layer 2
route to all other Nodes that are active in VLAN-1. This is done using the Fabric
Control-VM as discussed in Chapter 3.
Network Node Group
The sequence for learning Host-2s prefix for the Network Node Group would be:
1. R1 is connected via a LAG to both Node-3 and Node-4. R1 is running OSPF and
sends out an OSPF Hello towards the QFabric system.
2. The first step is to learn the MAC address of R1 in VLAN-2.
3. In this case, the traffic is incoming on a Node that is a part of the Network Node
Group. This means that the REs on the Nodes are disabled and all the learning needs
to take place at the Network Node Group VM.
4. This initial data is sent over the CPE links towards the master-DG (DG0). Once
the DG receives the data, it is sent to the Network Node Group-VM.
5. The Network Node Group-VM learns the MAC address of R1 in VLAN-2 and
distributes this route to all the other Nodes that are active in VLAN-2. This is again
done using the Fabric Control-VM.
6. Note that the OSPF Hello was already sent to the active-Network Node Group
VM. Since OSPF is enabled on the QFabric system as well, this Hello is processed.
7. Following the rules of OSPF, the necessary OSPF-packets (Hellos, DBD, etc.) are
exchanged between the active-Network Node Group-VM and R1 and the adjacency
is established and routes are exchanged between the QFabric and R1.
8. Note that whenever Node-3 or Node-4 receive any OSPF packets, they send the
packets out of their CPE links towards the active DG so that this data can reach the
Network Node Group-VM. This Control plane data is never sent out on the FTE
links.
9. Once the Network Node Group-VM learns these OSPF routes from R1, it again
leverages the internal-BGP peering with the Fabric Control-VM to distribute these
routes to all the Nodes that are a part of this routing instance.
Ping on Server-1
After QFabric has learned the Layer 2 route for Server-1 and the Layer 3 route for
Host-2, lets assume that a user initiates a ping on Server-1. The destination of the
ping is entered as the IP address of Host-2. Here is the sequence of steps that would
take place in this situation:
1. A ping is initiated on Server-1. The destination for this ping is not in the same
subnet as Server-1. As a result, Server-1 sends out this traffic to its default gateway
(which is the RVI for VLAN-1 on the QFabric system).
2. This data reaches the QFabric system. Lets say this data comes in on Node-2.
76
3. At this point in time, the destination-MAC address would be the QFabrics MAC
address. Node-2 does a lookup and finds out that the destination IP address is that
of Host-2. A routing lookup on this prefix reveals the next hop of R1.
4. In order to send this data to R2, the QFabric system also needs to resolve the ARP
for R2s IP address. The connected interface that points to R1 is the RVI for
VLAN-2. Also, VLAN-2 doesnt exist locally on Node-2.
Technically, the QFabric would have resolved the ARP for R1 while forming
the OSPF adjacency. That fact was omitted here to illustrate the complete
sequence of steps for end-to-end data transfer within a QFabric system.
5. This is the classic use case in which the QFabric must resolve an ARP for a VLAN
that doesnt exist on the ingress Node.
6. As a result, Node-2 encapsulates this data with a fabric header and sends it out of
its FTE links towards the Network Node Group Nodes (Node-3 in this example.)
7. The fabric header would have Node-3s PFE-id. The Interconnects would do a
lookup on the fabric header and send this data over to Node-3 so that it could be
sent further along to the Network Node Group VM for ARP resolution.
8. Node-3 sends this data to the master DG over the CPE links. The master DG in
turn sends it to the active Network Node Group VM.
9. Once the active Network Node Group VM receives this, it knows that ARP must
be resolved for R1s IP address in VLAN-2. The Network Node Group VM
generates an ARP request packet and sends it to all the Nodes that are active in
VLAN-2. (Note that this communication takes place over the CPE network.)
10. Each Node that is active in VLAN-2 receives this ARP-request packet on its CPE
links. This ARP request is then replicated by the Nodes and flooded on all the
revenue ports that are active in VLAN-2.
11. This is true for Node-3 and Node-4 as well. Since the link to R1 is a LAG, only
one of these Nodes sends out the ARP request towards R1.
12. R1 sends out an ARP reply and it is received on either Node-3 or Node-4.
13. Since ARP learning is a control plane function, this ARP reply is sent towards
the master DG so that it can reach the active Network Node Group VM.
14. The VM learns the ARP for R1 and then sends out this information to all the
Nodes that are active in the corresponding routing instance.
15. At this point in time, Node-1 and Node-2 know how to reach R1.
16. Going back to Step# 4, now Node-2 knows how to route traffic to R2, and the
local tables on Node-2 would suggest the next hop of Node-3 to reach R1.
17. Since the data to be sent between Server-1 and Host-2 has to be routed, a Layer
3, Node-2 decrements the IP TTL and adds the fabric-header on the traffic. The
fabric-header contains the PFE-id of Node-3.
18. After adding the fabric-header, this data is sent out on the FTE links towards
one of the Interconnects.
19. The Interconnects do a lookup on the fabric-header and determine that all this
traffic should be sent to Node-3.
77
20. Node-3 receives this traffic on its FTE links and sends this data out towards R1
after modifying the Ethernet-header.
This is a rather high-level sequence of events that take place for the BGP peering
between the QFabric and R1 and does not take into account all the things that need
to be done before peering, such as learning R1s MAC address, learning R1s ARP,
etc.
78
the VLAN and also the 40GbE FTE links (one or more) in case some other
Nodes also have ports active in this VLAN.
Q: What extra information is added to the data that is sent out on the 40GbE FTE
links?
Every Node sevice that is a part of a QFabric system adds a fabric header to
data before sending it out of the FTE links. The fabric header contains the
PFE-ID of the remote Node device where the data should be sent.
Q: How can the PFE-ID of a Node be obtained?
Using the CLI command show fabric multicast vccpdf-adjacency. Then
co-relate this output with the output of show virtual chassis CLI command.
Consider the following snippets taken from an RSNG:
qfabric-admin@RSNG0>showfabricmulticastvccpdf-adjacency
Flags:S-Stale
SrcSrcSrcDestSrcDest
DevidINEDevtypeDevidInterfaceFlagsPortPort
934TOR256n/a-1-1
934TOR512n/a-1-1
10259(s)TOR256fte-0/1/1.3276813
10259(s)TOR512fte-0/1/0.3276803
1134TOR256n/a-1-1
1134TOR512n/a-1-1
12259(s)TOR256fte-1/1/1.3276812
12259(s)TOR512fte-1/1/0.3276802
The Src Dev Id column shows the PFE-IDs for all the Nodes, while the Interface
column shows the IDs of all the interfaces that are connected to the Interconnects, but
only for those Node devices that are a part of the Node group (RSNG0 in this case).
(Note that the traditional Junos interface format is used: namely, FPC/PIC/PORT.)
You can see by the bolded output that Node device with PFE-ID of 10 corresponds to
PIC-0 and Node device with PFE-ID 12 corresponds to PIC-1.
The next step is to correlate this output with the show
virtual-chassis command:
qfabric-admin@RSNG0>showvirtual-chassis
PreprovisionedVirtualChassis
VirtualChassisID:0000.0103.0000
Mstr
MemberIDStatusModelprioRoleSerialNo
0(FPC0)Prsntqfx3500128Master*P6810-C
1(FPC1)Prsntqfx3500128BackupP7122-C
{master}
You can see that the Node device that corresponds to FPC-0 has the serial number of
P6810-C and the one with PFE Id of 1 has the serial-number P7122-C. The aliases of
these Nodes can then be checked by either looking at the configuration or by issuing
the show fabric administration inventory command.
80