Troubleshoot Networks Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

Physical layer troubleshooting

POTS failure
Many systems still run on analog phone lines, from standard phones, fire system
dialers, elevator emergency phones, ATM machines, to the trusty fax machine. These
lines can be delivered either via the phone company or from live VoIP system via an
analog converter. Once they are analog signals on the copper, the troubleshooting
process is the same. The process starts with a tool called a lineman's handset, or a
butt set. 

A butt set is a hardened analog phone with test leads connected. It is hardened to


absorb abuse while clipped to a tool belt, and the leads are generally wrapped in a
woven material for strength. The leads terminate in clips that can be attached to a
common POTS terminal block. They often have a bed of nails, which is a pad of
spikes that can directly bite through the jacket of a cable. Most modern butt sets also
include the ability to directly plug into an RJ11 connectorized cable for testing. 

I generally begin my troubleshooting by testing the line right where the device plugs
into it. Simply the connect the phone cable directly into the butt set, and test the
line. If the line comes up and dials out, I know I have an issue with the device. If the
butt set gets no dial tone here, I'll go straight to the DMARC, or the closest point to
my connection on the provider's network. 
The majority of tel code DMARCs consist of a 66 block. It's for this reason the
majority of butt sets have clips can connect to 66-block pins. I find that issues
generally originate either with the end device or within the provider's network, which
is why I test these first. Once I locate the proper pair of cable at the DMARC, I'll clip
my butt set's test leads on the jacks. If I still don't get dial tone, then I've effectively
pinpointed the issue as within the provider's network, and I'll contact them to repair
the issue. 
Before I call the provider, I'll usually pull their cable out of the DMARC in an attempt
to re-terminate it, just to be sure a connection hasn't come loose or corroded. If I do
get dial tone, I know that the issue is in the cable connecting the DMARC to the end
device. 
The easiest method to troubleshoot from here is to work from the DMARC towards
the end device, testing each connection in the cable until the fault is located. A
standard $5 analog phone can be used in place of a butt set. Actually, it should work
just fine for occasional use. When testing at the DMARC, I can strip about 1/2 an
inch of phone cable and wrap the bare wires around the leads on the tel code's
DMARC. Other than this, all testing steps are the same. 

Copper Ethernet failure


When corporate network connections is down, I like to eliminate the end device as
the failure. 

This will effectively cut the troubleshooting process in half which can save a
tremendous amount of time. I generally begin by unplugging the cable from the end
device, and connecting it directly to my laptop. If the connection comes up, then it is
an issue with the device. If the connection is still failed, then it will be between the
cable and the switch.
Assuming that the connection issue is from the cable to the switch, I'll move directly
to the switch to rule out cable issues. 
Using a known good cable, I plug my laptop directly into the switch port. If the port
comes up, I know the issue is with the cabling. 
If the port does not come up, I verify that the issue is within the switch (port failed or
shut or err disabled). 
If the switch has additional available ports, I'd verify one of them, then move the
connection. If I was able to verify the issue's with the cabling, I'd test between each
connection of the cabling. 
Often, cables run from wall ports to patch panels then into the switch. At this point, if
I have a cable tester, I'll begin to test using it. 
If I don't have one available, I can plug my laptop into each segment until I find the
failure. 
A standard cable tester will have the Test Unit and a Remote Module. The Remote
Module is connected to the far side of the cable, while the Test Unit is connected
locally. 

The cable tester will show me which pair, or even which wire has failed. Nicer test
units, usually somewhere around $80 to $100, will often include a Time Domain
Reflectometer or TDR feature. The TDR can supply the distance to the fault in a
cable. It does this by measuring reflections in a conductor. I can use the TDR feature
to verify if it is a connector that is faulty or if the break is further in the cable. 
Some modern routers and switches actually have the ability to test cables connected
to them. Which can aid in remote troubleshooting. 

If ultimately, it is determined that the end device is experiencing the issue, and it
happens to be a PC, then a few standard steps can be followed. First, check the
adapter settings and determine if the interface has been disabled. If the interface has
not been disabled, and it still doesn't come up it's is most likely the interface on the
PC and should be replaced. A good temporary is to use a USB NIC which can be used
for testing purpose if we don’t have a laptop.
A Loopback Connector can also be created to test a NIC or switch port. This is an
RJ45 jack or plug, that has the send and receive cables loop back to each other. The
following pins need to be connected for a gigabit loopback. Pin one to pin three, pin
two to pin six, pin four to pin seven and pin five to pin eight. When this is connected
to a NIC or a switch port, the interface should successfully come up.

Fiber interface failure

When a new connection refuses to come up, the first step is to ensure the wrong
media type wasn't used. I will ensure optics on all side are both multimode or single
mode, and all of the cables used are single mode or multimode. 
Next, I'll try swapping the transmit and receive connectors on one of my
interfaces. About 50% of the time, when I run a cross-connect between two devices, I
end up connecting both transmit sides and receive sides together accidentally
(probably because TX on 1 end becomes RX on other end. So TX and RX cable
should probably be cross connected between 2 devices)
I've yet to see a laptop with a fiber interface, but there are several workarounds to
test with. One is to loop the fiber. If I loop the interface, I plug a cable in the transmit
side then directly connect it to the receive side on the same optic. 
I generally do loopback testing with a jumper. Jumpers are simply fiber patch cables
that come in a variety of styles and lengths. 
I usually keep a couple of one meter cables with varying connectors to use for quick
replacements, or loopback testing. If I look at the connectors, they are usually
labeled so I can distinguish a single strand from another. I take my jumper, using the
same strand, and loop the in-device optic. If the interface doesn't come up, then I
know it is an issue with the in-device. 
On my end device, I'll verify that the interface isn't disabled at software. 
If it isn't, I'll try replacing the optic on the device with a spare. Though methods will
differ between operating systems and hardware platforms, it equates to the
same. Make sure it isn't disabled, then test by swapping. 
If I have previously determined the that in-device was sound, I will go to the switch
or router port on my side, and perform the loop test there. If the loop test fails, I will
check the interface isn't disabled in software. If the interface is, in fact, enabled, I will
change out the optic, and, if possible, move it to a new switchport. If the switchport
successfully loops, then I will move one connection point further away, and loop the
fiber again. I will continue moving further away from the switch until I locate the
fault. Most infrastructure is trunk cables that will span large distances, then small
jumpers that will connect to switches or in-devices. 

A trunk is simply a single cable with many strands of fiber inside. Replacing a failed


jumper only takes a couple of minutes. But, replacing a trunk is usually impractical. If
it is determined that the fibers on the trunk have failed, I will test the next available
strands, and if they test clean, then the connection will be moved to the new trunk
pair. 
There are some specialized tools that can be used for testing. A light meter can be
used to determine how much light is lost through a fiber optic cable. This consists of
a light source that will send a consistent laser signal into the fiber, and a tester that
will measure the level of light that emits from the end. 

An engineer can also detect distance to a break in an optical cable by using an


optical timed domain reflectometer, or OTDR. Much like a copper TDR, an OTDR will
detect reflections in a signal from a break in a cable, thereby telling me how far the
fault is from my current location. Troubleshooting optical fiber can get hairy, but
even with no specialized tools, you should be able to test with confidence.

Switching loop
Switching loops are the bane of every Layer Two network, and will absolutely ruin
your day. If protection mechanisms don't kick in, you can expect heavy packet loss if
you're lucky. If you aren't lucky, you can expect every connected switch to become
unresponsive. 
Most often, a loop will begin when a user has a free cable connected to a wall
port. For some reason, when a user sees a dangling cable, they feel the need to plug
it in somewhere, and that can be another open wall port. 
Loops can also be seen from VoIP phones. Somes phones have an additional
ethernet port allowing users to piggyback their computers off of the phone. If this
piggyback port is plugged back into the switch, it can cause a loop. 
When connecting switches together, if multiple ports are connected and STP or
bonding isn't properly configured, then a loop can occur.
I'll first start by saying don't use unmanaged switches in your network. They provide
zero visibility for monitoring and they have no protection mechanisms to speak of. If
I have inherited an unmanaged network, then when a loop occurs, I have little
recourse other than to begin unplugging switches. 

Imagine my router plugged into Switch A, that plugs into Switch B, that then plugs
into Switch C. During a loop, first ask everyone if they made any changes to the
network. If not, I would reboot all switches by power cycling them, but I would leave
Switch C disconnected. Did this fix the loop? If yes, then I know my loop is on Switch
C, and I will unplug its uplink port and power it back up. I should see some of the
ports flashing extremely rapidly and in sync. This will likely be my loop. If it wasn't
Switch C, then reboot A and C while leaving B off. I'll repeat the previous
process until the loop source is discovered. The moral of the story is don't use
unmanaged switches. 
Theoretically, if everything is properly configured in a managed switch
environment, then a loop shouldn't occur. Most of the switches will have many of the
same concepts, even if they go about configuration in different ways. Obviously
Spanning Tree Protocol should be properly configured with additional features
like Unidirectional Link Detection for optical interfaces, BPDU guard for access
interfaces, root guard for edge switches, filtering VLANs on trunk ports, and so on. 

There are a lot of moving parts in every network, and sometimes things fall between
the gaps. If I am troubleshooting, that means I have issues, and at this point, I'm
really just trying to find the source of the issue and mitigate it as quickly as
possible. Depending on which make and model of switch that I use and volume of
loop traffic, the switch will react differently. Some lower-end switches will completely
lock up or become unresponsive. Higher-end switches can often absorb a loop, but
may have degraded service. On these chassis, if service isn't degraded by a
significant amount, I may get notification saying that a MAC address is showing up
on multiple interfaces or VLANs. If the switched infrastructure becomes
unresponsive, and I can no longer remotely administer it, mitigation is the same as it
was above for unmanaged switches. Administration has to be brought back by any
means. Rebooting or disconnecting may be the quickest. 

If remote access is maintained, investigation can begin. If the switch does provide me
with a message stating it sees a MAC on multiple interfaces, I can begin to track this
MAC through the forwarding table. Depending on switch model, this command will
be different, but in Cisco, it will be something akin to show MAC address-table. This
will show all of the known MAC addresses and the interfaces they are reachable
on. Cisco allows for some refinement of any show command, which can make it
easier to quickly identify our desired information. In this instance, I'm going to use
the include command along with the last four positions of the MAC address. Show
MAC address table, pipe, include zero zero one. 

Once I have determined which ports are supporting a loop, if the interfaces are
designated as access ports, I'll shut them down. If one of the ports is an uplink to
another switch, I will connect to that device, and continue with the same
command. I'll simply rinse and repeat until the issue has been mitigated. Loop
mitigation is a messy business, and it's better avoided than experienced.

Duplicate addressing
Having duplicate IPs on your network is a very vexing problem. They can be trouble
shot fairly easily, but can also be avoided in some instances. Symptoms of a duplicate
IP can vary. It often manifests as a user saying they can't access a resource, or very
intermittent access to something. Normal trouble shooting would be to ping the
client IP which will, generally, work consistently. 

When the client tries to ping out; however, he will see intermittent responses. This
asymmetric behavior can be quite perplexing. In networking, one generally sees the
same behavior to and from a host. The problem lies in the fact that since two devices
have the same IP and I ping them from a different subnet, the router that directly
terminates that subnet will arp for the MAC address of the IP owner. It doesn't matter
which end device responds. The router will send to one of them and that device will
respond back, so my pings will always get through. From the opposite perspective, if
I'm the end device and I attempt to ping out when the ICMP returns the router may
send the traffic to me or it may send the traffic to the other similarly IP'd host. 

A suspected duplicate IP is fairly easy to test. First I'll start a persistent ping to the
host from my Windows PC with ping-t and the IP address. I should know the switch
port the host hangs off of, so I'll shut that interface down. An engineer could simply
unplug the host too, but I generally do my administration remotely. So the easiest
thing for me to do is to just shut the port down. While the host switch port is
down I'll leave my ping running for about 30 seconds. If my new host is
disconnected and the IP still responds to ICMP, then I have identified the duplicate
IP. Even if it doesn't respond I will connect to the local router or another host on the
subnet, and attempt the ping. I will then check the arp table to see if anything
responds with the IP in question. A Windows host would be arp -a from the CLI or
show ip arp from a Cisco device. Once I have the MAC address of the offender, I'll
track them down on the switch port terminating the rogue device. On a Cisco switch
issue show MAC address-table to gain the MAC address to interface mappings. 

One method to prevent duplicate IP's is to use DHCP snooping. Unfortunately, this is


only effective in a DHCP environment. To enable snooping I'd designate trusted
ports that are allowed to have DHCP servers. Then I enable snooping on a specific
VLAN. When a host on a snooping port comes up it won't be allowed to send any
traffic, save for a DHCP request. Once the DHCP server responds, the switch will
make an entry in a special table called the DHCP snooping binding table. This table
maps MAC addresses to IP addresses and what interface the host lives on. After this,
the only IP addresses allowed from this port will be ones on the binding table. If a
user tries to statically assign an IP address to their host, one that may be a
duplicate, DHCP snooping will prevent it from communicating, effectively stopping
the interference. If all users are experiencing issues I could have my default
gateway router's IP address duplicated. This can be a huge problem, effectively killing
the subnet. Fortunately, most routers will detect when one of their IP addresses have
been duplicated on the network segment. If I am exporting Syslog messages to a
collector and alerting for duplicate IPs I can quickly discover and mitigate the
issue. While duplicate IPs can be troublesome this should not be a show stopper.

DNS/IP addressing issues


Normally it starts with a call from a user saying, "I can't get to the Internet." 
The first is, how long have you had issues? This will determine if they just moved
in to this apartment or cubicle, and if it ever worked from here or if they've been in
the same spot for a year and it just now broke. It should also allow correlations
like did this report coincide with a power flicker or maintenance or any recent change
that was just performed? 
My next question is, what are the symptoms? Users will generally report back that
they can't access email and when they open their browser, the page cannot be
displayed. At this point, it could really be anything. 
Starting at the beginning, I'll have the user check their network setting via CLI with
ipconfig/all I'll verify that their IP address is valid. If it is valid, I'll have them
ping something on the internet like Google's DNS server, 8.8.8.8 If the ping
command responds with no loss, then I know basic network connectivity's
functioning. At this point, I'll attempt to ping something via DNS name. My go-to is
to have the user ping Google.com. If the user reports back a message akin to ping
request could not find host google.com, please check the name and try again, then
the user is most likely experiencing DNS issues. 
At this point, I will go back to the ipconfig results and verify what DNS servers the
user has configured. If the user has anything configured other than my designated
servers, I'll verify their network settings through the control panel. My clients are
generally configured for DHCP. So if there's a manual entry for DNS, then
something's awry.
Also change DNS server in nslookup and check if it resolves the URL. 
Some organizations do web filtering through DNS services, so a user could attempt
to circumvent the system. An admin can prevent this behavior by blocking UDP port
53 outbound on the corporate firewall from user subnets. Another reason the DNS
settings may be different could be due to a DNS hijack virus on the PC. These bad
boys will send all DNS queries from a client to a malicious DNS server looking to
redirect users to sites that will steal client information or install malware. It could also
be the client had their laptop in an environment where they needed to statically
configure and accidentally left this changed. If this happens to be the case, I'll have
the client set their interface back to obtain DNS automatically, then verify the change
with ipconfig. After this, I'll verify connectivity again. If the client had the correct DNS
settings and if it still doesn’t work, I'll verify the client can access any DNS servers.
I usually like to use Google's since it's easy to remember. To do this, I'll use the
nslookup tool. From the CLI, I type nslookup. Once the utility loads, I'll use the server
command to change the query server. Server 8.8.8.8 At this point, I'll simply attempt
to query google.com. If I get a reply back, then I know that the client can successfully
query. It may be either a Local Windows Firewall issue, local virus software, a network
firewall issue, a firewall issue on the DNS server, or a configuration issue on the DNS
server itself. If the client can't resolve off of Google's DNS servers, it could be the
local system firewall or virus software, or perhaps a network or firewall filter. At this
point, I would disable the local firewall and virus software and do a quick check to
see if service was restored. 
If all other hosts on the subnet are performing correctly, then it is likely an issue with
this machine alone and further PC troubleshooting should be performed. If all other
users begin to report similar issues, then I would continue to troubleshoot deeper in
the network. 
If other subnets within the network can successfully resolve, and I know the server is
functioning, and I need to test if my requests are making it to the server (route &
reverse route, firewall rules). Step one is to ping the server. If this is successful, I'll
move on. Otherwise, I'll look for issues in the path.

If this is a DNS server within my control, I would connect to it and run tcpdump if it's
a Linux machine, or Wireshark if it's a Windows machine. I would edit the input
filter to capture only traffic from my test host with ip host and its ip address. If I see
DNS queries coming in from my host, I know that the network infrastructure from
host to server is working. So the issue is likely a configuration setting on the server
(or reverse route). If I see no incoming queries from the host, then it is likely that
some kind of filtering is in place that is preventing the traffic or missing route. I'll
now check any firewalls or ACLs in the path. 
An engineer will generally run into IP addressing issues like subnet mask
configuration when integrating a new device. It can manifest as a device not being
able to exit the subnet, or one device on the subnet not being able to reach the
other. In the case of two hosts not reaching each other, it comes down to ARP. Host
A is configured with address 192.168.0.2/24, but Host B is improperly configured with
address 192.168.0.250/25. Host B believes the address in his directly connected
subnet range from 192.168.0.129 to 192.168.0.254 while Host A has a subnet range
of .1 to .254 This means if Host B wants to reach Host A, he thinks he needs to send
packets to the default gateway to reach A. In reality, Host B should simply ARP for
Host A's address, then directly connect. This asymmetric behavior means the
connections aren't going to work or work consistently. 
Another manifestation is that a new host can't access the Internet or basic network
resources. In our above example, imagine that the proper default gateway
of 192.168.0.1 is set, but the host is configured for 192.168.0.250/25. This means, he's
not in the same subnet as his default gateway and thus can't reach it. Mitigation is as
easy as using ipconfig to verify the configured subnet mask.

Rogue DHCP
nothing used to wreak havoc like a rogue DHCP server. A rogue is nothing more than
an unauthorized DHCP server on a network. First, I'll cover techniques to trouble
shoot and mitigate a rogue. Then I'll cover a couple of methods to prevent it. A
rogue DHCP server can show up for a couple of reasons. A rogue can be used to
perform a man-in-the-middle attack. A malicious user can hand out IPs to other
hosts, becoming a default gateway for them. 

Traffic proxying through an attacker can be manipulated or collected. It's far more


common for a user to incorrectly plug in a device.
If the user's IP address is in an unexpected subnet and the client is configured for
DHCP, I immediately know I have a rogue on my hands. These apartment rogues
generally come from a user getting a new wireless router then promptly plugging it
in backwards. At this point, I need to find out what the rogue's mac address is, so I
can shut it down. From the machine that has accepted the IP address from the rogue
server, I'll need to issue three commands. First, I'll type ipconfig, space, forward slash,
all. And I'll use the more command because I have a lot of output. I'll scroll down 'til I
find my ethernet adapter local area connection. Here you can see that it is currently
set to 192.168.88.1. On next, attempt to ping the default gateway, 192.168.88.1. This
will force the local machine to arp for the rogue DHCP server's mac address. I'll then
have the user type, arp minus a. This will list the host's arp table. And I can make note
of the rogue's mac address. I'll connect to the core switch and examine the mac
address table on the device.
In a large network, it would provide the port connecting to the next switch in the
chain. Once I've connected to the edge switch, I'll do the mac address table of that
device. This should provide me with a rogue DHCP server's access port. The quickest
way to mitigate the rogue is to simply shut down the port. Of course, this mitigation
technique only works if I have managed switches. 
The best way to mitigate a rogue DHCP server is to prevent one to begin with. Using
a technique called DHCP Snooping on my switched infrastructure can prevent
rogues at the port level. For Snooping to properly work, we first designate trusted
ports that allow DHCP servers to exist on them. These will generally be uplink switch
ports. I then designate which VLANs I want snooping to operate on, and enable
it. Once enabled, snooping will block all untrusted ports from communicating until
they make a DHCP request. The request and response alone will be allowed
through. The switch builds a special table with the name of DHCP snooping binding
table that maps the end user's mac address, ip address given via DHCP, and the
access port of that host. At this point, only traffic source from an ip in the binding
table will be allowed out of the port. Enabling this command also blocks an
untrusted port from acting as a DHCP server. 
If my particular model of switch doesn't support snooping, there sometimes is
another option. For untrusted ports, an admin can block inbound UDP traffic sourced
from port 67. Our DHCP server's communications to clients is always via UDP port
67. So blocking these on a user port will effectively stop the handing out of the
trusses. For detection, I configure a DHCP client interface on all of my user
subnets. When this detection device receives an address on the router from a rogue,
it will generate a syslog message to my monitoring system. The system will then send
an email, which I'm able to quickly react to. I also employ this on networks with
snooping or filtering configured just in case the switch is replaced. But some
configurations may have been overlooked. It never hurts to have an insurance policy
in place.

Failed routes/routing loop


vast majority of our troubleshooting is done with the same two tools: ping and
traceroute. The troubleshooting process usually begins when a client is calling saying,
"I'm unable to access "this resource on the internet."
If the customer only controls this side of the connection, I'll have them do a
traceroute from their infrastructure and send it to me. If they control both sides,
which can make the process much simpler to troubleshoot, I'll have them provide a
traceroute from side A to side B, and then the reverse, side B to side
A. 
Service providers generally take great pains to prevent failure with multiple layers of
redundancy. Redundancy adds multiple paths a packet can take. This can create
asymmetric routing, whereby packets will take different paths when transmitting or
receiving packets. This is especially true when talking about the internet. This is
where traceroutes from both sides of the connection can really be helpful. Examining
the traceroute will often show the point of failure or help to identify possible
causes. I will follow the customer traces to see how far the traffic gets before it fails. 
Ask them to verify the simple things.
I usually say something to the effect of, "I know you've likely already checked
this, "but would you mind pinging IP address X? I'll then duplicate their traceroute as
best I can. I'll do it first from my desk, then I'll jump to the router they use as their
default gateway and test (maybe to help identify issues with connectivity between
user device & router). If I can't duplicate their failure, then there is a high probability
there's something inside of their network as opposed to mine. If I can reproduce the
issue, or the issue seems to be intermittent, I'll jump to each of my network devices
in the path and try to duplicate the issue. I'll start by running a ping from my
probes looking to duplicate the issues. 
Additionally, I'll then connect to some external resource like public route servers to
see if I can duplicate the issue from outside of the network also. A route server, or
looking glass, is a device an admin can telnet to or a website they can connect
to, that allows for external pings and traceroutes. There are also some cloud services
that allow me to add my infrastructure IPs and test to it from multiple locations
across the globe simultaneously??. 
Often, I can also do a packet capture right on the probe, then open it in Wireshark
for further study. I don't typically have to go to this degree, but when I do, it's
indispensable. 
So what happens when I find a common point where there is a failure? If it is inside
my network, I'll first connect to the last responding point in my network. I'll then
review the route table, looking for the next hop corresponding to the destination of
the failed resource. Does the router know how to reach this resource? If the route
table does indicate it knows how to reach the resource, I'll try and issue a ping and
traceroute from this router. If the router can reach the destination, then it is likely
one of a few issues. It could either be a filtering issue, in which case I would verify
any access list or firewall rules in the path. It could also be the fact that this
router doesn't know how to reach the source host. If I am doing dynamic routing, I
will need to verify any route filtering that I may be doing, and check the routers that
are injecting routes for these destinations. 
Very rarely, I might see "TTL expired in transit" on my pings. This can often but not
always be indicative of a routing loop. Issuing a traceroute to the destination device
can usually help to diagnose. I will generally see the trace acting normally, but at
some point, it will hop to router A, then hop to router B, then back to A, then back to
B. It will continue this behavior until my trace exhausts its maximum number of
allowed hops. This can happen due to a failed destination route in combination with
the default route. Router A will have a route for the destination that sends the
packets to router B. Router B will have failed the route for that destination, and will
use his default route to send the packets back to router A. As one can imagine, this
behavior can be quite taxing on routers. With standard packets, each time they
make a hop to the next router, they will decrement the TTL so these looping packets
won't indefinitely loop. But depending on volume of traffic, you can still create heavy
packet loss and high CPU conditions on the routers involved. Route verification can
be the same here as above.
If I am experiencing issues outside of my network, out on the internet, what recourse
do I have? First, how deep outside of my network is it? If it is only one or two hops
outside, then this is likely an issue directly with my upstream ISP. I'll screenshot all of
the traces and pings from my probes and open a ticket with that provider.  If the
failure is deeper out on the internet, what can be done? Oftentimes, issues on the
internet happen at peering points between providers. Peering’s are points where two
ISPs connect their routed networks together. Often, an issue will arise here when
one of the links between the routers fails. Then they will congest the remaining
links, causing heavy packet loss. If I can identify issues at peering points between
ISP1 and ISP2, I'll then take my evidence and submit it to my ISP as well as to ISP1
and 2. I can't tell you how many times I've had to argue with ISP engineers that there
are issues with specific points in their network. The majority of my troubleshooting
time is spent doing precisely these steps.

Port already in use, netstat


When connecting to a service on a server like HTTP or Telnet, I'm connecting via a
specific port and protocol dedicated for that application. This bidirectional
connection is often referred to as a socket. 
When a server wants to accept HTTP connections, it will listen on TCP port 80. Only a
single application can listen on a given protocol and port number at a single
time. When multiple applications want to use the same socket, it is called a port
conflict. 

Some applications will check for an open port, and these will generally warn me of
the port conflict. Some other applications will simply fail with a generic error. The
best tool to determine if a port is in use is Netstat. Luckily, this command is available
for both Windows and Linux, though the syntax is slightly different for each. 
On Windows, my go to command line parameters for Netstat are -anob; a displays
connections in listening ports (ports of applications that are listening or connection
established), n tells Netstat not to resolve IP's to domain names. This speeds the
command and I prefer to see IP's at any rate; o shows the process ID of the running
application (significance of process id might be that same applications like chrome
might have different windows with different process id), b shows what the name of
the application holding the port open is. Once my OS moves up towards 10, I will
need to run my command as an administrator. 
Since I'm looking for HTTP, I'll look for anything listening on TCP port 80. When I find
the culprit, I can determine by name, what app is holding the port. If it is reporting
something generic, like Java, I can use the Process ID to find and kill it. I'll generally
use task manager to kill the process. Then I'll attempt to fire up my program again. 
If I'm troubleshooting on Linux, I'll issue a slightly different command, that will yield
almost the same results. Issuing a Netstat -nap will give me a massive list of
services. I like to use the grep command to filter the output for the specific port. In
our case, the command will look like Netstat -nap|grep:80. In Linux, the n and a
parameters perform the same function as Windows, p replaces the b and o
commands from Windows, showing what application is specifically operating the
connection. Once I find the offending process, I can use the kill command to kill the
process. 

I can now attempt to run my program again. Obviously killing an errant process isn't
the ultimate fix. At this point, I would need to determine why the conflict existed in
the first place.
If I need both applications to be available, one possible solution would be to move
the new application to a different server, or perhaps I could run them both on the
same server and just bind my new application to a different unused port. Port
conflicts aren't all that common. So a little preparation and practice can be the key to
a quick resolution.

Test service connectivity using telnet (especially for HTTP and SMTP)
The vast majority of services I find myself testing will be TCP based HTTP, SMTP, et
cetera. To test TCP, I need a TCP based utility. My go to tool on Windows is
Telnet. This can be done through the Windows telnet utility or via apps like
putty. The majority of HTTP servers will respond when an admin telnets to it via port
80. 
For this example, I'll telnet to google.com. Telnet, port, 80. Once connected, I type
get space forward slash space HTTP forward slash one period one and hit enter
twice. You should get some sort of output that indicates you entered a bad request
which means that our test was successful. The fact that anything came back shows
that we have a TCP connection in both directions. 
SMTP can also be simply tested with telnet. If I'm unsure of what the IP of a mail
server is, I can use NS lookup to verify the domain's MX record. MX stands for mail
exchanger and lists usable mail servers. From the NS lookup app, a set type to query
to MX by typing “set type equals -MX” and hit enter. I then type the domain name in
question. For this example, it will be google.com. This should ultimately supply you
with the primary mail server's address. I'll highlight it and right click to copy it. 
I'll then telnet to the server on port 25. Set to telnet, and set the port to 25. Once it
connects, I'll enter the command EHLO gregsowell.com or sometimes HELO
gregsowell.com. If it fails, I'll usually just try typing HELO gregsowell.com again. The
server should respond back preparing to accept an email from me, which confirms
our connectivity. 
Telnet won't be able to replicate all services, but I can often verify if a TCP session will
establish, which means network connectivity is there. 

Test service connectivity using wireshark


I can test specialized services using packet capture applications like Wireshark for
Windows or Tcpdump in Linux. The average laptop is going to be sending hundreds
of packets, which will make it impossible to capture interesting traffic otherwise. If I
want to narrow it down to a single host, I'll make my filter IP host and its' address. I
can also limit the traffic by port with TCP port and the number or UDP port and the
number. Once I have my Wireshark capture running and filtered, I can test my TCP
connection again. I'm going to configure my Wireshark to listen on the wifi interface
and set the filter options to TCP port 8728 and then hit enter. To simulate traffic, I'll
use putty to telnet to IP address 192 168 88 dot one on port 8728. I'm looking for the
TCP send packet I sent to the remote device to be answered in some form. Usually, it
will be via synac, though it may be a reset also. Here in my Wireshark capture, you
can see the remote host complete the TCP three way handshake with my laptop.
UDP traffic can be much more difficult to test unless I control both ends of the
connection. If I only have access to the client, I'll only be able to verify if my client is
sending traffic. This is due to the connectionless nature of UDP. If I control both the
server and client, then I'll initiate a Wireshark or Tcpdump on both devices and test. 

In Wireshark on my laptop, I'll set the interface to the first VM interface and then set
the filter to UDP port 2323 and hit enter. I'll then connect to my test server and start
Tcpdump with the command sudo tcpdump udp port 2323. As you can see, my filters
work very similarly between Wireshark and Tcpdump. From my laptop, I'll send a few
test packets. 
In Wireshark, I should see the traffic leaving my device, which I do. If I don't see it,
then it is likely a local client firewall issue. Switching back to the server's Tcpdump
command shows that it is receiving the traffic. If the client is sending but the server
isn't seeing the traffic, I'll begin troubleshooting any filters (firewall, routes issues, etc)
between the client and server. 
By the way, this testing method works just as well for TCP traffic. In the service
provider environment, I'll connect to one of my Mikrotik probes and use the Telnet
tool to test while doing a packet capture. Another tremendous feature is the ability
to VPN into the probe and test similarly to the user. 
WiFi intermittent service
Few things can be as frustrating or problematic as flaky wireless. Usually it comes in as a
report from the user saying, "the wireless is terrible," but by the time I get there, everything is
fine. This could be any of a very large number of issues. 
I usually prefer to start at the physical layer and work my way up the stack. The physical layer
can be really tricky depending on the environment I'm running in. Occasionally, I'll be in an
environment out of my control. Think an office complex. I generally can't dictate what other
tenants are allowed to do with their wireless, which means they could be the source of my
issues. 

All of my APs will be in some sort of network monitoring system, looking for connectivity
failures (of AP?) or high CPU conditions on the device. 
If I've ruled out issues with the access point itself, I'm going to look for interference
issues. Interference usually comes into play with 2.4 gigahertz networks, though they can
affect 5 gigahertz networks also. 2.4 is especially susceptible because it only has three non-
overlapping 20 megahertz channels to work with. To make matters worse, some modern
protocols like 802.11n allow for higher channel widths of 40 megahertz. This means I only
have two channels to work with. A lot of consumer-end routers will default with these 40
megahertz channels.
Interference really bites you when two access points within range of each other are
transmitting information heavily on the same channel. WiFi is a contention-based
media. Think of them as walkie-talkies. Only a single person can communicate on a single
channel at a time. If I want to send information, my nic will listen to ensure the channel is
quiet, then attempt to send. If the nic detects that there was a collision with another device
transmitting at the same time, it will wait a random amount of time, wait for the channel to
clear, then attempt to send again. Now, imagine there's another AP within range, and their
users are transmitting on the same channel. The easiest way to detect this interference is to
use a WiFi Analyzer. There are some simple free ones that will run on a laptop, or to make
things even easier, I can run one on a mobile device. 

I prefer WiFi Analyzer on my Android devices. This will display SSIDs, the channel they are
running in, and their signal strength. The easiest thing to do is walk around areas users will
be working wirelessly in, and look for the least utilized frequency, then switch the AP's
channel to this preferred frequency. Ensure your own APs aren't the source of
interference. When I have complete control of the wireless space, I should be able to create a
channel plan that will maximize signal distribution. 
If I can't detect other APs causing interference, then it could be from another device that's
not necessarily WiFi gear. The 2.4 gigahertz range can also be used for things like wireless
mics or baby monitors. Believe it or not, microwave ovens run in the low 2.4 gigahertz range
and can cause issues. If I'm in the kitchen at work and I'm microwaving my burrito, I tend to
lose access to the AP. In business, I hear that sales cures all. Well in WiFi, more spectrum
does just about the same. If I find that I'm running 2.4 in a supersaturated environment, then
I'll look at switching to the 5 gigahertz frequency. Depending on what country I'm installing
in, I can expect over 40 20 megahertz channels to work with. Not all legacy equipment will
support the 5 gigahertz range, nor will it penetrate objects like walls as well as the 2.4. So I'd
suggest testing before doing a full deployment. 

Frequencies aren't the only answer. Some vendors employ different hardware


techniques that will reject noise, use band steering to prefer the 5 gigahertz band when
possible, or use beam forming to direct wireless signals towards the client. I've seen very few
circumstances where commercial equipment didn't outperform consumer-grade gear. 
My last step is usually to drop a small WiFi probe in. I can monitor the device with my
existing network management system, looking for failure, trending connectivity, and also
connect to it for reliable, repeatable tests. Unfortunately, a wireless network is never quite
done. An admin must stay vigilant as some new source of interference may appear at any
time.

The secret weapon in every network admin's holster should be a set of


probes. Probes can take many forms, have varying features, and come at different
price points. The most basic probe is any device that can be remotely connected
to and perform basic troubleshooting with. More advanced features will allow an
admin to perform complex remote operations. Mandatory features for a probe
should be ping, traceroute, and a single network interface. Advanced features will be
remote access, either through some sort of remote desktop protocol or VPN, the
ability to view packets, the ability to capture packets, and multiple interfaces

Destination server or printer or website could not be accessed (assuming device gets IP)
Below troubleshooting is done assuming device gets IP. In case if device doesn’t get IP, check another article.
1. Trace source and destination device and check if firewall rules allow access at both ends for necessary
ports. If not, allow it.
2. If firewall rules permit access, ask user for traceroute, ping and telnet output and troubleshoot based
on where it is failing,
a. Even if traceroute is completed successfully, it is no confirmation that destination can be
accessed on particular port since traceroute works if ICMP packets are allowed.
b. If user gets an error message in server that ‘operation not permitted’ when he attempts to
traceroute, then ask someone with administrative privilege to execute traceroute, ping, etc.
If that person also gets error, check if traceroute permission is allowed in IP tables.
c. Traceroute and ping is not allowed from one end of firewall to another end. Adding
traceroute command to destination with source default gateway explicitly defined won’t
work in firewall. See firewall characteristics/ features article for more info.
d. If traceroute drops after crossing our firewall, either it could be route issue with ISP or
destination server might be blocking. Check last hop IP’s name, put destination IP in
ip2location.com and do traceroute from internet site to determine whether ISP or
destination server issue.
3. Ping:
a. To check if destination server is UP and reachable, ping internal/ internet based destination
IP’s from VDI as well as user’s PC. Additionally ping from internet for internet IP’s if no ping
response for pinging from VDI. This will help to isolate whether this is user’s subnet issue or
zebra network issue or destination server issue.
b. If unable to ping IP from user’s PC or VDI, try pinging from core switch or firewall where
destination device is connected. This steps helps in case if ping is blocked in access list rules
or if route/return route is not available for a particular subnet/site.
c. Also try pinging firewall subnets and internet IP’s from firewall if unable to ping from core to
check for SFR issue and other above mentioned issues.
d. You can’t ping default gateway of any interface on ASA firewall except the interface to which
you are directly connected. You can’t ping from a PC to it’s default gateway in PA firewall if
management interface allowing ping is not associated with the layer 3 interface to which
default gateway IP is defined. In PA firewall in CLi mode, source IP has to be defined when
pinging any IP (source IP in same subnet).
e. If ARP table and mac address table show device info but unable to ping the IP, then clear ARP
table and mac address table and then shut and unshut port and then check.
4. Access destination URL in browser:
a. Access destination URL from VDI and from user PC and compare. Is there a load balancer or
ping id authentication page loading before destination URL? If so, this also has to be allowed
in source and load balancer end. This can be known when we put destination URL in user’s
PC and URL changes to another (sometimes, we may not see this re-direction when loading
destination URL in VDI as it may probably happen very fast to notice it).
b. For internet sites, if unable to access from VDI or user PC, access destination website
additionally from mobile so that we can isolate the issue to be with either destination site
side or zebra network side.
5. Do log capture in ASDM for source or destination IP bi-directionally. To see if request is reaching
firewall and passing through successfully and to see if we are getting response, to check if anything is
getting blocked, to identify if our IP needs to be whitelisted in destination side or RDP service is not
running in destination server, etc. Also check if packet is reaching till firewall by checking for hitcount
increase in access list rule. If ASDM log capture doesn’t work, install wireshark and do packet capture.
a. If log capture shows bi-directional traffic passing successfully, then need to check on server
side.
6. Check if an unknown port not being opened is the reason for destination being unreachable by adding
ip any any rule.
a. Identify unknown ports by log or packet capture or ACL rule with log keyword in router,
checking with user, reading documentation or visiting support site.
b. Check if any extra port number is mentioned in destination URL. EX: if URL is http://crp1-
omwinternal.zebra.com:9431/soa-infra/services/default/ZEB3PLDHLRequestorSe, then
access need to be allowed for port number 9431.
c. If we are allowing access by application, check what are the protocols that application
‘depends on’ and ensure that is allowed as well. Ports associated with an application is one
thing protocols on which the application is dependent is another thing but both need to be
opened.
7. For internet based destination IP’s, check these additional things
a. Ensure NAT exists and ensure it exists for correct ISP interface in case of multiple ISP’s.
(private to private NAT may probably apply for 157.235.XX.XX subnet, 10.11.XX.XX subnet). If
user can ping default gateway but traceroute shows packet not leaving our device, check
NAT.
C If traceroute drops after crossing our firewall, either it could be route issue with ISP or
destination server might be blocking. See traceroute section. Does the site administrator has to
white-list our IP in order to access destination? Or correct IP white-listed?
D If netscope is running, bypass it and check
E If access is allowed for URL (using FQDN) instead of IP, verify by allowing access to URL’s IP as
FQDN sometimes doesn’t work. This might happen when URL has lot of IP’s.
8. If user unable to access from internet to internal IP, check these things:
a. Check if DNS is resolving from internet for URL’s public IP and resolving internally for private
IP. Compare to isolate if is DNS related. Below cause could also sometimes cause DNS
resolution issue.
b. Apart from Static NAT, firewall rules, route, server firewall rules, additionally check if route
exists in INET router/ VRF that comes between firewall and internet traffic. DNS might
probably still resolve as my guess if DNS records may be sent to internet on separate IP/port.
9. Check for issues in server side:
a. If source or destination server is linux server, then check IP tables or firewall D in BOTH
source & destination server if access is allowed (IP tables is like an internal firewall running
only in linux servers. Some linux servers can also have firewall D running in them).
i. If proxy server comes before application server, then IPTables need to be bypassed
in both servers. See separate article for info on proxy servers.
b. Check if subnet mask and default gateway is configured properly in servers. Once subnet
mask was wrong in a server, so that server was inaccessible from internal servers but
accessible from vpn connection for some reason.
c. Once Hitesh said something about load balancer. Check with server team for anything
related to this depending on issue.
d. Once transfer failed when user initiated traffic from 017 site server to his VPN PC. However
we later found that some traffic also gets initiated in reverse direction as well unknown to
user & that was getting blocked.
e. If server is hosted in AWS environment, then rules need to be allowed in AWS as well. They
will be usually connected through PA site-2-site VPN tunnel and if PA allows rules, then we
need to check in AWS side.
10. DNS. Put nslookup and destination URL – in case if nothing is resolved or if able to ping or access
destination using IP but not using URL/hostname, then these might indicate DNS issue.
a. Ensure if DNS IP is shown in ipconfig output
b. Check if appropriate DNS is configured in device. Try to resolve URL from your PC. If it
resolves, it is the DNS ip that they have configured that is not resolving.
c. Ping DNS server from VDI and user PC to check if it’s UP and if access to it exists.
d. Change DNS server on user’s PC and check. Also try to ping or traceroute destination URL
from core switch and if DNS on switch resolves URL to a different IP, change DNS server IP on
user’s PC.
e. Check if DNS server IP is allowed in access list applied on server’s interface. Filter access list
with domain keyword to find the object group associated with DNS servers.
f. If destination URL ends with any other domain name other zebra.lan, we need to ensure that
the domain name is added to ‘some centralized resolution database??’ so that when we
enter the URL, DNS server is able to resolve it. Otherwise DNS server may add the usual
domain name zebra.lan to the end of URL and try to resolve that URL for which no IP will
obviously be shown as no such URL exists.
g. For newly commissioned servers, server needs to be allowed access to DNS servers, LDAP,
etc in order to sync with zebra.LAN domain and resolve hostnames & this is the 1 st step
before other things can happen. In firewall 10.80.254.49, check ‘access-list lutron_access_in’
and filter for IP 10.80.55.75 to know IP’s & ports to be opened for a new server
h. In 1 case, non DNS resolution was due to issue in server. See article “User can access
destination server using IP but not using URL”
11. Check if route/reverse route exists in ALL devices in the path.
b. If user unable to access an internal server from internet, check if route exists for the public IP
in internet router/ INET vrf as it comes in between firewall and internet.
c. If source or destination is 192.168.xx.xx subnet, we need to check route for these lab
networks as route info of these subnets may only be available to devices within the site only
sometimes. Dhivahar had once said that same 192.168.xx.xx subnet may exist in multiple
sites. If so and if users insists on access to other sites, we may need to do NAT’ing to
10.xx.xx.xx subnet which is L3’s scope. Routing is also slightly different for 157.235.XX.XX
subnet, 10.11.XX.XX subnet.
d. If destination is connected to a network device to which we don’t have access. Check for
return route issue or firewall rules blocking issue by comparing traceroute and ping to
destination from 2 different sites.
e. Once route for destination in core was shown as getting load balanced across 2 firewalls –
the normal firewall at the site and ASA DMZ firewall at site due to faulty redistribution config
put in ASA DMZ firewall.
12. Bypass source & destination IP from being monitored in SFR in ALL appropriate firewall & then check
if it helps (Once firewall rules allowed access but neither user nor I from VDI could access an internet
based site. I tried to traceroute site’s IP from core but could not do it but could traceroute other sites
like google (8.8.8.8). however I could traceroute this site successfully from firewall. So excluded this
site from being monitored by SFR in user’s site’s firewall. Now I could traceroute from core & user
could also access. This was only temporary testing & we put rule back. Dhivahar asked to send ticket
to GISO for approval for bypassing this site in SFR. Once we get approval, we need to move ticket to
L3 team for bypassing permanently).
13. RDP issue: In order to RDP, below things need to be enabled in PC apart from firewall rules (local IT
will take care of RDP issues in PC, not windows server team).
a. User PC must be added to Active Directory Security Group
b. Settings, System, Remote Desktop - Enable Remote Desktop Button must be clicked to Turn
On RDP
c. User must be a member of net localgroup "Remote Desktop Users"
d. For RDP access to servers, windows server team said they will add user to a group that will
allow RDP access.
14. When source or destination is relay server, be careful and double check the true source or destination
IP. Relay server had functionality similar to that of IP helper address command. So this relay server
was only forwarding the traffic from source server to destination
15. Check if access is permitted to a server based on user’s user id/group in FMC.
16. If a user is statically configuring ip on a host machine or server or printer, ensure that correct subnet
mask and default gateway are entered. For printers, in few cases, we have found static IP on printer
and DNS record clashing (verify using nslookup).
17. If both source & destination or either of these is behind a device to which we don’t have access, we
can check with local IT contact about this device to see if access list rules are applied here. Try both
local username & tacacs username for access. If login disclaimer says anything about another brand
like HP, we can be sure we don’t have access. Ask for traceroute if both source & destination are
behind such a device to know the devices coming in between.
18. See if this could be java or browser or some other application error based on error message. For
example, If it shows Status 500 and Java Errors, then it might be java error & java may need to be
updated.
19. Once users in a particular subnet alone were not able to access an internal website but users in other
subnets in other sites were able to access it even though firewall rules permitted access to all subnets.
Found the issue in corrupted entries in an LDAP group shared by the users. Once the LDAP group
Cache was cleared, the user had the ability to log in and work properly.
20. Do this & below 2 steps as last resort if all other troubleshooting doesn’t help. The context of problem
is also important when doing this.
a. Sometimes restarting destination server might help
21. Check if there are other components involved in this traffic flow and restart it if necessary

What to do when device does not get DYNAMIC IP or gets APIPA IP after connecting to a switchport
169.25x.xx.xx IP is APIPA ip address and these are ip addresses allocated by windows machine themselves
when computer is not able to get ip address from DHCP server.
1. All these things need to exist before a device can get dynamic IP:
a. Check if port configured to correct vlan
i. Once both access vlan and voice vlan were same vlan & ip not received. After
removing voice vlan, ip received.
b. check if vlan is allowed through trunk between core and access as well as between core &
firewall if applicable. Ensure it is allowed on both ends of trunk.
i. Also verify status of ports in PO.
c. Check if dhcp server IP/dhcp relay is defined for the vlan in core switch or firewall
d. Check if subnet is added to DHCP server. Even if DHCP scope is not yet created for a subnet in
DHCP server, any static IP assigned to devices will still show up in ARP table.
e. Ensure L2 vlan exists and check if it is shut in core switch. If so, unshut it. However confirm
with dhivahar once before we unshut it.
f. Allow DHCP application from the subnet in palo alto firewall security policy. Refer to
CHG0069458 for sample subnet config. If a host machine is in a subnet connected to firewall,
then ASA firewall rules don’t need to allow DHCP ports in order for DHCP request to go. See
article “Firewall access list rules apply only for traffic PASSING through the firewall….”

2. DHCP app related:


Server team configures DHCP servers as primary and secondary. Only primary will give IP’s when it is
UP. Even if request goes to secondary because we have secondary as primary in our device, it will
send request to primary server assuming failover is configured properly.
a. Check if free IP’s are available for this subnet or not in DHCP app. Ask server team to clear
IP’s or increase scope or reduce lease time if no free IP’s. See wireless document for
command to check for IP address in controller.
i. Sometimes ARP table might show less entry but DHCP app may show entire DHCP
pool to be utilized which might cause user to not get IP.
ii. Once users did not get IP when there was 9% free IP’s, they got IP only when 60%
IP’s were cleared in DHCP app.
iii. Are new IP’s learned after clearing ARP cache/ DHCP app pingable?
b. Ensure subnet mask, gateway, domain name, etc are configured correctly in both network
device and scope options in DHCP app.
c. Are you able to ping DHCP server IP from source subnet?
d. Does route to DHCP server and reverse route to source subnet exist?
e. OnceServer team broke failover config between primary and secondary DHCP server and
then deactivated scope and enabled it again. Issue was fixed.
f. For a subnet, check if primary & secondary DHCP server IP configured in core switch matches
with primary & secondary server IP configured in DHCP server. If DHCP server reversed in
server side, secondary server (which is primary on switch side) would not respond as primary
server would be UP as per ravi however verify this.
g. Once primary server did have scope configured but only secondary had. So ip was not
assigned. Also failover was not configured properly between primary and secondary. So our
request hit secondary and was getting dropped. Primary was not configured because we
were not given read only access to it and so it was shown as being down for us & so we did
not request config on it.
h. Once DHCP scope wasn’t full yet users were not getting IP. When the DHCP server was
restarted by windows server team, issue was fixed.
i. Ensure static IP is not already assigned to device in DHCP app.
3. Check if ISE config is interfering.
4. If device still does not get IP or if IP is displayed in ARP table but user unable to use device, then clear
mac address table & arp table and shut & unshut port. Then check for IP & ping to verify. Sometimes,
arp & mac table needs to be cleared for old IP to be removed and new IP to be assigned to device.
5. Check if it zebra imaged PC or not. all zebra imaged PC's are associated with user’s id. This user id will
mapped in AD in order to get DHCP ip. So non-zebra PC’s will not get DHCP ip though once non-zebra
PC got vlan 1`44 IP but not IP from any other vlan.
6. If a device does not get IP but other devices in same subnet got IP:
a. If layer and layer 2 status is down for this device, it might be cable or port patching issue.
o If layer 1 and layer 2 are UP, then it might be port/switch issue and connect to neighboring
port and check.
o Once in 3OP, we had to change to another working vlan, then shut and unshut port, then
change back to old vlan in order for device to get IP.
7. If above troubleshooting doesn’t help, change port or device or user vlan or switch to isolate whether
issue is due to either of this.
8. Additionally, a PC or a connecting switch can have static IP assigned and then ping default gateway to
isolate whether there is issue in communicating with DHCP server or not.
9. If default gateway of a subnet cannot be pinged, then check if it is layer 3 issue such as subnet not
being advertised to other sites.
10. Check if ipconfig details shows DHCP as enabled or not. It if is not enabled, check if static IP has been
assigned in PC. If so, change to ‘obtain IP automatically’ option.
11. WIRELESS: Check these additional things when this issue is faced for Zwireless or Zguest subnet.
a. Is vlan allowed in AP’s trunk port?
b. Check 3 things in cisco controller settings: is SSID added in AP group, is SSID added in
flexconnect group, is SSID allowed in AP settings. Details on how to check these are in
wireless document.
c. Filter for wireless mac in logs to get a clue. In extreme controller, Give ‘show event-history
on site-008 | i ZWireless’ and if it shows completed WPA2-AES handshake on wlan 'ZGuest-
sm-site' radio '008-FL1-AP-010:R2'. it means device has successfully synched with controller
& we now need to check if there is any issue with DHCP request, offer, reply and
acknowledgement. If no logs are shown for a particular wirless network, then traffic is not
even reaching controller.
d. If everything else is checked, configure a port to zguest vlan to isolate if issue is SSID related
or DHCP server related.
12. If device is IP phone, to check if these phones were using POE Ethernet cable (POE cables themselves
take charge when connected to port & no necessity for a separate power cable). Identify phone’s port
and give ‘sh power inline’ command to see POE status for a port . If operational status is ‘off’, then
phone might not get power from POE cable. Turn it on using command ‘power inline auto’ in interface
configuration mode for phone to power on. Ravi said we could probably get problem in case POE
Ethernet cable was not used for POE phone.
13. Enable DHCP debugging or do packet capture in end device to check if DHCP discover and other
messages are sent back and forth.
14. Change DHCP server and test if other troubleshooting doesn’t help. However confirm with others
before doing this.

You might also like