Troubleshoot Networks Notes
Troubleshoot Networks Notes
Troubleshoot Networks Notes
POTS failure
Many systems still run on analog phone lines, from standard phones, fire system
dialers, elevator emergency phones, ATM machines, to the trusty fax machine. These
lines can be delivered either via the phone company or from live VoIP system via an
analog converter. Once they are analog signals on the copper, the troubleshooting
process is the same. The process starts with a tool called a lineman's handset, or a
butt set.
I generally begin my troubleshooting by testing the line right where the device plugs
into it. Simply the connect the phone cable directly into the butt set, and test the
line. If the line comes up and dials out, I know I have an issue with the device. If the
butt set gets no dial tone here, I'll go straight to the DMARC, or the closest point to
my connection on the provider's network.
The majority of tel code DMARCs consist of a 66 block. It's for this reason the
majority of butt sets have clips can connect to 66-block pins. I find that issues
generally originate either with the end device or within the provider's network, which
is why I test these first. Once I locate the proper pair of cable at the DMARC, I'll clip
my butt set's test leads on the jacks. If I still don't get dial tone, then I've effectively
pinpointed the issue as within the provider's network, and I'll contact them to repair
the issue.
Before I call the provider, I'll usually pull their cable out of the DMARC in an attempt
to re-terminate it, just to be sure a connection hasn't come loose or corroded. If I do
get dial tone, I know that the issue is in the cable connecting the DMARC to the end
device.
The easiest method to troubleshoot from here is to work from the DMARC towards
the end device, testing each connection in the cable until the fault is located. A
standard $5 analog phone can be used in place of a butt set. Actually, it should work
just fine for occasional use. When testing at the DMARC, I can strip about 1/2 an
inch of phone cable and wrap the bare wires around the leads on the tel code's
DMARC. Other than this, all testing steps are the same.
This will effectively cut the troubleshooting process in half which can save a
tremendous amount of time. I generally begin by unplugging the cable from the end
device, and connecting it directly to my laptop. If the connection comes up, then it is
an issue with the device. If the connection is still failed, then it will be between the
cable and the switch.
Assuming that the connection issue is from the cable to the switch, I'll move directly
to the switch to rule out cable issues.
Using a known good cable, I plug my laptop directly into the switch port. If the port
comes up, I know the issue is with the cabling.
If the port does not come up, I verify that the issue is within the switch (port failed or
shut or err disabled).
If the switch has additional available ports, I'd verify one of them, then move the
connection. If I was able to verify the issue's with the cabling, I'd test between each
connection of the cabling.
Often, cables run from wall ports to patch panels then into the switch. At this point, if
I have a cable tester, I'll begin to test using it.
If I don't have one available, I can plug my laptop into each segment until I find the
failure.
A standard cable tester will have the Test Unit and a Remote Module. The Remote
Module is connected to the far side of the cable, while the Test Unit is connected
locally.
The cable tester will show me which pair, or even which wire has failed. Nicer test
units, usually somewhere around $80 to $100, will often include a Time Domain
Reflectometer or TDR feature. The TDR can supply the distance to the fault in a
cable. It does this by measuring reflections in a conductor. I can use the TDR feature
to verify if it is a connector that is faulty or if the break is further in the cable.
Some modern routers and switches actually have the ability to test cables connected
to them. Which can aid in remote troubleshooting.
If ultimately, it is determined that the end device is experiencing the issue, and it
happens to be a PC, then a few standard steps can be followed. First, check the
adapter settings and determine if the interface has been disabled. If the interface has
not been disabled, and it still doesn't come up it's is most likely the interface on the
PC and should be replaced. A good temporary is to use a USB NIC which can be used
for testing purpose if we don’t have a laptop.
A Loopback Connector can also be created to test a NIC or switch port. This is an
RJ45 jack or plug, that has the send and receive cables loop back to each other. The
following pins need to be connected for a gigabit loopback. Pin one to pin three, pin
two to pin six, pin four to pin seven and pin five to pin eight. When this is connected
to a NIC or a switch port, the interface should successfully come up.
When a new connection refuses to come up, the first step is to ensure the wrong
media type wasn't used. I will ensure optics on all side are both multimode or single
mode, and all of the cables used are single mode or multimode.
Next, I'll try swapping the transmit and receive connectors on one of my
interfaces. About 50% of the time, when I run a cross-connect between two devices, I
end up connecting both transmit sides and receive sides together accidentally
(probably because TX on 1 end becomes RX on other end. So TX and RX cable
should probably be cross connected between 2 devices)
I've yet to see a laptop with a fiber interface, but there are several workarounds to
test with. One is to loop the fiber. If I loop the interface, I plug a cable in the transmit
side then directly connect it to the receive side on the same optic.
I generally do loopback testing with a jumper. Jumpers are simply fiber patch cables
that come in a variety of styles and lengths.
I usually keep a couple of one meter cables with varying connectors to use for quick
replacements, or loopback testing. If I look at the connectors, they are usually
labeled so I can distinguish a single strand from another. I take my jumper, using the
same strand, and loop the in-device optic. If the interface doesn't come up, then I
know it is an issue with the in-device.
On my end device, I'll verify that the interface isn't disabled at software.
If it isn't, I'll try replacing the optic on the device with a spare. Though methods will
differ between operating systems and hardware platforms, it equates to the
same. Make sure it isn't disabled, then test by swapping.
If I have previously determined the that in-device was sound, I will go to the switch
or router port on my side, and perform the loop test there. If the loop test fails, I will
check the interface isn't disabled in software. If the interface is, in fact, enabled, I will
change out the optic, and, if possible, move it to a new switchport. If the switchport
successfully loops, then I will move one connection point further away, and loop the
fiber again. I will continue moving further away from the switch until I locate the
fault. Most infrastructure is trunk cables that will span large distances, then small
jumpers that will connect to switches or in-devices.
Switching loop
Switching loops are the bane of every Layer Two network, and will absolutely ruin
your day. If protection mechanisms don't kick in, you can expect heavy packet loss if
you're lucky. If you aren't lucky, you can expect every connected switch to become
unresponsive.
Most often, a loop will begin when a user has a free cable connected to a wall
port. For some reason, when a user sees a dangling cable, they feel the need to plug
it in somewhere, and that can be another open wall port.
Loops can also be seen from VoIP phones. Somes phones have an additional
ethernet port allowing users to piggyback their computers off of the phone. If this
piggyback port is plugged back into the switch, it can cause a loop.
When connecting switches together, if multiple ports are connected and STP or
bonding isn't properly configured, then a loop can occur.
I'll first start by saying don't use unmanaged switches in your network. They provide
zero visibility for monitoring and they have no protection mechanisms to speak of. If
I have inherited an unmanaged network, then when a loop occurs, I have little
recourse other than to begin unplugging switches.
Imagine my router plugged into Switch A, that plugs into Switch B, that then plugs
into Switch C. During a loop, first ask everyone if they made any changes to the
network. If not, I would reboot all switches by power cycling them, but I would leave
Switch C disconnected. Did this fix the loop? If yes, then I know my loop is on Switch
C, and I will unplug its uplink port and power it back up. I should see some of the
ports flashing extremely rapidly and in sync. This will likely be my loop. If it wasn't
Switch C, then reboot A and C while leaving B off. I'll repeat the previous
process until the loop source is discovered. The moral of the story is don't use
unmanaged switches.
Theoretically, if everything is properly configured in a managed switch
environment, then a loop shouldn't occur. Most of the switches will have many of the
same concepts, even if they go about configuration in different ways. Obviously
Spanning Tree Protocol should be properly configured with additional features
like Unidirectional Link Detection for optical interfaces, BPDU guard for access
interfaces, root guard for edge switches, filtering VLANs on trunk ports, and so on.
There are a lot of moving parts in every network, and sometimes things fall between
the gaps. If I am troubleshooting, that means I have issues, and at this point, I'm
really just trying to find the source of the issue and mitigate it as quickly as
possible. Depending on which make and model of switch that I use and volume of
loop traffic, the switch will react differently. Some lower-end switches will completely
lock up or become unresponsive. Higher-end switches can often absorb a loop, but
may have degraded service. On these chassis, if service isn't degraded by a
significant amount, I may get notification saying that a MAC address is showing up
on multiple interfaces or VLANs. If the switched infrastructure becomes
unresponsive, and I can no longer remotely administer it, mitigation is the same as it
was above for unmanaged switches. Administration has to be brought back by any
means. Rebooting or disconnecting may be the quickest.
If remote access is maintained, investigation can begin. If the switch does provide me
with a message stating it sees a MAC on multiple interfaces, I can begin to track this
MAC through the forwarding table. Depending on switch model, this command will
be different, but in Cisco, it will be something akin to show MAC address-table. This
will show all of the known MAC addresses and the interfaces they are reachable
on. Cisco allows for some refinement of any show command, which can make it
easier to quickly identify our desired information. In this instance, I'm going to use
the include command along with the last four positions of the MAC address. Show
MAC address table, pipe, include zero zero one.
Once I have determined which ports are supporting a loop, if the interfaces are
designated as access ports, I'll shut them down. If one of the ports is an uplink to
another switch, I will connect to that device, and continue with the same
command. I'll simply rinse and repeat until the issue has been mitigated. Loop
mitigation is a messy business, and it's better avoided than experienced.
Duplicate addressing
Having duplicate IPs on your network is a very vexing problem. They can be trouble
shot fairly easily, but can also be avoided in some instances. Symptoms of a duplicate
IP can vary. It often manifests as a user saying they can't access a resource, or very
intermittent access to something. Normal trouble shooting would be to ping the
client IP which will, generally, work consistently.
When the client tries to ping out; however, he will see intermittent responses. This
asymmetric behavior can be quite perplexing. In networking, one generally sees the
same behavior to and from a host. The problem lies in the fact that since two devices
have the same IP and I ping them from a different subnet, the router that directly
terminates that subnet will arp for the MAC address of the IP owner. It doesn't matter
which end device responds. The router will send to one of them and that device will
respond back, so my pings will always get through. From the opposite perspective, if
I'm the end device and I attempt to ping out when the ICMP returns the router may
send the traffic to me or it may send the traffic to the other similarly IP'd host.
A suspected duplicate IP is fairly easy to test. First I'll start a persistent ping to the
host from my Windows PC with ping-t and the IP address. I should know the switch
port the host hangs off of, so I'll shut that interface down. An engineer could simply
unplug the host too, but I generally do my administration remotely. So the easiest
thing for me to do is to just shut the port down. While the host switch port is
down I'll leave my ping running for about 30 seconds. If my new host is
disconnected and the IP still responds to ICMP, then I have identified the duplicate
IP. Even if it doesn't respond I will connect to the local router or another host on the
subnet, and attempt the ping. I will then check the arp table to see if anything
responds with the IP in question. A Windows host would be arp -a from the CLI or
show ip arp from a Cisco device. Once I have the MAC address of the offender, I'll
track them down on the switch port terminating the rogue device. On a Cisco switch
issue show MAC address-table to gain the MAC address to interface mappings.
If this is a DNS server within my control, I would connect to it and run tcpdump if it's
a Linux machine, or Wireshark if it's a Windows machine. I would edit the input
filter to capture only traffic from my test host with ip host and its ip address. If I see
DNS queries coming in from my host, I know that the network infrastructure from
host to server is working. So the issue is likely a configuration setting on the server
(or reverse route). If I see no incoming queries from the host, then it is likely that
some kind of filtering is in place that is preventing the traffic or missing route. I'll
now check any firewalls or ACLs in the path.
An engineer will generally run into IP addressing issues like subnet mask
configuration when integrating a new device. It can manifest as a device not being
able to exit the subnet, or one device on the subnet not being able to reach the
other. In the case of two hosts not reaching each other, it comes down to ARP. Host
A is configured with address 192.168.0.2/24, but Host B is improperly configured with
address 192.168.0.250/25. Host B believes the address in his directly connected
subnet range from 192.168.0.129 to 192.168.0.254 while Host A has a subnet range
of .1 to .254 This means if Host B wants to reach Host A, he thinks he needs to send
packets to the default gateway to reach A. In reality, Host B should simply ARP for
Host A's address, then directly connect. This asymmetric behavior means the
connections aren't going to work or work consistently.
Another manifestation is that a new host can't access the Internet or basic network
resources. In our above example, imagine that the proper default gateway
of 192.168.0.1 is set, but the host is configured for 192.168.0.250/25. This means, he's
not in the same subnet as his default gateway and thus can't reach it. Mitigation is as
easy as using ipconfig to verify the configured subnet mask.
Rogue DHCP
nothing used to wreak havoc like a rogue DHCP server. A rogue is nothing more than
an unauthorized DHCP server on a network. First, I'll cover techniques to trouble
shoot and mitigate a rogue. Then I'll cover a couple of methods to prevent it. A
rogue DHCP server can show up for a couple of reasons. A rogue can be used to
perform a man-in-the-middle attack. A malicious user can hand out IPs to other
hosts, becoming a default gateway for them.
Some applications will check for an open port, and these will generally warn me of
the port conflict. Some other applications will simply fail with a generic error. The
best tool to determine if a port is in use is Netstat. Luckily, this command is available
for both Windows and Linux, though the syntax is slightly different for each.
On Windows, my go to command line parameters for Netstat are -anob; a displays
connections in listening ports (ports of applications that are listening or connection
established), n tells Netstat not to resolve IP's to domain names. This speeds the
command and I prefer to see IP's at any rate; o shows the process ID of the running
application (significance of process id might be that same applications like chrome
might have different windows with different process id), b shows what the name of
the application holding the port open is. Once my OS moves up towards 10, I will
need to run my command as an administrator.
Since I'm looking for HTTP, I'll look for anything listening on TCP port 80. When I find
the culprit, I can determine by name, what app is holding the port. If it is reporting
something generic, like Java, I can use the Process ID to find and kill it. I'll generally
use task manager to kill the process. Then I'll attempt to fire up my program again.
If I'm troubleshooting on Linux, I'll issue a slightly different command, that will yield
almost the same results. Issuing a Netstat -nap will give me a massive list of
services. I like to use the grep command to filter the output for the specific port. In
our case, the command will look like Netstat -nap|grep:80. In Linux, the n and a
parameters perform the same function as Windows, p replaces the b and o
commands from Windows, showing what application is specifically operating the
connection. Once I find the offending process, I can use the kill command to kill the
process.
I can now attempt to run my program again. Obviously killing an errant process isn't
the ultimate fix. At this point, I would need to determine why the conflict existed in
the first place.
If I need both applications to be available, one possible solution would be to move
the new application to a different server, or perhaps I could run them both on the
same server and just bind my new application to a different unused port. Port
conflicts aren't all that common. So a little preparation and practice can be the key to
a quick resolution.
Test service connectivity using telnet (especially for HTTP and SMTP)
The vast majority of services I find myself testing will be TCP based HTTP, SMTP, et
cetera. To test TCP, I need a TCP based utility. My go to tool on Windows is
Telnet. This can be done through the Windows telnet utility or via apps like
putty. The majority of HTTP servers will respond when an admin telnets to it via port
80.
For this example, I'll telnet to google.com. Telnet, port, 80. Once connected, I type
get space forward slash space HTTP forward slash one period one and hit enter
twice. You should get some sort of output that indicates you entered a bad request
which means that our test was successful. The fact that anything came back shows
that we have a TCP connection in both directions.
SMTP can also be simply tested with telnet. If I'm unsure of what the IP of a mail
server is, I can use NS lookup to verify the domain's MX record. MX stands for mail
exchanger and lists usable mail servers. From the NS lookup app, a set type to query
to MX by typing “set type equals -MX” and hit enter. I then type the domain name in
question. For this example, it will be google.com. This should ultimately supply you
with the primary mail server's address. I'll highlight it and right click to copy it.
I'll then telnet to the server on port 25. Set to telnet, and set the port to 25. Once it
connects, I'll enter the command EHLO gregsowell.com or sometimes HELO
gregsowell.com. If it fails, I'll usually just try typing HELO gregsowell.com again. The
server should respond back preparing to accept an email from me, which confirms
our connectivity.
Telnet won't be able to replicate all services, but I can often verify if a TCP session will
establish, which means network connectivity is there.
In Wireshark on my laptop, I'll set the interface to the first VM interface and then set
the filter to UDP port 2323 and hit enter. I'll then connect to my test server and start
Tcpdump with the command sudo tcpdump udp port 2323. As you can see, my filters
work very similarly between Wireshark and Tcpdump. From my laptop, I'll send a few
test packets.
In Wireshark, I should see the traffic leaving my device, which I do. If I don't see it,
then it is likely a local client firewall issue. Switching back to the server's Tcpdump
command shows that it is receiving the traffic. If the client is sending but the server
isn't seeing the traffic, I'll begin troubleshooting any filters (firewall, routes issues, etc)
between the client and server.
By the way, this testing method works just as well for TCP traffic. In the service
provider environment, I'll connect to one of my Mikrotik probes and use the Telnet
tool to test while doing a packet capture. Another tremendous feature is the ability
to VPN into the probe and test similarly to the user.
WiFi intermittent service
Few things can be as frustrating or problematic as flaky wireless. Usually it comes in as a
report from the user saying, "the wireless is terrible," but by the time I get there, everything is
fine. This could be any of a very large number of issues.
I usually prefer to start at the physical layer and work my way up the stack. The physical layer
can be really tricky depending on the environment I'm running in. Occasionally, I'll be in an
environment out of my control. Think an office complex. I generally can't dictate what other
tenants are allowed to do with their wireless, which means they could be the source of my
issues.
All of my APs will be in some sort of network monitoring system, looking for connectivity
failures (of AP?) or high CPU conditions on the device.
If I've ruled out issues with the access point itself, I'm going to look for interference
issues. Interference usually comes into play with 2.4 gigahertz networks, though they can
affect 5 gigahertz networks also. 2.4 is especially susceptible because it only has three non-
overlapping 20 megahertz channels to work with. To make matters worse, some modern
protocols like 802.11n allow for higher channel widths of 40 megahertz. This means I only
have two channels to work with. A lot of consumer-end routers will default with these 40
megahertz channels.
Interference really bites you when two access points within range of each other are
transmitting information heavily on the same channel. WiFi is a contention-based
media. Think of them as walkie-talkies. Only a single person can communicate on a single
channel at a time. If I want to send information, my nic will listen to ensure the channel is
quiet, then attempt to send. If the nic detects that there was a collision with another device
transmitting at the same time, it will wait a random amount of time, wait for the channel to
clear, then attempt to send again. Now, imagine there's another AP within range, and their
users are transmitting on the same channel. The easiest way to detect this interference is to
use a WiFi Analyzer. There are some simple free ones that will run on a laptop, or to make
things even easier, I can run one on a mobile device.
I prefer WiFi Analyzer on my Android devices. This will display SSIDs, the channel they are
running in, and their signal strength. The easiest thing to do is walk around areas users will
be working wirelessly in, and look for the least utilized frequency, then switch the AP's
channel to this preferred frequency. Ensure your own APs aren't the source of
interference. When I have complete control of the wireless space, I should be able to create a
channel plan that will maximize signal distribution.
If I can't detect other APs causing interference, then it could be from another device that's
not necessarily WiFi gear. The 2.4 gigahertz range can also be used for things like wireless
mics or baby monitors. Believe it or not, microwave ovens run in the low 2.4 gigahertz range
and can cause issues. If I'm in the kitchen at work and I'm microwaving my burrito, I tend to
lose access to the AP. In business, I hear that sales cures all. Well in WiFi, more spectrum
does just about the same. If I find that I'm running 2.4 in a supersaturated environment, then
I'll look at switching to the 5 gigahertz frequency. Depending on what country I'm installing
in, I can expect over 40 20 megahertz channels to work with. Not all legacy equipment will
support the 5 gigahertz range, nor will it penetrate objects like walls as well as the 2.4. So I'd
suggest testing before doing a full deployment.
Destination server or printer or website could not be accessed (assuming device gets IP)
Below troubleshooting is done assuming device gets IP. In case if device doesn’t get IP, check another article.
1. Trace source and destination device and check if firewall rules allow access at both ends for necessary
ports. If not, allow it.
2. If firewall rules permit access, ask user for traceroute, ping and telnet output and troubleshoot based
on where it is failing,
a. Even if traceroute is completed successfully, it is no confirmation that destination can be
accessed on particular port since traceroute works if ICMP packets are allowed.
b. If user gets an error message in server that ‘operation not permitted’ when he attempts to
traceroute, then ask someone with administrative privilege to execute traceroute, ping, etc.
If that person also gets error, check if traceroute permission is allowed in IP tables.
c. Traceroute and ping is not allowed from one end of firewall to another end. Adding
traceroute command to destination with source default gateway explicitly defined won’t
work in firewall. See firewall characteristics/ features article for more info.
d. If traceroute drops after crossing our firewall, either it could be route issue with ISP or
destination server might be blocking. Check last hop IP’s name, put destination IP in
ip2location.com and do traceroute from internet site to determine whether ISP or
destination server issue.
3. Ping:
a. To check if destination server is UP and reachable, ping internal/ internet based destination
IP’s from VDI as well as user’s PC. Additionally ping from internet for internet IP’s if no ping
response for pinging from VDI. This will help to isolate whether this is user’s subnet issue or
zebra network issue or destination server issue.
b. If unable to ping IP from user’s PC or VDI, try pinging from core switch or firewall where
destination device is connected. This steps helps in case if ping is blocked in access list rules
or if route/return route is not available for a particular subnet/site.
c. Also try pinging firewall subnets and internet IP’s from firewall if unable to ping from core to
check for SFR issue and other above mentioned issues.
d. You can’t ping default gateway of any interface on ASA firewall except the interface to which
you are directly connected. You can’t ping from a PC to it’s default gateway in PA firewall if
management interface allowing ping is not associated with the layer 3 interface to which
default gateway IP is defined. In PA firewall in CLi mode, source IP has to be defined when
pinging any IP (source IP in same subnet).
e. If ARP table and mac address table show device info but unable to ping the IP, then clear ARP
table and mac address table and then shut and unshut port and then check.
4. Access destination URL in browser:
a. Access destination URL from VDI and from user PC and compare. Is there a load balancer or
ping id authentication page loading before destination URL? If so, this also has to be allowed
in source and load balancer end. This can be known when we put destination URL in user’s
PC and URL changes to another (sometimes, we may not see this re-direction when loading
destination URL in VDI as it may probably happen very fast to notice it).
b. For internet sites, if unable to access from VDI or user PC, access destination website
additionally from mobile so that we can isolate the issue to be with either destination site
side or zebra network side.
5. Do log capture in ASDM for source or destination IP bi-directionally. To see if request is reaching
firewall and passing through successfully and to see if we are getting response, to check if anything is
getting blocked, to identify if our IP needs to be whitelisted in destination side or RDP service is not
running in destination server, etc. Also check if packet is reaching till firewall by checking for hitcount
increase in access list rule. If ASDM log capture doesn’t work, install wireshark and do packet capture.
a. If log capture shows bi-directional traffic passing successfully, then need to check on server
side.
6. Check if an unknown port not being opened is the reason for destination being unreachable by adding
ip any any rule.
a. Identify unknown ports by log or packet capture or ACL rule with log keyword in router,
checking with user, reading documentation or visiting support site.
b. Check if any extra port number is mentioned in destination URL. EX: if URL is http://crp1-
omwinternal.zebra.com:9431/soa-infra/services/default/ZEB3PLDHLRequestorSe, then
access need to be allowed for port number 9431.
c. If we are allowing access by application, check what are the protocols that application
‘depends on’ and ensure that is allowed as well. Ports associated with an application is one
thing protocols on which the application is dependent is another thing but both need to be
opened.
7. For internet based destination IP’s, check these additional things
a. Ensure NAT exists and ensure it exists for correct ISP interface in case of multiple ISP’s.
(private to private NAT may probably apply for 157.235.XX.XX subnet, 10.11.XX.XX subnet). If
user can ping default gateway but traceroute shows packet not leaving our device, check
NAT.
C If traceroute drops after crossing our firewall, either it could be route issue with ISP or
destination server might be blocking. See traceroute section. Does the site administrator has to
white-list our IP in order to access destination? Or correct IP white-listed?
D If netscope is running, bypass it and check
E If access is allowed for URL (using FQDN) instead of IP, verify by allowing access to URL’s IP as
FQDN sometimes doesn’t work. This might happen when URL has lot of IP’s.
8. If user unable to access from internet to internal IP, check these things:
a. Check if DNS is resolving from internet for URL’s public IP and resolving internally for private
IP. Compare to isolate if is DNS related. Below cause could also sometimes cause DNS
resolution issue.
b. Apart from Static NAT, firewall rules, route, server firewall rules, additionally check if route
exists in INET router/ VRF that comes between firewall and internet traffic. DNS might
probably still resolve as my guess if DNS records may be sent to internet on separate IP/port.
9. Check for issues in server side:
a. If source or destination server is linux server, then check IP tables or firewall D in BOTH
source & destination server if access is allowed (IP tables is like an internal firewall running
only in linux servers. Some linux servers can also have firewall D running in them).
i. If proxy server comes before application server, then IPTables need to be bypassed
in both servers. See separate article for info on proxy servers.
b. Check if subnet mask and default gateway is configured properly in servers. Once subnet
mask was wrong in a server, so that server was inaccessible from internal servers but
accessible from vpn connection for some reason.
c. Once Hitesh said something about load balancer. Check with server team for anything
related to this depending on issue.
d. Once transfer failed when user initiated traffic from 017 site server to his VPN PC. However
we later found that some traffic also gets initiated in reverse direction as well unknown to
user & that was getting blocked.
e. If server is hosted in AWS environment, then rules need to be allowed in AWS as well. They
will be usually connected through PA site-2-site VPN tunnel and if PA allows rules, then we
need to check in AWS side.
10. DNS. Put nslookup and destination URL – in case if nothing is resolved or if able to ping or access
destination using IP but not using URL/hostname, then these might indicate DNS issue.
a. Ensure if DNS IP is shown in ipconfig output
b. Check if appropriate DNS is configured in device. Try to resolve URL from your PC. If it
resolves, it is the DNS ip that they have configured that is not resolving.
c. Ping DNS server from VDI and user PC to check if it’s UP and if access to it exists.
d. Change DNS server on user’s PC and check. Also try to ping or traceroute destination URL
from core switch and if DNS on switch resolves URL to a different IP, change DNS server IP on
user’s PC.
e. Check if DNS server IP is allowed in access list applied on server’s interface. Filter access list
with domain keyword to find the object group associated with DNS servers.
f. If destination URL ends with any other domain name other zebra.lan, we need to ensure that
the domain name is added to ‘some centralized resolution database??’ so that when we
enter the URL, DNS server is able to resolve it. Otherwise DNS server may add the usual
domain name zebra.lan to the end of URL and try to resolve that URL for which no IP will
obviously be shown as no such URL exists.
g. For newly commissioned servers, server needs to be allowed access to DNS servers, LDAP,
etc in order to sync with zebra.LAN domain and resolve hostnames & this is the 1 st step
before other things can happen. In firewall 10.80.254.49, check ‘access-list lutron_access_in’
and filter for IP 10.80.55.75 to know IP’s & ports to be opened for a new server
h. In 1 case, non DNS resolution was due to issue in server. See article “User can access
destination server using IP but not using URL”
11. Check if route/reverse route exists in ALL devices in the path.
b. If user unable to access an internal server from internet, check if route exists for the public IP
in internet router/ INET vrf as it comes in between firewall and internet.
c. If source or destination is 192.168.xx.xx subnet, we need to check route for these lab
networks as route info of these subnets may only be available to devices within the site only
sometimes. Dhivahar had once said that same 192.168.xx.xx subnet may exist in multiple
sites. If so and if users insists on access to other sites, we may need to do NAT’ing to
10.xx.xx.xx subnet which is L3’s scope. Routing is also slightly different for 157.235.XX.XX
subnet, 10.11.XX.XX subnet.
d. If destination is connected to a network device to which we don’t have access. Check for
return route issue or firewall rules blocking issue by comparing traceroute and ping to
destination from 2 different sites.
e. Once route for destination in core was shown as getting load balanced across 2 firewalls –
the normal firewall at the site and ASA DMZ firewall at site due to faulty redistribution config
put in ASA DMZ firewall.
12. Bypass source & destination IP from being monitored in SFR in ALL appropriate firewall & then check
if it helps (Once firewall rules allowed access but neither user nor I from VDI could access an internet
based site. I tried to traceroute site’s IP from core but could not do it but could traceroute other sites
like google (8.8.8.8). however I could traceroute this site successfully from firewall. So excluded this
site from being monitored by SFR in user’s site’s firewall. Now I could traceroute from core & user
could also access. This was only temporary testing & we put rule back. Dhivahar asked to send ticket
to GISO for approval for bypassing this site in SFR. Once we get approval, we need to move ticket to
L3 team for bypassing permanently).
13. RDP issue: In order to RDP, below things need to be enabled in PC apart from firewall rules (local IT
will take care of RDP issues in PC, not windows server team).
a. User PC must be added to Active Directory Security Group
b. Settings, System, Remote Desktop - Enable Remote Desktop Button must be clicked to Turn
On RDP
c. User must be a member of net localgroup "Remote Desktop Users"
d. For RDP access to servers, windows server team said they will add user to a group that will
allow RDP access.
14. When source or destination is relay server, be careful and double check the true source or destination
IP. Relay server had functionality similar to that of IP helper address command. So this relay server
was only forwarding the traffic from source server to destination
15. Check if access is permitted to a server based on user’s user id/group in FMC.
16. If a user is statically configuring ip on a host machine or server or printer, ensure that correct subnet
mask and default gateway are entered. For printers, in few cases, we have found static IP on printer
and DNS record clashing (verify using nslookup).
17. If both source & destination or either of these is behind a device to which we don’t have access, we
can check with local IT contact about this device to see if access list rules are applied here. Try both
local username & tacacs username for access. If login disclaimer says anything about another brand
like HP, we can be sure we don’t have access. Ask for traceroute if both source & destination are
behind such a device to know the devices coming in between.
18. See if this could be java or browser or some other application error based on error message. For
example, If it shows Status 500 and Java Errors, then it might be java error & java may need to be
updated.
19. Once users in a particular subnet alone were not able to access an internal website but users in other
subnets in other sites were able to access it even though firewall rules permitted access to all subnets.
Found the issue in corrupted entries in an LDAP group shared by the users. Once the LDAP group
Cache was cleared, the user had the ability to log in and work properly.
20. Do this & below 2 steps as last resort if all other troubleshooting doesn’t help. The context of problem
is also important when doing this.
a. Sometimes restarting destination server might help
21. Check if there are other components involved in this traffic flow and restart it if necessary
What to do when device does not get DYNAMIC IP or gets APIPA IP after connecting to a switchport
169.25x.xx.xx IP is APIPA ip address and these are ip addresses allocated by windows machine themselves
when computer is not able to get ip address from DHCP server.
1. All these things need to exist before a device can get dynamic IP:
a. Check if port configured to correct vlan
i. Once both access vlan and voice vlan were same vlan & ip not received. After
removing voice vlan, ip received.
b. check if vlan is allowed through trunk between core and access as well as between core &
firewall if applicable. Ensure it is allowed on both ends of trunk.
i. Also verify status of ports in PO.
c. Check if dhcp server IP/dhcp relay is defined for the vlan in core switch or firewall
d. Check if subnet is added to DHCP server. Even if DHCP scope is not yet created for a subnet in
DHCP server, any static IP assigned to devices will still show up in ARP table.
e. Ensure L2 vlan exists and check if it is shut in core switch. If so, unshut it. However confirm
with dhivahar once before we unshut it.
f. Allow DHCP application from the subnet in palo alto firewall security policy. Refer to
CHG0069458 for sample subnet config. If a host machine is in a subnet connected to firewall,
then ASA firewall rules don’t need to allow DHCP ports in order for DHCP request to go. See
article “Firewall access list rules apply only for traffic PASSING through the firewall….”