The great re-route.

I enjoy writing blogs that explain an issue, detail the troubleshooting, and find a resolution. It doesn’t just showcase the technical side but also the mindset. You don’t have to know every command or every port of every protocol but knowing how to find that information and investigate an issue is critical. Even the great ‘hunch’ or gut feeling can be a great tool, it’s our brain trying to let another part of our body give some input, especially if the brain is overloaded with racing thoughts. Why isn’t this working?

The gut feeling or hunch also comes from many years of experience, so you know it is a valid input that deserves to be checked.

When something doesn’t work as expected, the outputs and symptoms make no sense, my gut says it’s a bug.

I will set the scene, as best I can without a diagram for an issue I worked on recently.

Server A is in Data Centre A, it traverses a firewall and then a SDWAN Cloud (2 x ASRs) to a Server B located in Data Centre B (2 x ASRs). Data Centre B also contains a firewall. The traffic is TCP port 445 (SMB), the mapping of a drive via a script. It has been working until one weekend at Data Centre B a firewall started to drop traffic and was rebooted.

Shortly after the reboot an incident was logged, as the users were trying to map this drive in a change window at the same time the firewall was playing up. I assumed the firewall issue was the cause, updated the ticket and asked the user to test again when possible.

The issue was still there.

The next step was to first check logs of firewalls to see if this traffic is traversing the correct path. I searched for source, destination & port across all firewalls and I did see traffic at Data Centre A, B but also Data Centre C? Data Centre C is another DC that shouldn’t even be in this path? Only one or two packets were seen at Data Centre C, the rest were at DC A & B.

I decided to install psping on the source server, this tool is excellent. Now that Telnet is all but disabled these days, psping allows you to send a TCP ping on a certain port testing the entire OSI stack.

I ran the psping and observed. I made note of the source ports generated, so I was sure what packets were missing.

Every eight or ninth ping would time out, repeatedly. I took a packet capture at DC C and sure enough the missing packets with source ports as noted were there.

How could this be I wondered? How could the network path send one or two packets to an entirely different DC? At first, I looked for some type of policy-based routing issue but found nothing. I also checked the SDWAN at Data Centre A and saw the routing table was correct. So how could two packets be routed the wrong way? I did a packet capture on the SDWAN routers at Data Centre A and found an interesting clue. The packets that were sent the wrong way always went through router #2. So, I decided to concentrate on this device.

Before I go further into the details, I must explain the topology a bit further. Data Centre A is a spoke site of the SDWAN, it has any to any connectivity across the entire fabric. It can connect to any hub or spoke.

Data Centre B is a Hub site. It has any to any connectivity to its various spokes but must use a backbone VPN to reach sites outside of its own hub.

Data Centre C is also a Hub site and operates the same as DC B.

So far, we are seeing one or two packets arriving at DC C, and we cannot work out why they arrive at this site. When they arrive at this site, they enter the network and are lost. Remember DC C is a hub, so to get to DC B (the destination) from DC C it must send it via the backbone, which it tries to and ages out.

So, as the user tries to setup a TCP connection and perform the drive mapping, packets are missing, TCP alerts the source and says I didn’t see these packets please send and once again certain packets are lost and the process continues over and over, with the connection failing.

So, I return to the routing table of Data Centre A, as this is the only decision point that could possibly send traffic somewhere else. The only possible way I thought this could occur, is if the route for the destination was missing because there is a default route from DC A to DC C effectively sending any unknown traffic to hub site DC C.

I check the routing table repeatedly as my psping is running. It never changed. The route was always present, I checked the CEF table, it doesn’t change so how does the traffic end up at DC C? I decided to check the SDWAN tunnel command which shows you what tunnel is being selected –

show sdwan policy tunnel-path vpn XX interface Te0/0/0.152 source-ip 10.x.x.x dest-ip 10.x.x.x protocol 6

Notice, protocol 6 which is TCP and when you enter this command it shows you the tunnel selected. Every time I ran this command it showed me the correct next hop of DC B.

It was at this point I decided I needed more commands, more detail of the operation that occurs inside the router itself. I opened a TAC case for ASR #2 at DC A.

I explained the issue to TAC, sent a diagram and my findings and awaited a response. We did a WebEx and worked on the issue, they first acknowledged that what I was seeing was correct. Two packets were being routed the wrong way, and they confirmed this by doing a special packet capture on the ASR router. It is known as a Datapath Packet Trace and it shows you exactly how the router decides what to do with a packet.

https://www.cisco.com/c/en/us/support/docs/content-networking/adaptive-session-redundancy-asr/117858-technote-asr-00.html

When SDWAN places traffic onto the fabric, it makes a decision regarding the tunnels that are built between sites. This SDWAN router at DC A has two internet transports and two colours. It actually uses both links active/active at both DC A and DC B so it has a possibility of 8 paths to choose from. The command I used above only showed me the current path selected, not all eight. Every time I ran it, it never changed but if you add the ‘all’ keyword at the end it will show you all available paths –

show sdwan policy tunnel-path vpn XX interface Te0/0/0.152 source-ip 10.x.x.x dest-ip 10.x.x.x protocol 6 all

So, I checked this command and the information the router uses I had been looking for appeared. Out of the eight paths available, six were pointing to DC B, and two were pointing to a spoke site. Those two had been programmed incorrectly and this spoke site didn’t even advertise or have a similar IP range as the destination, so it was randomly selected and added to this table.

So how did the traffic end up at DC C if it was sent to a Spoke site? 

The spoke site that has been incorrectly programmed just had a default route to DC C, so as the router uses all paths due to ECMP from top to bottom it cycles through the eight paths, two packets are being sent to a random spoke site, they then follow the default route to DC C and get lost, the remaining packets are sent to the correct destination. This next hop is known in the SDWAN world as a TLOC, it is a System ID of a SDWAN router and used as the next hop for traffic.

So how did this happen? And how do you fix it?

The how was a software bug, which is documented here –

https://quickview.cloudapps.cisco.com/quickview/bug/CSCvw61731

While you wait to apply a permanent fix you may need to apply a workaround and for me it was to reset all SDWAN tunnels. This would cause a small outage and the tunnels to be reprogrammed. Luckily for me, over another weekend this\ prefix and its associated tunnels were reset by the unreliable internet, so it was resolved.

The permanent fix was you guessed it, upgrade of code.

This was a tough issue to locate, but I trusted my gut feeling even though my brain was saying there is no way it would route two packets out of 10 the wrong way every time.

Although this is SDWAN, the tunnel programming is equivalent to a CEF table, where prefixes are programmed into a CEF table so it can be routed once then hardware can do the forwarding. I am sure there may even be a CEF bug out there as well in some Cisco device.

As networks become more advanced, the possibility of software bugs grows as well. So, make sure anytime you are troubleshooting, always go to the bug toolkit of your vendor, enter your code version, and type some keywords of your issue to see if anything matches up. I am not sure if I would have located this myself because I needed TAC to show me the hardware level packet tracing command and confirm my theory, but from now I will be adding the DataPath Packet Trace to my list of tools.

Happy New Year.

~Brad.

Please NoteOpinions expressed are my own.

Published by

theciscoworkerbee

Cisco Certified Engineer, studied at the Cisco Network Academy in Box Hill, Australia and trying to find my place in the networking world. Come on a journey with me as I navigate the day to day and study to become a CCIE.

Leave a comment