This was a fun one. Coming in to work on a Friday morning for what you hope is an uneventful segue in to the weekend and your colleague looks up from his fast scrolling terminal screen and says “we may have a problem”.
Unable to download NAT policy for ACE
The change he had implemented was simple enough. Add a couple of new sub-interfaces to the Cisco ASA firewall, add the required security ACLs and configure the NAT and no-NAT (NAT0) rules. The firewall code was still pre 8.3 on 8.0(4), so used the older NAT syntax.
The problem arose when the no-NAT config was applied, specifically adding ACE entries to the ACL that the no-NAT applied to the new interfaces referenced. The firewall threw up the following message:
Unable to download NAT policy for ACE
In the context of the above sequence of events, this message isn’t actually that obscure. Pre version 8.3, the Cisco ASA uses policy based NAT. For the no-NAT, it uses an ACL to decide which traffic should not be NAT’d as it comes in to an interface. As the new ACEs were being put in to the firewall, the above message is effectively telling us that the firewall was unable to apply this to the no-NAT policy. So the ACE shows up in the config, but it isn’t having any effect.
In addition to this, we had also lost management access to certain networks through the firewall as part of this change.
The config was rolled back as a matter of course but the issue remained. Running packet tracer on the firewall showed that the issue was down to the no-NAT, although comparing the config with a backup showed no differences.
Based on our gut feelings and the message we saw, the NAT0 statement was removed and re-added and the issue vanished. Searching Cisco.com brought up this bug (CSCsl46310). Cisco recommend reloading the firewall as a workaround prior to reapplying the NAT0 statement, but that wasn’t required in our case.
Known fixed releases are supposedly 8.2(0.79), 8.0(3.2) and 8.1(0.130), although on the download site, 8.2.5 is a recommended version so I think that will be my first stop.
It is actually a pleasant surprise when a bug at least produces behaviour and a system message that can be used to troubleshoot without too much effort.
Just in case anybody was thinking ‘wow, this guy must have a cushy job, only three trouble tickets over the last 18 months’, then think again. The purpose of these trouble ticket posts is to write about those issues that don’t occur every day, the crafty quirks that have you scratching your head. In this case, I was witnessing something with the Cisco Secure Desktop feature that didn’t make sense and was frustrating to say the least.
Cisco Secure Desktop
For those not in the know, Cisco Secure Desktop (CSD) is a feature of the Cisco ASA platform that allows remote access VPN clients to go through a number of health checks before being allowed to connect. In the scenario discussed here, I had configured a couple of client certificate checks and wanted to add a third for a specific client type.
The diagram above shows the third certificate check added at the bottom right. If this third check fails, the client is denied login.
When I saved the changes, everything looked as it should. The policy in the diagram above is not saved in the normal configuration file on the ASA but in this file: ‘flash:/sdesktop/data.xml’. As you can see from the snippet below, immediately after entering the third certificate check, the GUI and CLI agree on the configuration (certain details changed\removed):
The problem reared it’s ugly head when I closed ASDM and reloaded it. The XML file always shows correctly what the ASA is doing, but ASDM was now missing that third check in the GUI so now it appeared it was only carrying out the original certificate checks. That meant that to delete the third check, or add yet another check, I would need to hack the XML file.
I tried reloading the ASA. That made the GUI update to reflect the current XML file, but again, as soon as I made any changes and reloaded ASDM, the GUI would revert back to the two certificate check, regardless of what other checks I added. It was as if there was a cache somewhere.
I upgraded CSD to the latest version and still no joy. At that time, I got Cisco TAC involved. The person who I was dealing with was helpful and took all the details so he could lab it up. At first he reported he couldn’t replicate the issue but a few days later got back to me to report it was occurring but intermittently.
He requested some Java debug logs, which I duly gave and then waited for about a week while he escalated it to the development team.
Finally, he came back to me to advise that this was being listed as a new bug:
Lucky me, I got my first bug listed. Something tells me it won’t be my last. There is a better workaround than rebooting the ASA each time though:
ASA(config-webvpn)#no csd enable
Basically, disabling and re-enabling CSD re-syncs the ASDM with the XML file, so until the bug is fixed, if you see this issue, I’d suggest doing this workaround before you make any changes to your CSD policies. FYI, I was using version 1.7 of Java and apparently 1.6 doesn’t have the same problem. So, to manage your firewall security device, feel free to use a less secure Java client. Sigh…
Welcome to my 2nd trouble ticket post. It has come a bit later than I would have liked but have been having all sorts of other fun to deal with and let’s not forget the two weeks of the Olympics when I hung up my study cap completely.
To recap, in this category of post, I cover an issue I have come across that stands out for one reason or another. In Trouble Ticket #1, I covered a problem that put Cisco TAC on the back foot. In this ticket, I discuss an initially strange-looking problem which did get resolved but not before I got led on a wild goose chase.
The problem reared its head when I was determining why it appeared that the first /26 of a /24 was reachable yet the second /26 was not from a management station. What stirred things up was when I did a traceroute for an IP that was known to be available and a hop that I thought should be in the path was missing. The diagram below is a simplified topology.
When I did a traceroute from MGM1 to host 172.16.3.66, I only got as far as the default gateway, 172.16.1.1 and it died. A traceroute to 172.16.3.2 got there but with the following hops:
172.16.1.1 (default gateway, expected)
172.16.2.1 (layer 3 switch that routes to the 172.16.3.0 subnet, again expected)
172.16.3.2 (the destination host itself, expected)
What I didn’t see, and was expecting to in between steps 1 and 2 was the ASA firewall at 172.16.1.3. There was no ‘Request timed out’ message. The step was missing entirely. For the curious, the 172.16.1.1 address is the gateway address for all management hosts but holds routes for all the infrastructures it needs access to, hence the two hops within the same subnet.
I started to check the path being taken and as both traceroutes showed a successful reply from the GW, I ruled out any shenanigans on MGM1, but double checked the hosts file and ARP cache for anything untoward. Next, I checked out the GW and this is where the red herring reared its ugly head. Despite the routing table telling me that the next hop for both destination subnets was the firewall, I could see from the port descriptions (not always to be trusted!) and CDP output (more reliable!) that there was a switch directly connected that led through to the same subnets. I traced the path using CDP and found a 2nd layer 3 path but this one wasn’t firewalled. So here is the updated diagram:
The strange thing was, it was the same number of hops either way so there was still something missing regardless of which path was being taken. I decided to err on the side of caution and went back to the GW, checking things like ARP caches and CEF adjacencies. Everything looked as it should.
So off to the Cisco Support Forums I went and after several attempts to craft the correct search, I came across the answer I had been looking for. It turns out that the firewall was not doing something that was hiding it from the traceroute output. In brief, a traceroute works by sending an ICMP packet to the destination IP with a TTL of 1. When the next hop (in the first case, the default gateway) receives this packet, it decrements the TTL to 0 and sends a ‘TTL expired in transit’ message back to the source. The source then sends another ICMP packet to the destination with a TTL value of 2. The GW will forward this on to its own next hop for the destination, after decrementing the TTL to 1. As this packet hits the next hop, the same ‘TTL expired in transit’ message is sent back to the source. This continues until the destination hopefully responds. The traceroute command can therefore display each hop by using the IP address of all devices that respond with the ICMP messages.
The root cause of my issue was the ASA was not decrementing the TTL. Therefore, the ICMP packet was forwarded on from the firewall to the next hop with a TTL of 1, where it was replied to with the standard ‘TTL expired in transit’ message. In this way, the traceroute would still complete, however the ASA would be hidden from the output. OK, I can see how that might be useful from a security point of view but it makes troubleshooting a real pain in the backside, so let’s look at how to disable this feature from configuration mode:
! This creates a new class map to match any traffic
! This is the key command. It decrements the TTL on all traffic passing through the firewall
! Exit out to configuration mode
firewall(config)#service-policy global_policy global
! This makes the policy active
The traceroute output was now as I was expecting:
Tracing route to 172.16.3.2 over a maximum of 30 hops
1 <1 ms <1 ms <1 ms 172.16.1.1
2 <1 ms <1 ms <1 ms 172.16.1.3
3 2 ms <1 ms <1 ms 172.16.2.1
4 <1 ms <1 ms <1 ms 172.16.3.2
From the contextual help on the ASA’s CLI, it appears that this behaviour is applied to all IP based traffic, not just ICMP traffic. It should be noted that the config above only applies to ASA versions 8.0(3) and later. It should also be noted that the initial issue I was seeing that got me to this point i.e. part of a subnet responding, part of it not, was down to the fact that this subnet had been previously addressed differently. When the subnet was made larger to a /24, all devices were readdressed correctly, the security ACLs on the firewall were updated but there was a NONAT ACL that was still configured for the previous /26 subnet. I updated that too and return traffic, now matching the NONAT ACL, was not NAT’d and was returned as expected.
This is the first of what I am calling a Trouble Ticket post, a review of an issue I’ve run into which I believe warrants sharing with others. This could be because I came across something I’ve not seen before, that I ended up pulling my hair out over or perhaps it simply involved something that was particularly interesting to me. I aim to lay down as much information as I can within the confines of confidentiality, where applicable, and explain my troubleshooting process. So let’s get started with the first ticket.
A customer’s satellite site had a 1Gb\s fibre link laid to the main data centre using a 3rd party supplier. The diagram below shows all the key details to the issue but some more background may help. R1 has an RJ45 to the LAN behind it as does SW1 and both connect to the WAN with LC SFPs. The port on SW1 is layer 2, with the layer 3 endpoint being an ASA behind it.
When R1 was rebooted, its fibre port would initialise but would show as down\down.
It soon became apparent that this wasn’t going to be a slam dunk fix. Doing a shutdown\no shutdown didn’t help. Removing the LC connector from R1’s SFP and reinserting brought the link back up and removing one of the ST connectors from the patch panel and reinserting also brought the link back up. At this point the link would remain up and stable…until the next reboot. Looked initially like a layer 1 issue.
R1 was returned to my head office where I powered it up with a 2nd SFP in a 2nd port, with a loopback fibre. No matter how hard I tried, I could not replicate the issue i.e. both ports came up as expected after every reboot. This led me to believe that the issue was probably with the link itself, somewhere from R1’s SFP to SW1’s SFP. I took R1 back to the remote site myself with a spare patch lead and another SFP and swapped these out but the issue still persisted.
I then got on to the WAN provider and requested they tested the link. They did an end to end test and reported that they could see no issue but when I presented the results of my loopback test, they agreed to move on to a different fibre pair from the site’s patch panel to their POP site and down to the data centre. This still left the patch panel from their kit in the data centre to SW1, but otherwise a complete replacement. Again, the issue remained.
I then went back to site and replaced R1 with a temporary Cisco 2960G (which I shall call SW2) for testing. For the test I simply removed the SFP from R1 with fibre intact and plugged in to SW2. Each time I rebooted, the link would come up as expected. This test by itself would strongly suggest that the link is actually OK and that the issue lies with the 3925. At this point, I should point out that I did all testing both with the required config on the relevant ports and with default config on there too. This seemed to leave the following causes for the issue:
Hardware issue with 3925
Incompatibility at layer 1 between hardware and WAN provider
Regardless of which of these was the issue, I decided to get our hardware support provider involved and when they tried to get me to do all the things I’d already done, I asked them to provide a replacement router which arrived within hours. The original router had IOS 15.1(2)T2 on it for some reason and when the replacement came, I made the mistake of slapping the latest version of the same T train, 15.2(2)T.
My next stop was Cisco themselves so asked for the ticket to be escalated to Cisco TAC. After several days of chasing, I was asked to downgrade to 15.1(4) M3, a known stable release. This was done within minutes of being requested and yet the problem still persisted and Cisco were telling me that this was the only fix they could think of at that time. A reach out to my online network drew no help either. Knowing that this was stumping everybody that looked at it did not make it any the more palatable.
Eventually, I got direct access to TAC, which is another story in itself, and we set up a Webex session so they could see the issue ‘live’. They ran all the standard commands to get a report they could spend some time looking through but were still unable to see what the root cause was several days later. Prior to the testing, they had even sent out another two SFPs that they claimed were compatible with my setup, just to play it safe.
At this point, I was aware that the customer was keen to get the fault resolved ASAP so they could use the circuit with confidence so I presented three options:
Get the provider to terminate their line at the customer site on active kit
Get a different platform entirely, perhaps a 3945 or an ASR
Terminate the fibre on an intermediate switch
Option 1 came in at a cost that seemed deliberately prohibitive i.e. the provider just didn’t want to do it so added a zero to the cost. Option 2 looked the most likely but the hardware support provider were playing hardball regarding who should front the cost of the more expensive model. At this point, I told the customer to try option 3, but only as a short-term fix whilst I would chase option 2 to its conclusion. I also took this opportunity to streamline the config on the link, which had originally been set up as a trunk with allowed VLANs that had no place being there.
It was at this point that I was told of the underlying purpose of the link and it turned out that a 3925 would never have met the design requirements regardless. The colleague that came up with the kit list had moved on to pastures new but his replacement pointed out that a 3560X switch would not only be much cheaper but also exceed the requirements. The switch landed on my desk and I configured it, installed it and confirmed the issue was not replicating on this platform. The customer soak tested the link and was more than happy with performance. Job done.
In summary, sometimes you can’t make things right from the CLI or by wiggling cables. If you adhere to an IT life-cycle model, whether it’s ITIL, Microsoft’s Operation Framework or Cisco’s PPDIOO, you avoid many of these kind of issues in the first place as you are putting your kit list together based on business requirements\strategy and not based just on what has worked before in another scenario, on what tech looks cool this week or on a gut feeling. I am a strong proponent for a suitable planning and design phase but all too often, these are seen as time-drains when networks could be getting built and used. The loss of productivity that often follows during the operational phase, chasing your tail due to poor planning and design, can inevitably cost much more, not just in money but in staff morale and customer good will.
I still believe strongly that this is a hardware issue with the 3925. Has anybody seen anything similar to this issue, or have other stories of faults they were inevitably unable to resolve to their complete satisfaction?