Trouble Ticket #2: Missing hop in Traceroute output

Introduction

Welcome to my 2nd trouble ticket post. It has come a bit later than I would have liked but have been having all sorts of other fun to deal with and let’s not forget the two weeks of the Olympics when I hung up my study cap completely.

To recap, in this category of post, I cover an issue I have come across that stands out for one reason or another. In Trouble Ticket #1, I covered a problem that put Cisco TAC on the back foot. In this ticket, I discuss an initially strange-looking problem which did get resolved but not before I got led on a wild goose chase.

Description

The problem reared its head when I was determining why it appeared that the first /26 of a /24 was reachable yet the second /26 was not from a management station. What stirred things up was when I did a traceroute for an IP that was known to be available and a hop that I thought should be in the path was missing. The diagram below is a simplified topology.

Diagram

How things initially looked

Problem

When I did a traceroute from MGM1 to host 172.16.3.66, I only got as far as the default gateway, 172.16.1.1 and it died. A traceroute to 172.16.3.2 got there but with the following hops:

  1. 172.16.1.1 (default gateway, expected)
  2. 172.16.2.1 (layer 3 switch that routes to the 172.16.3.0 subnet, again expected)
  3. 172.16.3.2 (the destination host itself, expected)

What I didn’t see, and was expecting to in between steps 1 and 2 was the ASA firewall at 172.16.1.3. There was no ‘Request timed out’ message. The step was missing entirely. For the curious, the 172.16.1.1 address is the gateway address for all management hosts but holds routes for all the infrastructures it needs access to, hence the two hops within the same subnet.

Troubleshooting steps

I started to check the path being taken and as both traceroutes showed a successful reply from the GW, I ruled out any shenanigans on MGM1, but double checked the hosts file and ARP cache for anything untoward. Next, I checked out the GW and this is where the red herring reared its ugly head. Despite the routing table telling me that the next hop for both destination subnets was the firewall, I could see from the port descriptions (not always to be trusted!) and CDP output (more reliable!) that there was a switch directly connected that led through to the same subnets. I traced the path using CDP and found a 2nd layer 3 path but this one wasn’t firewalled. So here is the updated diagram:

The new layout after investigation

The strange thing was, it was the same number of hops either way so there was still something missing regardless of which path was being taken. I decided to err on the side of caution and went back to the GW, checking things like ARP caches and CEF adjacencies. Everything looked as it should.

So off to the Cisco Support Forums I went and after several attempts to craft the correct search, I came across the answer I had been looking for. It turns out that the firewall was not doing something that was hiding it from the traceroute output. In brief, a traceroute works by sending an ICMP packet to the destination IP with a TTL of 1. When the next hop (in the first case, the default gateway) receives this packet, it decrements the TTL to 0 and sends a ‘TTL expired in transit’ message back to the source. The source then sends another ICMP packet to the destination with a TTL value of 2. The GW will forward this on to its own next hop for the destination, after decrementing the TTL to 1. As this packet hits the next hop, the same ‘TTL expired in transit’ message is sent back to the source. This continues until the destination hopefully responds. The traceroute command can therefore display each hop by using the IP address of all devices that respond with the ICMP messages.

The root cause of my issue was the ASA was not decrementing the TTL. Therefore, the ICMP packet was forwarded on from the firewall to the next hop with a TTL of 1, where it was replied to with the standard ‘TTL expired in transit’ message. In this way, the traceroute would still complete, however the ASA would be hidden from the output. OK, I can see how that might be useful from a security point of view but it makes troubleshooting a real pain in the backside, so let’s look at how to disable this feature from configuration mode:

firewall(config)#class-map ICMP_TTL

firewall(config-cmap)#match any

! This creates a new class map to match any traffic

firewall(config)#policy-map global_policy

! This policy-map should already exist

firewall(config-pmap)#class ICMP_TTL

! This adds our new class-map to this policy

firewall(config-pmap-c)#set connection decrement-ttl

! This is the key command. It decrements the TTL on all traffic passing through the firewall

firewall(config-pmap-c)#exit

firewall(config-pmap)#exit

! Exit out to configuration mode

firewall(config)#service-policy global_policy global

! This makes the policy active

The traceroute output was now as I was expecting:

C:\>tracert 172.16.3.2

Tracing route to 172.16.3.2 over a maximum of 30 hops

1    <1 ms    <1 ms    <1 ms  172.16.1.1

2    <1 ms    <1 ms    <1 ms  172.16.1.3

3     2 ms    <1 ms    <1 ms  172.16.2.1

4    <1 ms    <1 ms    <1 ms  172.16.3.2

Summary

From the contextual help on the ASA’s CLI, it appears that this behaviour is applied to all IP based traffic, not just ICMP traffic. It should be noted that the config above only applies to ASA versions 8.0(3) and later. It should also be noted that the initial issue I was seeing that got me to this point i.e. part of a subnet responding, part of it not, was down to the fact that this subnet had been previously addressed differently. When the subnet was made larger to a /24, all devices were readdressed correctly, the security ACLs on the firewall were updated but there was a NONAT ACL that was still configured for the previous /26 subnet. I updated that too and return traffic, now matching the NONAT ACL, was not NAT’d and was returned as expected.

Now to just remove that non-firewalled path…

Till the next time.