This is the first of what I am calling a Trouble Ticket post, a review of an issue I’ve run into which I believe warrants sharing with others. This could be because I came across something I’ve not seen before, that I ended up pulling my hair out over or perhaps it simply involved something that was particularly interesting to me. I aim to lay down as much information as I can within the confines of confidentiality, where applicable, and explain my troubleshooting process. So let’s get started with the first ticket.
A customer’s satellite site had a 1Gb\s fibre link laid to the main data centre using a 3rd party supplier. The diagram below shows all the key details to the issue but some more background may help. R1 has an RJ45 to the LAN behind it as does SW1 and both connect to the WAN with LC SFPs. The port on SW1 is layer 2, with the layer 3 endpoint being an ASA behind it.
When R1 was rebooted, its fibre port would initialise but would show as down\down.
It soon became apparent that this wasn’t going to be a slam dunk fix. Doing a shutdown\no shutdown didn’t help. Removing the LC connector from R1’s SFP and reinserting brought the link back up and removing one of the ST connectors from the patch panel and reinserting also brought the link back up. At this point the link would remain up and stable…until the next reboot. Looked initially like a layer 1 issue.
R1 was returned to my head office where I powered it up with a 2nd SFP in a 2nd port, with a loopback fibre. No matter how hard I tried, I could not replicate the issue i.e. both ports came up as expected after every reboot. This led me to believe that the issue was probably with the link itself, somewhere from R1’s SFP to SW1’s SFP. I took R1 back to the remote site myself with a spare patch lead and another SFP and swapped these out but the issue still persisted.
I then got on to the WAN provider and requested they tested the link. They did an end to end test and reported that they could see no issue but when I presented the results of my loopback test, they agreed to move on to a different fibre pair from the site’s patch panel to their POP site and down to the data centre. This still left the patch panel from their kit in the data centre to SW1, but otherwise a complete replacement. Again, the issue remained.
I then went back to site and replaced R1 with a temporary Cisco 2960G (which I shall call SW2) for testing. For the test I simply removed the SFP from R1 with fibre intact and plugged in to SW2. Each time I rebooted, the link would come up as expected. This test by itself would strongly suggest that the link is actually OK and that the issue lies with the 3925. At this point, I should point out that I did all testing both with the required config on the relevant ports and with default config on there too. This seemed to leave the following causes for the issue:
- Hardware issue with 3925
- IOS bug
- Incompatibility at layer 1 between hardware and WAN provider
Regardless of which of these was the issue, I decided to get our hardware support provider involved and when they tried to get me to do all the things I’d already done, I asked them to provide a replacement router which arrived within hours. The original router had IOS 15.1(2)T2 on it for some reason and when the replacement came, I made the mistake of slapping the latest version of the same T train, 15.2(2)T.
My next stop was Cisco themselves so asked for the ticket to be escalated to Cisco TAC. After several days of chasing, I was asked to downgrade to 15.1(4) M3, a known stable release. This was done within minutes of being requested and yet the problem still persisted and Cisco were telling me that this was the only fix they could think of at that time. A reach out to my online network drew no help either. Knowing that this was stumping everybody that looked at it did not make it any the more palatable.
Eventually, I got direct access to TAC, which is another story in itself, and we set up a Webex session so they could see the issue ‘live’. They ran all the standard commands to get a report they could spend some time looking through but were still unable to see what the root cause was several days later. Prior to the testing, they had even sent out another two SFPs that they claimed were compatible with my setup, just to play it safe.
At this point, I was aware that the customer was keen to get the fault resolved ASAP so they could use the circuit with confidence so I presented three options:
- Get the provider to terminate their line at the customer site on active kit
- Get a different platform entirely, perhaps a 3945 or an ASR
- Terminate the fibre on an intermediate switch
Option 1 came in at a cost that seemed deliberately prohibitive i.e. the provider just didn’t want to do it so added a zero to the cost. Option 2 looked the most likely but the hardware support provider were playing hardball regarding who should front the cost of the more expensive model. At this point, I told the customer to try option 3, but only as a short-term fix whilst I would chase option 2 to its conclusion. I also took this opportunity to streamline the config on the link, which had originally been set up as a trunk with allowed VLANs that had no place being there.
It was at this point that I was told of the underlying purpose of the link and it turned out that a 3925 would never have met the design requirements regardless. The colleague that came up with the kit list had moved on to pastures new but his replacement pointed out that a 3560X switch would not only be much cheaper but also exceed the requirements. The switch landed on my desk and I configured it, installed it and confirmed the issue was not replicating on this platform. The customer soak tested the link and was more than happy with performance. Job done.
In summary, sometimes you can’t make things right from the CLI or by wiggling cables. If you adhere to an IT life-cycle model, whether it’s ITIL, Microsoft’s Operation Framework or Cisco’s PPDIOO, you avoid many of these kind of issues in the first place as you are putting your kit list together based on business requirements\strategy and not based just on what has worked before in another scenario, on what tech looks cool this week or on a gut feeling. I am a strong proponent for a suitable planning and design phase but all too often, these are seen as time-drains when networks could be getting built and used. The loss of productivity that often follows during the operational phase, chasing your tail due to poor planning and design, can inevitably cost much more, not just in money but in staff morale and customer good will.
I still believe strongly that this is a hardware issue with the 3925. Has anybody seen anything similar to this issue, or have other stories of faults they were inevitably unable to resolve to their complete satisfaction?
Till the next time.