Trouble Ticket #1: High fibre diet

Introduction

This is the first of what I am calling a Trouble Ticket post, a review of an issue I’ve run into which I believe warrants sharing with others. This could be because I came across something I’ve not seen before, that I ended up pulling my hair out over or perhaps it simply involved something that was particularly interesting to me. I aim to lay down as much information as I can within the confines of confidentiality, where applicable, and explain my troubleshooting process. So let’s get started with the first ticket.

Description

A customer’s satellite site had a 1Gb\s fibre link laid to the main data centre using a 3rd party supplier. The diagram below shows all the key details to the issue but some more background may help. R1 has an RJ45 to the LAN behind it as does SW1 and both connect to the WAN with LC SFPs. The port on SW1 is layer 2, with the layer 3 endpoint being an ASA behind it.

Diagram

Network diagram
Network diagram

Problem

When R1 was rebooted, its fibre port would initialise but would show as down\down.

Troubleshooting steps

It soon became apparent that this wasn’t going to be a slam dunk fix. Doing a shutdown\no shutdown didn’t help. Removing the LC connector from R1’s SFP and reinserting brought the link back up and removing one of the ST connectors from the patch panel and reinserting also brought the link back up. At this point the link would remain up and stable…until the next reboot. Looked initially like a layer 1 issue.

R1 was returned to my head office where I powered it up with a 2nd SFP in a 2nd port, with a loopback fibre. No matter how hard I tried, I could not replicate the issue i.e. both ports came up as expected after every reboot. This led me to believe that the issue was probably with the link itself, somewhere from R1’s SFP to SW1’s SFP. I took R1 back to the remote site myself with a spare patch lead and another SFP and swapped these out but the issue still persisted.

I then got on to the WAN provider and requested they tested the link. They did an end to end test and reported that they could see no issue but when I presented the results of my loopback test, they agreed to move on to a different fibre pair from the site’s patch panel to their POP site and down to the data centre. This still left the patch panel from their kit in the data centre to SW1, but otherwise a complete replacement. Again, the issue remained.

I then went back to site and replaced R1 with a temporary Cisco 2960G (which I shall call SW2) for testing. For the test I simply removed the SFP from R1 with fibre intact and plugged in to SW2. Each time I rebooted, the link would come up as expected. This test by itself would strongly suggest that the link is actually OK and that the issue lies with the 3925. At this point, I should point out that I did all testing both with the required config on the relevant ports and with default config on there too. This seemed to leave the following causes for the issue:

  • Hardware issue with 3925
  • IOS bug
  • Incompatibility at layer 1 between hardware and WAN provider

Regardless of which of these was the issue, I decided to get our hardware support provider involved and when they tried to get me to do all the things I’d already done, I asked them to provide a replacement router which arrived within hours. The original router had IOS 15.1(2)T2 on it for some reason and when the replacement came, I made the mistake of slapping the latest version of the same T train, 15.2(2)T.

My next stop was Cisco themselves so asked for the ticket to be escalated to Cisco TAC. After several days of chasing, I was asked to downgrade to 15.1(4) M3, a known stable release. This was done within minutes of being requested and yet the problem still persisted and Cisco were telling me that this was the only fix they could think of at that time. A reach out to my online network drew no help either. Knowing that this was stumping everybody that looked at it did not make it any the more palatable.

Eventually, I got direct access to TAC, which is another story in itself, and we set up a Webex session so they could see the issue ‘live’. They ran all the standard commands to get a report they could spend some time looking through but were still unable to see what the root cause was several days later. Prior to the testing, they had even sent out another two SFPs that they claimed were compatible with my setup, just to play it safe.

At this point, I was aware that the customer was keen to get the fault resolved ASAP so they could use the circuit with confidence so I presented three options:

  1. Get the provider to terminate their line at the customer site on active kit
  2. Get a different platform entirely, perhaps a 3945 or an ASR
  3. Terminate the fibre on an intermediate switch

Option 1 came in at a cost that seemed deliberately prohibitive i.e. the provider just didn’t want to do it so added a zero to the cost. Option 2 looked the most likely but the hardware support provider were playing hardball regarding who should front the cost of the more expensive model. At this point, I told the customer to try option 3, but only as a short-term fix whilst I would chase option 2 to its conclusion. I also took this opportunity to streamline the config on the link, which had originally been set up as a trunk with allowed VLANs that had no place being there.

It was at this point that I was told of the underlying purpose of the link and it turned out that a 3925 would never have met the design requirements regardless. The colleague that came up with the kit list had moved on to pastures new but his replacement pointed out that a 3560X switch would not only be much cheaper but also exceed the requirements. The switch landed on my desk and I configured it, installed it and confirmed the issue was not replicating on this platform. The customer soak tested the link and was more than happy with performance. Job done.

Summary

In summary, sometimes you can’t make things right from the CLI or by wiggling cables. If you adhere to an IT life-cycle model, whether it’s ITIL, Microsoft’s Operation Framework or Cisco’s PPDIOO, you avoid many of these kind of issues in the first place as you are putting your kit list together based on business requirements\strategy and not based just on what has worked before in another scenario, on what tech looks cool this week or on a gut feeling. I am a strong proponent for a suitable planning and design phase but all too often, these are seen as time-drains when networks could be getting built and used. The loss of productivity that often follows during the operational phase, chasing your tail due to poor planning and design, can inevitably cost much more, not just in money but in staff morale and customer good will.

I still believe strongly that this is a hardware issue with the 3925. Has anybody seen anything similar to this issue, or have other stories of faults they were inevitably unable to resolve to their complete satisfaction?

Till the next time.

Cisco Live London 2012 Day 3

Day 3 at Cisco Live London 2012 and yes, it’s true. I have whored myself today with no shame nor remorse, but more on that later. The day started off so well too!! Today, the primary theme for me was simply WAN. Optimisation, high availability, security and best design. Both sessions were delivered by Adam Groudan, a man who touts himself as Cisco’s WAN evangelist and it was soon clear why. It’s always nice to sit and listen to somebody who really knows their shit, especially when you yourself might not! If I was to give you two topics to go away and read up on, it would be DMVPN and Performance Routing (PfR). Am looking forward to trying this stuff out on the lab.Then came the first whoring of the day. A tweet I sent out on Monday:

Just put my hand to head and found brain tissue leaking out of ears. Thanks @CiscoLiveEurope! That was some technical seminar #CLEUR

This caught the attention of some of the guys in the social lounge and they asked if they could do a quick video interview on how I was finding the event and if they could use both the video and the tweet in their marketing material. Sure I said, as long as my Twitter handle is included! I have just started blogging after all and knowing that there might be more people reading it keeps the motivation going…..no…..please dont go!!

Following on from that, it was off for the 2nd and final keynote speech of the week, presented by Cisco Futurist Dave Evans with guest Richard Noble, the holder of the land speed record until 1997. Dave presented a very intriging 10 things to look out for in the next 10 years. I unfortunately had to bomb out at number 8 for a meeting with Cisco Scotland so will watch the keynote on Cisco Live Virtual. If you like tech and progress, I strongly suggest you do too…it was very interesting and Richard’s part juts showed what an amazing field engineering is. The Bloodhound car (picture posted in last blog at the end) is at the pinnacle of technological progress. The thing that really blew my mind was the fact that this car uses a Cosworth F1 engine….it’s job is to pump the fuel required for the jet engine!! An F1 engine required effectively as a pump for a bigger engine. If I recall correctly, that car throws out something like 70000bhp. I will be watching the television coverage when the new record attempt is made, hopefully next year.

Lunch today was provided at the Crown Plaza hotel courtesy of the Cisco Scotland team for attendees from a Scottish company. Hell, it was a free bit of tasty lunch so I didn’t want to tell them I am actually English in case they barred me. Of course, there is no such thing as a free lunch but the 30 minute marketing pitch on their UCS offerings was actually quite informative.

The afternoon brought the 2nd WAN session mentioned above and then I attended a useful 30 minute session on the value of certifcation and how it can help your career. This was presented by David Mallory, the CTO for Cisco Learning and we had a good 15 minute chat after the session on the value of different study methods and materials, how to approach the CCIE lab and what to expect and what Cisco are doing to keep the very high standard of their different tracks and levels of certifcation. Where else could you get that kind of high value information in such a condensed time?

And now, for some more whoring news. Before Dave Evans began his keynote speech this morning, Darren Cambell came on to take part in an Xbox 360 Kinnect competition with some of the attendees who had somehow managed to find the time to play a Cisco Live game. In the early afternoon, Darren was doing a meet and greet at the social lounge and with him being from Manchester too, I thought I’d go and have a chat. Now, for those that dont know me, I’m not shy in the slightest so charged up to him and asked for a photo opportunity which he willingly supplied. Please note the Gold medal around my neck that he picked up at Athens 2004 for the 4x100m relay. He’s only 3 months younger than me but still looks like he’s in his 20’s. Makes you sick really! Joking aside, he’s a really nice bloke.

Nice bloke
Fastest man at Cisco Live for sure

Another whoring alert just in, I recently tweeted Jimmy Ray Purser of Cisco fame asking for a photo to which he replied in the affirmative. So when I turned a corner in the World of Solutions and saw both him and Robb Boyd having their photos taken, I introduced myself and asked him to uphold his end of the bargain, despite me offering him nothing in return! They were in the middle of a photo shoot themselves but dropped everything straight away and Jimmy had a good chat with me about things in a completely relaxed way before I stopped annoying them any further.

Network rock stars
Thanks Robb for the monkey face!

The final ‘this whoring news just in’ was when, at the morning’s WAN session, Adam had about 10 little boxes of magnetic Visio style network icons to hand out to people who asked the best question. Of course, as soon as he said that hands were popping up all over the place. When my question, which deserved a box for being the most retarded of the week, didnt get such recognition, I ended up approaching him at the end of the session, noticed a spare box on his desk and told him that I was trying to get my daughter in to network design and that the box would allow her to do this over her cornflakes in the morning. Box…in the bag. Thanks Adam. She is only four at the moment, I should add, but I’ll be showing her, using the icons, how one might design a redundant WAN solution!

OK, I am seriously goosed but they are handing out free beer so I’m off for the night. Planning on being sensible…ish tonight so I can give it my all for the last full day, then on Friday, its off to the Cisco store for some much coveted books.

Till the next time…