I was first made aware of an issue when the hosting provider where I host this blog at were tweeting apologies on 12/08/14 for an interrupted service and I later received an excellently worded apology and explanation from them. A couple of colleagues also got in touch later that evening with reports from further afield.
The facts in no particular order
- Essentially, routers and switches either make the decision to forward packets in hardware using a special type of very fast memory called TCAM or more software based, using cheaper and somewhat slower RAM. The advantage of TCAM is its speed and its ability to provide an output with a single CPU cycle but it is costly and also a finite resource. RAM on the other hand is slower, but you can usually throw more of it at a problem. Depending on which model of router/switch you have depends on which forwarding method is used
- The number of IPv4 routes on the Internet has been growing steadily and increasingly since its creation. Back in early May 2014, this global routing table hit 500K routes
- The devices that use TCAM are not only restricted by the finite size available to it, but this TCAM is used for other things besides IPv4 routing information, e.g. access lists (ACLs), QoS policy information, IPv6 routes, MPLS information, multicast routes. So in effect, TCAM is partitioned according to the use the device is being put to. Cisco’s 6500 and 7600 switch and router platforms (respectively) have a default setting for each of these. On many of the devices, the limit for IPv4 routes is set to 512K
- Verizon have a big block of IP addresses that they advertise as an aggregated prefix
- On Tuesday, for some reason, Verizon started advertising a large amount of subnets within their block as /24 networks instead, to the tune of several thousand, causing the global routing table to exceed the 512K limit on those devices configured as such
- This had the impact that those affected devices did not have enough TCAM to hold the full Internet routing table and so the prefixes that didn’t make it in to the table would not be reachable. As prefixes come up and down on the Internet all the time, these routes would have been random in nature throughout the issue i.e. it would not have just been the Verizon routes affected
Are you affected?
If you have Cisco 6500 or 7600 devices running full BGP tables, you need to run the following command:
6500>show mls cef maximum-routes
FIB TCAM maximum routes :
IPv4 + MPLS - 512k
IPv6 + IP Multicast - 32k (default)
If the IPv4 line of output is 512k or lower, you are in a pickle and will need to change the settings by entering the command below:
6500(config)#mls cef maximum-routes 1000
Where the 1000 is the number of 1K entries i.e. the setting as shown in the first output would be 512. Typing a ‘?’ instead of the number will return the maximum available on your platform, so you could in theory be requiring a hardware refresh to add to your woes.
If you have an ASR9K, follow the instructions here to get to your happy place:
Most other router platforms use RAM and so the more you have, the more routes it can handle. The performance varies widely from platform to platform. You should check the vendor’s documentation for specifics e.g. the Cisco ASR1002-X will do 500,000 IPv4 routes with 4GB of RAM and 1,000,000 with 8GB RAM
Who is to blame?
There is an ongoing debate at the moment about whether Cisco are liable or the service providers. I would argue that it is predominantly the latter but Cisco could have done a better job of advising their customers. Cisco did post an announcement about this on their website a number of months ago, but I didn’t spot it so I’m assuming many other customers didn’t also.
Having said that, if you buy a bit of kit to do something, you need to take some responsibility for failing to include capacity planning in to your operational strategy.
Till the next time. (#768K!)
One Reply to “The 512K route issue”
This has been discussed at length over on the NANOG mailing list, that a service provider was caught out by this is pretty shocking. I agree with you that the blame firmly lies with the SP, I suspect they were being cheap and hoping to get another year out of their kit in the hopes that table growth would slow.