paul_venezia
Senior Contributing Editor

Redundant Redundant

analysis
Apr 15, 20034 mins

The holy grail of peace of mind: network redundancy. What better way to make sure that hardware failures won't take the network down than to have two of everything? It stands to reason that if one is good, two are immediately better. In some situations, this is absolutely the case. Core switching, for example. Spanning-tree handles the failover duties, and shouldn't take more that 10 seconds to heal the network.

The holy grail of peace of mind: network redundancy. What better way to make sure that hardware failures won’t take the network down than to have two of everything? It stands to reason that if one is good, two are immediately better. In some situations, this is absolutely the case. Core switching, for example. Spanning-tree handles the failover duties, and shouldn’t take more that 10 seconds to heal the network. With Rapid spanning-tree, that can be dropped to 1 second. Depending on the vendor, switched networks can take advantage of both core switches simultaneously, with the full network load collapsing on one when the other fails. Wonderful technology.

It can also be taken too far.

Any network designed for redundancy has to set a stopping point. There are places within the network that redundancy is simply redundant. A scenario: Redundant core switches, chassis-based closet switches with redundant power supplies, failover-teamed gigabit NICs in the servers, server and edge switches uplinked to both cores, PVSTP (Per VLAN Spanning-Tree) to balance VLAN traffic across the cores, redundant firewalls, redundant Internet edge switches, Redundant WAN routers… waitaminute. Redundant WAN routers? Everything up to that item is fairly easily accomplished, and the costs are capital costs; you’ve bought the hardware. Once we start talking about redundant WAN routers, things get expensive very quickly. It’s one thing to have redundant hardware; it’s quite another to have redundant ATM, PTP or frame-relay links to remote sites. The most likely core issue in a remote-site data link failure is the line itself, either physical or provider issues. Redundant routers don’t help with that. Most providers can create dark frame-relay DLCIs that are mapped to another site, or another router, but the cost isn’t small, and it shows up every month. It’s also not automatic; someone has to turn up those maps in the switch. This scenario will help if the main site is destroyed by fire, but here we are again; setting the boundaries of redundancy.

If the business case can be made for a warm or hot site, and the money is in the budget to provide for this expensive undertaking, then by all means, it should be implemented. If the budget is tight, there are better ways to spend the IT dollar. When talking redundancy, we need to determine what we’re protecting ourselves against. Is it the destruction of a datacenter? That moves into BCM (Business Continuity Management), or disaster recovery. Is it the failure of a mission-critical system? That may be worth it. Is it the failure of a core switch? How about a patch cable? Should thousands of dollars be spent on hardware and implementation time to implement automatic failover for a faulty patch cable?

For all these devices, the criteria for redundancy digs deeper. In the mission-critical system, are we providing redundancy all the way down? From disk to power supplies to NICs to RAM to system board to CPU? Which is most likely to fail? It’s actually in that order. For switching, it’s almost the same, power supplies to linecards to individual ports to gang controllers to system board. What are we betting our money on, anyway?

For most networks, the implementation of redundant core switching with failover NIC teaming and redundant firewalls is the sweet spot. If it’s cost-effective, redundant ISPs are definitely desirable. Money spent outside of these efforts is nearly always subject to diminishing returns. If the mainboard in a server is flaky, then all the redundant power supplies, RAID arrays and network equipment you can think of won’t help that server maintain viability. Redundant or clustered servers might, but that’s another discussion.

If you’ve stayed with me so far, you’ll notice that there’s a significant piece missing from the above. The single greatest threat to the network and to the overall infrastructure is human error. We can have all the redundant hardware and links we want, and a mistake in a routing table or inadvertent configuration error can take everything down in a heartbeat. Perhaps we should be putting some of our IT dollars into technical and change-management training to help reduce the effects of human error on the enterprise network. While you can’t control the humans at your carrier or ISP, you can control what occurs within your organization. Only YOU can prevent infrastructure fires.