by Alan Zeichick

No room for server failure

reviews
Sep 12, 20034 mins

Stratus ftServer 6600 high-availability server delivers solid fault tolerance to Windows

Sometimes, you simply can’t afford a server failure. Maybe it’s a system that monitors safety conditions at a factory, or an application that records financial transactions. Maybe it’s a bottleneck back-end database server that supports a distributed Web farm.

Stratus Technologies Bermuda’s ftServer 6600 fault-tolerant server is expressly designed for these grave situations.

Nearly everything is redundant on the ftServer 6600, including the processors. That degree of high availability comes via a modular approach. The server consists of a 10U-high (17.5 inches) rack-mountable chassis that includes seven slide-out pizza-box modules: two for redundant hard disks, two for redundant PCI I/O cards, and two or three for redundant motherboards and processors.

All of these field-replaceable modules are cross-connected using a passive backplane, and each contains its own AC power supply and cooling system.

Everything on the server runs in lockstep, even the Intel Xeon processors and the memory chips within the processor modules. If one module of a pair fails, the other instantly picks up the workload without dropping a transaction.

This approach isn’t new. Many proprietary servers, including those from Tandem (now owned by Hewlett-Packard), offer similar lockstep-based protection. Stratus has sold a line of fault-tolerant servers for some time.

But what sets Stratus’ approach apart is that it uses Intel Xeon processors and runs Windows 2000 Advanced Server or Windows Server 2003. Compared with Stratus’ older ftServer 6500, the new 6600 model packs faster processors, disks, and I/O into about half the space.

Available this month at under $100,000, this truly is fault-tolerant hardware for the masses, but it’s more expensive than setting up clusters of lesser servers.

The ftServer 6600 boots up using only a single set of processor, disk, and I/O modules. After the operating system starts, it loads a set of drivers and utilities that brings the redundant hardware online. This software is designed to keep each module operating in lockstep with its partner.

One module of each type is always set as the active component while its partner monitors the active component’s behavior using watchdog circuits and backplane cross-connects. The partner automatically takes over if it detects a failure in the active component. It also sends out alerts via SNMP, e-mail, and modem when a component fails.

At least that’s what it should do. The test system didn’t quite meet expectations.

I pulled the disk modules one at a time and replaced them easily. The server did not hiccup, and tasks continued to run without delay. When the disk slice module was replaced, LEDs indicated that the module was powering up and synchronizing; within a few minutes, the status lights showed that full fault tolerance was restored.

There was similar fault-tolerant behavior with the processor modules; the server worked perfectly with either module installed.

But then when the third module type was tested —the I/O slice that contained the PCI slots — the server crashed a few seconds after it was reinserted. When the problem was simulated via a software-based shutdown and restart of the I/O module through the administrative software, it also caused a crash, so the problem was not caused by faulty connectors or wiring.

After much investigation, engineers at Stratus decided that this was a known issue with that generation of preproduction server, caused by electrical signaling issues on the server’s backplane. According to the company, this fault was discovered and repaired before releasing customer beta hardware.

That problem aside, the server is impressive. Stratus clearly put effort into not only building the fault-tolerant hardware and software drivers, but also into making it manageable via a Microsoft Management Console snap-in or a browser. Each processor module contains a modem designed to phone back to Stratus’ datacenter if faults are discovered in the server, and to allow Stratus’ engineers to call in to diagnose the server. The company was then able to gather data to debug the crash problem during my tests.

The only fault with this extremely fault-tolerant server is the aforementioned signaling electrical problems. But assuming the company has fixed that blip, the ftServer 6600 is highly recommended for Windows applications that need this extreme level of hardware high availability.