What’s the best strategy for ensuring hardware reliability? Today’s server hardware is very reliable, more so than most commercial operating systems. But it’s not infallible. When “hardware failure is not an option,” there are two general ways of improving system reliability: using one fault-tolerant server or using a cluster of ordinary servers. In a cluster, if one server crashes, the other server takes over. However, clusters suffer from several fundamental limitations. First, there is a delay while faults are detected. Second, if a server is acting strangely but hasn’t crashed, the fault may be not be detected. Third, even when failures are detected, the flawed server’s current transactions may be lost or dropped during the hand-over. Finally, some in-house or commercial applications may not lend themselves to clustering at all. In those cases, you’ll need to deploy a single fault-tolerant server, like the Stratus ftServer 6600, instead of a cluster of lesser, semi-fault-tolerant servers. On the other hand, fault-tolerant servers aren’t perfect. A crashed driver stack, an unchecked buffer, a “blue screen of death,” an application time-out — all of these will bring down a $100,000 fault-tolerant server just as thoroughly as they clobber a commodity server at one-twentieth the cost. But unless lightning strikes twice simultaneously, a server cluster can withstand these “soft” errors and keep on processing new transactions. In my experience, software, not hardware, is the cause of most server failures; given that, server clusters should be your first choice when designing a high-availability server farm. Save the expensive fault-tolerant machines for applications that are truly hardware-fault intolerant. There is, of course, a middle ground: Install a cluster of reasonably affordable servers that offer limited, but not complete, fault tolerance. Many manufacturers offer servers with redundant power supplies, redundant disk arrays, redundant cooling systems, even redundant memory, as in the HP ProLiant DL740. But even there, a CPU or motherboard failure, operating system fault, or application failure could still crash the system. A cluster of fault-tolerant servers will give the best of both worlds. Technology IndustrySoftware DevelopmentSmall and Medium Business