by Alan Zeichick

Making the choice: Clustered servers vs. fault-tolerant server

reviews

Sep 12, 20032 mins

What’s the best strategy for ensuring hardware reliability?

Today’s server hardware is very reliable, more so than most commercial operating systems. But it’s not infallible. When “hardware failure is not an option,” there are two general ways of improving system reliability: using one fault-tolerant server or using a cluster of ordinary servers.

In a cluster, if one server crashes, the other server takes over. However, clusters suffer from several fundamental limitations. First, there is a delay while faults are detected. Second, if a server is acting strangely but hasn’t crashed, the fault may be not be detected. Third, even when failures are detected, the flawed server’s current transactions may be lost or dropped during the hand-over.

Finally, some in-house or commercial applications may not lend themselves to clustering at all. In those cases, you’ll need to deploy a single fault-tolerant server, like the Stratus ftServer 6600, instead of a cluster of lesser, semi-fault-tolerant servers.

On the other hand, fault-tolerant servers aren’t perfect. A crashed driver stack, an unchecked buffer, a “blue screen of death,” an application time-out — all of these will bring down a $100,000 fault-tolerant server just as thoroughly as they clobber a commodity server at one-twentieth the cost. But unless lightning strikes twice simultaneously, a server cluster can withstand these “soft” errors and keep on processing new transactions.

In my experience, software, not hardware, is the cause of most server failures; given that, server clusters should be your first choice when designing a high-availability server farm. Save the expensive fault-tolerant machines for applications that are truly hardware-fault intolerant.

There is, of course, a middle ground: Install a cluster of reasonably affordable servers that offer limited, but not complete, fault tolerance. Many manufacturers offer servers with redundant power supplies, redundant disk arrays, redundant cooling systems, even redundant memory, as in the HP ProLiant DL740. But even there, a CPU or motherboard failure, operating system fault, or application failure could still crash the system. A cluster of fault-tolerant servers will give the best of both worlds.

Technology IndustrySoftware DevelopmentSmall and Medium Business

Show me more

Topics

About

Policies

Our Network

More

Making the choice: Clustered servers vs. fault-tolerant server

What’s the best strategy for ensuring hardware reliability?

Show me more

New ‘StoatWaffle’ malware auto‑executes attacks on developers

VS Code now updates weekly

How to land a software development job in an AI-focused world

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)