High-priced servers with proprietary architectures have the edge in guarding against failure In the off-the-shelf world of value servers, surmounting challenges to high availability is your job. Management solutions make remote observation easier, and clustering is getting closer to standard fare for OSes.But I wonder: Will a rack of value servers with the equivalent computing power of a large, multiprocessor monolithic server ever be able to sense and respond to availability problems the way big iron and their OSes can?The OS on a value server is likely to blast alerts out to applications when something goes wrong in the hardware, drivers, or OS. No application should hear about an availability issue that it didn’t create or can’t address. IBM knows exactly what’s in every Power5 server it sells. That level of understanding of each component results in a major boost in availability — and in the ability to capture errors. If something goes wrong with an IBM server at the OS or hardware platform level, an underlying protective layer protects against data corruption, connection loss, and errant behavior. An application should typically run forever. Mastery of the art of graceful failure under all conditions shouldn’t be a prerequisite for enterprise application developers. And it’s work that’s done over and over for project after project.Consider that user-mode applications are often saddled with the responsibility of sorting out arcane, platform-peculiar errors related to availability. Does a disk write fault mean that the OS has already done all it can, or might it be in the midst of fail-over and I’m supposed to keep trying? What do I do in my application code when I get a memory parity error? If a TCP socket closes and the same IP requests a new connection a few seconds later, how can I tell whether it’s a new session or the continuation of an old one after a fail-over? I’ve got two files open and the system says it’s out of file handles. It isn’t my fault so it shouldn’t be my problem! With so many possibilities — I like to ponder the reason an OS notifies an application of a fatal error — it’s not surprising that the standard response to a problem is log and abort. That’s a cop-out in some large projects, but it’s the right way to get an application through QA in most.One example of lousy handling of an availability challenge is the sequence of events that occurs when a network interface fails over. Network hardware and drivers do very smart things that make the spare network interface a perfect duplicate of the one that failed. But it takes time, and in applications throughout the system, network connections close and it’s left to the application to decide how to respond. It would be smarter to pause threads that have open network connections so they at least have a chance of staying connected. Or perhaps the OS should attempt to reconnect on the application’s behalf after the fail-over. An availability-related error that makes it to an application is a sign that a system designer, device driver developer, or OS engineer punted. But I can’t rag on them too much.The fact that value server OSes and drivers can’t rely on minimum capabilities in underlying hardware is an unintended consequence of open systems. That tradeoff means that, for now, value servers and portable OSes will burden applications with availability issues more than they should. Value servers can’t be referred to as high-availability systems until they see to availability themselves. Software DevelopmentDevelopment ToolsTechnology IndustrySecuritySmall and Medium Business