Admins lavish attention on a critical database upgrade, but the good ol' human factor derails their plans in this tech story At the time of this story, I worked for a large tech corporation, and we ran into an example of no matter how much pre-planning is done, human error still creeps into the day-to-day work life.Several administrators, who due to outsourcing were located in different countries, were asked to take on a project. One particular customer had a large database that was outdated. We supported basically everything, and everything needed an upgrade, including server hardware, operating system, cluster, and database software.[ Know what you’re doing as a tech pro? If so, you’ll pay the price of being in high demand, while the slackers party on. Find out the details in Paul Venezia’s The Deep End blog. | Get a new tech tale in your inbox every week in InfoWorld’s Off the Record newsletter or follow Off the Record on Twitter. ] This being a critical database, all precautions would be taken, including having a current backup in place and a neat fallback plan in case anything went wrong. Instructions for the change were written down and agreed upon by all the technical parties involved. It was a pretty straightforward operation, and everything was ready for an easy, quick, and painless upgrade, with minimal downtime.The change took place, for some reason, on a Sunday evening. First a full backup was taken, stored on one of the file systems that was about to be moved to the new server. Then the migration started.Everything went fine for a while: All the file systems were mounted in the new servers, which were already running a shiny new operating system and clustering software. Then the server administrator ran a script, provided by the database administrator, which was supposed to upgrade the database. One crucial mistake was made at this point: The script was designed both to install a new instance and to upgrade an existing one. The different behavior was controlled by a single command-line switch, which would indicate an upgrade; otherwise, a new install was assumed.This command-line option was not there in the written change instructions that the server administrator executed. As a feature, when installing a new instance, the script would make sure there were no old files lying around that could conflict with the install, so for good measure it issued a different command to wipe out all of the database-related file systems before installing the binaries.The server administrator watched as the script ran for several minutes. He somehow overlooked some messages from the script about how it was removing all files from here and there, as they got buried among several screens of information. The script finished with no errors, so the database administrator proceeded to start the database. He found a small problem at this point: There was no data. The database binaries were all there, newly installed, but all the data file systems were empty.They figured the problem was a file system corruption, which had been a problem with other recent projects. They tried to unmount and mount the file systems again. Still no data. Then they ran some file system checks, which ran cleanly. Yet again, no data. They called SAN support to verify the disks. There was nothing wrong with them.[ Get a $50 American Express gift cheque if we publish your tech experiences. Send your story of a lesson learned, of dealing with frustrating coworkers or end-users, or a story that illustrates a relevant takeaway to today’s IT profession to offtherecord@infoworld.com. ] After several of these futile attempts, they went to the fallback plan of moving the disks back to the old server, restoring the backup, and starting the database.So they did. After mounting the file systems in the old server, they noticed they were still empty (in their minds, corrupted), including the file system where the backup was stored. Again they made several troubleshooting attempts to recover these “corrupt” file systems.At that point it was almost 8 a.m. on Monday, end-users needed to start working, and everybody was getting nervous. They finally realized the data was gone for good and there were no choice but to restore from the daily tape backups. First problem: They hadn’t checked the backup tape before starting the change, and the backup for Sunday had not completed by the time they began. So they had to use Saturday’s backup, and one day’s worth of data was lost. Second problem: Restoring close to 1TB from tape takes time. It was only on Monday afternoon that the data was restored.Then a whole bunch of different issues occurred. Consistency checks failed. Logs were missing. They had to downgrade the database binaries. Users’ access and privileges were lost. Finally, on Tuesday night, the database was back to business as usual.The problem had occurred because the database administrator’s script removed all the files, including the backup. And the server administrator hadn’t recognized a corrupted file system from an empty one. To say the least, a thorough review was taken and action plans implemented to (hopefully) prevent a repeat of such an incident, including ways to communicate more effectively with each other across different countries, time zones, and first languages. But no matter how sophisticated technology becomes or how far a company reaches globally, there’s still a human factor involved — for better or for worse.This story, “Where have all the files gone?,” was originally published at InfoWorld.com. Read more crazy-but-true stories in the anonymous Off the Record blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. IT JobsIT Skills and TrainingCareersTechnology IndustryDatabases