Rsync doesn't get the credit it deserves. Here's what you might be missing Over a decade ago, Andrew Tridgell was quoted as saying, “In 50 years’ time, I doubt anyone will have ever heard of Samba, but they’ll probably be using rsync in one way or another.” As one of the principal developers of both utilities, he might have known a bit about what he was talking about.More than 10 years later, I think that forecast is fairly safe. That’s not to say that Samba is going to be forgotten anytime soon, but rsync is — or should be — a staple in just about every infrastructure across the globe.[ Also on InfoWorld: Why Sun’s NIS will never die | 9 traits of the veteran Unix admin | Track the latest trends in open source with InfoWorld’s Open Sources blog and Technology: Open Source newsletter ] On its face, rsync seems simple. Give it a source and target directory, along with a method of communication, and it will make the directories identical. However, there’s much more going on under the hood than many people realize. Rsync isn’t just making sure that one or more files are the same on either end; it’s actually using a complex algorithm to compare parts of each file and, thus, significantly reduces the amount of time and bandwidth necessary to synchronize them.Take, for example, a large file of 5GB. It’s very easy to run an MD5 sum on the file on each side of a synchronization path to see if they differed. If the sums didn’t match, a simple synchronization utility would then ship the entire source file over to the target. Naturally, this would result in a 5GB file transfer. Rsync, by contrast, runs a rolling checksum across the entire file, comparing checksums for small segments of the file, and then only causes the transfer of those blocks that do not match. This is a very simplified description of what’s actually going on, but it gets the point across. If only 2MB of data actually changed in that 5GB file, only about 2MB of data will be transmitted from the source to the target. This is a huge benefit in time and bandwidth savings.In addition, rsync uses compression to further reduce bandwidth when applicable, and it defaults to using SSH on most *nix systems for security. As a result, a simple rsync command performs more work than you might think. There are caveats, however, such as when dealing with compressed files. Even when there may be only a small number of actual changes to a file, compression algorithms may substantially alter the final compressed file, resulting in much more data transfer than would otherwise be required. However, even here lurk rsync capabilities that don’t get much fanfare, such as gzip’s rsyncable mode that causes only minor alterations in the final compressed file, allowing rsync to appropriately transfer only the necessary data. This can result in massive performance and speed gains when dealing with compressed files needing synchronization.There are other oft-overlooked elements to rsync, such as the ability to create snapshots of directories or file systems without requiring that all of the data be synchronized during every pass. Say you have a 10GB directory tree that you wish to synchronize to another server every night. You wish to retain a week’s worth of backups at the target, but you’d really rather not have to rsync 10GB of data every night, nor store 10GB of data for every day of data retention. Using rsync’s --link-dest parameter, you can create a single 10GB backup, then instruct rsync to only back up changes within that hierarchy in subsequent passes. This works by causing rsync to create hard links to all of the files that exist, unchanged, in the main directory.Let’s say that in that 10GB directory tree, a few files change during the day, resulting in 30MB of total changes. When your nightly rsync runs, it will create hard links to the files that have not changed and will copy over the 30MB in files that have changed, resulting in the new backup directory appearing as if it has all 10GB worth of data, when in reality it has a whole lot of hard links and only the files that have changed since the last pass. If you’re synchronizing nightly and you want to keep 14 days of backups, that means you’re only storing 10GB once and much smaller file changes in every subsequent pass. Instead of storing and transmitting 140GB, you might only have to store and transmit 15GB, depending on how much churn is in that directory tree. If you’re interested in knowing more about using rsync for backups, here’s a good example you might want to read. Of course, as anyone who uses kernel.org is likely to know, rsync can function as a server and a client. In fact, this is how many “prosumer” NAS storage arrays handle backup functions, by running rsync servers that can listen for incoming requests for data. It’s about as easy as it sounds. An rsync listener is run on one system, and certain directories or file systems are exposed for synchronization. Instead of requiring authentication via the normal SSH transport, rsync can use its native transport based on predetermined secrets to allow authenticated or anonymous synchronization requests. This is best suited to nonsensitive data because the transport itself is not secured. If secured transmission is desired, you can use either SSH and public keys or a VPN.There are other goodies in rsync, such as the ability to limit bandwidth consumption during transfers to reduce the impact on network connections, or the ability to use fuzzy matching to determine if a file has a twin or version with a different name or checksum on the target, and thus can be used as the basis of a rolling checksum transfer, and other elements that only serve to increase this utility’s usefulness.Regardless, it’s clear why Tridgell has more faith in the longevity of rsync than Samba. Once the sands of time have washed the SMB protocol away for good and Samba is resting in peace alongside NetBEUI and NetWare, we’ll still have the need for fast and fluid file synchronization throughout our infrastructures. Luckily, we already have the solution. This story, “Why you should be using rsync,” was originally published at InfoWorld.com. Read more of Paul Venezia’s The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter. Technology IndustryOpen SourceSoftware DevelopmentData Management