21 February 2015

The Death of Hardware RAID... Good Riddance!

There was a point in the not-too-distant past when hardware-backed RAID storage was an indispensable part of the toolkit of any server administrator.  The basic idea of RAID (Redundant Array of Inexpensive Disks, for the uninitiated) is to mitigate the risk of data loss problems associated with hard disk failures.  You have a set of disks, i.e. an array, and if one (or sometimes if more than one, depending on the configuration) fails, your data is still available... often in a manner transparent to the operating system and/or end user.

Like everything else in computing, RAID moved in generations, with different configurations in different generations having different capabilities:

  • RAID 1: The earliest arrangements that could be considered RAID were shipped in 1983, though the actual term RAID wouldn't be coined for several years.  These earliest configurations consisted of mirrored pairs of drives, i.e. two drives with the same contents that appeared as a single drive to the system.  For two drives of the same size, the total amount of available storage is that of a single drive.  RAID 1, also known as mirroring, is still in common use today for some scenarios.
  • RAID 2: This early implementation was a highly-performant mode that striped individual bits across drives, and increased the available capacity while still providing error correcting via Hamming code CRCs on one or more dedicated drives.
  • RAID 3: A rare implementation using a dedicated parity drive, in which bytes (as opposed to individual bits as in RAID-2) are striped across drives, this level was never deployed widely.
  • RAID 4: Very similar to both RAID-2 and RAID-3, except that entire blocks, instead or bits or bytes, are the unit checksummed and striped across drives
  • RAID 5: The de facto RAID mode for server admins for many years is novel in striping both data and parity across all drives, which improves performance and ensures that all drives wear evenly, unlike levels 2 through 4.  A minimum of three drives is required for RAID 5, with the capacity of approximately two drives available for data.  RAID 6 is almost identical to RAID 5 save that parity is doubly-redundant, allowing up to two drives to fail without compromising data.
There have also been various stacked / combined levels, one of the most notable of which is referred to as RAID 10, a stripe of mirror pairs.

All of these methods also have some serious shortcomings:
  • While all of these systems address the problem of data availability, only the most advanced (and therefore expensive) implementations give a tinker's d*** about data integrity.  A recent well-written article by Jim Salter, including some handy experiments, documents the problems of RAID and bit rot.  The short version: RAID protects against complete disk failure, not against subtle disk corruption.
  • Hardware RAID is (or was, until very recently) both expensive to implement, requiring dedicated controllers, and tedious to maintain, requiring experienced personnel and complex drivers for every supported platform.  It was formerly a hobby of hardware and software manufacturers to gang up on IT professionals by only allowing certain hardware to work with certain software in certain supported deployments.  Of course, those days are long gone, right? *wink, wink, nudge, nudge*
  • Software RAID, while virtually free to implement and much easier than hardware to maintain, suffers from significant performance problems.  It's also very rarely portable across operating systems.
After all of these years, the relationship between IT folks and RAID has turned sour like a shotgun marriage...

So what's the alternative?  Enter next-generation file systems.  That's right- file systems.  Traditionally, file systems just organized files, and it was up to volume management solutions like RAID to keep the file systems consistent and happy.  As the problems with RAID were exposed, and personal computing devices (where RAID is rarely an option) became more widely used, file systems themselves became more resilient to compensate.

The next generation of file systems is a bit different.  The solutions combine both volume management, typically including RAID-like capability, and file systems into a single all-encompassing data storage mechanism.  This allows them to do all sorts of amazing things around data storage, with tricks to both save space and protect data integrity, all while inducing a negligible performance penalty when properly provisioned.  Examples include:
  • The cross-platform ZFS, originally developed by Sun Microsystems and sent out in an open-source lifeboat before the company sank, was the pioneer of next-generation filesystems.  It supports mirroring, striping, and RAID-Z (like RAID-5) and also supports nesting of those different types.  It also supports block checksums, in-line block-level deduplication, live snapshots and rollback, filesystem clones, transparent compression, transparent encryption in some implementations, differential backup and restore, etc. ad nauseam... Short version: it's awesome, there's a Free & Open Source version available via the OpenZFS project, and it's available on most *nix platforms.
  • Linux's Btrfs has a similar feature set, with integrated volume management and RAID-like capabilities, plus snapshots, plus transparent compression (notably with file-level granularity).  Deduplication is done outside of the write path, to conserve memory.  Btrfs also has a special mode of operation for solid-state drives which reduces unnecessary writes, in addition to the TRIM command also supported by some ZFS implementations.  While not as mature as ZFS, Btrfs is quite stable and usable and most environments, and seems to be preferred over the former for Linux usage.
  • A lesser-known option, HAMMER (no connection to the nefarious baddies in the Marvel universe), is available only on DragonflyBSD at the present time.  It has a somewhat limited feature set compared to the previous two options, but still does deduplication, checksums, etc.  RAID-like functionality is not yet implemented directly, but rather through a streaming backup mechanism.
These next-generation filesystem-and-volume-manager combos all attempt to address the shortcomings of the traditional filesystem / volume manager model, and do so rather effectively.  But hang on a second, you ask: we've heard about Solaris, the BSDs, and Linux.  What about the two major consumer operating systems: Mac OS X and Windows?

In terms of major consumer operating systems, while Apple's default HFS+ file system is practically a geriatric patient in technological terms (having not changed much structurally since the mid-90s), there is a stable, well-maintained, and free build of OpenZFS available for Mac OS X.  Recently some work has even been done to make it bootable.  It's not officially blessed (pun intended- please tell me that somebody gets it) by Apple, but it does the trick for those who want and/or need it.

Windows users are almost out of luck!  The most capable file system offered by Microsoft at this time is ReFS which, when combined with Windows Storage Spaces, isn't terrible... but it's also nothing amazing either.  Even an article by a Microsoft Kool-Aid drinker fanboy advocate admits (starting at paragraph 16, for those looking for it) that its feature set is unimpressive compared to those other solutions.  The only selling point that he can offer over ZFS is reduced memory footprint... which any server admin knows is practically a farce when talking about Windows- quite possibly the most notorious modern OS when it comes to memory thrashing.  Its also still a discrete filesystem / volume manager solution, with all of the inherent limitations of that design.  Nevertheless, it still has significant advantages over hardware RAID, like the ability to detect bit-flips with opt-in (seriously?  you have to turn it on manually, from a command line?) block-level checksums.

While there are those out there still defending hardware RAID, they must do so with the caveat that it should no longer be considered the panacea as which it was regarded for many years.  If you're a server admin in 2015, and you're still using hardware RAID as your go-to redundancy solution without any hesitation, you should probably take a good, long look in the mirror and reevaluate your life.  Start by asking which of those solutions above could bring amazing data integrity goodness to your microcosm.

No comments:

Post a Comment

Your comments are welcome. Please keep them professional, courteous, and respectful of the blog author and of other commentors.