leftbar
leftbar
Fault Tolerance Levels

Overview

For success, businesses must maximize the availability of corporate data. Failures between the end-user and the desired data must be eliminated. The key is in establishing and maintaining a high level of fault tolerance, without making huge capital investments. Explaining the different levels and approaches to fault tolerance, this document can help you decide which ones are best for your computing environment.

Foundations of Fault Tolerance

Maintaining a high level of data access and maximizing a user's efficiency are two major concerns of corporate IT departments. Administrators can greatly increase the overall availability of data by deploying robust fault tolerant systems from the data center to the desktop, featuring built-in redundancy and fault monitoring of the critical components used to store, manage and transfer data. Depending on a company's budgets and needs, fault tolerant fortifications to existing implementations can occur incrementally or in a complete retrofit. Often, reinforcement begins with the disk storage itself and grows outward toward the end-user through successive upgrades to disk subsystems, disk controllers, server-to-storage links, servers, and - ultimately - server-to-user distribution networks.

With today's information-intensive applications, the rate at which corporations store data can easily exceed the ability of the IT or IS department to readily provide additional storage capacity. As the amount of data grows, storing it in a flexible way with the appropriate level of fault tolerance becomes a major challenge.

Historically, one approach was to use the servers themselves. Unfortunately, servers offer limited options for adding storage capacity. Packing additional hard drives into server housings often requires a delicate hand and strong assembly skills, and usually results in significant network down-time. When servers run out of chassis space for additional drives, administrators traditionally have only a few, unappealing alternatives available. They can replace smaller drives with higher-capacity disks and suffer through lengthy downtime, while data gets backed up from the smaller disks and restored to the larger ones. Otherwise, they can add disks to alternate servers having the internal space and "mount" them across the network - which will consume valuable network bandwidth, increase system complexity, and slow down disk access to network speeds with file-sharing overhead. Additionally, rackmount servers offer limited space to add additional disk drives internally.

A better approach is the use of external disk drives, the first step toward a rapidly deployable response to storage demands. Disk drives that come with their own fan-cooled enclosures, power supplies, and external bus interface connectors offer a quick fix to storage needs. Beyond just a few external devices, however, such solutions simply don't scale well before power and space requirements, not to mention cabling, became too unruly to manage.

Fault Tolerant Storage Enclosures

Designed exclusively for storage devices, JBOD (or "Just A Bunch of Disks") is an externally attached storage solution capable of efficiently housing multiple drives. Unlike external stand-alone disk drives, JBOD solutions incorporate power and cooling systems that are capable of supporting a large number of drives simultaneously. The drives are connected to an internal bus, reducing external bus cabling to a single connection between the server and the JBOD system. JBOD systems also support hot-swap disk drives that can be added and replaced without interrupting data storage or server operations. High quality JBODs incorporate redundancy in components guarding against total system failure. It can be used connected to internal PCI RAID controllers or to complete external storage subsystems. A highly efficient solution for quickly adding storage capacity, JBOD prevails over the most demanding internal storage expansion requirements.

As a dedicated storage solution, every component of a JBOD system that implements fault tolerance adds to the overall integrity and availability of stored data. In particular, arrays of disk drives have stringent cooling and power requirements, making these systems prime candidates for fault tolerant design.

Cooling Systems
Restricted airflow puts at risk servers packed tightly with temperature-sensitive components, but moving from internal drives to an external JBOD solution allows for optimal cooling of disk drives and server components. To efficiently cool components, JBOD systems employ multiple cooling fans. But unlike typical fans that ventilate an entire chassis and a wide variety of components, JBOD cooling systems are specifically engineered to remove heat from arrays of disk drives.

From fan velocity to airflow to exhaust ports, JBOD cooling systems are built to do one thing and do it well: keep your data from overheating. By employing multiple cooling fans, any of them can fail without jeopardizing the entire cooling system. The purposefully engineered JBOD chassis allows replacement of a failed cooling fan without interrupting the operation of the JBOD system.

Power Supplies
Another focus of fault tolerance in data storage systems is the power supply. Fault tolerant power supply systems are usually implemented with redundant power supplies, either of which is capable of providing the necessary power for the system. When combined, both power supplies contribute to handling the system's overall requirements. When a problem is detected, the failed or failing power supply is isolated until it can be repaired or replaced.

As with cooling, a JBOD solution's power supply system benefits both the storage system and the server by off-loading disk drive power requirements from the server's own power system. This is a significant benefit, because disk drives are demanding and unpredictable power consumers. By taking even a couple of drives off the server's general power supply and moving them into a JBOD system, the longevity of the server's power supply increases.

Self-Monitoring Enclosures

Fault tolerance in active power, cooling, databus, and other components help data storage systems survive a number of otherwise crippling hardware failures, but it doesn't provide us with any means to identify and repair a failure. Ideally, JBOD solutions would notify an administrator by sending standardized messages to monitoring software on a management station.

In 1995, a standard called SAF-TE was drafted to meet the challenge, and since then it has become widely accepted. SAF-TE, short for SCSI Accessed Fault Tolerant Enclosure, is the method by which SCSI-based storage devices, controllers, power supplies, and other components communicate their status to monitoring applications. Environmental monitoring, such as storage chassis temperature, can also be tracked through SAF-TE.

Messaging from enclosures that are compliant with SAF-TE standards can be translated to audible and visible notifications on the JBOD system itself - status lights and alarms - to indicate failure of critical system components. While less frequently discussed, given its passive role in fault tolerance and high availability, SAF-TE is a very important element when attempting to maintain a high degree of availability. Using a software-monitoring tool helps retain a high level of fault tolerance for continuous data protection. By implementing the appropriate enclosure monitoring facility, the operator can easily be alerted when there's a failure. If a power supply or fan module fails, for instance, redundant modules simply carry the additional load. Monitoring these components is a part of maintaining a good fault tolerant system design. Using the SAF-TE status information from the storage devices, software on the management host can:

1) alert operators in the event of a problem,

2) define which component or environmental threshold the problem originates from, and

3) allow for a quick re-establishment of fault tolerance. Alerts can be sent to an event log, an e-mail system, a pager, or a service provider's technical support team.

RAID Fault Tolerant Disk Arrays

Even a significant amount of unallocated, raw storage capacity provides little benefit if users can't access it. A failed disk drive is often the culprit when data is unexpectedly inaccessible. Among personal computer users, it's not rare to have lost good day's work when the single drive relied on for all data storage "crashed." The standard response such a significant failure point is RAID, an acronym that stands for Redundant Array of Inexpensive Disks. As implied, redundancy is implemented by an array of disk drives configured in such a way that failures can occur without any loss of user data or interruption in availability.

RAID Level 1
The simplest configuration for fault tolerant RAID storage is called RAID Level 1, or "mirroring," where data stored on one drive is mirrored (duplicated) on a second, identical drive. High-performance RAID Level 1 implementations can access both drives concurrently - so when there are multiple I/O requests, writing out a block of data to a mirrored RAID configuration requires only a single concurrent write rather than consecutive writes to each disk.

An added benefit of concurrent disk access is doubled throughput for data reads from disk. Both drives contain identical data, so half of the desired data can be I/O requests read from one drive, while the other half can be read simultaneously from the second drive. In the event of a disk failure, the mirrored copy is used for data access until the failed drive is replaced. Performance during a disk failure is the same as that of accessing a single disk. Of course, it's not possible to double throughput on data reads until the failed drive is replaced and full mirroring is restored.

While highly efficient for reading and writing, RAID Level 1 does not use the disk drives themselves very efficiently. Physical storage requirements end up being double the fault tolerant storage yielded by the configuration. For this reason, RAID Level 1 most often is found in applications that require very high data availability.

RAID Level 5
A common configuration for fault tolerant RAID storage is called RAID Level 5. RAID Level 5, or "distributed parity," requires more than two drives to implement. When data is written to disk, it gets broken down into blocks and spread across multiple drives. The "distributed parity" component is a signature generated by the data blocks. Parity blocks are also distributed among the array's drives. An algorithm is used to determine where to place the parity block generated by a group of data blocks, and this algorithm allows for a drive failure without loss of data. That's because a missing data block can be regenerated from calculations based on the remaining data blocks and the parity block. Losing a parity block on a failed drive is also not a problem, because the blocks used to generate the data are guaranteed to be on the remaining drives.

RAID Level 5 has fast read access, because data is segmented across multiple drives and can be read concurrently. Write access performance is slightly affected by parity generation, but compared to mirroring, distributed parity allows for much more efficient disk utilization because a smaller percentage of disk space needs to be reserved for parity protection. Accordingly, RAID Level 5 is popular for file and application servers, including Web, e-mail, news, and database systems.

RAID Level 0
While it not a fault tolerant RAID configuration, RAID Level 0, or "striping," is nevertheless an essential component of yet another popular RAID format, RAID Level 0+1. Striping refers to the way blocks of data get spread across multiple disk drives. RAID Level 0 performance gains are realized in both reading and writing to disk, in configurations that support concurrent access to striped disks. It's just that RAID Level 0 offers no fault tolerant protection.

In a striped four-disk array, for example, a file that's composed of four data blocks might have one block residing on each of the four disks. Reading or writing this file requires only a single concurrent read or write to the four disks, instead of four sequential reads or writes to a single disk. The difference between RAID Level 0 and RAID Level in the same four-drive set is that all four drives would provide the data access, but if one should fail, the RAID Level 0 configuration file data would no longer be accessible, whereas the RAID Level 5 data would remain available.

RAID Level 0 is often used in applications dealing with large files that must be read or written quickly - video production, image editing, and pre-press, for example. Such environments typically make use of backup systems in place of recent data, and any loss in terms of data or downtime is usually minimal.

RAID Level 0+1
Building on RAID Level 0, RAID Level 0+1 mirrors two equal RAID Level 0 configurations, resulting in a fault tolerant configuration with very high throughput. This type of configuration is appropriate for applications that require fast data transfer as well as reliability, such as file sharing and imaging applications.

Fault Tolerant RAID Controllers

A RAID controller is responsible for managing data on the disk array, monitoring the status of the disk drives, and maintaining user data integrity in the event of a drive failure. RAID systems can also employ "hot standby" disk drives that are readily available to replace a failed drive in a RAID configuration, allowing unattended fault tolerance in the data center. It's the controller's job to replace any failed drive with a hot standby disk drive, allowing the virtual array to regain its level of data protection. (The failed drive can be replaced at a later time.)

This flexibility is a key benefit that RAID controllers bring to the system administrator. If a drive fails when an operator or administrator is not available to quickly remedy the situation - overnight, say, or during a weekend or at a remote site - then the hot standby automatically joins the array, taking the identity of the failed drive. The RAID controller quickly restores the fault tolerant nature of the disk array.

Even though our disk drives, where the data resides, may be fault tolerant, by using a single RAID controller we have introduced a new potential failure point. Without RAID protection, any drive failure can be catastrophic for the data - but a controller failure only results in losing access to the data. Fortunately, fault tolerant configurations allow for redundant RAID controllers, allowing one controller to fail without users ever losing access to the data.

There are two common implementations for dual RAID controller systems: active/active and active/standby. In dual-controller active/active configurations, each controller operates independently of - or in concert with - the other, enabling fast data transfer from storage to the host. Each controller in an active/active configuration can serve information to different hosts, and there is constant checking (called a "heartbeat") between them to guarantee the other controller's operational status. In the event of a controller failure, the second controller automatically assumes the responsibilities of the failed controller without any disruption to data access.

Replacing the failed controller effectively reestablishes the heartbeat, and both controllers automatically return to normal active/active behavior. This is transparent to the user, with no interruption in availability or data loss.

Controllers can also be configured in "active/standby" configuration, where a single controller is responsible for all attached devices. Monitoring its heartbeat, the standby controller verifies that the active controller is operational. If it determines that the active controller has failed, then the standby controller will take over until the failed controller recovers or is replaced. This operation also works transparently to end users, while assuring their continued access.

With redundant power and cooling systems, dual active/active RAID controllers, and SAF-TE-compliant enclosure monitoring, your data storage systems have now become highly fault tolerant. You can shift your focus to the systems connecting the servers and users to the data storage. And so, the path from the servers to the storage system is the next level of protection.

Fault Tolerant Host Bus Adapters

The use of dual Host Bus Adapters (HBAs) at the server level establishes multiple paths between the server and the data storage subsystem. Implementing software on the server to monitor these redundant paths adds another layer of fault tolerance.

If a path is interrupted or broken, the software re-directs the data request through an alternate path. This is known as "upstream" or "I/O path" failover. Like the heartbeat between redundant RAID controllers, there is a virtual heartbeat between redundant paths to data. If one of the paths fails to return the heartbeat, the data path is automatically re-routed and access to data is maintained without interruption to the user.

Server Clustering

Server Clustering At this point, we have discussed redundancy in every component from the server attachment to the physical drives on which data is stored. The server itself remains the most significant weakness in the quest for complete fault tolerance. Servers are responsible for stored data access through applications such as Web servers, databases, and file sharing. Failure of any component on the server systems can undermine the most fault tolerant storage implementations. The solution to this problem is clustering.

The term "clustering" is used loosely to describe a group of servers acting as a single virtual server. While many implementations use clustering to build a more powerful "virtual" server, they also provide another level of fault tolerance. Clustering shares the computing resource among multiple separate systems by running some user requests on one physical server, and other requests on another server. Deciding which server to use for any request is handled by the clustering management software, so the user does not have to query a number of different systems and pick the most available one. This allows multiple servers to be connected and support users' requests. If one server were to fail, the request is simply re-routed to the next available server.

For some applications, such as Web services, this type of clustered environment is used to maintain continuous access while providing multiple simultaneous connections. Some clustering solutions are built for the simple reason of increasing operating system or server stability. In today's computing environment, the purchase of dual or quad processing servers with a clustering strategy allows for a high level of server fault tolerance at a very cost-effective price. Under these circumstances, a user or application can be migrated flawlessly from a failing or failed server to an available "healthy" one.

The progression from local internal server storage, to external RAID storage, to a clustered server solution in a direct-attached or Storage Area Network (SAN) environment encompasses a high degree of fault tolerance and follows a logical path to data protection. Well-planned storage products will follow the same path, allowing today's JBOD solution to become part of tomorrow's RAID system, with future migration to a SAN clustered environment when the need arises. Each step along the path enables greater fault tolerance and therefore increased availability and protection for the most important element of any storage system: your data.


rightbar
rightbar