FAB: Building Distributed Enterprise Disk Arrays from Commodity Components

FAB: Building Distributed Enterprise Disk Arrays from Commodity Components

Yasushi Saito, Svend Frølund, Alistair Veitch, Arif Merchant, Susan Spence Hewlett-Packard Laboratories


ABSTRACT This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that pro- vides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection ofbricks, small storage appliances containing commodity disks, CPU, NVRAM, and network interface cards. FAB deploys a new majority-voting- based algorithm to replicate or erasure-code logical blocks across bricks and a reconfiguration algorithm to move data in the back- ground when bricks are added or decommissioned. We argue that voting is practical and necessary for reliable, high-throughput stor- age systems such as FAB. We have implemented a FAB prototype on a 22-node Linux cluster. This prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional master- slave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client re- quests.

Categories and Subject Descriptors D.4.5 [Software]: Operating systems—Reliability; C.5.5 [Computer system implementation]: Servers; H.3.4 [Information storage and retrieval]: Systems and software—Distributed systems

General Terms Algorithms, Management, Performance, Reliability

Keywords Storage, disk array, replication, erasure coding, voting, consensus

1. INTRODUCTION A Federated Array of Bricks(FAB) is a distributed disk array

that provides reliable accesses to logical volumes using only com- modity hardware. It solves the two problems, scalability and cost, associated with traditional monolithic disk arrays.

Traditional disk arrays drive collections of disks using central- ized controllers. They achieve reliability via highly customized,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’04October 7–13, 2004, Boston, Massachusetts, USA. Copyright 2004 ACM 1-58113-804-0/04/0010 …$5.00.

redundant and hot swappable hardware components. They do not scale well, because there is a high up-front cost for even a mini- mally configured array, and a single system can only grow to a lim- ited size. These limitations force manufacturers to develop multiple products for different system scales, which multiplies the engineer- ing efforts required. These issues, coupled with relatively low man- ufacturing volumes, drive up their cost—high-end arrays retail for many millions of dollars, at least 20 times more than the price of consumer-class systems with equivalent capacity.

FAB consists of a collection ofbricks—small rack-mounted com- puters built from commodity disks, CPU, and NVRAM—connected by standard networks such as Ethernet. Bricks autonomously dis- tribute data and functionality across the system to present a highly available set of logical volumes to clients through standard disk- access interfaces such as iSCSI [32]. FAB can scale incremen- tally, starting from just a few bricks and adding more bricks as demand grows, up to several hundred bricks. It is also cheaper than traditional arrays: due to the economies of scale inherent in high-volume production, a brick with 12 SATA disks and 1GB of NVRAM can be built for less than $2000, with a total system cost of about 20% to 80% of traditional arrays, even with three-way replication.

Commodity hardware is, of course, far less reliable than its en- terprise counterparts. Using the reliability figures reported in [4, 3], we expect the mean time between failures of a typical network switch to be 4 years, and that of a typical brick to be 4 to 30 years, depending on the quality of disks and the internal disk organiza- tion (e.g., RAID-5 is more reliable than RAID-0). FAB inevitably faces frequent changes to the system, including brick failures or additions, and network partitioning.

The FAB project tries to achieve two goals in such environments. First, FAB should providecontinuous service, masking failures transparently and ensuring stable performance over diverse work- loads. Second, it should ensurehigh reliability, comparable to that of today’s high-end disk arrays: 10,000+ mean years before the first data loss, tolerating the failures of disks, CPUs, or networks.

The key idea behind FAB to achieve these goals is replication and erasure coding byvoting. Acting on behalf of a client, a read or write request coordinator communicates with a subset (quorum) of bricks that store the data. Voting allows FAB to tolerate failed bricks and network partitioning safely without blocking. It also en- ablesperformance decoupling[24]—tolerating overloaded bricks by simply ignoring them, as long as others are responsive. This is especially effective in systems like FAB, in which brick response times fluctuate due to the randomness inherent in disk-head mech- anisms. Voting-based replication is not new, but it has seen little use in high-throughput systems, because of concerns about ineffi- ciency, as reading data must involve multiple remote nodes [36]. In



this paper, we show that voting is indeed practical and often neces- sary for reliable, high throughput storage systems. Specifically, our contributions are:

New replication and erasure-coding algorithms: We present asyn- chronous voting-based algorithms that ensure strictly lineariz- able accesses [17, 2] to replicated or erasure-coded data. They can handle any non-Byzantine failures, including brick fail- ures, network partitioning, and slow bricks. Existing algo- rithms [5, 27], in contrast, not only lack erasure-coding sup- port, but also could break consistency when a brick that coor- dinates a request crashes in the middle.

A new dynamic quorum reconfiguration algorithm: FAB can ad- just quorum configurations dynamically, while allowing I/O requests from clients to proceed unimpeded. It improves reli- ability by allowing the system to tolerate more failures than in a system with fixed-quorum voting, and by adding a new brick after another brick is decommissioned.

Efficient implementation and evaluation of FAB: We present sev- eral techniques that improve the efficiency of these algorithms and implement them in FAB.

We have implemented a FAB prototype on a 22-node Linux clus- ter. As we show in Section 7, this prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional master- slave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client re- quests.

2. RELATED WORK Today’s standard solution for building reliable storage systems

is centralized disk arrays employing RAID [7], such as EMC Sym- metrix, Hitachi Lightning, HP EVA, and IBM ESS. To ensure re- liability, these systems incorporate tightly synchronized hardware- level redundancy at each layer of their functionality, including pro- cessing, cache, disk controllers and RAID control. As reviewed in the previous section, this architecture limits their capacity, through- put, and availability. FAB distributes the functionality of array con- trollers across bricks while maintaining the consistency semantics of a single disk.

The idea of distributed, composable disk arrays was pioneered by TickerTAIP [6] and Petal [22]. Petal uses a master-slave replication protocol, which cannot tolerate network partitioning. In addition, it has a period (∼30 seconds) of unavailability during fail-over, which can cause clients to take disruptive recovery actions, such as database-log or file-system scanning. In contrast, FAB can mask failures safely and instantaneously using voting, and it supports Reed-Solomon erasure coding in addition to replication. Recently, LeftHand Networks [23] and IBM [19] have proposed FAB-like storage systems, but no details about them have been published.

Network-attached secure disks (NASD) [13] let clients access network-attached disks directly and safely. Both FAB and NASD try to build scalable distributed storage, but with different emphases: FAB focuses on availability and reliability through redundancy, whereas NASD focuses on safety through access-control mecha- nisms. These systems complement each other.

The ability of voting algorithms to tolerate failures or slow nodes has led to their recent adoption in storage systems. FarSite [1] is a distributed serverless file system that uses voting-based algo- rithms to tolerate Byzantine failures.Self-* is also a serverless file system that uses voting-based erasure-coding algorithms [12, 16]. OceanStore [31] is a wide-area file system that uses voting to

Storage handler

iSCSI frontend

Volume layouts & seggroups

Admin frontend

Coordinator Paxos


… File/DB servers

FAB bricks Disk map

Timest amps

Buffer cache

Front- end


Back- end

RPC Status monitor

Figure 1: The structure of a FAB system. Bricks are connected to each other and to clients by commodity networks. All bricks run the same set of software modules, shown in the right-hand picture. Volume layouts, seggroups, and diskmaps are on-disk data structures, normally cached in memory. The buffer cache and timestamp table are stored in NVRAM.

tolerate Byzantine failures and erasure coding for long-term, space- efficient data storage. Unlike these systems, FAB is designed as a high-throughput local-area storage system. It tolerates only stop- ping failures, but it ensures consistent data accesses without chang- ing the clients or exploiting file-system semantics. Ling [24] and Huang [18] use voting to build a high-throughput storage system, but they support only replication, with only single-client accesses, and require a special protocol to run on each client.

Consistent reconfiguration has been studied in viewstamped repli- cation [29], which uses two-phase commits to update data and Paxos [20, 21] to transition views. More recently, RAMBO [27] proposed the idea of concurrent active views and background state synchroniza- tion. This idea is used in FAB as well, but whereas RAMBO is based on single register (logical block) emulation, FAB runs more efficient voting algorithms over multiple logical blocks.

3. OVERVIEW Figure 1 shows the structure of a FAB system. FAB is a sym-

metrically distributed system—each brick runs the same set of soft- ware modules and manages the same types of data structures. FAB clients, usually file or database servers, use iSCSI [32] for reading and writing logical blocks, and a proprietary protocol for adminis- trative tasks, such as creating and deleting logical volumes. At a high level, a read or write request is processed as follows:

1. The client sends an iSCSI request of the form〈volume-id, off- set, length〉 to acoordinator, that is, a brick that acts as a gate- way for the request. Because of FAB’s symmetric structure, the client can choose any brick as the coordinator to access any logical volume. Different requests, even from the same client, can be coordinated by different bricks. In practice, the client uses either hard-wired knowledge or a protocol such as iSNS [33] (a name service for iSCSI) to pick a coordinator.

2. The coordinator finds the set of bricks that store the requested blocks. These are thestorage bricks for the request.

3. The coordinator runs the replication or erasure-coding proto- col against the storage bricks, passing the tuple〈volume-id, offset, length〉 to them.

4. Each storage brick converts the tuple〈volume-id, offset, length〉 to physical disk offsets and accesses the requested data.

3.1 Key data structures and software modules The steps described above are carried out using the following

key data structures: • Volume layoutmaps a logical offset to aseggroupat segment

granularity for each volume. A segment, set to 256MB, is the unit of data distribution.



• Seggroupdescribes the layout of a segment, including the set of bricks that store the segment. The volume layout and seg- groups are used in step 2 to locate the set of storage bricks for a request. A seggroup is also the unit of reconfiguration, as we discuss further in Section 5. • Diskmapmaps a logical offset to the tuple〈disk-number, disk-

offset〉 atpagegranularity for each logical volume. A page, set to 8MB, is the unit of disk allocation. Diskmap contents are unique to each brick. Diskmaps are used in step 4. • Timestamp tablestores timestamp information for recently mod-

ified blocks. The contents of this table are unique to each brick. This data structure is used in steps 3 and 4 to access replicated or erasure-coded blocks in a consistent fashion. We discuss FAB’s replication and erasure-coding algorithms and their use of timestamp tables in more detail in Section 4.

Figure 2 shows an example of I/O request processing. Volume layouts and seggroups are called theglobal metadata, because they are replicated on every brick and are read by the request coordi- nator. Following the approach pioneered by Petal [22], we use Paxos [20, 21], an atomic broadcast protocol, to maintain the con- sistency of the global metadata across bricks. Paxos allows bricks to receive exactly the same sequence of metadata updates, even when updates are issued concurrently and bricks fail and recover. Thus, by letting bricks initially boot from the same (empty) global metadata and use Paxos for updates, they can keep their metadata consistent. As discussed further in Section 5.2, FAB is designed to withstand stale global metadata, so long as bricks eventually re- ceive metadata updates. As such, reading global metadata is done directly against the local copy.

These data structures are managed by software modules that are roughly divided into three groups. Thefrontend receives requests from clients (step 1). Thecore contains modules needed to locate logical blocks and maintain data consistency (steps 2 and 3). In par- ticular, thecoordinator module is responsible for communicating with the backend modules of remote bricks to access blocks con- sistently. Thestatus monitor keeps track of the disk usage and load of other bricks. It is used to assign less-utilized segment groups to volumes while creating volumes (Section 3.2), and to pick a brick in the quorum that reads data from disk (Section 4.4). It currently deploys two mechanisms. First, the status information is piggy- backed on every message exchanged between bricks; this gives a timely view of the status of a small set of bricks. Second, we use a variation of the gossip-based failure detector [34] to advertise the status to a random brick every three seconds; this gives an older, but more comprehensive, view of the system. Finally, thebackend modules are responsible for managing and accessing NVRAM and physical disks (step 4).

3.2 Data layout and load balancing All the segments assigned to a seggroup must use the same re-

dundancy policy: replication of the same degree or erasure coding with the same layout. FAB’s policy is to create, for each redun- dancy policy, an average of four seggroups that contain a specific brick. Logical volume segments are assigned to seggroups semi- randomly when the volume is created, favoring seggroups contain- ing bricks with less utilized disks (the status monitor is consulted for this purpose). The assignment of physical disk blocks to pages (i.e., diskmap) is done randomly by each brick when the page is written for the first time.

The choice of number of seggroups per brick reveals a tension between load balancing and reliability. After a brickb fails, the “read” requests normally handled byb are now served by the other bricks in the seggroups thatb belongs to. Thus, the more seggroups


Volume table

Volume ID

Volume layout Seggroup

Disk map

Paxos- replicated

Locally managed

<diskID, offset>

Disk Diskmap


0 256MB 512MB 768MB

1024MB 1280MB

0 8MB



784MB 776MB

0 8MB 16MB 24MB 32MB 40MB

Volume ID




Figure 2: Example of locating a logical 1KB block at offset 768MB of a volume. The client sends a request of the form 〈volume-id, 768MB, 1KB〉 to a random coordinator. In the top half of the diagram, the coordinator locates the volume layout from the local copy of the global metadata and finds the seg- group for the offset 768MB. The seggroup shows that the data is stored on bricksB,D, andE. The coordinator then executes the replication or erasure-coding protocol against bricksB,D and E. In the bottom half of the diagram, each of the bricksB,D, and E consult the local diskmap to convert the offset 768MB to disk addresses.





0 5 10 15 20 Segment groups/brick



L (y

e a rs

) 3-replication

2,4 erasure coding

5,7 erasure coding

Figure 3: Mean time to data loss (MTTDL) of FAB in systems with 256 TB logical capacity.

per brick, the more evenly the extra load is spread. Creating too many seggroups, however, reduces the system’s reliability, since this increases the number of combinations of brick failures that can lead to data loss. Figure 3 shows how the reliability changes with the number of seggroups per brick. This analysis is based on a Markov model assuming bricks with twelve 256GB SATA disks. Failures are assumed to be independent. We assume a disk mean time to failure (MTTF) of 57 years, based on manufacturers’ spec- ifications and a brick (enclosure) MTTF of 30 years, based on data from [4]. The time to repair a failure depends on the failure type and is based on the time required to copy the data to spare space — we assume that spare space is always available. Based on this, we pick an average of four seggroups per brick because this meets our goal of a 10,000 year MTTDL, while still allowing the load to be spread evenly.

The choice of segment and page sizes involves several trade-offs. A larger segment size reduces the global-metadata management overhead, but at the cost of less storage allocation freedom, because bricks in a seggroup must store all its segments. The page is cho- sen to be smaller than the segment to reduce the storage waste for erasure-coded volumes (Section 4.2), or for logical volumes whose size is not segment-aligned. Too small a page size, however, could also hurt performance by increasing disk-head movement. We find that the current setting of 256MB segments and 8MB pages offers a good balance for the next few years—even with bricks with 10TB



raw capacity and one thousand 1TB logical volumes in the system, the size of the global metadata and diskmaps would be only 5MB and 10MB, respectively.


FAB provides two redundancy mechanisms, replication and erasure- coding. Both are based on the idea of voting: each request makes progress after receiving replies from a (random) quorum of storage bricks. Our protocols require no persistent state on the request co- ordinator. This feature allows any brick to act as a coordinator and helps FAB become truly decentralized without changing clients.

Section 4.1 describes our basic replication protocol for a single logical block, and Section 4.2 describes how it can be extended for erasure coding. Multi-block requests are logically handled by running multiple instances of these algorithms in parallel, but in practice, we batch and run them as efficiently as single-block re- quests. We discuss this and other implementation-related issues in later sections.

4.1 Replication The task of a request coordinator is straightforward in theory:

when writing, it generates a new unique timestamp and writes the new block value and timestamp to a majority of storage bricks; when reading, it reads from a majority and returns the value with the newest timestamp. The challenge lies in the handling of the failure of the participants in the middle of a “write” request: the new value may end up on only a minority of bricks. A storage system must ensurestrict linearizability [2, 17]—it must present a single global ordering of (either successful or failed) I/O requests, even when they are coordinated by different bricks. Put another way, after a “write” coordinator fails, future “read” requests to the same block must all return the old block value or all return the new value, until the block is overwritten by a newer “write” request. Prior approaches, e.g., Gifford’s use of two-phase commits [14] cannot ensure a quick fail-over, and Ling et al.’s use of end-to-end consistency checking [24] conflicts with our goal of leaving the client interface (iSCSI) unchanged.

FAB takes an alternative approach, performing recovery lazily when a client tries to read the block after an incomplete write. Fig- ure 4 shows the pseudocode of FAB’s algorithm. Each replicated block keeps two persistent timestamps:valTs is the timestamp of the block currently stored, andordTs is the timestamp of the newest ongoing “write” request. An incomplete “write” request is indi- cated byordTs > valTs on some brick. A “write” runs in two phases. First, in theOrder phase, the replicas update theirordTs to indicate a new ongoing update and ensure that no request with an older timestamp is accepted. In the second,Write, phase, the replicas update the actual disk block andvalTs. A “read” request usually runs in one phase, but takes two additional phases when it detects an incomplete past “write”—the coordinator first discov- ers the value with the newest timestamp from a majority, and then writes that value back to a majority with a timestamp greater than that of any previous writes. In this protocol, a “write” request still tries to write toall the bricks in the seggroup; the coordinator just does not wait for all the replies. Thus, a read-recovery phase usu- ally happens only when there is an actual failure. Figure 5 shows an example of I/Os using this algorithm.

One unusual feature of our protocol is that a request may abort when it encounters a concurrent request with a newer timestamp. In this case it is up to the client or the coordinator to retry. In practice, abortion is rare, given that protocols such as NTP can synchronize

// I/O coordinator code. proc write(val)

ts ← NewTimestamp() send [Order, {}, ts] to bricks in the seggroup if a majority reply ‘‘yes’’

send [Write, val , ts] to bricks in the seggroup if a majority reply ‘‘yes’’ return OK

return ABORTED proc read()

send [Read] to bricks in the seggroup if a majority reply ‘‘yes’’ and all timestamps are equal

return the val in a reply. ts ← NewTimestamp() // Slow ‘‘recover’’ path starts send [Order, ‘‘all’’, ts] to bricks in the seggroup if a majority reply ‘‘yes’’

val ← the value with highest valTs from replies send [Write, val , ts] to bricks in the seggroup if a majority reply ‘‘yes’’ return val

return ABORTED

// Storage handler code. Variable val stores the block contents. when Receive [Read]

status ← (valTs ≥ ordTs) reply [status, valTs, val ]

when Receive [Order, targets, ts] status ← (ts > max(valTs, ordTs)) if status ordTs ← ts if targets = ‘‘all’’ or this block ∈ targets reply [valTs, val , status] else reply [valTs, status]

when Receive [Write, newVal , ts] status ← (ts > valTs and ts ≥ ordTs) if status val ← newVal ; valTs ← ts reply [status]

Figure 4: FAB’s replication algorithm for a single logical block. The function NewTimestamp generates a locally monotoni- cally increasing timestamp by combining the real-time clock value and the brick ID (used as a tie-breaker).

clocks with sub-millisecond precision [28, 10]. Being able to abort requests, however, offers two benefits. First, it allows for an effi- cient protocol—a “read” request can complete in a single round as opposed to two in previous algorithms [5, 27], skipping the round to discover the latest timestamp. Second, abortion enablesstrict linearizability—that is, only by sometimes aborting requests can an algorithm properly linearize requests whose coordinators could crash in the middle. A theoretical treatment of this issue appears in separate papers [11, 2].

4.2 Erasure coding FAB also supports genericm,n Reed-Solomon erasure coding.

Reed-Solomon codes have two characteristics. First, they gener- aten−m parity blocks out ofm data blocks, and can reconstruct the original data blocks from anym out of n blocks. Second, they provide a simple function, which we callDelta, that enables in- cremental update of parity blocks [30]. Using this function, when writing to a logical blockX, the new value of any parity block can be computed byxor(old-parity, Delta(old-x, new-x)), whereold- parity is the old parity block value, andold-x andnew-x are the old and new values of blockX.

Figure 7 shows our data-access algorithm for erasure-coded vol- umes. Supporting erasure-coded data requires three key changes to the basic replication protocol: segment layout, quorum size, and update logging.

We currently use the entire segment as the erasure-code chunk, as shown in Figure 6, unlike typical RAID systems that use smaller chunk sizes such as 64KB. We chose this layout because it lets a large logical sequential request be translated into a large sequential



Write Read Write Order WriteRead






Timeline (1) (2) (3) (4) (5) (6) (7)

Failure-free execution

Recovery from a coordinator failure


Coord- inators



Figure 5: A logical block is replicated on bricks X,Y, and Z. In steps (1) and (2), coordinatorC1 writes to the block in two rounds. Coordina- tor C2 reads from {Y,Z}, discovers that the timestamps are consistent and finishes (in practice,C2 reads the block value from only one replica; Section 4.4). Steps (4) to (8) show why a write needs two rounds.C1 tries to write, but crashes after sendingWrite to only Y. Later, while trying to read, C2 discovers the partial write by observingvalTs<ordTs on Z. C2 discovers the newest value in step (7) and writes it back to a majority (in fact, all) in step (8), so that future requests will read the same value. In a different scenario,C2 could contact only{X,Z} in step (6), and C2 would find and write back the old value. This causes no problem—when a write fails, the client cannot assume its outcome.

D 1

D 2

logical volume


D 1

d 1

j -1

data brick 1


p j -1

parity brick

D 2

d 2

j -1

data brick 2

strip j






Figure 6: An example of 2,3 erasure-coded segment. An m,n erasure-coding scheme splits one segment intomequal-size chunks (D1,D2), and addsm−n parity chunks. A horizontal, block-size-height slice is called a “strip”. Bricks in the seggroup maintain the set of timestamps and the update log for each strip. In this example, with a 1KB logical block, the 3rd strip of the segment will occupy regions{(2KB, 3KB), (131074KB, 131075KB)} of the segment.

disk I/O at each brick. The downside is that it may abort writes spuriously, when two blocks that happen to be in the same strip are updated concurrently. With a database transaction workload (Sec- tion 7.3), however, the conflict rate is measured to be< 0.001%, and we consider that the benefits outweigh the downsides.

As in replication, each request contacts a subset of the bricks that store the segment. However, withm,n erasure coding, a coordina- tor must collect replies fromm+ d(n−m)/2e bricks—that is, the intersection of any two quorums must contain at leastmbricks—to be able to reconstruct the strip value during a future “read”. We call this quorum system anm-quorum. For instance, them-quorum size is 3 for a 2,4 erasure code, and 8 for a 6,10 erasure code.

The final change involves the need for strip recovery. Suppose that a “write” coordinator crashes after writing the new value to less thanm bricks in the second round. The subsequent “read” re- quest must recover the old value, which might become impossible if the “write” request simply overwrote the blocks and ifn < 2m (which is a rather common setting). We solve this situation byup- date logging— a storage brick merely logs the new value in the sec- ond round of the “write”. A read request, when recovering the old value, scans the log on anm-quorum of bricks and finds the newest strip value that can be fully reconstructed. The “write” coordina- tor, after it replies to the client, instructs the bricks to overwrite the old block value, and thus compress their log, in an asynchronous Commit phase. In practice, the log is implemented in each brick’s NVRAM cache, and the third round—replacing the block value with the log entry—is performed simply by modifying the cache

// I/O coordinator code. ‘‘idx’’ is the block number within the strip. proc write(val, idx)

ts ← NewTimestamp() send [Order, {idx}, ts] to bricks in the seggroup if an m-quorum reply ‘‘yes’’ and idx ’th brick replies with oldval

delta← Delta(oldval , val , idx) send [Write-EC, val , ts] to the idx ’th brick. send [Write-EC, NULL, ts] to other data bricks. send [Write-EC, delta, ts] to parity bricks if an m-quorum reply ‘‘yes’’

send [Commit, ts] to bricks in the seggroup return OK

Needs help with similar assignment?

We are available 24x7 to deliver the best services and assignment ready within 3-4 hours? Order a custom-written, plagiarism-free paper

Get Answer Over WhatsApp Order Paper Now