Flat Datacenter Storage

Reference: Nightingale, Edmund B., et al. "Flat Datacenter Storage." OSDI. 2012.

0. Abstract

Flat Datacenter Storage (FDS) is a high-performance, fault-tolerant, large-scale, locality-oblivious blob store. Using a novel combination of full bisection bandwidth networks, data and metadata striping, and flow control, FDS multiplexes an application’s large-scale I/O across the available throughput and latency budget of every disk in a cluster. FDS therefore makes many optimizations around data locality unnecessary. Disks also communicate with each other at their full bandwidth, making recovery from disk failures extremely fast. FDS is designed for datacenter scale, fully distributing metadata operations that might otherwise become a bottleneck. FDS applications achieve single-process read and write performance of more than 2GB/s. We measure recovery of 92GB data lost to disk failure in 6.2 s and recovery from a total machine failure with 655GB of data in 33.7 s. Application performance is also high: we describe our FDS-based sort application which set the 2012 world record for disk-to-disk sorting.

1. Introduction

  • Traditional network topology at datacenters introduces locality constraints, which is not suitable for computations requiring data movement and can sometimes even hinder efficient resource utilization. When bandwidth was scarce, these sacrifices were necessary to achieve the best performance.
  • However, recently developed CLOS networks(large numbers of small commodity switches with redundant interconnections) have made it economical to build non-oversubscribed full bisection bandwidth networks at the scale of a datacenter for the first time.
  • Flat Datacenter Storage (FDS) is a datacenter storage system designed from first principles(a shared and centralized model of storage) under the formerly unrealistic assumption that datacenter bandwidth is abundant using full bisection bandwidth networks.
  • Flat storage model: all compute nodes can access all storage with equal throughput.
  • Fantastic Performance: world record cluster sort.

2. Design Overview

2.1 Blobs and Tracts

Blobs and Tracts

  • A blob is a byte sequence which stores data named with a 128-bit GUID. Blobs can be any length up to the system’s storage capacity.
  • Reads from and writes to a blob are done in units called tracts. Each tract within a blob is numbered sequentially starting from 0. Tracts are sized such that random and sequential access achieves nearly the same throughput. In their cluster, tracts are 8MB(it can be configured if the storage media is changed).
  • Blobs and tracts are mutable.

TractServer

  • Every disk is managed by a process called a tractserver that services read and write requests that arrive over the network from clients. Tractservers do not use a file system. Instead, they lay out tracts directly to disk by using the raw disk interface. Since there are only about tracts per disk (for a 1TB disk), all tracts’ metadata is cached in memory, eliminating many disk accesses.

API

API
  • All calls in FDS are non-blocking; the library invokes the application’s callback when the operation completes.
  • Tract reads are not guaranteed to arrive in order of issue. Writes are not guaranteed to be committed in order of issue.
  • FDS guarantees atomicity: a write is either committed or failed completely

2.2 Deterministic Data Placement

Metadata Server

  • Many systems solve data placement using a metadata server that stores the location of data blocks, allowing maximum flexibility of data placement and visibility into the system’s state. However, it has drawbacks: the metadata server is a central point of failure, usually implemented as a replicated state machine, that is on the critical path for all reads and writes.
  • FDS uses a metadata server, which collects a list of the system’s active tractservers and distribute it to clients, called tractor locator table.

Deterministic Distribution

  • FDS uses SHA-1 for this hash. Adding the tract number outside the hash ensures that large blobs use all entries in the TLT uniformly.
  • Once clients find the proper tractserver address in the TLT, they send read and write requests containing the blob GUID, tract number, and (in the case of writes) the data payload.
  • In a single-replicated system, the TLT is constructed by concatenating m random permutations of the tractserver list. Setting m > 1 ensures that after being delayed in a slow queue, clients will fan out to m other tractservers for their next operation.

Advantages

  • The metadata server is in the critical path only when a client process starts. This is the key factor that allows to practically keep tract sizes arbitrarily small.
  • The TLT can be cached long-term since it changes only on cluster configuration, not each read and write, eliminating all traffic to the metadata server in a running system under normal conditions.
  • The metadata server stores metadata only about the hardware configuration, not about blobs. Since traffic to it is low, its implementation is simple and lightweight.
  • Since the TLT contains random permutations of the list of tractservers, sequential reads and writes by independent clients are highly likely to utilize all tractservers uniformly and are unlikely to organize into synchronized convoys.

2.3 Per-Blob Metadata

  • FDS stores each blob’s metadata in its special metadata tract. Clients find a blob’s metadata on a tractserver using the same TLT used to find regular data.
  • Applications must extend a blob before writing past the end of it. The extend operation is atomic, is safe to execute concurrently with other clients. Applications must extend a blob before writing past the end of it. The extend operation is atomic, is safe to execute concurrently with other clients.

2.4 Dynamic Work Allocation

  • In FDS, since storage and compute are no longer co-located, the assignment of work to worker can be done dynamically, at fine granularity, during task execution.
  • The best practice for FDS applications is to centrally give small units of work to each worker as it nears completion of its previous unit. This system ensures that the maximum dispersion in completion times across the cluster is only the time required for the slowest worker to complete a single unit.

3. Replication and Failure Recovery

3.1 Replication

  • Each entry of the TLT in an n-way replicated cluster contains n tractservers. When an application writes a tract, the client library finds the appropriate row of the TLT and sends the write to every tractserver it contains. Applications are notified that their writes have completed only after the client library receives write acknowledgments from all replicas. Reads select a single tractserver at random.
  • Clients send these operations only to the tractserver acting as the primary replica, which executes a two-phase commit with the other replicas.
  • FDS also supports per-blob variable replication.

3.2 Failure Recovery

  • Each row of the TLT lists several tractservers, while each row also has a version number assigned by the metadata server.
  • Tractservers send heartbeat messages to the metadata server. When the metadata server detects a tractserver timeout, it declares the tractserver dead. Then it will: (1) invalidates the current TLT by incrementing the version number of each row in which the failed tractserver appears; (2) picks random tractservers to fill in the empty spaces in the TLT where the dead tractserver appeared; (3) sends updated TLT assignments to every server affected by the changes; (4) waits for each tractserver to ack the new TLT assignments, and then begins to give out the new TLT to clients when queried for it.
  • If a client attempts an operation using a stale TLT entry, the tractserver detects the inconsistency and rejects the operation. This prompts the client to retrieve an updated TLT from the metadata server.
  • After a tractserver failure, the TLT immediately converges to provide applications the current location to read or write data, which differentiates it from other hash-based approaches, which may cause requests to be routed multiple times through the network before determining an up-to-date location for data.

3.3 Replicated Data Layout

  • For any replication level k > 2, FDS starts with the “all-pairs” TLT, then expands each entry with k − 2 additional random disks(subject to failure domain constraints).
  • Advantages: (1) performance during recovery still involves every disk in the cluster since every pair of disks is still represented in the TLT; (2) for 3-way replication, a triple disk failure within the recovery window now has only about a 2/n chance of causing permanent data loss; (3) adding more replicas decreases the probability of data loss.
  • One possible disadvantage to a TLT with O() entries is its size. To mitigate this issue: (1) a tractserver can manage multiple disks, this reduces n by a factor of 5–10; (2) we can limit the number of disks that participate in failure recovery, to build an n-disk cluster where m disks are involved in recovery, the TLT only needs O() entries.

Failure Domains

  • A failure domain is a set of machines that have a high probability of experiencing a correlated failure.
  • FDS guarantees that none of the disks in a single row of the TLT share the same failure domain. This policy is also followed during failure recovery: when a disk is replaced, the new disk must be in a different failure domain than the other tractservers in that particular row.

3.4 Cluster Growth

  1. The new tractserver is given the assignments but they are marked as “pending” and the TLT version for each entry is incremented. The new tractserver then begins copying data from other replicas. During this phase, clients write data to both the existing servers and the new server so that the new tractserver is kept up-to-date.
  2. Once the tractserver has completed copying old data, the metadata server ‘commits’ the TLT entry by incrementing its version and changing its assignment to the new tractserver. It also notifies the now replaced tractserver that it can safely garbage collect all tracts associated with that TLT entry.

3.5 Consistency Guarantees

  • The current protocol for replication depends upon the client to issue all writes to all replicas. This decision means that FDS provides weak consistency guarantees to clients.
  • FDS could be modified to use chain replication to provide strong consistency guarantees for all updates to individual tracts.
  • Tractservers may also be inconsistent during failure recovery. While in this state, tractservers reject read requests; clients use other replicas instead.

4. Networking

  • FDS creates an uncongested path from disks to CPUs by: (1) Giving each storage node network bandwidth equal to its disk bandwidth, preventing bottlenecks between the disk and network; (2) Using a full bisection bandwidth network, preventing bottlenecks in the network core; (3) Giving compute nodes as much network bandwidth as they need I/O bandwidth, preventing bottlenecks at the client.
  • FDS’ data interfaces pass the zero-copy model all the way to the application. They also use buffer pools to avoid the large page fault penalty associated with frequent allocation of large buffers.
  • A disadvantage of short flows is that TCP’s bandwidth allocation algorithms perform poorly. Under the high degree of fan-in seen during reads, high packet loss can occur as queues fill during bursts, this is sometimes called incast.
  • FDS does so with a request-to-send/clear-to-send (RTS/CTS) flowscheduling system. Large sends are queued at the sender, and the receiver is notified with an RTS. The receiver limits the number of CTSs outstanding, thus limiting the number of senders competing for its receive bandwidth and colliding at its switch port.

5. Microbenchmarks

5.1 Raw Disk Performance

  • Sequential performance peaks at about 131MB/s
  • 8MB random reads reach ≈ 117MB/s

5.2 Remote Reading and Writing

  • Throughput increases linearly at the rate of about 1,150MB/s/client for writing and 950MB/s/client for reading, roughly 90% and 74%, respectively, of the 10Gbps interface.

5.3 Failure Recovery

  • 1TB disk in a 3,000 disk cluster could be recovered in ≈ 17s

6. Applications

6.1 Sorting

  • Beat MinuteSort word record in April 2012

6.2 Cointegration

6.3 Serving an Index of the Web

  • FDS is the first high performance blob storage system designed for datacenter scale with a flat storage model.
Written on March 15, 2017