The Google File System

Reference: Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.

0. Abstract

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

1. Introduction

  • Component failures are the norm rather than the exception
  • Files are huge by traditional standards
  • Most files are mutated by appending new data rather than overwriting existing data
  • Co-designing the applications and the file system API benefits the overall system by increasing our flexibility

2. Design Overview

2.1 Assumptions

  • The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
  • The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
  • The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
  • The workloads also have many large, sequential writes that append data to files. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
  • The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file.
  • High sustained bandwidth is more important than low latency.

2.2 Interface

  • create, delete, open, close, read, write, snapshot and record append

2.3 Architecture

  • Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation.
  • For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.

  • The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.
  • It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers.
  • The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

  • Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers.
  • Neither the client nor the chunkserver caches file data, since files are usually too big to be cached at the client, and chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
  • Clients cache metadata.

2.4 Single Master

  • Client asks multiple chunks in the request and the master will send back the chunk handle and locations of the replicas, then client cache this information and send a request to one of these replicas to read the actual data.

2.5 Chunk Size

  • 64MB

2.6 Metadata

  • The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory.
  • The first two types are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines.
  • The master does not store chunk location information persistently. It asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.

In-Memory Data Structures

  • Periodic scanning: implementing chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers.
  • The master maintains less than 64 bytes of metadata for each 64 MB chunk.
  • The file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.

Chunk Locations

  • Master polls chunkservers for which chunkservers have a replica of a given chunk at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.

Operation Log

  • The operation log is the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations.
  • We replicate the log on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely.
  • The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
  • The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing.

2.7 Consistency Model

Guarantees by GFS

  • At least once semantic
  • After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by: (1) applying mutations to a chunk in the same order on all its replicas, (2) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down.

Implications for Applications

  • Applications may use unique identifiers to remove duplicates.

3. System Overview

3.1 Leases and Mutation Order

  • We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk.
  • The lease requests are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers.
  1. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses.
  2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
  3. The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out.
  4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order.
  5. The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
  6. The secondaries all reply to the primary indicating that they have completed the operation.
  7. The primary replies to the client. Any errors encountered at any of the replicas are reported to the client.

3.2 Data Flow

  • We decouple the flow of data from the flow of control to use the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion.
  • To avoid network bottlenecks and high-latency links (e.g., inter-switch links are often both) as much as possible, each machine forwards the data to the “closest” machine in the network topology that has not received it. Our network topology is simple enough that “distances” can be accurately estimated from IP addresses.
  • We use full-duplex links to support receiving and sending data simultaneously.

3.3 Atomic Record Appends

3.4 Snapshot

  • We use standard copy-on-write techniques to implement snapshots. When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder.
  • After the leases have been revoked or have expired, the master logs the operation to disk. It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. The newly created snapshot files point to the same chunks as the source files.
  • The master will pick a new chunk when a snapshot chunk is requested.

4. Master Operation

4.1 Namespace Management and Locking

  • GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. With prefix compression, this table can be efficiently represented in memory. Each node in the namespace tree (either an absolute file name or an absolute directory name) has an associated read-write lock.
  • Read/write lock

4.2 Replica Placement

  • Usually on two machines in the rack, another one in another rack to get the trade-off between reliability and network bandwidth utilization.

4.3 Creation, Re-replication, Rebalancing

  • Considering factors for where to place the initially empty chunk: (1) we want to place new replicas on chunkservers with below-average disk space utilization; (2) we want to place new replicas on chunkservers with below-average disk space utilization; (3) we want to spread replicas of a chunk across racks.
  • Re-replication with priority
  • Rebalancing for better disk space and load balancing

4.4 Garbage Collection

Mechanism

  • When a file is deleted by the application, the file is just renamed to a hidden name that includes the deletion timestamp. During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days (the interval is configurable).
  • In a similar regular scan of the chunk namespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks

4.5 Stale Replica Detection

  • For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas. Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas.
  • The master removes stale replicas in its regular garbage collection.

5. Fault Tolerance and Diagnosis

5.1 High Availability

  • Fast Recover
  • Chunk Replication
  • Master Replication: clients use only the canonical name of the master (e.g. gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine.

5.2 Data Integrity

  • A chunk is broken up into 64 KB blocks. Each has a corresponding 32 bit checksum. Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.
  • For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver.
  • For write, we only calculate checksum for appended data, and for overwrites, we need to read and verify the first and last blocks of the range being overwritten.
  • During idle periods, chunkservers can scan and verify the contents of inactive chunks.

5.3 Diagnostic Tools

  • Diagnostic logs

6. Measurements

6.1 Micro-benchmarks

  • Reads
  • Writes
  • Record Appends

6.2 Real World Clusters

  • Storage
  • Metadata
  • Read and Write Rates
  • Master Load
  • Recovery Time

6.3 Workload Breakdown

  • Methodology and Caveats
  • Chunkserver Workload
  • Appends versus Writes

7. Experiences

  • Permissions and Quotas
  • Disk driver and Linux kernel should be matched
  • fsync() in Linux 2.2 kernel is costly since its cost is proportional to the size of the file rather than the size of the modified portion, and we eventually migrated to Linux 2.4.
  • Linux read-writer lock with mmap
  • AFS
  • RAID
  • NASD
Written on January 30, 2018