Remus: High Availability via Asynchronous Virtual Machine Replication

Reference: Cully, Brendan, et al. "Remus: High availability via asynchronous virtual machine replication." Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. 2008.

0. Abstract

Allowing applications to survive hardware failure is an expensive undertaking, which generally involves re-engineering software to include complicated recovery logic as well as deploying special-purpose hardware; this represents a severe barrier to improving the dependability of large or legacy applications. We describe the construction of a general and transparent high availability service that allows existing, unmodified software to be protected from the failure of the physical machine on which it runs. Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections. Our approach encapsulates protected software in a virtual machine, asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, and uses speculative execution to concurrently run the active VM slightly ahead of the replicated system state.

1. Introduction

1.1 Goal

  • Generality. High availability should be provided as a low-level service.
  • Transparency. High availability should not require that OS or application code be modified to support facilities such as failure detection or state recovery.
  • Seamless failure recover. (1) No externally visible state should ever be lost in the case of single-host failure; (2) failure recovery should proceed rapidly enough that it appears as nothing more than temporary packet loss from the perspective of external users.

Common HA systems are based on asynchronous storage mirroring followed by application-specific recovery code.

1.2 Approach

  • VM-based whole-system replication
  • Speculative execution
  • Asynchronous replication
Speculative execution and asynchronous replication in Remus

2. Design and Implementation

Overview
  • Based on the Xen virtual machine monitor
  • Remus achieves high availability by propagating frequent checkpoints of an active VM to a backup physical host
  • Disk state is not externally visible
  • The virtual machine does not actually execute on the backup host until a failure occurs

2.1 Failure Model

Properties

  • The fail-stop failure of any single host is tolerable.
  • Should both the primary and backup hosts fail concurrently, the protected system’s data will be left in a crash-consistent state.
  • No output will be made externally visible until the associated system state has been committed to the replica.

2.2 Pipelined Checkpoints

Stages

  1. Once per epoch, pause the running VM and copy any changed state into a buffer(similar with stop-and-copy stage of live migration). When state changes preserved in a buffer, the VM is unpaused and speculative execution resumes.
  2. Buffered state is transmitted and stored in memory on the backup host.
  3. Once the completed set of state has been received, the checkpoint is acknowledged to the primary.
  4. Buffered network output is released.

2.3 Memory and CPU

Live Migration

  • memory is copied to the new location while the VM continues to run at the old location. During migration, writes to memory are intercepted, and dirtied pages are copied to the new location in rounds.
  • Xen provides the ability to track guest writes to memory using a mechanism called shadow page tables, where the VMM maintains a private(“shadow”) version of the guest’s page tables and exposes these to the hardware MMU. Page protection is used to trap guest access to its internal version of page tables, allowing the hypervisor to track updates, which are propagated to the shadow versions as appropriate.

Optimization

  • Migration enhancements. Remus optimized checkpoint signaling and increased the efficiency of the memory copying process.
  • Checkpoint support.
  • Asynchronous transmission.
  • Guest modifications.

2.4 Network buffering

  • Outbound packets generated since the previous checkpoint are queued until the current state has been checkpointed and that checkpoint has been acknowledged by the backup site.
  • Converting the inbound traffic to outbound by routing it through a special device called an intermediate queuing device since in Linux queuing disciplines only operate on outgoing traffic.

2.5 Disk Buffering

  • Writes to disk from the active VM are treated as write-through: they are immediately applied to the primary disk image, and asynchronously mirrored to an in-memory buffer on the backup.
  • At the time that the backup acknowledges that a check- point has been received, disk updates reside completely in memory.

2.6 Detecting Failure

  • A timeout of the backup responding to commit requests or a timeout of new checkpoints being transmit- ted from the primary.

3. Evaluation

  • Although Remus is efficient at state replication, it does introduce significant network delay, particularly for applications that exhibit poor locality in memory writes. Applications that are very sensitive to network latency may not be well suited to this type of high availability service.
  • The bottleneck for checkpoint frequency is replication time, and it has significant impact on network latency but not on disk performance.
  • Potential Optimizations: (1) deadline scheduling: to provide stricter scheduling guarantees, the rate at which the guest operates could be deliberately slowed between checkpoints, depending on the number of pages dirtied; (2) page compression: in order to reduce the amount of state requiring replication, sending only the delta from a previous transmission of the same page; (3) copy-on-write checkpoints: marking dirty pages as copy-on-write and resuming the domain immediately.
  • Virtual machine migration
  • Virtual machine logging and replay
  • Operating system replication
  • Library approaches
  • Replicated storage
  • Speculative execution
Written on February 22, 2017