On Wednesday we had a presentation and discussion of the Ocean Vista (OV) replication and distributed transaction protocol. OV works in the WANs, where each region has all data-partitions, and transactions can originate in any region. OV separates replication from transaction execution, by making replication conflict-free with a FastPaxos-inspired protocol. For the transaction execution, OV maintains the visibility watermarks, such that any transaction ongoing the replication is not yet visible, and all transactions below the watermark are visible. The protocol computes the watermark via a hierarchical gossiping protocol by taking the minimum watermark from each region. The regional watermark, in its turn, is a minimum of the server watermarks in the region, and a server watermark is an ongoing transaction with a minimal timestamp on that server. A few additional optimizations exist to allow reading from a single server, but these require an additional watermark to designate the full replication of a transaction.
Below you can find a presentation from our reading group by Balaji Arun
Our discussion focused on a few points/questions:
(1) Why does OV protocol use the FastPaxos-like algorithm for replication? FastPaxos requires the use of larger “super quorums”, however, OV replication is not client-driven, and a server specifically picks a unique timestamp for the replication. In FastPaxos, multiple commands may be tried on the same instance (i.e. timestamp), requiring a larger quorum for recovery in phase-1 of Paxos with a smaller majority quorum. However, in OV we do not see such conflicts: the timestamps for transactions are unique and are assigned by one node that coordinates replication. The only possible conflict that we saw happening is when a transaction first tries to replicate the command and then issues an abort on the same timestamp. This arguably creates a write-write conflict on the same instance (timestamp), but we think it can be resolved by establishing fixed precedence to make aborts always win over the regular writes. With precedence order established, writing abort to a majority of nodes should be sufficient to make abort persistent and recoverable. We have not reached a definite conclusion on why OV uses larger fast quorums, so we may still be missing something in our understanding of the protocol.
(2) Another discussion point was concerning the evaluation and comparison. A recent SLOG paper solves a similar problem, so we were wondering how it may behave compared to the OV. The group’s conclusion was a definite “it depends.” From one perspective, OV is more decentralized, so it may be able to achieve higher throughput when there are many multi-partition transactions. SLOG is more centralized, having both a dedicated master per partition and a dedicated ordering layer for the transactions involving many partitions. Both systems, however, do not abort transactions in most of the cases (unless there are significant failures), so their performance may be close to each other.
(3) Related to (2). Is comparison with TAPIR a fair one? Two protocols operate quite differently. We thought that it would have been nice to see a comparison with Calvin and SLOG.
(4) Performance for geo-sharded setup could suffer greatly. If we do not have all partitions/shards in every region, we will need to do a cross-region replication. In the protocol, the visibility watermark gets updated after the replication is complete, and having WAN replication will delay that process. Moreover, this may delay the visibility of other transactions that do not have geo-replication but have a timestamp after the geo-replicated transaction.