Tag Archives: transactions

Reading Group. Performance-Optimal Read-Only Transactions

Last meeting we looked at “Performance-Optimal Read-Only Transactions” from OSDI’20. This paper covers important topics of transactional reads in database/data-management systems. In particular, the paper discusses “one-shot” read-only transactions that complete in 1 network round-trip-time (RTT) without blocking and bloated and expensive messages. If this sounds too good to be true, it is. Before presenting these types of transactions, the authors discuss why it is impossible to have Non-blocking, One round-trip, Constant size messaging, Strictly Serializable (NOCS) read-only transactions. This becomes a “pick 3 out of 4” kind of deal.

The Performance-Optimal Read-Only Transaction (PORT) system shows how to get away with NOC and try to get as close as possible to S. For One round-trip constraint, the paper makes the clients coordinate their own read-transactions and control the ordering of reads, all in one round of message exchange. This requires the clients to send some recency/progress metadata over to servers. In the case of PORT, the metadata is a Version Clock, a type of logical clock. It is just a number, so it is Constant-size. Finally, the servers can use metadata to return the latest value that satisfies the recency constraint imposed by the Version Clock in a Non-blocking manner. The servers also avoid coordination to again satisfy the One round-trip requirement. To make sure the reads do not block, PORT never considers the in-progress operations. PORT separates the in-progress operations from the completed ones with a stable frontier time/version. Clients must request reads at what they know to be the latest stable, immutable state of the system and never try to request the state from the in-progress operations. Since different clients may have different and stale knowledge of the stable frontier, the system needs to support reading different versions of data, hence PORT relies on a multi-version store. 

PORT also does some clever trickery to improve the consistency. For example, a promotion mechanism is used to block-out a range of versions for writing in some cases. If some data was written with a version v=10 and then a read transaction has requested a value at version v=15,  the v=10 value will be promoted to occupy the entire range of versions [10, 15], and servers will be disallowed to write anything in that range. This, however, does not cause the write in that version range to abort, and instead, it will be written at version v=16. 

The paper implements PORT in ScyllaDB and Eiger and shows nearly identical throughput in read-heavy workloads to that of non-transactional reads while also beating Eiger’s transactions. There are quite a few important details and nuances on implementing PORT. The implementation on top of Eiger is full of surprises as the promotion mechanism described above no longer works for transactional writes, and PORT uses another clever trick. 

The presentation by Alex Miller that goes into a bit more details than my summary:

Discussion

1) SNOW theorem. NOCS theorem the authors discuss sounds similar to the SNOW. Well, it is by the same first author, so this makes some sense. Both are about read-only transactions, both concern the trade-offs between performance and latency. NOCS focuses on performance-optimal transactions, while SNOW talks about latency-optimal. Both talk about the impossibility of having the highest consistency and the be “x-optimal” (fill in the “x”). Moreover, the NOC (non-blocking, one round trip, constant metadata) implies that performance here largely means latency as well. It jsut happens that if we stop doing all the extra work, then the throughput improves as well. In a sense, it appears that NOCS is a rebranding of SNOW to some extent. We can even map letters in both abbreviations to similar concepts. S = (strict) serializability in both cases. N = Non-blocking. O = one round trip (in SNOW it is coordinate/retry, which is pretty much whether we add more messages or not). So three letters are the same, W & C are the difference, but even there we can find some similarities. W in SNOW stands for write conflict avoidance, and one way to do so may require violating C in constant metadata. The paper itself mentions that NOCS is similar to SNOW.

2) Other causal systems. Occult and PaRiS were brought up to the discussion briefly. We have not spent too much time on this though. Occult is a causal system that avoids “slowdown cascades” due to dependencies and the need to enforce causality. PORT with its one-RTT non-blocking mechanism seems to be similar in this regard, so a comparison would be interesting. 

3) HLC for the logical clock? HLCs are used in the transactions in MongoDB and Cockroach. HLCs are logical clocks, constant in size, and do help identify consistent cuts/snapshots for transactions. MongoDB uses HLCs for cross-partition causal transactions, and it seems to fit well within the NOC. CockroachDB is more involved, but it also uses HLC. Another important part of HLC is that it can provide a single serial order, but this is something PORT actually avoids in Eiger-PORT since it needs to provide a different serial order to different clients to enforce read-your-write property without blocking.

4) On the importance of a stable frontier. A stable frontier is the time in the system’s execution that separates what is safe to read and what is not. Everything before the stable frontier is committed/executed and safe, any operation after the frontier may not have been fully written/committed yet, and is not safe. This separation is clear in Scylla-PORT, but gets blurred in Eiger-PORT and its read-your-write reordering. 

5) Replication. The paper does not address replication issues at all, so one has to wonder about how it handles replication and associated failures. For example, in Cassandra/Scylla, a read succeeds after being completed by some read-quorum that may be smaller than all replicas for the object. This means that you can promote the value on a subset of replicas, and then do a write on some quorum containing the un-promoted replicas and end up with the same write recorded under different versions on different replicas. This may or may not be a huge problem, but a conversation on replication/failures would be very useful. The code (which is open source) may help to shed the light on this, but we have not had a chance to look at it during the discussion.

6) Eiger-PORT. This one is very different from the ScyllaDB version. It is different from how the paper described PORT all along since Eiger-PORT cannot promote the operations because now writes are in a transaction, and all writes from one transaction must be promoted to a higher version atomically. If you do that, you need to coordinate between servers and add messages and lose the O part of NOC. Authors go into more details describing the Eiger-PORT protocol, which is not the easiest thing to grasp from the first read. It is also mind-twisting when you start reordering operations for different clients. Actually, as of the time of this writing, we were still discussing some aspects of Eiger-PORT in our group’s slack channel.

7) Evaluation. We liked the choice and rationale for picking the baseline systems to evaluate against. PORT indeed showed to have low overhead in ScyllaDB while improving the database’s consistency semantics.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Aria: A Fast and Practical Deterministic OLTP Database.

In our 33rd reading group meeting, we discussed “Aria: A Fast and Practical Deterministic OLTP Database.” by Yi Lu, Xiangyao Yu, Lei Cao, Samuel Madden. We had a very nice presentation by Alex Miller:

Quick Summary

Aria is a transaction protocol, heavily influenced by Calvin, and it largely adopts Calvin’s transaction model, with one big difference. In Calvin, read and write sets of a transaction must be known beforehand, but Aria has no such strict requirement, allowing for generally more flexible transactions. Aria’s main goal is parallelizing transactions as much as possible to maximize the throughput. To that extent, Aria adopts batching and processes multiple transactions in each batch concurrently.

ariaEach batch operates in two phases: execution and commit. All transactions in a batch start the execution phase from the same snapshot (the result of committing and applying the previous batch). In the execution phase, each transaction concurrently executes and produces the new state, which is stored temporarily. Once all transactions in a batch have finished executing, the protocol moves into the commit phase, where Aria aborts any transaction that has a Write-after-Write(WAW) conflict or Read-after-Write(RAW) conflict. Aria uses unique and sortable transaction IDs to determine the order of transactions within the batch to find these conflicts. Any aborted transaction goes to the future batch, while the remaining successful transactions commit and apply their temporary execution results. Once the commit has finished, a new batch can start, and the protocol keeps moving in these lock-steps, synchronizing after each step. Every replica can run the protocol without synchronizing/communicating with other replicas unless the system is sharded/partitioned and one replica need to read data from another shard. An important optimization of Aria is deterministic reordering, where transactions within a batch can be reordered (i.e. not follow the order of their TIDs) to reduce the number of aborts, and consequently, reduce the amount of wasted work in the execution phase.

Discussion Points

We had good participation and touched on quite a few things in the discussion, but to the credit of the paper, we were able to find lots of answers there just by reading more carefully.

1) Transaction Serializability. Paper claims serializability, but it applies a bunch of transactions to the same snapshot. How does it compare to snapshot isolation? and what is the difference that allows serializability?

We think the batch processing and addition of RAW conflict within the batch makes a difference. Snapshot Isolation checks for WAW conflicts only and allows some artifacts. By also disallowing RAW conflicts, we can eliminate the problem. However, there are nuances with how a read set can be defined for conflict resolution. For example, What is a read set in a transaction running UPDATE items SET x=0 WHERE x=1? It can be either all items, or only items with x=1 if some indexing is used, and this difference may result in serializability issues. For the paper’s defense, it does not explicitly consider a SQL model, and also mentions that things like the above are up to the users to decide, and a fall-back to Calvin-like approach.

2) Batch size. This is a batched protocol, so the batch size may play a role in the performance. Batches that are too small will have more frequent barriers between batches and phases. Batches that are too large have a higher chance of having a super long transaction that stalls the entire batch, impacting the performance. The paper mentions that the transactions should take about the same time to execute for best performance to avoid the case when one slow transaction stalls the entire batch.

3) Paper readability. The overall consensus in the group was that this was one of the easier transaction appears to read.

4) Comparison with other transaction systems. SLOG paper we discussed a few months ago also uses determinism on each node to run transactions. Unlike Aria though, SLOG uses determinism for execution order, while Aria executes unordered (from the same state) and deterministically aborts the conflicting transactions. Overall it seems that Aria’s use of determinism is more extensive – the same snapshot, transactions are deterministic themselves, deterministic conflict search and abort, etc.

There was also a mention of CockroachDB’s transactions, since they also use snapshot, and produce temporary results, but Cockroach is more optimistic than deterministic. Also, Cockroach is more interested in low-latency transactions, while Aria is all about high throughput even at the expense of latency.

5) Deterministic reordering. The paper mentions that the reordering is a best-effort algorithm and not the most optimal one. It also seems that reordering plays a big role in having high throughput, so can we squeeze more performance with a better reordering algorithm? Obviously, it is not efficient to brute-force all possible permutations, but maybe some better heuristic approach?

aria-reorder

6) Performance variance under conflict. Aria works best for workloads with not a lot of conflict between transactions. Reordering helps, but not in all cases, as evidenced in the evaluation.

 

Join our reading group on slack for more discussions, paper schedule, and zoom participation.

Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions

Ocean Vista

On Wednesday we had a presentation and discussion of the Ocean Vista (OV) replication and distributed transaction protocol. OV works in the WANs, where each region has all data-partitions, and transactions can originate in any region. OV separates replication from transaction execution, by making replication conflict-free with a FastPaxos-inspired protocol. For the transaction execution, OV maintains the visibility watermarks, such that any transaction ongoing the replication is not yet visible, and all transactions below the watermark are visible. The protocol computes the watermark via a hierarchical gossiping protocol by taking the minimum watermark from each region. The regional watermark, in its turn, is a minimum of the server watermarks in the region, and a server watermark is an ongoing transaction with a minimal timestamp on that server. A few additional optimizations exist to allow reading from a single server, but these require an additional watermark to designate the full replication of a transaction.

Below you can find a presentation from our reading group by Balaji Arun

 

Discussion

Our discussion focused on a few points/questions:

(1) Why does OV protocol use the FastPaxos-like algorithm for replication? FastPaxos requires the use of larger “super quorums”, however, OV replication is not client-driven, and a server specifically picks a unique timestamp for the replication. In FastPaxos, multiple commands may be tried on the same instance (i.e. timestamp), requiring a larger quorum for recovery in phase-1 of Paxos with a smaller majority quorum. However, in OV we do not see such conflicts: the timestamps for transactions are unique and are assigned by one node that coordinates replication. The only possible conflict that we saw happening is when a transaction first tries to replicate the command and then issues an abort on the same timestamp. This arguably creates a write-write conflict on the same instance (timestamp), but we think it can be resolved by establishing fixed precedence to make aborts always win over the regular writes. With precedence order established, writing abort to a majority of nodes should be sufficient to make abort persistent and recoverable. We have not reached a definite conclusion on why OV uses larger fast quorums, so we may still be missing something in our understanding of the protocol.

(2) Another discussion point was concerning the evaluation and comparison. A recent SLOG paper solves a similar problem, so we were wondering how it may behave compared to the OV. The group’s conclusion was a definite “it depends.” From one perspective, OV is more decentralized, so it may be able to achieve higher throughput when there are many multi-partition transactions. SLOG is more centralized, having both a dedicated master per partition and a dedicated ordering layer for the transactions involving many partitions. Both systems, however, do not abort transactions in most of the cases (unless there are significant failures), so their performance may be close to each other. 

(3) Related to (2). Is comparison with TAPIR a fair one? Two protocols operate quite differently. We thought that it would have been nice to see a comparison with Calvin and SLOG.

(4) Performance for geo-sharded setup could suffer greatly. If we do not have all partitions/shards in every region, we will need to do a cross-region replication. In the protocol, the visibility watermark gets updated after the replication is complete, and having WAN replication will delay that process. Moreover, this may delay the visibility of other transactions that do not have geo-replication but have a timestamp after the geo-replicated transaction. 

Review: Implementing Linearizability at Large Scale and Low Latency

In this post I will talk about Implementing Linearizability at Large Scale and Low Latency SOSP 2015 paper.

Linearizability, the strongest form of consistency, can be very important in large scale data storage systems, although many such systems either do not implement linearizability or do not fully expose serializable operation to the clients. The later type of systems can maintain linearizability for internal operations that occur between servers, but do not provide the same consistency to the clients.

The authors of the paper provide a linearizability framework, called RIFL, suitable for use in existing non-linearizable RPC based distributed system. The framework allows to convert existing RPC into linearizable ones in just a few lines of code with minimal impact on the overall performance. The paper only discusses RPC-based systems, since according to the paper, linearizability requires a request-response protocol to operate. I think it may be possible to sue RIFL-like system for message passing approaches as long as receiving each message eventually produces an ack to the sender.

Linearizability

In order to better understand RIFL and how it is beneficial in the data-store system, we need to talk about Linearizability. According to the paper, Linearizable operations appear to happen instantaneously and only once at same point in the execution of a system. It is important to understand that in a real system an operation can take some time to execute and can potential fail midway through its execution. Linearizable system must make it appear to all its clients as if the operation happened right away. The ability to execute operations only once is another important point, as many existing systems retry execution of operation upon failure. Authors say such operations follow at-least-once semantics, whereas linearizable operations have exactly-once semantics.  In order to achieve certain consistency guarantees, many existing systems use idempotent operations which produce the same outcome regardless of how many times such operations have been executed. Authors show an example in which running such operation more than once can break the linearizability after a certain failure.non-linearizable_example

Example of at-least-once semantic breaking linearizability.

In this case we have two clients interacting with a single server. Client B writes 2 to the server but it crashes before the server has a chance to respond. When client A reads the data, it will get the value written by B. Later client A can write a different value, while client B is recovering from the failure. Once client B as back up, it does not know that its previous operation has succeeded, so it retries it and overwrites the later value written by A.  Authors do not mention how likely such example to occur in practice, but given a large scale of the system with thousands or even millions clients, it will be unwise to discount the possibility of such failure. Nothing is mentioned whether any of the existing data-store systems address the issue.

RIFL

RIFL framework allows the conversion of the system relaying on the at-least-once RPC operations into linearizable exactly-once operations. The main idea behind RIFL is storing the results of the RPC execution, so that in case of a retry of an RPC call the system could have used already known result without having to re-run the procedure. The results of the RPC invocation are stored on the completion records, and each such record is associated with each unique RPC. This ensures the exactly-once operation of the RPCs in the system, but also opens up a number of problems that had to be solved.

RIFL

High level representation of RIFL logic

In order to operate properly, the system must be able to detect retry calls. In order to make such detection easy, each RPC is assigned a 128-bit ID number consisting of a 64-bit client ID and 64-bit sequence at such client. If an operation is to be retried by the client, it must use the same ID. Before execution of an RPC, the server will check if it is aware of a completion record for such RPC and if it does not exist, RPC continues, but if a completion record is present then the server will return the results stored in the completion record instead of running an RPC.

Migrating completion records is essential in the event of a failure, as the system relies on the presence of such records to make a decision on whether an RPC needs to run. From time to time, data can migrated from one server to another, especially in case of a server crash. The new server must have the completion records available to it after the migration, so each completion records is attached to one of the data-objects being modified by an RPC, so that moving the object will also move all the completion records for RPCs acting on the object. Unfortunately, authors do not explain in detail how the migration is made, as this part is probably left out to the underlying system. It is very likely that completion records also get replicated with the objects they belong to for durability reasons, although no mechanism for such replication is described as well, so it is worth to assume that the completion record replication is left out to the underlying system.

Overtime many completion records are going to be created for each object, increasing the storage requirements and the bandwidth used for replication and migration of objects. In order to improve resource utilization, a garbage collection mechanism for old completion records was devised. In RIFL a completion record can get removed from the system if a client acknowledges that it knows of a successful RPC execution and will not retry it in the future. Such acknowledgments are piggybacked to the new RPC requests and as a result incur minimal overhead. In case of a client failure, no acknowledgement will be sent to the server, causing certain completion records to persist. In order to deal with this problem, RIFL uses lease manager to grant leases to all the clients. In case a client lease is not renewed, all completion records for the client will be purged. It is not clear how a centralized lease system can impact the overall performance of a system implementing RIFL. A time synchronization between the client, lease manager and a server is used to reduce the need to communicate: the server will contact lease manager only when it estimates that the client lease will expire soon. This portion of a lease protocol raises some questions about the reliability of the lease sub-system. What is going to happen if time skew is greater than the server estimates for? If the server time is ahead of lease manager time, server will start issuing more check requests to the lease manager, but if it is lagging behind the lease manager time, than the server may think lease is still good while the client may have already been dead. I think the worst case scenario is that GC does not collect all dead completion records, which may not be of a big immediate problem, but may eventually lead to the excess memory consumption by the server applications.

Transactions with RIFL

Authors implemented a transaction system using RIFL for linearizability on top of RAMCloud, a distributed, in memory key-value datastore. A two-phase commit protocol similar to Sinfonia is used to implement transactions. In the first phase of the protocol, usually called a prepare phase, a set of read, write or delete commands is sent to servers and each server upon receiving prepare determines if it can proceed with the commit. If all servers can commit, then a second phase finalizes the transaction. RIFL makes crash recovery simpler compared to Sinfonia. Since each prepare operation is linearizable, retires of the prepare will not cause and adversary effects. Upon a more serious crash, recovery manager can learn if the results of the prepare operation without the knowledge of the original commit commands, and if all prepares have succeeded, it can finalize the transaction; in case of some prepare failures, transaction is simply aborted.

One important point authors make in the paper is about traditional way of implementing linearizability on datastores and how it differs from their implementation. In the existing system, linearizability is implemented on top of a transaction system and according to the authors this approach creates more cumbersome transaction mechanism. With RIFL, transactions were implemented on top of a linearizability layer, which authors claim is a better approach.

transactio_linearizability

Evaluation

RIFL was implemented and evaluated in RAMCloud. Overall, authors claim only 5% reduction in latency for RIFL linearizable write RPCs compared to the original writes. No significant difference in throughout was observed when using RIFL.

latency_throughput

Added overhead of RIFL to the RAMCloud system. Left is latency, right is throughput.

Transaction performance was evaluated with TPC-C benchmark typically used for performance evaluation of Online Transaction Processing (OLTP) systems.  RIFL RAMCloud was compared against H-store database. Both system are in-memory databases but they different significantly in their purpose and typical use cases. As a matter of fact, RIFL RAMCloud solution had to be specifically implemented for TPC-C benchmark. When comparing the two systems, authors found out that RAMCloud with RIFL significantly outperforms H-store in all tests. I am a bit skeptical about these results, at least without more knowledge about how RAMCloud was used to make TPC-C benchmark work with it and whether the implementation of RIFL & RAMCloud interface for TPC-C benchmark was specifically tailored for the tests performed by TPC-C. It may have been a good idea to compare the system against other transaction protocols implemented in RAMCloud, such as the ones based on consensus.

Overall Thoughts

When reading the paper I thought that the idea of caching the results of RPC calls is a very straightforward and simple and I am surprised it has not been exploited before. Yes, store such cache presents a few challenges, mainly in memory management of the overall system, as the cache size can grow large, but as shown in RIFL, these are not very big challenges and can be solved with simple protocols and existing tools, such as ZooKeeper.  Authors claim that implementing transactions on top of linearizability layer is a better and faster approach. Transactions (mini-transactions?) implementation became easier with RIFL, but I am not sure the performance benefit is obvious. On my opinion performance comparison with H-store seems somewhat unfair.