In the 68th reading group session, we discussed scheduling in dataflow-like systems with Cameo. The paper, titled “Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo,” appeared at NSDI’21.
This paper discusses some scheduling issues in data processing pipelines. When a system answers a query, it breaks the query into several steps or operators, such as groupBys, aggregations, and various other processing steps. Each of these operators executes on some underlying serverless actor (in this work, the underlying platform is Orleans), so naturally, the operators must be scheduled to run on some actors when the actors become available.
Scheduling is easy when we always have idle actors/workers that can pick up news tasks. But this can lead to overprovisioning of resources and becomes expensive for all but the most latency-critical applications. As such, resource-shared environments that allow processing multiple queries from different tenants are more cost-efficient and practical. Of course, the problem with shared environments is keeping all tenants happy and isolate them from each other. This means that scheduling needs to be aware of SLAs a system must provide to tenants and try not to blow past the deadline. Cameo is a dataflow processing system that aims for fine-grained scheduling of tasks in the system to make sure all queries, consisting of many tasks, finish within some deadline or SLA guarantee.
Keeping a deadline is tricky though. One complication arises from queries that require more than one operation to compute. Some of these query operations may be dependant on each other and need to happen in sequence. As a result, the query deadline must be broken down into sub-deadlines for all the sequential execution steps. Instead of enforcing the execution deadlines, Came enforces a start deadline for each task. This deadline is the latest time a task must begin to compute, such that it will not impact the downstream tasks and overall query deadline.
Things can get more complicated from here. If one step takes longer to compute than anticipated, then the start deadlines of downstream tasks/steps may be violated. The system, however, tries to adjust/reprioritize the consecutive steps to accelerate their execution and hopefully still meet the overall query deadline/SLA. But there is more. See, if the scheduler made a mistake once and misjudged the execution time of a task, this can happen for the same query again. To counter this problem, Cameo propagates the task execution information to the scheduler so that it can take better actions next time. This feedback mechanism allows the scheduler to learn from past executions of a query/task and improve in the future.
Now about the scheduler. It has two major components – the scheduling mechanism and the scheduling policy. The scheduling mechanism executes the scheduling policies, and the two parts are rather separate. Such separation allows the scheduling mechanism to take tasks with high priority (sooner start deadline) regardless of how this priority was computed by the policy. The scheduling policy component is pluggable and can be changed for different applications, as long as it creates the priorities that the execution component understands. This part must follow Cameo API to allow rescheduling of downstream/consecutive tasks and learning the stats of step execution to adjust in the future.
The authors claim significant latency improvements of up to 4.6 times when using Cameo.
As always, we had a presentation in our reading group. The video presentation by Ajit Yagaty is available on YouTube:
Discussion
1) Spark Streaming. The scenarios presented in the paper operate on dynamic data sets that continuously ingest new data. This is similar to other streaming dataflow solutions, such as Spark Streaming, for example. The paper does not provide a direct comparison in evaluation but claims that Cameo has a much more fine-grained scheduling at the per-task level.
2) Fault Tolerance. One natural question in the reading group was around fault tolerance. The paper does not touch on this topic at all. Since the system is built using the Orleans framework, we suspect that most fault tolerance is taken care of by the framework.
3) Spiky Workloads. The feedback mechanism used to learn how long tasks may take and adjust the scheduling accordingly works great in a streaming system when we have a steady inflow of data and the same queries that process this new data. Adding a new query will likely create some problems (at least for that query) until the system learns the stats of all the tasks this query uses. Similarly, having a big spike in data ingestion can cause some mayhem. If a query operates on the premise that some task needs to process n events/messages, but receives 10n, its execution time may become too large and cause the downstream deadlines to violate. And such momentary spikes in a steady flow of new data may happen rather often — all that is needed is some transient network glitch that delayed some messages causing a spikey delivery moments later.
4) Query Optimizers? We are not query optimization experts here in the reding group, but we wonder if using some query optimization techniques can help here. So instead of learning on historical stats to predict how long different steps of a query may take, can we be smarter and optimize based on the actual data at hand? A simple and naive thing to do for a previous example in (3) is to account for data cardinalities when computing the start deadlines so that a spike in data ingestion can be noted and dealt with immediately.
Reading Group
Our reading group takes place over Zoom every Thursday at 1:00 pm EST. We have a slack group where we post papers, hold offline discussions, and most importantly manage Zoom invites to paper presentations. Please join the slack group to get involved!