I lead a cloud computing group at the University of New Hampshire where we focus on performance, reliability and efficiency of distributed systems in the cloud. We are a diverse group consisting of a mix of undergraduate, master and PhD students.
Current Projects and Research Directions
While we broadly focus on performance, reliability and efficiency of distributed systems, and do not shy away from exploring a variety of research in this space, our lab concentrates of the following main focus areas:
Metastable Failures in Distributed Systems
Metastable failures refer to a class of catastrophic system failures that cause a permanent, self-sustaining overload of the impacted system. Distinguishing characteristics of metastable failures are the initial trigger that temporarily overloads the system and the sustaining effect that kicks in due to such overload and keeps the systems in the overloaded state, even after the initial trigger is fixed. Once in this permanently overloaded state, called the metastable failure state, the system is perpetually busy but unable to complete any useful work until drastic manual measures, such as restarting the system, are taken. Metastable failures have led to several prominent cloud outages in recent years.
While these failures are relatively common in large systems, they manifest themselves differently, making them more difficult to identify and mitigate. For instance, various possible triggers, such as server and network failures, unexpected workloads, and bugs, may cause the initial overload. Similarly, diverse mechanisms may create sustaining effects, ranging from retry behaviors to unoptimized and rare execution paths. Our lab researches metastable failures and works towards reducing the likelihood of such failures.
Hardware-Accelerated Distributed Systems
Many modern distributed systems lack support for advanced hardware features present in modern commodity hardware. In this project, we aim to use these features and built-in accelerators to create efficient data pipelines that combine networking, checksumming, compression, encryption, and other data handling while largely bypassing the use of general-purpose CPUs.
Efficiency of Replicated Systems
At their core, many modern data-intensive systems rely on consensus/replication protocols to provide strong consistency and ensure reliability and durability. Although there has been much work on scaling and improving these replication protocols, most effort has focused on improving the absolute performance of protocols with no regard for resource efficiency. Consequently, the resource utilization and overall throughput have gone up, while the amount of work produced per unit of resource has decreased.
The current approach of improving resource usage to boost throughput works well in environments with dedicated nodes that have plenty of idle resources. Increasingly, however, NewSQL systems are deployed in the cloud, where resource sharing through task-packing is a common strategy for improving resource utilization. Specifically, the tenants pay for multiple resources, such as compute, network bandwidth, and storage, based on demand and consumption. Therefore, as we move towards the world where computing is a utility offered by a handful of cloud providers, the importance of resource efficiency in the cloud becomes paramount for the economically viable operation of computing services. In this project, we study resource efficiency in the current generation of replicated systems and use the acquired knowledge to develop new replication protocols that increase performance and reduce cost through higher resource efficiency.
Data and Access Locality for Geo-distributed Databases
Consensus-based strongly consistent distributed storage systems serve as a backbone for many large cloud applications and services. Such storage systems are desirable due to their fault-tolerance, durability, consistency, and programmability properties. Unfortunately, strong consistency becomes more expensive to achieve as the geography of systems grows. For this reason, many current state-of-the-art storage solutions limit their available geography when operating in strongly consistent mode. In this project, we design a new generation of fast, strongly consistent distributed storage systems for planetary scale. Our solutions rely on a two-pronged approach, combining adaptability to workload access-locality and core algorithmic improvements to communication and replication. We plan to make access-locality a “first-class” concept in our systems to allow tighter and more efficient integration with replication, transactions, and load-balancing components. We also make algorithmic improvements to enable more efficient communication and payload replication in larger systems with many nodes distributed over the cloud and edge.
Students
Current Students
- PhD
- Owen Hilyard
- Alakbar Askarov
- Catrina Janos
- Master
- Adam Hassick
- Lusha Zhang
Graduated
- Hannah Marsh – BS’24, now PhD @ Tufts
- Owen Hilyard – BS’23
- Marielle Webster – MS’23
- Jacob Berg – MS’23
- Joshua Guarnieri – MS’21
I am currently looking for students interested in distributed and edge systems, large-scale distributed databases/data-stores, fault-tolerance and reliability of systems. Please send your CV to aleksey.charapko@unh.edu