Local Algorithms for Robust Disconnected Systems (LARDS)

(in collaboration with Radim Bartoš)

Project outline

Propagation pattern
Propagation of information from the center of a 31x31 grid using UrUlUlUr meeting pattern.

We consider systems that are naturally (and often unavoidably) distributed. The participants in those systems are varied (sensors and actuators; mobile robots; cellphones; desktop, portable and wearable computers; etc.) and we refer to them using the generic term of agents. These agents are deployed for a purpose and need to coordinate in order to achieve their tasks. These tasks usually involve computing, but not to the exclusion of other charges like sensing, actuation, motion and interaction with other agents. We refer to the specification of the agents’ tasks as their mission.

This mission often demands that agents react to their environment safely and quickly in spite of large scale deployment, uncertainty, dynamic changes and limited resources. This is best achieved if agents are designed in a structured but not rigid way. Structure is necessary to control engineering complexity but many assumptions of traditional distributed models (static topologies, known sets of participants, low failure rates, unlimited power, high bandwidth, end-to-end routing, etc.) are too rigid for these systems. In particular, we expect agents to be able to react locally in those scenarios where coordination over the entire network or even communication with a central authority is impossible. In other words, in addition to being physically distributed, the systems we consider are inherently decentralized.

Coordination among agents requires communication. Our agents, however, tend to inhabit a world in which communication can be difficult, for instance because of mobility, wireless or underwater transmissions or power limitations. One approach to this challenge, best exemplified by the work on Delay Tolerant Networks (DTNs), is to minimize the necessary changes to distributed algorithms by presenting them with a networking abstraction as close as possible to what they were designed for. The difficulty, of course, is to achieve such well behaved networks in spite of environmental challenges, which is not always possible in a strict sense (hence the “Delay” in DTN). The main benefit of this approach is that designers can rely on familiar frameworks (possibly with unusual parameters, like high latency) when implementing missions.

Recently, a different strategy has started to emerge. It hinges on the idea to forgo standard networking abstractions such as routing and end-to-end connections—ill-suited to the new environments in which agents tend to evolve—in favor of new computation models. In contrast to DTNs, this is a radical step in which the baby is thrown out with the bath water. It requires a complete redesign of distributed algorithms and applications. Our project belongs to this family of thought and, to emphasize the fact that agents still form cooperating networks despite the absence of a standard communication network, we refer to such systems using the slightly oxymoronic term of disconnected networks.

Propagation delay
Lengths of shortest paths to top edge (UrUrUrUr meeting pattern). In this experiment, 42% of the nodes (in black) have failed and another 9% (in yellow) are isolated. Lighter shades of blue indicate longer shortest paths.

To illustrate the concepts of disconnected networks, consider the population protocols model of Angluin et al. (or our own somewhat related self-similar algorithms). These are computation models in which agents nondeterministically interact with each others but have no control over these interactions. Computations proceed opportunistically when local groups of agents are formed and perform steps independently from all the other agents in the system. This is a radical departure from classic message-based computing where an agent A sends a message to a specific agent B. Here, agent A must work with whatever agents are currently reachable (and possibly use them as intermediates if some information that originates on a specific agent B is needed). Communication evolves from a service that is more-or-less always available into a resource that needs to be used efficiently and opportunistically. (Note, however, that standard distributed algorithms and communication mechanisms are still needed to implement local group steps.)

Population protocols and self-similar algorithms investigate such group-based computations under a minimalist set of assumptions (basically, that the formation of groups is in the hands of an adversarial environment but that there are enough encounters of agents for the system to make progress). Such extreme assumptions are suitable for an analysis of computability questions and worst-case performance of algorithms but in reality, agents have some degree of control over group formation and how they use it can have a huge impact on the performance and reliability of systems. The study of how agents should employ their group forming capabilities to create the opportunities for interactions needed by their algorithms and to benefit the implementation of their mission is at the core of our project. In other words, we seek strategies that will result in efficient implementations of emerging group-based computation models.

There are many facets to this problem depending on the type of agents and the characteristics of their missions. For instance, agents can be mobile and autonomous and form groups based on distance and locations. Or they can be static but intermittently powered and use sleep/wake times as their group organizing principle. In either case, there can be tensions between group formation and mission parameters. For instance, it may be desirable for a mobile agent A to move to location x for the purpose of joining a group but A must also visit a different location y according to its scanning and sensing mission. In the same way, the tradeoff between long wake periods (more groups, better agent coordination) and short wake periods (better network longevity) can be studied. It turns out that there are more subtle parameters than these obvious tradeoffs. For instance, it can be shown that, without changing group locations and visit rates (or wake/sleep durations), performance can be greatly improved by carefully choosing the order in which agents participate to their different groups, using limited knowledge of the agents’ mission.

The main reason that we depart from traditional computing models for the systems under consideration is the undesirability (and often impossibility) of establishing and maintaining fully connected networks of agents at all times. Notwithstanding environmental challenges (high latencies, high failure rates, motion, noise, interferences, short communication ranges, etc.) that need to be taken into account, we seek systems that can react locally with limited global coordination. Agents should form local groups based on the nature of the mission and the type of event they respond to.

We therefore assume that agent have some control over which groups to form but in general, they won’t be able to achieve an optimal schedule of meetings with other agents exactly. For this reason, it is important that any scheme used to organize meetings of agents be robust and able to withstand delays, failures and in general unfriendly environments. As part of our project, we study network reliability under various models of faults. We have also devised mechanisms that can be triggered to “repair” networks by recomposing group membership and meeting schedules dynamically. The key challenge in defining such repair mechanisms is that they often need to be coordinated but the only way agents can coordinate is by using the current (and failing) group building strategy. Nevertheless, we show that simple (local, autonomous, loosely coordinated) strategies are capable of handling many network failures, including catastrophic scenarios.

These examples illustrate how grouping and scheduling strategies can be evaluated in terms of performance and robustness. In addition to these intrinsic properties, strategies can be compared in relation to specific algorithms and the computation models in which they are expressed. We plan to extend the embryonic models of population protocols and self-similar algorithms into a full-fledge model capable of expressing distributed solutions to all the aspects of a mission. This will allow us to compare the performance and the robustness of a particular algorithm under different group-forming strategies. In contrast to a classic measure based on the number of messages exchanged and their sizes, this comparison makes use of a measure of complexity that combines the number of group steps and the complexity of each individual step.

Disconnected networks of agents can be quite complex because computation, motion, sensing, communication, actuation and mission specification are all intertwined, which makes their analytical study a difficult task. In order to get some insight into the influence of various parameters, we have implemented a software simulator that lets us implement various schedules, meeting policies and repair strategies and allows us to visualize key aspects of the system such as mission-related activities and information propagation. This simulator facilitates a research process in which we formulate an assumption, design a corresponding experiment, simulate the system behavior to validate (or invalidate) our assumption, and finally proceed with a rigorous explanation of the observed behavior.

Altogether, this project focuses on an emerging class of distributed systems for which conventional algorithms—and the networking abstractions that support them—are inadequate. We explore new computational models and devise suitable strategies to implement them reliably. The ultimate objective of this research is to simplify the software design of these systems and to improve their robustness.