The Magic of Distributed Access Control

Why are distributed systems so hard to design? Well, one challenge is getting them to make decisions. In a centralized model, you can write up a set of rules, hand those rules to a trusted authority, and just let it start arbitrating decisions for you. In a distributed model, you have to weave those rules into the very fabric of the system itself.

Access control is a very common need for all kinds of applications. To be practical, many applications will not only want to restrict certain actions to people who have permission, but they'll also want some way for people to delegate their permissions granularly. In a centralized system, implementing access control is pretty easy: All actions go through a central authority, which at any given time knows the full state of the system. Whenever someone tries to take an action, the authority checks whether the current state allows it, and can immediately decide whether to accept it or reject it. Only if it's accepted does it change the system's state.

In a distributed system, access control gets more complicated. Some systems use network-wide consensus so that they can come to a more or less "final" conclusion about whether an event actually changes the state, and what order it takes with respect to other events. This is how you get blockchains. It works, I guess, but you have to trade off some amount of decentralization or scalability in the process.

Personally, I'm much more interested in exploring systems without consensus, in which nothing is final and the state is eventually consistent. This is the design space in which we find collaborative, local-first documents.

But how can you possibly authorize anything when the system doesn't even guarantee a consistent view of who has what permissions? Whatever the answer, it's going to require a healthy dose of optimism and a touch of magic.

Now, Matrix has already explored this same problem space. Matrix is organized into rooms (chat rooms), with the main data structure of interest here being room state: a flat, replicated key-value store which holds not just "cosmetic" data like the room's name and topic, but also information governing the authorization of changes to the room state itself. The protocol then defines a conflict resolution algorithm which deterministically merges divergent room states together. The specifics of this algorithm are critical to Matrix's eventual consistency and access control guarantees.

There are a few things I find unsatisfying about its design, though. In particular:

Matrix requires there to be a total order on power levels, rather than just a partial order. This means that you can't give Alice permission to do X and Bob permission to do Y without giving at least one of them the ability to do both X and Y. Delegation of permissions isn't as granular as it could be.
The conflict resolution algorithm depends on room state being a flat key-value store. But what if another data structure is better suited to the application at hand? Ideally, we would have a generic way to add access control to any CRDT with arbitrary semantics.

Finally, it's not especially clear what access control guarantees Matrix actually provides. The conflict resolution algorithm does ultimately read as a (dauntingly complex!) algorithm rather than a set of transparent semantic rules. In lieu of a formal proof, understanding what properties it has is a difficult task.

My goal here is to show that a simple, principled, and generic approach to access control for local-first applications is within reach. Let's get going!

Take One

For a spell to be effective, you must first give it direction. What makes a good access control system? What does it mean for something to be authorized, when the very state that authorized it might yet shift out from under you?

Let's first review what it means for a centralized access control system to be correct. I think this can be summed up neatly into a single rule:

Principle of Authorization
An event e sent at time t changes the system's state only if the state at t authorizes e.

If we can just get a distributed system to satisfy this same principle, then we should be good, right? Well, there are two major obstacles to doing so.

The first is that this principle assumes the existence of an authoritative clock. In a distributed system there is no such thing—each peer has their own clock, and events can happen concurrently. We can overcome this by requiring the Principle of Authorization to hold for any arbitrary (causally consistent) clock. In theory, this ensures that authorization decisions are fair to all peers and nobody can manipulate the sequence of events to their advantage.

The second challenge is that, since no part of the state is ever final, it's impossible to know whether the state you have for time t is really the full state for time t. However, if we can change our perspective a little, this isn't such a big problem. Whenever you use an eventually consistent system, the best the system can do is be optimistic that it has complete information. So we need to apply the same optimism to authorization: when a peer decides whether an event is authorized, they must base this on the assumption that they already know the full state of the system. If more information arrives later, they must be prepared to alter past authorization decisions—they can never be final!

In other words, as long as peers consider all authorization decisions tentative and ensure that they are consistent with any possible clock, we should expect to get a valid access control system.

Now it's time to give some force to our magic.

Let there be events. Linked to one another by sheer causality. Always growing downward, never cyclic.

Alone, they each hold an effect. Together, they give life to a document, forking and joining in a semilattice dance.

Many events are good. But some of them dare to overstep the rules. Tread carefully, we mustn't allow their effects to set in.

Somewhere a branch breaks free, carrying events with independent destinies. They threaten to circumvent others that would deauthorize them—but you have a plan. In your mind, an image forms: one event happening before the other, the possibilities collapsing to form a single strand. Summon all possible extensions, search them for answers. Do any of them block these events?

There—you catch the anomaly, return everything to its place. Fuse the branches into a single definitive state.

So, we can certainly fulfill the Principle of Authorization by checking every possible total ordering of the event graph. If we think of time as a partial order, we've essentially implemented this stronger guarantee:

Continuous Authorization
An event e sent at time t changes the system's state only if the state at every t' which is equal to or concurrent with t authorizes e.

But is this efficient? Not at all. The number of ways to order n independent events grows factorially. Upholding access control guarantees with respect to real time is a noble goal, but sadly it's prohibitively expensive.

Hmm, what if we changed the rules? If we could somehow check each event against only a single state, authorization checks could be much more efficient. Check this out:

Eventual Authorization
An event e sent at time t changes the system's state only if the state at the latest t' which is equal to or concurrent with t authorizes e.

This is weaker than Continuous Authorization in a significant way: it allows an event that would normally have been unauthorized to later become authorized once the state has changed. Intuitively, this should be okay; if someone tries to send an event for which they lack permission, but then they later receive permission, they could just as well have sent the same event upon receiving permission.

Take Two

To enter this realm, we will need a protective charm. It's a simple one, wielded by the likes of Zermelo and Kripke, useful for guarding against any form of self-reference. This time, it reads:

No Cyclic Dependencies
Whether an event e is authorized must not depend on whether e actually changes the system's state.

You cross the threshold. On the surface, this world is very similar to the last one, but you know they work by different rules. Still, you can see that events in linear sequence behave familiarly, even here.

You blink, and new events appear. Perhaps they were there all along? One of them blocks its sibling's effects, the other allows them. Which one do you believe?

The last world would have stopped the event in its tracks, but this one is much more forgiving. Carve a seam into the figure: below it, the event. Above, everything that might have happened before it. This is the state that will judge the event.

But what is that state? It depends on whether the other events are authorized. Carve two more seams, each like the last. The states of these branches form a self-referential spiral. Look for shortcuts, simplifications. These are how you will escape.

Now you are ready. Dive into the spiral and seek the fixed point.

So this is another possible access control system. To summarize, it gives us these semantics:

An event e changes the state if and only if the state at {e' | ¬(e ≤ e')} authorizes it.

It should be easy enough to see that any access control system implementing this rule will satisfy Eventual Authorization. The state at {e' | ¬(e ≤ e')} reflects all events that happened before e or concurrently with e, which by definition is the latest such available state. The main difficulty, since the authorization checks for concurrent events are mutually recursive, lies in how you reach a fixed point. Lazy data structures are always a good option for computing fixed points.

Demo: Do Not Erase

As a proof of concept, I've created a whiteboard application demonstrating some basic access control rules. The document state is simple enough that I was able to model it as a handful of conflict-free replicated sets, which I treat as a lazy data structure in order to perform authorization queries. Check out the video for details:

The code is available at rad:z2pFBc3GY68cDyCtWx8CsSWDXS7X.

Caveats

The semantics of Continuous Authorization and Eventual Authorization are really sensible, I think, as long as we assume that their notion of 'time' reflects the order in which events actually happen in the real world. But famously, time in distributed systems is more complicated. For one thing, networks are slow, devices go offline, and so the partial order of events given by the system can always end up being weaker than actual causality. There's not much you can do about this; peers can even be dishonest and pretend not to know about events that they have actually received. In a system satisfying Continuous/Eventual Authorization, this means that there are any number of ways to force peers to consider the possibility that an event happened earlier than it actually did.

Let's examine the impact of this problem. With Eventual Authorization, if you are dishonest and send an event that doesn't actually reference the most recent state, peers will still authorize it against the same state either way. (Good, so you still can't circumvent events that restrict your power.)

But it does change the state that other, concurrent events will be authorized against. If you have power, you can abuse this to cause the authorization decisions of earlier events to be changed—even if they happened before you got your power!

Is this really okay? Well, consider that whenever you delegate power to someone, you're trusting them to use that power responsibly. It shouldn't be a stretch to consider this trust as an indicator of their honesty as well. Notably, Matrix makes a similar assumption: it resolves conflicts in favor of the user with the higher power level, on the grounds that they are more likely to be honest. I think this is a reasonable assumption to have as part of the threat model. The most important job of an access control system is to stop people who will never have power from performing restricted actions, and that's not what's at stake here.

The part where someone's power might unintentionally extend to the time before they received their power is still not super cool. As a mitigation, you could require events to be authorized not just against {e' | ¬(e ≤ e')} but also against {e' | e > e'}; that is, require an event's parent links to provide a proof that at some point, this event would've been authorized. However, this only stops a person's power from extending beyond the first instance of them receiving their power. Without checking all possible total orderings of the event graph, it's still entirely possible for someone who has had power given to them, revoked, and then later restored to affect events that happened in the time between losing their power and getting it back. Is this mitigation worth it? It almost feels nicer to do without it so that the threat model remains obvious.

One notable consequence of the ability of dishonest peers to roll back their clocks is that you should avoid designing systems which have mutually exclusive events. If, for example, your system allows admins to strip each other of their power, then it will always be possible for admins to enter a battle of "no, I demoted you first—no, you" and evade having their power taken away.

This constraint is probably a good thing: in centralized systems that allow admins to demote each other, any admin that has their account compromised can initiate a takeover of the system, locking all other admins out of it forever. A better design might require admins to voluntarily demote themselves (as Matrix does), or else retain information about who delegated their power to who, so that people can only be demoted by someone higher up in the power hierarchy.

Other than that, the primary challenge that remains is to make this technique fast. The Do Not Erase demo is completely unoptimized, because I wanted it to read as a direct translation of the Eventual Authorization semantics into runnable code. In particular, it traverses the entire event graph many times over while performing authorization checks! Instead of implementing this in terms of lazy data structures and recursion, a better bet might be to iteratively revise authorization decisions until you reach a fixed point. In general, there seem to be lots of opportunities to reuse previous authorization results and focus only on the parts of the event graph that change, so I feel optimistic.

Closing thoughts

One thing that inhibits the development of new distributed systems is the lack of reusable components. While I think I've managed to distill a reusable technique for access control here, it would be really interesting to see if this can be turned into a reusable library or perhaps even a standard. A dream of mine is to have a standard for distributed documents that define their own presentation logic, business logic, and access control rules. If such a thing is possible, this could provide a basis for collaborative experiences that's as malleable as the open Web, yet as personal as a file stored on your own device—local-first software, but as a reusable platform.

That's about all! Note that I am not a security professional, so you should definitely take my analyses and judgements here with a grain of salt. But I really hope that the idea I've presented here can be of use to future systems needing access control :) In particular, the flexibility and semantic transparency of this approach, as compared to the status quo of Matrix state resolution, appears to be a promising direction for improvement.